Yup makes sense. I'm just searching now for another example of the alert not firing and Ill respond once I get that data.jolson wrote:I'm noticing a couple of '8' values in that document, which means that it's *possible* the subsystem is clogging up around those times. I want to see if this is a possibility.
Let's adjust our crontab and add a timestamp (sorry, I should have had you do this yesterday):After adding the timestamp, I would like you to try to find any alerts in your Audit Log that are not firing properly. After we find a couple that do not fire properly, we can compare the times of high jobs/poller counts with misfiring alerts. Does that make sense?Code: Select all
* * * * * /bin/date >> /root/pseveryminute * * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute
Some alerts not firing
Re: Some alerts not firing
Re: Some alerts not firing
Ok I found another one that didn't fire.Jklre wrote:Yup makes sense. I'm just searching now for another example of the alert not firing and Ill respond once I get that data.jolson wrote:I'm noticing a couple of '8' values in that document, which means that it's *possible* the subsystem is clogging up around those times. I want to see if this is a possibility.
Let's adjust our crontab and add a timestamp (sorry, I should have had you do this yesterday):After adding the timestamp, I would like you to try to find any alerts in your Audit Log that are not firing properly. After we find a couple that do not fire properly, we can compare the times of high jobs/poller counts with misfiring alerts. Does that make sense?Code: Select all
* * * * * /bin/date >> /root/pseveryminute * * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute
Log message:
Code: Select all
Field Action Value Search
@timestamp 2015-09-16T06:09:31.551Z
@version 1
CSTTError 149152495
Mitch_ErrorNum 79452
Mitch_Message %%MAPPMatrixProcessor%%loadTheMie%%%%%%%%
Mitch_Message1 [jboss_jbc9] Correlated error.
Mitch_Priority NONFATAL
Mitch_host pjbc9n3lxv jboss_jbc9: node3:
Mitch_org_host papp16lxv:jboss_jbc9:node1
Mitch_prog %%CORESERVICES%%MCM_HELPER_SERVICE%%
Mitchell_time 2015-09-15T23:09:31.318-07:00
_id AU_UxbnXCYWYHvT22t4m
_index logstash-2015.09.16
_type syslog
facility 0
facility_label kernel
host 172.*.*.*
message <11>pjbc9n3lxv jboss_jbc9: node3: [ID 21089 Sep 15, 2015 11:09:31 PM com.services.core.errorlog.server.SyslogWriter write#012SEVERE: %%149152495%%79452%%NONFATAL%%papp16lxv:jboss_jbc9:node1%%CORESERVICES%%MCM_HELPER_SERVICE%%2015-09-15T23:09:31.318-07:00%%MAPPMatrixProcessor%%loadTheMie%%%%%%%%[jboss_jbc9] Correlated error.
priority 0
severity 0
severity_label Emergency
syslog5424_pri 11
syslog_message [ID 21089 Sep 15, 2015 11:09:31 PM com.services.core.errorlog.server.SyslogWriter write#012SEVERE: %
tags _grokparsefailure_sysloginput,Mitchell_Jboss_Log,_grokparsefailure
type syslogCode: Select all
2015-09-16T06:20:08.887-07:00 ALERT Alert ID DJXSAoheTFmsd3Tw_5YBwA returned OK: 0 matching entries found |logs=0;0;0
Field Action Value Search
_id AU_WT_a3CYWYHvT2210V
_index nagioslogserver_log
_type ALERT
created 1442409608887
message Alert ID DJXSAoheTFmsd3Tw_5YBwA returned OK: 0 matching entries found |logs=0;0;0
source Nagios Log Server
type ALERTCode: Select all
Wed Sep 16 06:07:01 PDT 2015
6
Wed Sep 16 06:08:01 PDT 2015
6
Wed Sep 16 06:09:01 PDT 2015
6
Wed Sep 16 06:10:01 PDT 2015
6
Wed Sep 16 06:11:01 PDT 2015
6
Wed Sep 16 06:12:01 PDT 2015
6
Wed Sep 16 06:13:01 PDT 2015
6
Wed Sep 16 06:14:01 PDT 2015
4
6
Wed Sep 16 06:15:01 PDT 2015
Wed Sep 16 06:16:01 PDT 2015
5
Wed Sep 16 06:17:01 PDT 2015
5
Wed Sep 16 06:18:01 PDT 2015
6
Wed Sep 16 06:19:01 PDT 2015
5
Wed Sep 16 06:20:01 PDT 2015
6
Wed Sep 16 06:21:01 PDT 2015
6
Wed Sep 16 06:22:01 PDT 2015
6
Wed Sep 16 06:23:01 PDT 2015
6
Wed Sep 16 06:24:01 PDT 2015
5
Wed Sep 16 06:25:01 PDT 2015
6Re: Some alerts not firing
Hmmm... it *could* be the number of poller/jobs running that causes this problem, but I'm not convinced. A few questions for you:
1. I assume that other alerts are running fine in roughly the same time period. Is that a correct statement?
2. Do any alerts appear to be consistently failing, or is it random?
1. I assume that other alerts are running fine in roughly the same time period. Is that a correct statement?
2. Do any alerts appear to be consistently failing, or is it random?
Re: Some alerts not firing
jolson wrote:Hmmm... it *could* be the number of poller/jobs running that causes this problem, but I'm not convinced. A few questions for you:
1. I assume that other alerts are running fine in roughly the same time period. Is that a correct statement?
2. Do any alerts appear to be consistently failing, or is it random?
Yes other alerts are firing fine and it does seem like its random at this point. I haven't seen anything close to a repeat at this point.
Re: Some alerts not firing
Let's start a tail of your jobs.log to see if any relevant information comes up:
Every minute, you should see a result like this:
Another question: What timezone are you currently in?
Code: Select all
tail -f /usr/local/nagioslogserver/var/jobs.logIs there anything out of the ordinary? The tail may need to run until an alert has failed. I'm hoping we're provided with a better idea as to what might be wrong here.Running command run_alerts with args ' ' for job id: run_all_alerts
SUCCESS
Another question: What timezone are you currently in?
Re: Some alerts not firing
We are in the PST timezone.jolson wrote:Let's start a tail of your jobs.log to see if any relevant information comes up:
Every minute, you should see a result like this:Code: Select all
tail -f /usr/local/nagioslogserver/var/jobs.logIs there anything out of the ordinary? The tail may need to run until an alert has failed. I'm hoping we're provided with a better idea as to what might be wrong here.Running command run_alerts with args ' ' for job id: run_all_alerts
SUCCESS
Another question: What timezone are you currently in?
Also is there a way to disable the truncating of the log files while we troubleshoot this? The tail won't really be of use without a time stamp since we need to compare when the missing alert was vs this jobs log file..... Actually while typing this I wrote a quick bash script that solves that problem.
Code: Select all
#!/bin/bash
while true; do
DATE=$(date +"%Y-%m-%d_%H%M%S\n")
printf $DATE "\n"
cat /usr/local/nagioslogserver/var/jobs.log
printf "\n"
sleep 5
done
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: Some alerts not firing
Hi Jklre,
So 1 lost per day - no correlation.
I feel like we're running up against a bug. With so many alerts configured and only 1 getting lost a day, and the consistent behavior of the jobs subsystem - I just don't see a load or "clogging" issue. That's my opinion of course, right now I don't see a silver bullet.
Are you continuing to add alerts? I suspect that the problem is somehow related to your great number of alerts, I'm just not seeing how. I wonder if as the number of alerts you have configured goes up if we'll see an increase in missed alerts.
So 1 lost per day - no correlation.
I feel like we're running up against a bug. With so many alerts configured and only 1 getting lost a day, and the consistent behavior of the jobs subsystem - I just don't see a load or "clogging" issue. That's my opinion of course, right now I don't see a silver bullet.
Are you continuing to add alerts? I suspect that the problem is somehow related to your great number of alerts, I'm just not seeing how. I wonder if as the number of alerts you have configured goes up if we'll see an increase in missed alerts.
Re: Some alerts not firing
jdalrymple wrote:Hi Jklre,
So 1 lost per day - no correlation.
I feel like we're running up against a bug. With so many alerts configured and only 1 getting lost a day, and the consistent behavior of the jobs subsystem - I just don't see a load or "clogging" issue. That's my opinion of course, right now I don't see a silver bullet.
Are you continuing to add alerts? I suspect that the problem is somehow related to your great number of alerts, I'm just not seeing how. I wonder if as the number of alerts you have configured goes up if we'll see an increase in missed alerts.
Now that we have all of our rules ported over the number of them should remain rather static (around 2000). 1 a days is pretty much an educated guess at this point since the alerts are only firing when there's an issue, so theres really no static number for them. If the alerts arn't firing then there is really no way to test them. So we are currently comparing the output of our previous log tool vs the results of nagios log server and are seeing the discrepancy.
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: Some alerts not firing
I understand.
Finding the correlation is going to be key here. Let's let the system stew over the weekend and if you can collect any missed alerts and send them to us Monday we'll review and hopefully the relationship will become apparent. Right now I just don't see it because all your alerts appear pretty similar to begin with.
Finding the correlation is going to be key here. Let's let the system stew over the weekend and if you can collect any missed alerts and send them to us Monday we'll review and hopefully the relationship will become apparent. Right now I just don't see it because all your alerts appear pretty similar to begin with.
Re: Some alerts not firing
jdalrymple wrote:I understand.
Finding the correlation is going to be key here. Let's let the system stew over the weekend and if you can collect any missed alerts and send them to us Monday we'll review and hopefully the relationship will become apparent. Right now I just don't see it because all your alerts appear pretty similar to begin with.
Ok I've been doing some digging and found another example.
it looks like last yesterday there was a large chunk of missing alerts from 6:20pm 9/30 - 12:27am 10/1
I'm trying to think of what else these alerts have in common.
Heres the dashboard for this alert
Here's the audit log.
2015-10-01T00:27:08.195-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned CRITICAL: 3 matching entries found |logs=3;0;0
2015-10-01T00:11:25.774-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T23:55:50.769-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T23:40:17.170-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T23:24:38.186-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T23:09:01.865-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T22:53:27.024-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T22:37:45.788-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T22:22:09.106-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T22:06:31.000-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T21:50:55.810-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T21:35:21.279-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T21:19:38.029-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T21:04:01.687-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T20:48:26.205-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T20:32:46.639-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T20:17:08.491-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T20:01:31.414-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T19:45:56.475-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T19:30:21.223-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T19:14:39.150-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T18:59:02.141-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T18:43:26.805-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T18:27:46.094-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T18:12:08.162-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T17:56:30.697-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
Code: Select all
2015-09-30_181758
Running command run_alerts with args ' ' for job id: run_all_alerts
SUCCESS
Processed 0 node jobs.
Processed 1 global jobs.
2015-09-30_183158
Processed 0 node jobs.
Processed 0 global jobs.
Running command run_alerts with args ' ' for job id: run_all_alerts
SUCCESS
2015-09-30_183359
Running command run_alerts with args ' ' for job id: run_all_alerts
SUCCESS
Processed 0 node jobs.
Processed 1 global jobs.Code: Select all
Wed Sep 30 18:08:01 PDT 2015
6
Wed Sep 30 18:09:01 PDT 2015
6
Wed Sep 30 18:10:01 PDT 2015
6
Wed Sep 30 18:11:01 PDT 2015
6
Wed Sep 30 18:12:01 PDT 2015
6
Wed Sep 30 18:13:01 PDT 2015
5
Wed Sep 30 18:14:01 PDT 2015
4
Wed Sep 30 18:15:01 PDT 2015
6
Wed Sep 30 18:16:01 PDT 2015
6
Wed Sep 30 18:17:01 PDT 2015
5
Wed Sep 30 18:18:01 PDT 2015
6
Wed Sep 30 18:19:01 PDT 2015
6
Wed Sep 30 18:20:01 PDT 2015
6
Wed Sep 30 18:21:01 PDT 2015
6
Wed Sep 30 18:22:01 PDT 2015
6
Wed Sep 30 18:23:01 PDT 2015
6
Wed Sep 30 18:24:01 PDT 2015
6
Wed Sep 30 18:25:01 PDT 2015
6
Wed Sep 30 18:26:01 PDT 2015
6
Wed Sep 30 18:27:02 PDT 2015
6
Wed Sep 30 18:28:01 PDT 2015
6
Wed Sep 30 18:29:01 PDT 2015
6
Wed Sep 30 18:30:01 PDT 2015
5
Wed Sep 30 18:31:01 PDT 2015
6
Wed Sep 30 18:32:01 PDT 2015
6You do not have the required permissions to view the files attached to this post.