Page 2 of 4

Re: Some alerts not firing

Posted: Wed Sep 16, 2015 12:07 pm
by Jklre
jolson wrote:I'm noticing a couple of '8' values in that document, which means that it's *possible* the subsystem is clogging up around those times. I want to see if this is a possibility.

Let's adjust our crontab and add a timestamp (sorry, I should have had you do this yesterday):

Code: Select all

* * * * * /bin/date >> /root/pseveryminute
* * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute
After adding the timestamp, I would like you to try to find any alerts in your Audit Log that are not firing properly. After we find a couple that do not fire properly, we can compare the times of high jobs/poller counts with misfiring alerts. Does that make sense?
Yup makes sense. I'm just searching now for another example of the alert not firing and Ill respond once I get that data.

Re: Some alerts not firing

Posted: Wed Sep 16, 2015 12:32 pm
by Jklre
Jklre wrote:
jolson wrote:I'm noticing a couple of '8' values in that document, which means that it's *possible* the subsystem is clogging up around those times. I want to see if this is a possibility.

Let's adjust our crontab and add a timestamp (sorry, I should have had you do this yesterday):

Code: Select all

* * * * * /bin/date >> /root/pseveryminute
* * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute
After adding the timestamp, I would like you to try to find any alerts in your Audit Log that are not firing properly. After we find a couple that do not fire properly, we can compare the times of high jobs/poller counts with misfiring alerts. Does that make sense?
Yup makes sense. I'm just searching now for another example of the alert not firing and Ill respond once I get that data.
Ok I found another one that didn't fire.

Log message:

Code: Select all

Field	Action	Value	Search
 @timestamp 	  	2015-09-16T06:09:31.551Z	 
 @version	  	1	 
 CSTTError	  	149152495	  
 Mitch_ErrorNum	  	79452	 
 Mitch_Message	  	%%MAPPMatrixProcessor%%loadTheMie%%%%%%%%	 
 Mitch_Message1	  	[jboss_jbc9] Correlated error.
 Mitch_Priority	  	NONFATAL	 
 Mitch_host	  	pjbc9n3lxv jboss_jbc9: node3:	 
 Mitch_org_host	  	papp16lxv:jboss_jbc9:node1	 
 Mitch_prog	  	%%CORESERVICES%%MCM_HELPER_SERVICE%%	 
 Mitchell_time	  	2015-09-15T23:09:31.318-07:00	 
 _id	  	AU_UxbnXCYWYHvT22t4m	 
 _index	  	logstash-2015.09.16	 
 _type	  	syslog	 
 facility	  	0	 
 facility_label	  	kernel	 
 host	  	172.*.*.*	 
 message	  	<11>pjbc9n3lxv jboss_jbc9: node3: [ID 21089 Sep 15, 2015 11:09:31 PM com.services.core.errorlog.server.SyslogWriter write#012SEVERE: %%149152495%%79452%%NONFATAL%%papp16lxv:jboss_jbc9:node1%%CORESERVICES%%MCM_HELPER_SERVICE%%2015-09-15T23:09:31.318-07:00%%MAPPMatrixProcessor%%loadTheMie%%%%%%%%[jboss_jbc9] Correlated error.
 priority	  	0	 
 severity	  	0	 
 severity_label	  	Emergency	 
 syslog5424_pri	  	11	 
 syslog_message	  	 [ID 21089 Sep 15, 2015 11:09:31 PM com.services.core.errorlog.server.SyslogWriter write#012SEVERE: %	 
 tags	  	_grokparsefailure_sysloginput,Mitchell_Jboss_Log,_grokparsefailure	 
 type	  	syslog
Audit log:

Code: Select all

2015-09-16T06:20:08.887-07:00	ALERT	Alert ID DJXSAoheTFmsd3Tw_5YBwA returned OK: 0 matching entries found |logs=0;0;0

Field	Action	Value	Search
 _id	  	AU_WT_a3CYWYHvT2210V	 
 _index	  	nagioslogserver_log	 
 _type	  	ALERT	 
 created 	  	1442409608887	 
 message	  	Alert ID DJXSAoheTFmsd3Tw_5YBwA returned OK: 0 matching entries found |logs=0;0;0	 
 source	  	Nagios Log Server	 
 type	  	ALERT
pseveryminute file

Code: Select all

Wed Sep 16 06:07:01 PDT 2015
6
Wed Sep 16 06:08:01 PDT 2015
6
Wed Sep 16 06:09:01 PDT 2015
6
Wed Sep 16 06:10:01 PDT 2015
6
Wed Sep 16 06:11:01 PDT 2015
6
Wed Sep 16 06:12:01 PDT 2015
6
Wed Sep 16 06:13:01 PDT 2015
6
Wed Sep 16 06:14:01 PDT 2015
4
6
Wed Sep 16 06:15:01 PDT 2015
Wed Sep 16 06:16:01 PDT 2015
5
Wed Sep 16 06:17:01 PDT 2015
5
Wed Sep 16 06:18:01 PDT 2015
6
Wed Sep 16 06:19:01 PDT 2015
5
Wed Sep 16 06:20:01 PDT 2015
6
Wed Sep 16 06:21:01 PDT 2015
6
Wed Sep 16 06:22:01 PDT 2015
6
Wed Sep 16 06:23:01 PDT 2015
6
Wed Sep 16 06:24:01 PDT 2015
5
Wed Sep 16 06:25:01 PDT 2015
6
Let me know if you need some additional information. Thanks

Re: Some alerts not firing

Posted: Wed Sep 16, 2015 4:38 pm
by jolson
Hmmm... it *could* be the number of poller/jobs running that causes this problem, but I'm not convinced. A few questions for you:
1. I assume that other alerts are running fine in roughly the same time period. Is that a correct statement?
2. Do any alerts appear to be consistently failing, or is it random?

Re: Some alerts not firing

Posted: Thu Sep 17, 2015 10:46 am
by Jklre
jolson wrote:Hmmm... it *could* be the number of poller/jobs running that causes this problem, but I'm not convinced. A few questions for you:
1. I assume that other alerts are running fine in roughly the same time period. Is that a correct statement?
2. Do any alerts appear to be consistently failing, or is it random?

Yes other alerts are firing fine and it does seem like its random at this point. I haven't seen anything close to a repeat at this point.

Re: Some alerts not firing

Posted: Thu Sep 17, 2015 3:53 pm
by jolson
Let's start a tail of your jobs.log to see if any relevant information comes up:

Code: Select all

tail -f /usr/local/nagioslogserver/var/jobs.log
Every minute, you should see a result like this:
Running command run_alerts with args ' ' for job id: run_all_alerts
SUCCESS
Is there anything out of the ordinary? The tail may need to run until an alert has failed. I'm hoping we're provided with a better idea as to what might be wrong here.

Another question: What timezone are you currently in?

Re: Some alerts not firing

Posted: Thu Sep 17, 2015 5:44 pm
by Jklre
jolson wrote:Let's start a tail of your jobs.log to see if any relevant information comes up:

Code: Select all

tail -f /usr/local/nagioslogserver/var/jobs.log
Every minute, you should see a result like this:
Running command run_alerts with args ' ' for job id: run_all_alerts
SUCCESS
Is there anything out of the ordinary? The tail may need to run until an alert has failed. I'm hoping we're provided with a better idea as to what might be wrong here.

Another question: What timezone are you currently in?
We are in the PST timezone.

Also is there a way to disable the truncating of the log files while we troubleshoot this? The tail won't really be of use without a time stamp since we need to compare when the missing alert was vs this jobs log file..... Actually while typing this I wrote a quick bash script that solves that problem.

Code: Select all

#!/bin/bash
while true; do
	DATE=$(date +"%Y-%m-%d_%H%M%S\n")
	printf $DATE "\n"
	cat /usr/local/nagioslogserver/var/jobs.log
	printf "\n"
	sleep 5
done
So far it seems like we are averaging one lost alert per day. (from what I can find) We don't have too many alerts firing when these missing alerts occurred. My concern is that if we have a big indecent we will loose quite a large amount of these alerts.

Re: Some alerts not firing

Posted: Fri Sep 18, 2015 1:12 pm
by jdalrymple
Hi Jklre,

So 1 lost per day - no correlation.

I feel like we're running up against a bug. With so many alerts configured and only 1 getting lost a day, and the consistent behavior of the jobs subsystem - I just don't see a load or "clogging" issue. That's my opinion of course, right now I don't see a silver bullet.

Are you continuing to add alerts? I suspect that the problem is somehow related to your great number of alerts, I'm just not seeing how. I wonder if as the number of alerts you have configured goes up if we'll see an increase in missed alerts.

Re: Some alerts not firing

Posted: Fri Sep 18, 2015 1:34 pm
by Jklre
jdalrymple wrote:Hi Jklre,

So 1 lost per day - no correlation.

I feel like we're running up against a bug. With so many alerts configured and only 1 getting lost a day, and the consistent behavior of the jobs subsystem - I just don't see a load or "clogging" issue. That's my opinion of course, right now I don't see a silver bullet.

Are you continuing to add alerts? I suspect that the problem is somehow related to your great number of alerts, I'm just not seeing how. I wonder if as the number of alerts you have configured goes up if we'll see an increase in missed alerts.

Now that we have all of our rules ported over the number of them should remain rather static (around 2000). 1 a days is pretty much an educated guess at this point since the alerts are only firing when there's an issue, so theres really no static number for them. If the alerts arn't firing then there is really no way to test them. So we are currently comparing the output of our previous log tool vs the results of nagios log server and are seeing the discrepancy.

Re: Some alerts not firing

Posted: Fri Sep 18, 2015 2:34 pm
by jdalrymple
I understand.

Finding the correlation is going to be key here. Let's let the system stew over the weekend and if you can collect any missed alerts and send them to us Monday we'll review and hopefully the relationship will become apparent. Right now I just don't see it because all your alerts appear pretty similar to begin with.

Re: Some alerts not firing

Posted: Thu Oct 01, 2015 12:51 pm
by Jklre
jdalrymple wrote:I understand.

Finding the correlation is going to be key here. Let's let the system stew over the weekend and if you can collect any missed alerts and send them to us Monday we'll review and hopefully the relationship will become apparent. Right now I just don't see it because all your alerts appear pretty similar to begin with.

Ok I've been doing some digging and found another example.

it looks like last yesterday there was a large chunk of missing alerts from 6:20pm 9/30 - 12:27am 10/1

I'm trying to think of what else these alerts have in common.

Heres the dashboard for this alert
1.jpg
Here's the audit log.

2015-10-01T00:27:08.195-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned CRITICAL: 3 matching entries found |logs=3;0;0
2015-10-01T00:11:25.774-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T23:55:50.769-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T23:40:17.170-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T23:24:38.186-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T23:09:01.865-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T22:53:27.024-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T22:37:45.788-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T22:22:09.106-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T22:06:31.000-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T21:50:55.810-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T21:35:21.279-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T21:19:38.029-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T21:04:01.687-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T20:48:26.205-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T20:32:46.639-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T20:17:08.491-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T20:01:31.414-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T19:45:56.475-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T19:30:21.223-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T19:14:39.150-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T18:59:02.141-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T18:43:26.805-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T18:27:46.094-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T18:12:08.162-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0
2015-09-30T17:56:30.697-07:00 ALERT Alert ID b_X6MCWxQv2zGCdkeuIgOA returned OK: 0 matching entries found |logs=0;0;0

Code: Select all

2015-09-30_181758
Running command run_alerts with args ' ' for job id: run_all_alerts
SUCCESS
Processed 0 node jobs.
Processed 1 global jobs.

2015-09-30_183158
Processed 0 node jobs.
Processed 0 global jobs.

Running command run_alerts with args ' ' for job id: run_all_alerts
SUCCESS

2015-09-30_183359
Running command run_alerts with args ' ' for job id: run_all_alerts
SUCCESS
Processed 0 node jobs.
Processed 1 global jobs.
ps-everyminute

Code: Select all

Wed Sep 30 18:08:01 PDT 2015
6
Wed Sep 30 18:09:01 PDT 2015
6
Wed Sep 30 18:10:01 PDT 2015
6
Wed Sep 30 18:11:01 PDT 2015
6
Wed Sep 30 18:12:01 PDT 2015
6
Wed Sep 30 18:13:01 PDT 2015
5
Wed Sep 30 18:14:01 PDT 2015
4
Wed Sep 30 18:15:01 PDT 2015
6
Wed Sep 30 18:16:01 PDT 2015
6
Wed Sep 30 18:17:01 PDT 2015
5
Wed Sep 30 18:18:01 PDT 2015
6
Wed Sep 30 18:19:01 PDT 2015
6
Wed Sep 30 18:20:01 PDT 2015
6
Wed Sep 30 18:21:01 PDT 2015
6
Wed Sep 30 18:22:01 PDT 2015
6
Wed Sep 30 18:23:01 PDT 2015
6
Wed Sep 30 18:24:01 PDT 2015
6
Wed Sep 30 18:25:01 PDT 2015
6
Wed Sep 30 18:26:01 PDT 2015
6
Wed Sep 30 18:27:02 PDT 2015
6
Wed Sep 30 18:28:01 PDT 2015
6
Wed Sep 30 18:29:01 PDT 2015
6
Wed Sep 30 18:30:01 PDT 2015
5
Wed Sep 30 18:31:01 PDT 2015
6
Wed Sep 30 18:32:01 PDT 2015
6