Page 1 of 4

Some alerts not firing

Posted: Wed Sep 09, 2015 4:56 pm
by Jklre
It looks like we have a few sporadic alerts that are not firing.

For example with one of our alerts ID "AU7QO0hUMpN7f10cl0Gq" the audit reports show it as running but it returns 0 results.

Once occurrence was at 09-08 at 19:30 hours

See here:
MQ0.jpg
But you can see in the audit log for the alert:
MQ2.jpg
There were actually quite a few missed for this specific alert. We got one alert on 09-05
mq1.jpg
The alert is configured to run every 15 minutes with a threshold of 0 warning 0 critical

Here is the query for the alert:

{"query":{"filtered":{"query":{"bool":{"should":[{"query_string":{"query":"*"}}]}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1438011721702,"to":1438012621702}}},{"fquery":{"query":{"query_string":{"query":"Mitch_Message:(\" ALERT*CRIT: QM ECS: Channel State: Chl\" AND \"is Retrying\")"}},"_cache":true}}]}}}}}

Any ideas of how we can troubleshoot this?

Thank you.

Re: Some alerts not firing

Posted: Thu Sep 10, 2015 2:56 pm
by tmcdonald
What lookback period do you have set for the alert?

Re: Some alerts not firing

Posted: Thu Sep 10, 2015 4:57 pm
by Jklre
tmcdonald wrote:What lookback period do you have set for the alert?
The loop back period is also 15 minutes.

Re: Some alerts not firing

Posted: Fri Sep 11, 2015 1:58 pm
by jolson
I know that you have many, many alerts. Could you tell me how many alerts there are exactly?

My hope is that the subsystem isn't getting clogged with all of the alerts - but it is a possibility.

Let's see the results of the following:

Code: Select all

ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
The reported number should be between 2 and 4, I'd like to see if you have any hanging jobs/poller processes.

Re: Some alerts not firing

Posted: Fri Sep 11, 2015 4:59 pm
by Jklre
jolson wrote:I know that you have many, many alerts. Could you tell me how many alerts there are exactly?

My hope is that the subsystem isn't getting clogged with all of the alerts - but it is a possibility.

Let's see the results of the following:

Code: Select all

ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
The reported number should be between 2 and 4, I'd like to see if you have any hanging jobs/poller processes.

There are 1925 alerts at the moment

[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4

[root@pnls02lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4

Re: Some alerts not firing

Posted: Mon Sep 14, 2015 10:27 am
by jolson
That output looks fine. What I'd like you to do is set up a cron job to run the 'ps -ef' commands every minute or so.

Log in as root and enter cron edit mode:

Code: Select all

crontab -e
Add the following line to your crontab:

Code: Select all

* * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute
After about a day, send the results of the 'pseveryminute' file over to us - note that you shouldn't see numbers higher than 10 or so - if you do, it's likely that the jobs subsystem is stressed in some way and we'll need to workaround that. I've had cases where a client had 200+ jobs/pollers running at one time - I want to make sure that's not the case here.

Re: Some alerts not firing

Posted: Mon Sep 14, 2015 1:15 pm
by Jklre
jolson wrote:That output looks fine. What I'd like you to do is set up a cron job to run the 'ps -ef' commands every minute or so.

Log in as root and enter cron edit mode:

Code: Select all

crontab -e
Add the following line to your crontab:

Code: Select all

* * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute
After about a day, send the results of the 'pseveryminute' file over to us - note that you shouldn't see numbers higher than 10 or so - if you do, it's likely that the jobs subsystem is stressed in some way and we'll need to workaround that. I've had cases where a client had 200+ jobs/pollers running at one time - I want to make sure that's not the case here.
I went ahead and set this up. Lets see what comes from this. Ill send this over tomorrow and we can see what we can find.

thanks

Re: Some alerts not firing

Posted: Mon Sep 14, 2015 2:28 pm
by jolson
Looking forward to your results! :)

Re: Some alerts not firing

Posted: Tue Sep 15, 2015 12:02 pm
by Jklre
jolson wrote:Looking forward to your results! :)
Here is the file we setup yesterday

Re: Some alerts not firing

Posted: Tue Sep 15, 2015 12:08 pm
by jolson
I'm noticing a couple of '8' values in that document, which means that it's *possible* the subsystem is clogging up around those times. I want to see if this is a possibility.

Let's adjust our crontab and add a timestamp (sorry, I should have had you do this yesterday):

Code: Select all

* * * * * /bin/date >> /root/pseveryminute
* * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute
After adding the timestamp, I would like you to try to find any alerts in your Audit Log that are not firing properly. After we find a couple that do not fire properly, we can compare the times of high jobs/poller counts with misfiring alerts. Does that make sense?