Some alerts not firing

Jklre · Post by **Jklre** » Wed Sep 09, 2015 4:56 pm

It looks like we have a few sporadic alerts that are not firing.

For example with one of our alerts ID "AU7QO0hUMpN7f10cl0Gq" the audit reports show it as running but it returns 0 results.

Once occurrence was at 09-08 at 19:30 hours

See here:

MQ0.jpg

But you can see in the audit log for the alert:

MQ2.jpg

There were actually quite a few missed for this specific alert. We got one alert on 09-05

mq1.jpg

The alert is configured to run every 15 minutes with a threshold of 0 warning 0 critical

Here is the query for the alert:

{"query":{"filtered":{"query":{"bool":{"should":[{"query_string":{"query":"*"}}]}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1438011721702,"to":1438012621702}}},{"fquery":{"query":{"query_string":{"query":"Mitch_Message:(\" ALERT*CRIT: QM ECS: Channel State: Chl\" AND \"is Retrying\")"}},"_cache":true}}]}}}}}

Any ideas of how we can troubleshoot this?

Thank you.

tmcdonald · Post by **tmcdonald** » Thu Sep 10, 2015 2:56 pm

What lookback period do you have set for the alert?

Jklre · Post by **Jklre** » Thu Sep 10, 2015 4:57 pm

tmcdonald wrote:What lookback period do you have set for the alert?

The loop back period is also 15 minutes.

jolson · Post by **jolson** » Fri Sep 11, 2015 1:58 pm

I know that you have many, many alerts. Could you tell me how many alerts there are exactly?

My hope is that the subsystem isn't getting clogged with all of the alerts - but it is a possibility.

Let's see the results of the following:

Code: Select all

ps -ef | egrep "jobs|poller" | grep -v grep | wc -l

The reported number should be between 2 and 4, I'd like to see if you have any hanging jobs/poller processes.

Jklre · Post by **Jklre** » Fri Sep 11, 2015 4:59 pm

jolson wrote:I know that you have many, many alerts. Could you tell me how many alerts there are exactly?

My hope is that the subsystem isn't getting clogged with all of the alerts - but it is a possibility.

Let's see the results of the following:
Code: Select all
ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
The reported number should be between 2 and 4, I'd like to see if you have any hanging jobs/poller processes.

jolson · Post by **jolson** » Mon Sep 14, 2015 10:27 am

That output looks fine. What I'd like you to do is set up a cron job to run the 'ps -ef' commands every minute or so.

Log in as root and enter cron edit mode:

Code: Select all

crontab -e

Add the following line to your crontab:

Code: Select all

* * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute

After about a day, send the results of the 'pseveryminute' file over to us - note that you shouldn't see numbers higher than 10 or so - if you do, it's likely that the jobs subsystem is stressed in some way and we'll need to workaround that. I've had cases where a client had 200+ jobs/pollers running at one time - I want to make sure that's not the case here.

Jklre · Post by **Jklre** » Mon Sep 14, 2015 1:15 pm

jolson wrote:That output looks fine. What I'd like you to do is set up a cron job to run the 'ps -ef' commands every minute or so.

Log in as root and enter cron edit mode:
Code: Select all
crontab -e
Add the following line to your crontab:
Code: Select all
* * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute
After about a day, send the results of the 'pseveryminute' file over to us - note that you shouldn't see numbers higher than 10 or so - if you do, it's likely that the jobs subsystem is stressed in some way and we'll need to workaround that. I've had cases where a client had 200+ jobs/pollers running at one time - I want to make sure that's not the case here.

I went ahead and set this up. Lets see what comes from this. Ill send this over tomorrow and we can see what we can find.

thanks

jolson · Post by **jolson** » Mon Sep 14, 2015 2:28 pm

Looking forward to your results!

Jklre · Post by **Jklre** » Tue Sep 15, 2015 12:02 pm

jolson wrote:Looking forward to your results!

Here is the file we setup yesterday

jolson · Post by **jolson** » Tue Sep 15, 2015 12:08 pm

I'm noticing a couple of '8' values in that document, which means that it's *possible* the subsystem is clogging up around those times. I want to see if this is a possibility.

Let's adjust our crontab and add a timestamp (sorry, I should have had you do this yesterday):

Code: Select all

* * * * * /bin/date >> /root/pseveryminute
* * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute

After adding the timestamp, I would like you to try to find any alerts in your Audit Log that are not firing properly. After we find a couple that do not fire properly, we can compare the times of high jobs/poller counts with misfiring alerts. Does that make sense?

Nagios Support Forum

Some alerts not firing

Some alerts not firing

Re: Some alerts not firing

Re: Some alerts not firing

Re: Some alerts not firing

Re: Some alerts not firing

Re: Some alerts not firing

Re: Some alerts not firing

Re: Some alerts not firing

Re: Some alerts not firing

Re: Some alerts not firing