Some alerts not firing

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Some alerts not firing

Post by Jklre »

It looks like we have a few sporadic alerts that are not firing.

For example with one of our alerts ID "AU7QO0hUMpN7f10cl0Gq" the audit reports show it as running but it returns 0 results.

Once occurrence was at 09-08 at 19:30 hours

See here:
MQ0.jpg
But you can see in the audit log for the alert:
MQ2.jpg
There were actually quite a few missed for this specific alert. We got one alert on 09-05
mq1.jpg
The alert is configured to run every 15 minutes with a threshold of 0 warning 0 critical

Here is the query for the alert:

{"query":{"filtered":{"query":{"bool":{"should":[{"query_string":{"query":"*"}}]}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1438011721702,"to":1438012621702}}},{"fquery":{"query":{"query_string":{"query":"Mitch_Message:(\" ALERT*CRIT: QM ECS: Channel State: Chl\" AND \"is Retrying\")"}},"_cache":true}}]}}}}}

Any ideas of how we can troubleshoot this?

Thank you.
You do not have the required permissions to view the files attached to this post.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Some alerts not firing

Post by tmcdonald »

What lookback period do you have set for the alert?
Former Nagios employee
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Re: Some alerts not firing

Post by Jklre »

tmcdonald wrote:What lookback period do you have set for the alert?
The loop back period is also 15 minutes.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Some alerts not firing

Post by jolson »

I know that you have many, many alerts. Could you tell me how many alerts there are exactly?

My hope is that the subsystem isn't getting clogged with all of the alerts - but it is a possibility.

Let's see the results of the following:

Code: Select all

ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
The reported number should be between 2 and 4, I'd like to see if you have any hanging jobs/poller processes.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Re: Some alerts not firing

Post by Jklre »

jolson wrote:I know that you have many, many alerts. Could you tell me how many alerts there are exactly?

My hope is that the subsystem isn't getting clogged with all of the alerts - but it is a possibility.

Let's see the results of the following:

Code: Select all

ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
The reported number should be between 2 and 4, I'd like to see if you have any hanging jobs/poller processes.

There are 1925 alerts at the moment

[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4

[root@pnls02lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Some alerts not firing

Post by jolson »

That output looks fine. What I'd like you to do is set up a cron job to run the 'ps -ef' commands every minute or so.

Log in as root and enter cron edit mode:

Code: Select all

crontab -e
Add the following line to your crontab:

Code: Select all

* * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute
After about a day, send the results of the 'pseveryminute' file over to us - note that you shouldn't see numbers higher than 10 or so - if you do, it's likely that the jobs subsystem is stressed in some way and we'll need to workaround that. I've had cases where a client had 200+ jobs/pollers running at one time - I want to make sure that's not the case here.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Re: Some alerts not firing

Post by Jklre »

jolson wrote:That output looks fine. What I'd like you to do is set up a cron job to run the 'ps -ef' commands every minute or so.

Log in as root and enter cron edit mode:

Code: Select all

crontab -e
Add the following line to your crontab:

Code: Select all

* * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute
After about a day, send the results of the 'pseveryminute' file over to us - note that you shouldn't see numbers higher than 10 or so - if you do, it's likely that the jobs subsystem is stressed in some way and we'll need to workaround that. I've had cases where a client had 200+ jobs/pollers running at one time - I want to make sure that's not the case here.
I went ahead and set this up. Lets see what comes from this. Ill send this over tomorrow and we can see what we can find.

thanks
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Some alerts not firing

Post by jolson »

Looking forward to your results! :)
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Re: Some alerts not firing

Post by Jklre »

jolson wrote:Looking forward to your results! :)
Here is the file we setup yesterday
You do not have the required permissions to view the files attached to this post.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Some alerts not firing

Post by jolson »

I'm noticing a couple of '8' values in that document, which means that it's *possible* the subsystem is clogging up around those times. I want to see if this is a possibility.

Let's adjust our crontab and add a timestamp (sorry, I should have had you do this yesterday):

Code: Select all

* * * * * /bin/date >> /root/pseveryminute
* * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute
After adding the timestamp, I would like you to try to find any alerts in your Audit Log that are not firing properly. After we find a couple that do not fire properly, we can compare the times of high jobs/poller counts with misfiring alerts. Does that make sense?
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked