Some alerts not firing
Some alerts not firing
It looks like we have a few sporadic alerts that are not firing.
For example with one of our alerts ID "AU7QO0hUMpN7f10cl0Gq" the audit reports show it as running but it returns 0 results.
Once occurrence was at 09-08 at 19:30 hours
See here:
But you can see in the audit log for the alert:
There were actually quite a few missed for this specific alert. We got one alert on 09-05
The alert is configured to run every 15 minutes with a threshold of 0 warning 0 critical
Here is the query for the alert:
{"query":{"filtered":{"query":{"bool":{"should":[{"query_string":{"query":"*"}}]}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1438011721702,"to":1438012621702}}},{"fquery":{"query":{"query_string":{"query":"Mitch_Message:(\" ALERT*CRIT: QM ECS: Channel State: Chl\" AND \"is Retrying\")"}},"_cache":true}}]}}}}}
Any ideas of how we can troubleshoot this?
Thank you.
For example with one of our alerts ID "AU7QO0hUMpN7f10cl0Gq" the audit reports show it as running but it returns 0 results.
Once occurrence was at 09-08 at 19:30 hours
See here:
But you can see in the audit log for the alert:
There were actually quite a few missed for this specific alert. We got one alert on 09-05
The alert is configured to run every 15 minutes with a threshold of 0 warning 0 critical
Here is the query for the alert:
{"query":{"filtered":{"query":{"bool":{"should":[{"query_string":{"query":"*"}}]}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1438011721702,"to":1438012621702}}},{"fquery":{"query":{"query_string":{"query":"Mitch_Message:(\" ALERT*CRIT: QM ECS: Channel State: Chl\" AND \"is Retrying\")"}},"_cache":true}}]}}}}}
Any ideas of how we can troubleshoot this?
Thank you.
You do not have the required permissions to view the files attached to this post.
Re: Some alerts not firing
What lookback period do you have set for the alert?
Former Nagios employee
Re: Some alerts not firing
The loop back period is also 15 minutes.tmcdonald wrote:What lookback period do you have set for the alert?
Re: Some alerts not firing
I know that you have many, many alerts. Could you tell me how many alerts there are exactly?
My hope is that the subsystem isn't getting clogged with all of the alerts - but it is a possibility.
Let's see the results of the following:
The reported number should be between 2 and 4, I'd like to see if you have any hanging jobs/poller processes.
My hope is that the subsystem isn't getting clogged with all of the alerts - but it is a possibility.
Let's see the results of the following:
Code: Select all
ps -ef | egrep "jobs|poller" | grep -v grep | wc -lRe: Some alerts not firing
jolson wrote:I know that you have many, many alerts. Could you tell me how many alerts there are exactly?
My hope is that the subsystem isn't getting clogged with all of the alerts - but it is a possibility.
Let's see the results of the following:The reported number should be between 2 and 4, I'd like to see if you have any hanging jobs/poller processes.Code: Select all
ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
There are 1925 alerts at the moment
[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4
[root@pnls02lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4
Re: Some alerts not firing
That output looks fine. What I'd like you to do is set up a cron job to run the 'ps -ef' commands every minute or so.
Log in as root and enter cron edit mode:
Add the following line to your crontab:
After about a day, send the results of the 'pseveryminute' file over to us - note that you shouldn't see numbers higher than 10 or so - if you do, it's likely that the jobs subsystem is stressed in some way and we'll need to workaround that. I've had cases where a client had 200+ jobs/pollers running at one time - I want to make sure that's not the case here.
Log in as root and enter cron edit mode:
Code: Select all
crontab -eCode: Select all
* * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminuteRe: Some alerts not firing
I went ahead and set this up. Lets see what comes from this. Ill send this over tomorrow and we can see what we can find.jolson wrote:That output looks fine. What I'd like you to do is set up a cron job to run the 'ps -ef' commands every minute or so.
Log in as root and enter cron edit mode:Add the following line to your crontab:Code: Select all
crontab -eAfter about a day, send the results of the 'pseveryminute' file over to us - note that you shouldn't see numbers higher than 10 or so - if you do, it's likely that the jobs subsystem is stressed in some way and we'll need to workaround that. I've had cases where a client had 200+ jobs/pollers running at one time - I want to make sure that's not the case here.Code: Select all
* * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute
thanks
Re: Some alerts not firing
Here is the file we setup yesterdayjolson wrote:Looking forward to your results!
You do not have the required permissions to view the files attached to this post.
Re: Some alerts not firing
I'm noticing a couple of '8' values in that document, which means that it's *possible* the subsystem is clogging up around those times. I want to see if this is a possibility.
Let's adjust our crontab and add a timestamp (sorry, I should have had you do this yesterday):
After adding the timestamp, I would like you to try to find any alerts in your Audit Log that are not firing properly. After we find a couple that do not fire properly, we can compare the times of high jobs/poller counts with misfiring alerts. Does that make sense?
Let's adjust our crontab and add a timestamp (sorry, I should have had you do this yesterday):
Code: Select all
* * * * * /bin/date >> /root/pseveryminute
* * * * * /bin/ps -ef | /bin/egrep "jobs|poller" | grep -v grep | wc -l >> /root/pseveryminute