Wrong Alerts from Alterting function in NLS 1.4.4

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
comfone
Posts: 127
Joined: Fri May 01, 2015 3:28 am

Wrong Alerts from Alterting function in NLS 1.4.4

Post by comfone »

Hi
We have around 44 alerts configured in our NSL. The alerts trigger an alarm via NRDP to our Nagios XI instance. From there we either send eMail or Text Message notifications.
We have the problem that NLS triggers wrong alerts. The status of the alert switches to critical, but when I show the dashboard of this alert, the condition is not given and when I click on "run the alert now" it switches immediately back to "OK".
It happened the last 2 nights at almost the same time (3 a.m. CEST) when a wrong alert was triggered.
This specific alert has a lookback period of 90min and a check interval of 30min. The threshold is set to 1: (both).
Our Command Subsystem looks like:
cleanup_cmdsubsys 1 hour
backups 1 day
backup_maintenance 1 day
run_all_alerts 1 minute
run_update_check 1 day

Are there known issues with the Alerting function? Is there anything a can improve/configure in order to not happen again?
Best regards,
Philipp
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Wrong Alerts from Alterting function in NLS 1.4.4

Post by mcapra »

comfone wrote:when I show the dashboard of this alert, the condition is not given
Could you elaborate on this? When you say the condition is not given, do you mean that the Nagios Log Server Alerts page does not display anything?

Can you also share the output of this command and tell me the name of the alert that originally produced this issue:

Code: Select all

curl -XGET 'http://localhost:9200/nagioslogserver/alert/_search?size=100'
Feel free to PM the results of that curl command as it may contain sensitive information.
Former Nagios employee
https://www.mcapra.com/
comfone
Posts: 127
Joined: Fri May 01, 2015 3:28 am

Re: Wrong Alerts from Alterting function in NLS 1.4.4

Post by comfone »

Hi,
thanks for your answer and sorry for my unclear explanation.
With "condition is not given" I meant that NLS gets the expected log information regularly and there is no (obvious) reason to switch the alert to critical. The alerts page works and when I click on the "Show alert in dashboard" button, the dashboard shows me the expected log entries for the defined lookback time.

Attached you can find a text file which contains the curl output. During the past days there were two alerts which triggered false alarms: SSG-STATISTICS-ApplicationAlive and the other SSG-ETDR-ApplicationAlive

Br,
Philipp
You do not have the required permissions to view the files attached to this post.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Wrong Alerts from Alterting function in NLS 1.4.4

Post by mcapra »

The alerts and their queries look reasonable enough. Lets check the audit log. Can you share the output of the following commands executed from the CLI of your Nagios Log Server machine:

Code: Select all

curl -XGET 'http://localhost:9200/nagioslogserver_log/ALERT/_search?size=100' -d '{"query":{"filtered":{"filter":{"range":{"created":{"from":1490717756000,"to":1490976686000}}},"query":{"query_string":{"query":"SSG-STATISTICS-ApplicationAlive"}}}}}'
curl -XGET 'http://localhost:9200/nagioslogserver_log/ALERT/_search?size=100' -d '{"query":{"filtered":{"filter":{"range":{"created":{"from":1490717756000,"to":1490976686000}}},"query":{"query_string":{"query":"SSG-ETDR-ApplicationAlive"}}}}}'
If I could also get the logs from the destination Naigos XI machine, that might also be helpful. They're typically be found here:

Code: Select all

/usr/local/nagios/var/archives/nagios-03-31-2017-00.log
/usr/local/nagios/var/archives/nagios-03-30-2017-00.log
/usr/local/nagios/var/archives/nagios-03-29-2017-00.log
/usr/local/nagios/var/archives/nagios-03-28-2017-00.log
Former Nagios employee
https://www.mcapra.com/
comfone
Posts: 127
Joined: Fri May 01, 2015 3:28 am

Re: Wrong Alerts from Alterting function in NLS 1.4.4

Post by comfone »

Please find the files attached.
You do not have the required permissions to view the files attached to this post.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Wrong Alerts from Alterting function in NLS 1.4.4

Post by mcapra »

I see one CRITICAL from right around the time your indices would have been rotating. May/may not be related but it's definitely noteworthy:

Code: Select all

{
	"_index": "nagioslogserver_log",
	"_type": "ALERT",
	"_id": "AVshufVRl8iePEtZwRgh",
	"_score": 1.4803797,
	"_source": {
		"created": 1490919486800,
		"created_by": "System",
		"type": "ALERT",
		"message": "Alert Name SSG-ETDR-ApplicationAlive returned CRITICAL: 0 matching entries found |logs=0;1:;1:",
		"source": "Nagios Log Server"
	}
}
Can you share the output of the following command executed from the CLI of one of your Nagios Log Server instances:

Code: Select all

curl -XGET 'http://localhost:9200/logstash-*/_search?size=100' -d '{"query":{"filtered":{"query":{"bool":{"should":[{"query_string":{"query":"*"}}]}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1490918286000,"to":1490919486800}}},{"fquery":{"query":{"query_string":{"query":"type:(\"SSG - ETDRS\")"}},"_cache":true}},{"terms":{"logsource":["csvdb008"]}},{"terms":{"Data.raw":["OK"]}}]}}}}}'
Can you also share a screenshot of your "Administration -> Backup & Maintenance" page?
Former Nagios employee
https://www.mcapra.com/
comfone
Posts: 127
Joined: Fri May 01, 2015 3:28 am

Re: Wrong Alerts from Alterting function in NLS 1.4.4

Post by comfone »

Hi,
please find the CURL output attached.
Thanks a lot for your support!
Br,
Philipp
You do not have the required permissions to view the files attached to this post.
comfone
Posts: 127
Joined: Fri May 01, 2015 3:28 am

Re: Wrong Alerts from Alterting function in NLS 1.4.4

Post by comfone »

...and the screenshot.
You do not have the required permissions to view the files attached to this post.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Wrong Alerts from Alterting function in NLS 1.4.4

Post by mcapra »

This looks to be some sort of false positive caused by the indices rotating in/out or the indices being optimized around that time. I might be wrong, but that's my initial assessment.

If you want to open up a email ticket with [email protected] and do a remote session, we can do that too. Reproducing it consistently and proving the above is going to be pretty difficult, though.

The other option is to adjust your max check attempts for the passive services in Nagios XI to help eliminate some of those false positives. This is likely an issue with Elasticsearch and not something that's going to be immediately solvable with simple modifications to the alerting logic.
Former Nagios employee
https://www.mcapra.com/
comfone
Posts: 127
Joined: Fri May 01, 2015 3:28 am

Re: Wrong Alerts from Alterting function in NLS 1.4.4

Post by comfone »

Hi,
thanks a lot for your efforts!
I will consider to adjust the "check attempts" in Nagios XI.
Br,
Philipp
Locked