Page 3 of 3

Re: CPU usage high and checks delayed

Posted: Tue Nov 07, 2017 1:51 pm
by kyang
Does the XI notification log show these notifications getting sent around the time you are receiving them? Is this how you can tell it's catching up with old alerts or how?

Do the problems at the location that went down still exist?

Re: CPU usage high and checks delayed

Posted: Tue Nov 07, 2017 1:57 pm
by snapon_admin
The problems at that location no longer exist, and when I say it's catching up I mean that I'll look at a service check and it will say "Next check 12:40" when it's currently 12:55. The checks seem to be running still, but they're behind. And because they're trying to catch up I keep getting high CPU usage, and high load. Another way I can tell that things are behind is the scheduled events over time graph will start off looking pretty normal after I restart nagios but will eventually drop to almost nothing, and the Monitoring Engine check statistics will show 0s for 1-min, 5-min, and 15-min active checks.

Re: CPU usage high and checks delayed

Posted: Tue Nov 07, 2017 2:12 pm
by snapon_admin
Here's a look at our Active Service checks graph. The top one is the last 48 hours with the time of the outage I explained highlighted (approximately 2 PM CST) and the bottom graph is what it typically looks like (showing the last 7 days).

Re: CPU usage high and checks delayed

Posted: Tue Nov 07, 2017 2:16 pm
by snapon_admin
Additional info on check latency and scheduled events over time.

Re: CPU usage high and checks delayed

Posted: Tue Nov 07, 2017 3:06 pm
by snapon_admin
This happens pretty much any time we have a major outage anywhere. I really just need a way to tell Nagios to stop playing catch up and start running new checks. There's gotta be a queue somewhere I can clear or something right?

Re: CPU usage high and checks delayed

Posted: Tue Nov 07, 2017 4:42 pm
by dwasswa
Hi @snapon_admin,

Let's try clearing out the system and then restart it because it seems like it's stuck.

Please run the commands below:

Code: Select all

service nagios stop
service ndo2db stop
service crond stop
pkill -9 -u nagios

If your server is using the Postgres database, you would run the command below:

Code: Select all

echo "truncate table xi_events; truncate table xi_meta; truncate table xi_eventqueue;" | psql nagiosxi nagiosxi
If you are using MYSQL, you would run the command below:

Code: Select all

echo "truncate table xi_events; truncate table xi_meta; truncate table xi_eventqueue;" | mysql -u root -pnagiosxi nagiosxi
Then:

Code: Select all

service crond start
service ndo2db start
service nagios start
service npcd restart
Please follow the steps above and let me know if it solves your issue.

If it solves your issue, the next time it happens again, just run the same commands.

Re: CPU usage high and checks delayed

Posted: Tue Nov 07, 2017 4:47 pm
by tmcdonald
In addition to what @dwasswa posted, if you truly want to "reset" all the checks so they start running immediately, you could remove the status.dat and retention.dat files which are what carry the state information, but this is a very heavy approach. This will remove comments, states (so it all goes back to pending), downtime, etc. so it's somewhat of a nuclear option. If that is what you want to do, then this is the closest you can get to "I just added all these hosts and services from scratch then applied my configs" with the benefit of keeping your performance data.

Re: CPU usage high and checks delayed

Posted: Tue Nov 07, 2017 5:28 pm
by snapon_admin
I have a ticket open for this on the new ticketing system so for the sake of organization and rapid/fluid response we may want to lock this thread up and keep replies in one place. I have replied on the ticket (334811).