CPU usage high and checks delayed

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
snapon_admin
Posts: 952
Joined: Mon Jun 10, 2013 10:39 am
Location: Kenosha, WI
Contact:

CPU usage high and checks delayed

Post by snapon_admin »

A couple days ago our CPU usage shot up pretty high (hovering around 80%) and load increased pretty significantly as well. I'm not sure what the cause of that was, but I'm noticing that checks on the server are lagging a bit behind and can't seem to catch up (one check says Next check is at 9:52 and it's currently 9:59 for example) causing the load to not be able to stabilize. Any thoughts on what I can do to get my checks to catch up and alleviate this issue? I feel like I've had this issue before, but can't recall exactly what fixed it.
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: CPU usage high and checks delayed

Post by dwhitfield »

Can you post the output of ps -eo pcpu,args --sort=-%cpu|head?

Also, to clarify, the CPU usage is on the Nagios server, not on servers you are monitoring, correct?
User avatar
snapon_admin
Posts: 952
Joined: Mon Jun 10, 2013 10:39 am
Location: Kenosha, WI
Contact:

Re: CPU usage high and checks delayed

Post by snapon_admin »

Code: Select all

[root@lisl-ngos-01-pv ~]# ps -eo pcpu,args --sort=-%cpu|head
%CPU COMMAND
78.1 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
31.0 /usr/bin/perl -w /usr/local/nagios/libexec/check_nwc_health --hostname 10.94.19.2 --community Sn4p0nC0r3s --mode cpu-load --units % --warning 80 --critical 90
27.0 /usr/bin/perl -w /usr/local/nagios/libexec/check_nwc_health --hostname 10.160.19.2 --community Sn4p0nC0r3s --mode memory-usage --units % --warning 80 --critical 90
17.6 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
16.8 /usr/sbin/httpd
13.6 /usr/sbin/httpd
13.6 /usr/sbin/httpd
12.8 /usr/sbin/httpd
12.7 /usr/sbin/httpd
And correct, usage is high on the XI server.
bwallace
Posts: 1145
Joined: Tue Nov 17, 2015 1:57 pm

Re: CPU usage high and checks delayed

Post by bwallace »

To put things in perspective, what are the system specs - number of CPUs, drive space,etc
Also, please post screenshots of 'System Status' and 'Monitoring Engine Status' under Admin > System Information

- Around the time of the spike, what was recorded in the event log?
Home > Monitoring Process > Event Log

- nagios.log files from when spikes occur could be helpful as well.

* Don't forget to scrub any sensitive data prior to posting, thanks!
Be sure to check out the Knowledgebase for helpful articles and solutions!
User avatar
snapon_admin
Posts: 952
Joined: Mon Jun 10, 2013 10:39 am
Location: Kenosha, WI
Contact:

Re: CPU usage high and checks delayed

Post by snapon_admin »

8 CPUs, 16GB RAM, 200GB disk space, 52G available.

Info from my home dashboard with sensitive data removed:
dashboard stats.png
Event log. You can see the load spiked up there to 50, usually it hovers around 5-10:
event log.png
You do not have the required permissions to view the files attached to this post.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: CPU usage high and checks delayed

Post by WillemDH »

Hey Snapon,

I had similar cpu spikes in the past. Turned out both times it was storage related. One of the times it seemd our SAN AST (automatic storage tiering) stopped working leaving the datastore where the nagios server was running on slower then usual storage.
The other time, something was going wrong inter-datacenter replication wise. The replication issue casued performance issues on the datastore where the XI server was running.

The issues you are describing could be caused by many thing though. One of the thing I also could recommend checking is the load on servers running on the same ESXi and / or the same VMware datastore.

(all the above only applias of course on a virtual XI server.. ;) )

Good luck in hunting the issue.

Willem
Nagios XI 5.8.1
https://outsideit.net
bwallace
Posts: 1145
Joined: Tue Nov 17, 2015 1:57 pm

Re: CPU usage high and checks delayed

Post by bwallace »

Appreciate the tips WilllemDH - thanks!
Snapon - I don't have the answer right now but I think we may be closer.

mysql is running hot (from top)
5574 mysql 20 0 4269m 132m 4732 S 84.6 0.8 41:37.83 mysqld

db maint is supposed to run every hour on XI, but the profile does not collect the dbmaint.log, found here:

/usr/local/nagiosxi/var

Could you post this?

I'd like to review dbmaint.log to see what kind of job dbmaint has been doing, perhaps it is not pruning tables as its supposed to?

Note that the npcd.log has a lot of these errors:
NPCD: WARN: MAX load reached: load 61.120000/60.000000 at i=1

...but his is an outcome of the high load so we should try to figure that out before increasing the npcd max load threshold
Be sure to check out the Knowledgebase for helpful articles and solutions!
User avatar
snapon_admin
Posts: 952
Joined: Mon Jun 10, 2013 10:39 am
Location: Kenosha, WI
Contact:

Re: CPU usage high and checks delayed

Post by snapon_admin »

Yeah I saw that mysqld was pegging it out, just not sure how to fix it. I did look into the SAN possibilities that Willem mentioned and the team that manages that isn't seeing any issues.

dbmaint.log:

Code: Select all

[root@lisl-ngos-01-pv var]# cat dbmaint.log
CREATING: /usr/local/nagiosxi/var/dbmaint.lock
CLEANING ndoutils TABLE 'commenthistory'...
SQL: DELETE FROM nagios_commenthistory WHERE entry_time < FROM_UNIXTIME(1444161902)
CLEANING ndoutils TABLE 'processevents'...
SQL: DELETE FROM nagios_processevents WHERE event_time < FROM_UNIXTIME(1444161902)
CLEANING ndoutils TABLE 'externalcommands'...
SQL: DELETE FROM nagios_externalcommands WHERE entry_time < FROM_UNIXTIME(1475093102)
CLEANING ndoutils TABLE 'logentries'...
SQL: DELETE FROM nagios_logentries WHERE logentry_time < FROM_UNIXTIME(1467921902)
CLEANING ndoutils TABLE 'notifications'...
SQL: DELETE FROM nagios_notifications WHERE start_time < FROM_UNIXTIME(1467921902)
CLEANING ndoutils TABLE 'contactnotifications'...
SQL: DELETE FROM nagios_contactnotifications WHERE start_time < FROM_UNIXTIME(1467921902)
CLEANING ndoutils TABLE 'contactnotificationmethods'...
SQL: DELETE FROM nagios_contactnotificationmethods WHERE start_time < FROM_UNIXTIME(1467921902)
CLEANING ndoutils TABLE 'statehistory'...
SQL: DELETE FROM nagios_statehistory WHERE state_time < FROM_UNIXTIME(1412625902)
CLEANING ndoutils TABLE 'timedevents'...
SQL: DELETE FROM nagios_timedevents WHERE event_time < FROM_UNIXTIME(1475697602)
CLEANING ndoutils TABLE 'systemcommands'...
SQL: DELETE FROM nagios_systemcommands WHERE start_time < FROM_UNIXTIME(1475697602)
CLEANING ndoutils TABLE 'servicechecks'...
SQL: DELETE FROM nagios_servicechecks WHERE start_time < FROM_UNIXTIME(1475697602)
CLEANING ndoutils TABLE 'hostchecks'...
SQL: DELETE FROM nagios_hostchecks WHERE start_time < FROM_UNIXTIME(1475697602)
CLEANING ndoutils TABLE 'eventhandlers'...
SQL: DELETE FROM nagios_eventhandlers WHERE start_time < FROM_UNIXTIME(1475697602)
LASTOPT:  1475694902
INTERVAL: 60
NOW:      1475697902
OPTTIME:  1475698502
CLEANING nagiosxi TABLE 'commands'...
SQL: DELETE FROM xi_commands WHERE processing_time < 1475669102::abstime::timestamp without time zone
CLEANING nagiosxi TABLE 'events'...
SQL: DELETE FROM xi_events WHERE processing_time < 1475669102::abstime::timestamp without time zone
SQL1: SELECT xi_meta.meta_id FROM xi_meta LEFT JOIN xi_events ON xi_meta.metaobj_id=xi_events.event_id WHERE metatype_id='1' AND event_id IS NULL
SQL2: DELETE FROM xi_meta WHERE meta_id IN (SELECT xi_meta.meta_id FROM xi_meta LEFT JOIN xi_events ON xi_meta.metaobj_id=xi_events.event_id WHERE metatype_id='1' AND event_id IS NULL)
CLEANING nagiosxi TABLE 'auditlog'...
SQL: DELETE FROM xi_auditlog WHERE log_time < 1473105902::abstime::timestamp without time zone
CLEANING nagiosql TABLE 'logbook'...
SQL: DELETE FROM tbl_logbook WHERE time < FROM_UNIXTIME(1475669102)




Repair Complete: Removing Lock File
User avatar
snapon_admin
Posts: 952
Joined: Mon Jun 10, 2013 10:39 am
Location: Kenosha, WI
Contact:

Re: CPU usage high and checks delayed

Post by snapon_admin »

Just as an FYI as well, it was getting worse so I just rebooted the server. Didn't seem to help at all.
bwallace
Posts: 1145
Joined: Tue Nov 17, 2015 1:57 pm

Re: CPU usage high and checks delayed

Post by bwallace »

So the dbmaint.log looks fine, IMO. But no change after a reboot?
A post-reboot profile would be quite helpful actually, could you provide a new profile? PM it to me, thanks.
Be sure to check out the Knowledgebase for helpful articles and solutions!
Locked