Scheduled events over time piling up on "NOW"

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
johndoe
Posts: 114
Joined: Fri Oct 28, 2011 10:14 am

Re: Scheduled events over time piling up on "NOW"

Post by johndoe »

Code: Select all

[root@XXX ~]# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg | head -2
Nagios Core 3.5.0
[root@XXX ~]# tail -50 /var/log/messages | grep ndo
[root@XXX ~]# service nagios status
nagios (pid 2584) is running...
[root@XXX ~]# service ndo2db status
ndo2db (pid 2618) is running...

File sent via PM to avoid any sensitive info disclosure, checked and there seemed to be none but can never be too careful. Feel free to suggest any other improvements on the actual configs.
Nagios XI 2012R2.8c Running on Ubuntu 12.04 Using 99% passive checks for monitoring
Monitoring nearly 800 Passive services spread through roughly 40 machines
Running on an 8 core, KVM virtualized VM, with 15 GB of RAM and using RAMDisk
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Scheduled events over time piling up on "NOW"

Post by lmiltchev »

I didn't see anything weird in the nagios.cfg file, besides the fact that the sections were out of order. Anyway, go to:

Admin->System Profile->Download Profile

save and PM me the "profile.zip" file.
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Scheduled events over time piling up on "NOW"

Post by lmiltchev »

It seems like you have a crashed table in the nagiosql database.

140320 13:20:01 [ERROR] /usr/libexec/mysqld: Table './nagiosql/tbl_logbook' is marked as crashed and last (automatic?) repair failed

Run the following commands:

Code: Select all

cd /usr/local/nagiosxi/scripts
./repairmysql.sh nagios
./repairmysql.sh nagiosql
service nagios stop
killall nagios
service ndo2db stop
service ndo2db start
service nagios start
Check if nagios service is running:

Code: Select all

service nagios status
Be sure to check out our Knowledgebase for helpful articles and solutions!
johndoe
Posts: 114
Joined: Fri Oct 28, 2011 10:14 am

Re: Scheduled events over time piling up on "NOW"

Post by johndoe »

Did that, same problem...
Nagios XI 2012R2.8c Running on Ubuntu 12.04 Using 99% passive checks for monitoring
Monitoring nearly 800 Passive services spread through roughly 40 machines
Running on an 8 core, KVM virtualized VM, with 15 GB of RAM and using RAMDisk
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Scheduled events over time piling up on "NOW"

Post by scottwilkerson »

johndoe,

I had a revisit of the code that makes the up the "Monitoring Engine Event Queue" and what you are seeing is likely just because of how the data is queried and the fact that you have lots of passive checks, and almost all the hosts/services are reporting in very frequently and the "next_check" time is likely off into the future further than the likely time that a real check will come in...

This is gonna be ugly, but this is what the SQL looks like that is used to pull data and then be massaged into the XML that populates the graph

Code: Select all

SELECT COUNT(*) AS total_events,next_check, NOW() as time_now,
	TIMESTAMPDIFF(SECOND,NOW(),next_check) AS seconds_from_now,
	(TIMESTAMPDIFF(SECOND,NOW(),next_check) DIV 10) AS bucket
	FROM nagios_hoststatus
	WHERE TRUE 
	AND (TIMESTAMPDIFF(SECOND,NOW(),next_check) < 300)
	AND instance_id = '1'
	AND UNIX_TIMESTAMP(next_check) != 0
	GROUP BY instance_id, bucket
	UNION
	SELECT COUNT(*) AS total_events,next_check, NOW() as time_now,
	TIMESTAMPDIFF(SECOND,NOW(),next_check) AS seconds_from_now,
	(TIMESTAMPDIFF(SECOND,NOW(),next_check) DIV 10) AS bucket
	FROM nagios_servicestatus
	WHERE TRUE 
	AND (TIMESTAMPDIFF(SECOND,NOW(),next_check) < 300)
	AND instance_id = '1'
	AND UNIX_TIMESTAMP(next_check) != 0
	GROUP BY instance_id, bucket	
	ORDER BY bucket ASC LIMIT 10000

Long story short, I think this is just a product of your environment. you might be able to massage the graph by reducing the check_interval on your checks (even though you don't to active checks)
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked