I have a very strange problem on my nagios server.
I run nagios 3.2.0, compiled from source on a RHEL 5.5 plain server.
It was working fine since about 1 year ago when it was initially installed.
However, about 12 hours ago it stopped working. The admin interface is loading fine, I can surf and use all the options in the menu, however, the web interface doesn't seem to be updating the nagios status of all my service checks on all my 60 servers.
In fact, the checks are not really being executed as i do not see any nagios activity on the logs, it doesn't check for any local o remote service, and we are talking about a simple tcp 80 response or mailing on port 25, nrpe doesn't show any activity for internal load average or disk space checks.
When I launch the nrpe or tcp checks, they work fine and report good results from the shell:
Code: Select all
[root@server.myserver.com:~]/home/nagios/libexec/check_tcp -H REMOTE.SRV.IP -p 80
TCP OK - 0.001 second response time on port 80|time=0.000535s;;;0.000000;10.000000
[root@server.myserver.com:~]
Code: Select all
[root@server.myserver.com:~]/home/nagios/libexec/check_nrpe -H REMOTE.SRV.IP -c check_load
OK - load average: 2.35, 6.33, 4.46|load1=2.350;15.000;30.000;0; load5=6.330;10.000;25.000;0; load15=4.460;5.000;20.000;0;
[root@server.myserver.com:~]
This was the last thing nagios system reported to the logs before it got frozen:
Code: Select all
[1289196000] CURRENT SERVICE STATE: server223_01;Particion /mnt/disk2;OK;HARD;1;DISK OK - free space: /mnt/disk2 193225 MB (88% inode=99%):
[1289196279] Auto-save of retention data completed successfully.
[1289199879] Auto-save of retention data completed successfully.
[1289203479] Auto-save of retention data completed successfully.
[1289207079] Auto-save of retention data completed successfully.
[1289210679] Auto-save of retention data completed successfully.
[1289214279] Auto-save of retention data completed successfully.
[1289217879] Auto-save of retention data completed successfully.
[1289221479] Auto-save of retention data completed successfully.
[1289221859] Caught SIGTERM, shutting down...
[1289221859] Successfully shutdown... (PID=18009)
[1289221860] Nagios 3.2.0 starting... (PID=29329)
[1289221860] Local time is Mon Nov 08 07:11:00 CST 2010
[1289221860] LOG VERSION: 2.0
[1289221860] Finished daemonizing... (New PID=29330)
Code: Select all
[1289238578] Caught SIGTERM, shutting down...
[1289238578] Successfully shutdown... (PID=13234)
[1289238579] Nagios 3.2.0 starting... (PID=13293)
[1289238579] Local time is Mon Nov 08 11:49:39 CST 2010
[1289238579] LOG VERSION: 2.0
[1289238579] Finished daemonizing... (New PID=13294)
Code: Select all
[root@server.myserver.com:~]pidof nagios
13294
Thanks!