Hello All,
I have built my Nagios Core system recently and was about to Go-Live it's behaving weird. My problem is Nagios is not updating the check and I can see that in the CGI(Nagios Front End panel) has checked the checks 24 hours backs. Until I restarted the Nagios Service. And, this is keep on happening.
I have installed nagios core version 3.4.1 and has integrated Pnp4nagios for graphing.
Below is the statistics of my current setup:
-------------------------------------------------------------------
/usr/local/nagios/bin/nagiostats -c /usr/local/nagios/etc/nagios.cfg
Nagios Stats 3.4.1
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 05-11-2012
License: GPL
CURRENT STATUS DATA
------------------------------------------------------
Status File: /usr/local/nagios/var/status.dat
Status File Age: 0d 0h 0m 13s
Status File Version: 3.4.1
Program Running Time: 0d 0h 27m 37s
Nagios PID: 17395
Used/High/Total Command Buffers: 0 / 166 / 4096
Total Services: 9364
Services Checked: 9364
Services Scheduled: 9364
Services Actively Checked: 9364
Services Passively Checked: 0
Total Service State Change: 0.000 / 29.930 / 0.041 %
Active Service Latency: 0.000 / 14.730 / 3.523 sec
Active Service Execution Time: 0.007 / 50.013 / 0.422 sec
Active Service State Change: 0.000 / 29.930 / 0.041 %
Active Services Last 1/5/15/60 min: 1600 / 8795 / 9364 / 9364
Passive Service Latency: 0.000 / 0.000 / 0.000 sec
Passive Service State Change: 0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit: 9071 / 116 / 7 / 170
Services Flapping: 2
Services In Downtime: 0
Total Hosts: 845
Hosts Checked: 845
Hosts Scheduled: 845
Hosts Actively Checked: 845
Host Passively Checked: 0
Total Host State Change: 0.000 / 6.250 / 0.014 %
Active Host Latency: 0.000 / 11.754 / 2.101 sec
Active Host Execution Time: 0.008 / 10.012 / 0.124 sec
Active Host State Change: 0.000 / 6.250 / 0.014 %
Active Hosts Last 1/5/15/60 min: 164 / 839 / 845 / 845
Passive Host Latency: 0.000 / 0.000 / 0.000 sec
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 842 / 3 / 0
Hosts Flapping: 0
Hosts In Downtime: 0
Active Host Checks Last 1/5/15 min: 367 / 1325 / 3900
Scheduled: 206 / 864 / 2625
On-demand: 161 / 461 / 1275
Parallel: 208 / 869 / 2636
Serial: 0 / 0 / 0
Cached: 159 / 456 / 1264
Passive Host Checks Last 1/5/15 min: 0 / 0 / 0
Active Service Checks Last 1/5/15 min: 2440 / 9693 / 28588
Scheduled: 2440 / 9693 / 28588
On-demand: 0 / 0 / 0
Cached: 0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0
External Commands Last 1/5/15 min: 0 / 1 / 2
-------------------------------------------------------------------
OS:
Linux 2.6.18-274.el5 #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
It would be of great help if you can assist me in solving this problem.
Workaround: I have written a script to which checks the status.dat file freshness if it is not updating restart the nagios. by doing this I am making sure that Nagios is running.
Regards,
Babu Dhinakaran S
Nagios is forzen. status.dat file is in stale
-
babudhinakaran
- Posts: 7
- Joined: Tue Sep 04, 2012 9:18 am
Re: Nagios is forzen. status.dat file is in stale
Do you get any useful log output from running:
tail -f <pathto>/nagios.log
tail -f <pathto>/nagios.log
-
babudhinakaran
- Posts: 7
- Joined: Tue Sep 04, 2012 9:18 am
Re: Nagios is forzen. status.dat file is in stale
No, I don't see any progression of the logfile, once Nagios is frozen.
I see one error "[1346826710] Error: Unable to rename file '/usr/local/nagios/var/nagios.debug' to '/usr/local/nagios/var/nagios.debug.old': No such file or directory" while the Nagios is running, This is because I enabled debug feature in nagios.cfg.
Please let me know if you need any additional detials. Earlier,I used Gearman and Nagiz as Event broker Module. The reason for going to Gearman is to support our larger INfraStructure size. we Have around 900+ servers and 9500+ services. As of now I have unloaded both the Broker module to check the stability.
Regards,
Babu Dhinakaran S
I see one error "[1346826710] Error: Unable to rename file '/usr/local/nagios/var/nagios.debug' to '/usr/local/nagios/var/nagios.debug.old': No such file or directory" while the Nagios is running, This is because I enabled debug feature in nagios.cfg.
Please let me know if you need any additional detials. Earlier,I used Gearman and Nagiz as Event broker Module. The reason for going to Gearman is to support our larger INfraStructure size. we Have around 900+ servers and 9500+ services. As of now I have unloaded both the Broker module to check the stability.
Regards,
Babu Dhinakaran S
Re: Nagios is forzen. status.dat file is in stale
Can you show an
ls -l
on the nagios/var directory that has the status and log file? I'm wondering if there's some sort of permissions issue where the Nagios process isn't able to write to the status file.
ls -l
on the nagios/var directory that has the status and log file? I'm wondering if there's some sort of permissions issue where the Nagios process isn't able to write to the status file.
-
babudhinakaran
- Posts: 7
- Joined: Tue Sep 04, 2012 9:18 am
Re: Nagios is forzen. status.dat file is in stale
Thanks for the reply.
I don't think we have a permission issue. As it works for 30 mins or so then it freezes.
Here is the ls -l of var folder:
----------------------------------------------------------
total 2415592
drwxrwxr-x 2 nagios nagios 4096 Sep 6 00:00 archives
-rw-r--r-- 1 root root 156639 Aug 1 11:07 Drive_Test.csv
-rw-rw-r-- 1 nagios nagios 258310159 Sep 6 06:51 host-perfdata
-rw-r--r-- 1 nagios nagios 63041555 Sep 4 15:01 livestatus.log
-rw-r--r-- 1 nagios nagios 332831 Sep 5 06:34 nagios.debug
-rw-r--r-- 1 nagios nagios 1000314 Sep 5 06:34 nagios.debug.old
-rw-r--r-- 1 nagios nagios 6 Sep 4 15:02 nagios.lock
-rw-rw-r-- 1 nagios nagios 2386192 Sep 6 06:51 nagios.log
-rw-r--r-- 1 nagios nagios 2163 Sep 4 15:00 nag_pid.txt
-rw-r--r-- 1 nagios nagios 9969400 Sep 5 15:28 objects.cache
-rw------- 1 nagios nagios 14453419 Sep 6 06:28 retention.dat
drwxrwsr-x 2 nagios nagcmd 4096 Sep 4 15:02 rw
-rw-rw-r-- 1 nagios nagios 2107031561 Sep 6 06:51 service-perfdata
drwxrwxr-x 3 nagios nagios 4096 Jun 20 14:34 spool
-rw-rw-r-- 1 nagios nagios 14371171 Sep 6 06:51 status.dat
------------------------------------------------------------------
I have unloaded the Event Brokers: Nagviz and Gearman and now it's been 26 hours and is working fine. If I load the Event Brokers then it hangs.
Please let me know if you need any additional details.
Regards,
Babu Dhinakaran S
[email protected]
I don't think we have a permission issue. As it works for 30 mins or so then it freezes.
Here is the ls -l of var folder:
----------------------------------------------------------
total 2415592
drwxrwxr-x 2 nagios nagios 4096 Sep 6 00:00 archives
-rw-r--r-- 1 root root 156639 Aug 1 11:07 Drive_Test.csv
-rw-rw-r-- 1 nagios nagios 258310159 Sep 6 06:51 host-perfdata
-rw-r--r-- 1 nagios nagios 63041555 Sep 4 15:01 livestatus.log
-rw-r--r-- 1 nagios nagios 332831 Sep 5 06:34 nagios.debug
-rw-r--r-- 1 nagios nagios 1000314 Sep 5 06:34 nagios.debug.old
-rw-r--r-- 1 nagios nagios 6 Sep 4 15:02 nagios.lock
-rw-rw-r-- 1 nagios nagios 2386192 Sep 6 06:51 nagios.log
-rw-r--r-- 1 nagios nagios 2163 Sep 4 15:00 nag_pid.txt
-rw-r--r-- 1 nagios nagios 9969400 Sep 5 15:28 objects.cache
-rw------- 1 nagios nagios 14453419 Sep 6 06:28 retention.dat
drwxrwsr-x 2 nagios nagcmd 4096 Sep 4 15:02 rw
-rw-rw-r-- 1 nagios nagios 2107031561 Sep 6 06:51 service-perfdata
drwxrwxr-x 3 nagios nagios 4096 Jun 20 14:34 spool
-rw-rw-r-- 1 nagios nagios 14371171 Sep 6 06:51 status.dat
------------------------------------------------------------------
I have unloaded the Event Brokers: Nagviz and Gearman and now it's been 26 hours and is working fine. If I load the Event Brokers then it hangs.
Please let me know if you need any additional details.
Regards,
Babu Dhinakaran S
[email protected]
- inventsekar
- Posts: 37
- Joined: Fri Jul 20, 2012 11:29 am
Re: Nagios is forzen. status.dat file is in stale
well, i am not an expert in nagios, but have good knowledge on HPOV, and i can give you some suggestions...
how long you are facing this issue? i mean, monitoring around 10 thousand services is a huge task. did u recently started to monitor lot services or its gradually became so many services?
the CPU informations of the nagios server??
how long you are facing this issue? i mean, monitoring around 10 thousand services is a huge task. did u recently started to monitor lot services or its gradually became so many services?
the CPU informations of the nagios server??
-
babudhinakaran
- Posts: 7
- Joined: Tue Sep 04, 2012 9:18 am
Re: Nagios is forzen. status.dat file is in stale
Hello,
Earlier we were monitoring on NagiosXI(paid version of Nagios core) and we started migrating to Nagios Core due to high load on the NagiosXIserver, the migration took almost a month to complete and it was working fine after the completing the migration for 20+ days (during this time we deleted/added monitors. But, no addition/deletion during this 20 days).
CPU Info: It's a 16 core processor Xeon CPU and the load is: load average: 2.86, 5.85, 8.11
------------------------------------------------------------------------------------------------------------
processor : 15
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
stepping : 5
cpu MHz : 2800.189
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 3
cpu cores : 4
apicid : 23
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5600.15
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
-------------------------------------------------------------------
Regards,
Babu Dhinakaran S
Earlier we were monitoring on NagiosXI(paid version of Nagios core) and we started migrating to Nagios Core due to high load on the NagiosXIserver, the migration took almost a month to complete and it was working fine after the completing the migration for 20+ days (during this time we deleted/added monitors. But, no addition/deletion during this 20 days).
CPU Info: It's a 16 core processor Xeon CPU and the load is: load average: 2.86, 5.85, 8.11
------------------------------------------------------------------------------------------------------------
processor : 15
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
stepping : 5
cpu MHz : 2800.189
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 3
cpu cores : 4
apicid : 23
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5600.15
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
-------------------------------------------------------------------
Regards,
Babu Dhinakaran S