Nagios has become unstable and is exhibiting a number of issues:
1. Notifications are failing at random with an error:
"[1347476154] Warning: Contact 'cgraham' service notification command '/usr/bin/printf "%b" "[email warning]" cgraham@address.com' timed out after 30 seconds
2. Getting fork errors:
"Warning: The check of service '[service check]' could not be performed due to a fork() error: 'Resource temporarily unavailable'. The check will be rescheduled.
3. Getting pipe errors:
[1347475829] HOST ALERT: [host name];DOWN;SOFT;1;Could not open pipe: /bin/ping -n -U -w 30 -c 5 [ip address]
4. /tmp full of check entries (over 365K)
5. Seeing a block of retries in the logs:
[1347476391] Warning: The check of host '[hostname]' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the host...
Any ideas where to start on this? My search of the various issues turned up nothing promising.
Nagios Not Functioning
Nagios Not Functioning
Last edited by CGraham on Thu Sep 13, 2012 8:14 am, edited 2 times in total.
Re: Nagios Not Functioning
Installed Version: 2011R3.2
Nagios XI Installation Profile
Download Profile
System:
[hostname].AirWatch.Local 2.6.32-220.13.1.el6.i686 i686
CentOS release 6.2 (Final)
Gnome Installed
Apache Information
PHP Version: 5.3.3
Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1
Server Name: [hostname]
Server Address: 10.1.3.151
Server Port: 80
Date/Time
PHP Timezone: America/New_York
PHP Time: Wed, 12 Sep 2012 16:10:59 -0400
System Time: Wed, 12 Sep 2012 16:10:59 -0400
Nagios XI Data
nagios (pid 29397) is running...
NPCD running (pid 1973).
ndo2db (pid 2020) is running...
CPU Load 15: 1.42
Total Hosts: 616
Total Services: 6098
Function 'get_base_uri' returns: http://[hostname]/nagiosxi/
Function 'get_base_url' returns: http://[hostname]s/nagiosxi/
Function 'get_backend_url(internal_call=false)' returns: http://[hostname]/nagiosxi/includes/components/profile/profile.php
Function 'get_backend_url(internal_call=true)' returns: http://[hostname]/nagiosxi/backend/
Ping Test localhost
Running:
/bin/ping -c 3 localhost 2>&1
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.026 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.028 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.041 ms
--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.026/0.031/0.041/0.009 ms
Test wget To locahost
WGET From URL: http://localhost/nagiosql/index.php
Running:
/usr/bin/wget http://localhost/nagiosql/index.php
--2012-09-12 16:11:01-- http://localhost/nagiosql/index.php
Resolving localhost... ::1, 127.0.0.1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5259 (5.1K) [text/html]
Saving to: `/tmp/nagiosql_index.tmp'
0K ..... 100% 431M=0s
2012-09-12 16:11:02 (431 MB/s) - `/tmp/nagiosql_index.tmp' saved [5259/5259]
Nagios XI Installation Profile
Download Profile
System:
[hostname].AirWatch.Local 2.6.32-220.13.1.el6.i686 i686
CentOS release 6.2 (Final)
Gnome Installed
Apache Information
PHP Version: 5.3.3
Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1
Server Name: [hostname]
Server Address: 10.1.3.151
Server Port: 80
Date/Time
PHP Timezone: America/New_York
PHP Time: Wed, 12 Sep 2012 16:10:59 -0400
System Time: Wed, 12 Sep 2012 16:10:59 -0400
Nagios XI Data
nagios (pid 29397) is running...
NPCD running (pid 1973).
ndo2db (pid 2020) is running...
CPU Load 15: 1.42
Total Hosts: 616
Total Services: 6098
Function 'get_base_uri' returns: http://[hostname]/nagiosxi/
Function 'get_base_url' returns: http://[hostname]s/nagiosxi/
Function 'get_backend_url(internal_call=false)' returns: http://[hostname]/nagiosxi/includes/components/profile/profile.php
Function 'get_backend_url(internal_call=true)' returns: http://[hostname]/nagiosxi/backend/
Ping Test localhost
Running:
/bin/ping -c 3 localhost 2>&1
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.026 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.028 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.041 ms
--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.026/0.031/0.041/0.009 ms
Test wget To locahost
WGET From URL: http://localhost/nagiosql/index.php
Running:
/usr/bin/wget http://localhost/nagiosql/index.php
--2012-09-12 16:11:01-- http://localhost/nagiosql/index.php
Resolving localhost... ::1, 127.0.0.1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5259 (5.1K) [text/html]
Saving to: `/tmp/nagiosql_index.tmp'
0K ..... 100% 431M=0s
2012-09-12 16:11:02 (431 MB/s) - `/tmp/nagiosql_index.tmp' saved [5259/5259]
Re: Nagios Not Functioning
I think you're hitting a system ulimit max somewhere. Take a look at this wiki and see if it applies to your issue.
http://support.nagios.com/wiki/index.ph ... g_Orphaned
http://support.nagios.com/wiki/index.ph ... g_Orphaned
Re: Nagios Not Functioning
I made the changes to the /etc/security/limits.conf. Then rebooted. Here's the result:
[root@ATL02PRDMGMT03 ~]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 31297
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
It doesn't look like it took....
[root@ATL02PRDMGMT03 ~]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 31297
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
It doesn't look like it took....
Re: Nagios Not Functioning
Here is my /etc/security/limits.conf file:
[root@[hostname] ~]# cat /etc/security/limits.conf | grep -v "^#"
* hard nofile 200000
@nagios hard memlock 128 #locked memory
@nagios soft memlock 128
@nagios soft nofile 4096 #open files
@nagios hard nofile 4096
@nagios hard nproc 4096 #max user processes
@nagios soft nproc 4096
@nagios hard stack 20480 #stack size
@nagios soft stack 20480
[root@[hostname] ~]# cat /etc/security/limits.conf | grep -v "^#"
* hard nofile 200000
@nagios hard memlock 128 #locked memory
@nagios soft memlock 128
@nagios soft nofile 4096 #open files
@nagios hard nofile 4096
@nagios hard nproc 4096 #max user processes
@nagios soft nproc 4096
@nagios hard stack 20480 #stack size
@nagios soft stack 20480
Re: Nagios Not Functioning
Hmm, I hate to have you reboot one more time, but can you try updating the settings once more, only this time use the wildcard: * instead of @nagios for the ulimit directives.
Also, check the /etc/security/limits.d/ directory and make sure there aren't files in there that are overriding the new settings.
Also, check the /etc/security/limits.d/ directory and make sure there aren't files in there that are overriding the new settings.
Re: Nagios Not Functioning
new limits.conf
commented out another limits file under limits.d
[root@ATL02PRDMGMT03 ~]# cat /etc/security/limits.conf | grep -v "^#"
* hard memlock 128 #locked memory
* soft memlock 128
* soft nofile 4096 #open files
* hard nofile 4096
* hard nproc 4096 #max user processes
* soft nproc 4096
* hard stack 20480 #stack size
* soft stack 20480
Rebooted:
[root@ATL02PRDMGMT03 ~]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 31297
max locked memory (kbytes, -l) 128
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 20480
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
That looks better
commented out another limits file under limits.d
[root@ATL02PRDMGMT03 ~]# cat /etc/security/limits.conf | grep -v "^#"
* hard memlock 128 #locked memory
* soft memlock 128
* soft nofile 4096 #open files
* hard nofile 4096
* hard nproc 4096 #max user processes
* soft nproc 4096
* hard stack 20480 #stack size
* soft stack 20480
Rebooted:
[root@ATL02PRDMGMT03 ~]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 31297
max locked memory (kbytes, -l) 128
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 20480
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
That looks better
Re: Nagios Not Functioning
Good deal. Anytime I've ever seen that issue the ulimit increases fixed it, so I'm betting that will take care of it.
Re: Nagios Not Functioning
Now this error is back in the logs:
[1347486156] Warning: Contact '[contact]' service notification command '/usr/bin/printf "%b" "[service check]" | /bin/mail -s "[Subject]" cgraham@air-watch.com' timed out after 30 seconds
EDIT: I'm not seeing any of the other previous issues. Maybe this is a separate postfix issues. I attempted to make postfix changes and restart postfix but when I ran "service postfix restart" it acted like postfix wasn't running. Am I confusing sendmail, /bin/mail and postfix?
[1347486156] Warning: Contact '[contact]' service notification command '/usr/bin/printf "%b" "[service check]" | /bin/mail -s "[Subject]" cgraham@air-watch.com' timed out after 30 seconds
EDIT: I'm not seeing any of the other previous issues. Maybe this is a separate postfix issues. I attempted to make postfix changes and restart postfix but when I ran "service postfix restart" it acted like postfix wasn't running. Am I confusing sendmail, /bin/mail and postfix?
Re: Nagios Not Functioning
I'm going to move this to another thread since it's another issue.