Nagios Support Forum

Posted: **Thu Jun 05, 2014 5:00 pm**

Swap details

[root@qualngs ~]# swapon -s
Filename				Type		Size	Used	Priority
/dev/dm-0                               partition	262136	0	-1
/swapfile1                              file		1048568	0	-2
[root@qualngs ~]# free
             total       used       free     shared    buffers     cached
Mem:      16326396    2363528   13962868          0      90764     316828
-/+ buffers/cache:    1955936   14370460
Swap:      1310704          0    1310704
[root@qualngs ~]#

Posted: **Thu Jun 05, 2014 5:13 pm**

Adding more memory has had a very adverse affect I am not getting

nagios.log

Code: Select all

[1402006143] wproc: Core Worker 21390: job 54 (pid=28236): Dormant child reaped
[1402006143] wproc: Core Worker 21393: job 54 (pid=28249) timed out. Killing it
[1402006143] wproc: GLOBAL SERVICE EVENTHANDLER job 54 from worker Core Worker 21393 timed out after 31.01s

error_log

Code: Select all

[Thu Jun 05 22:55:53 2014] [error] [client 172.20.10.254] Script timed out before returning headers: extinfo.cgi, referer: https://rmon.xxxxx.co.uk/nagios/side.php
[Thu Jun 05 22:56:53 2014] [warn] [client 172.20.10.254] Timeout waiting for output from CGI script /usr/local/nagios/sbin/extinfo.cgi, referer: https://rmon.xxxxx.co.uk/nagios/side.php
[Thu Jun 05 23:02:14 2014] [warn] [client 172.20.10.254] Timeout waiting for output from CGI script /usr/local/nagios/sbin/statusjson.cgi, referer: https://rmon.xxxxx.co.uk/nagios/main.php
[Thu Jun 05 23:02:14 2014] [error] [client 172.20.10.254] Script timed out before returning headers: statusjson.cgi, referer: https://rmon.xxxxx.co.uk/nagios/main.php
[Thu Jun 05 23:02:24 2014] [warn] [client 172.20.10.254] Timeout waiting for output from CGI script /usr/local/nagios/sbin/tac.cgi, referer: https://rmon.xxxxx.co.uk/nagios/side.php
[Thu Jun 05 23:02:24 2014] [error] [client 172.20.10.254] Script timed out before returning headers: tac.cgi, referer: https://rmon.xxxxx.co.uk/nagios/side.php
[Thu Jun 05 23:03:14 2014] [warn] [client 172.20.10.254] Timeout waiting for output from CGI script /usr/local/nagios/sbin/statusjson.cgi, referer: https://rmon.xxxxx.co.uk/nagios/main.php
[Thu Jun 05 23:03:25 2014] [warn] [client 172.20.10.254] Timeout waiting for output from CGI script /usr/local/nagios/sbin/tac.cgi, referer: https://rmon.xxxxx.co.uk/nagios/side.php

Cannot even login to http://localhost/nagios or http://localhost/nagiosxi at all.

I am going to put the memory back

Thanks

Chris

Posted: **Thu Jun 05, 2014 6:01 pm**

Latest from the /var/log/messages

Code: Select all

un  5 23:46:21 qualngs xinetd[1817]: EXIT: nrpe status=0 pid=56716 duration=1(sec)
Jun  5 23:51:20 qualngs xinetd[1817]: START: nrpe pid=11539 from=::ffff:172.20.10.126
Jun  5 23:51:20 qualngs nrpe[11539]: Error: Could not complete SSL handshake. 5
Jun  5 23:51:20 qualngs xinetd[1817]: EXIT: nrpe status=0 pid=11539 duration=0(sec)
Jun  5 23:56:20 qualngs xinetd[1817]: START: nrpe pid=31853 from=::ffff:172.20.10.126
Jun  5 23:56:20 qualngs nrpe[31853]: Error: Could not complete SSL handshake. 5
Jun  5 23:56:20 qualngs xinetd[1817]: EXIT: nrpe status=0 pid=31853 duration=0(sec)
Jun  6 00:00:00 qualngs ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 128000 of 256000 messages and 131072000 of 131072000 bytes in the queue. See README for kernel tuning options.
Jun  6 00:00:02 qualngs ndo2db: Message sent to queue.
Jun  6 00:00:02 qualngs ndo2db: Warning: queue send error, retrying...
You have mail in /var/spool/mail/root

Posted: **Thu Jun 05, 2014 6:13 pm**

I increased the size of the kernel to double the size and its running again. I am now getting.

Code: Select all

[root@qualngs ~]# tail /var/log/messages
Jun  6 00:10:27 qualngs rsyslogd-2177: imuxsock lost 342 messages from pid 2107 due to rate-limiting
Jun  6 00:10:28 qualngs rsyslogd-2177: imuxsock begins to drop messages from pid 53153 due to rate-limiting
Jun  6 00:10:28 qualngs rsyslogd-2177: imuxsock begins to drop messages from pid 2107 due to rate-limiting
Jun  6 00:10:29 qualngs rsyslogd-2177: imuxsock lost 27 messages from pid 53153 due to rate-limiting
Jun  6 00:10:33 qualngs rsyslogd-2177: imuxsock lost 544 messages from pid 2107 due to rate-limiting
Jun  6 00:10:36 qualngs rsyslogd-2177: imuxsock begins to drop messages from pid 2107 due to rate-limiting
Jun  6 00:10:39 qualngs rsyslogd-2177: imuxsock lost 36 messages from pid 2107 due to rate-limiting
Jun  6 00:12:00 qualngs rsyslogd-2177: imuxsock begins to drop messages from pid 2107 due to rate-limiting
Jun  6 00:12:02 qualngs rsyslogd-2177: imuxsock lost 80 messages from pid 2107 due to rate-limiting
Jun  6 00:12:07 qualngs rsyslogd-2177: imuxsock begins to drop messages from pid 2107 due to rate-limiting

Posted: **Fri Jun 06, 2014 5:44 am**

More information:

Kernel tweaks have not made a difference I guess I need to keep increasing these but dont want that to affect other services etc.

Getting this in the postgresql logs:

Code: Select all

ERROR:  relation "xi_notifications" does not exist
STATEMENT:  VACUUM ANALYZE xi_notifications;

/var/log/messages

Code: Select all

Jun  6 11:38:08 qualngs ndo2db: Message sent to queue.
Jun  6 11:38:08 qualngs ndo2db: Warning: queue send error, retrying...
Jun  6 11:38:19 qualngs ndo2db: Message sent to queue.
Jun  6 11:38:19 qualngs ndo2db: Warning: queue send error, retrying...
Jun  6 11:38:30 qualngs ndo2db: Message sent to queue.
Jun  6 11:38:30 qualngs ndo2db: Warning: queue send error, retrying...
Jun  6 11:38:41 qualngs ndo2db: Message sent to queue.
Jun  6 11:38:41 qualngs ndo2db: Warning: queue send error, retrying...
Jun  6 11:38:52 qualngs ndo2db: Message sent to queue.
Jun  6 11:38:52 qualngs ndo2db: Warning: queue send error, retrying...

error_log

Code: Select all

[Fri Jun 06 11:33:03 2014] [error] [client xxx.xxx.xxx.xxx] File does not exist: /var/www/html/admin

mysqld.log

Code: Select all

140606 10:10:04 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140606 10:14:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140606 10:14:28  InnoDB: Initializing buffer pool, size = 8.0M
140606 10:14:28  InnoDB: Completed initialization of buffer pool
140606 10:14:28  InnoDB: Started; log sequence number 0 44233
140606 10:14:28 [Note] Event Scheduler: Loaded 0 events
140606 10:14:28 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution

sysctl -p

Code: Select all

net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key
error: "net.bridge.bridge-nf-call-iptables" is an unknown key
error: "net.bridge.bridge-nf-call-arptables" is an unknown key
kernel.msgmnb = 262144000
kernel.msgmax = 262144000
kernel.shmmax = 4294967295
kernel.shmall = 4294967295
kernel.msgmni = 256000

Nagios Core stopped and NagiosXI stopped

This is all pointing to the kernel but what should I change the settings too. So many different articles for best config for postgresql and mysql all saying different things.

Thanks

Chris

Posted: **Fri Jun 06, 2014 1:51 pm**

Nagios Core stopped and NagiosXI stopped

Do you mean you are not able to log in the web UI, or services are not running? What is the output of the following commands?

Code: Select all

service nagios status
service ndo2db status
service mysqld status
service postgresql status
service crond status
service httpd status

Posted: **Mon Jun 09, 2014 4:36 am**

Hi,

So this is state of play today have spent all weekend tweaking the kernel settings:

1. Nagios Core is Running and working without errors in the event logs
2. Nagios XI - ndo2db is stopped and when it starts it runs for 20 mins and then the kernel is exhausted and XI and Core crash. - I have to stop ndo2db and then run service nagios restart and nagios core starts working again.

I have left ndo2db stopped since 0430 today and Nagios Core has not crashed, the only thing now additional to this is the following errors are now appearing in the /var/log/messages

Code: Select all

Jun  9 10:25:40 qualngs rrdcached[17046]: queue_thread_main: rrd_update_r (/usr/local/nagios/share/perfdata/invu.centerprise.co.uk/#Process_CPU_Consumption.rrd) failed with status -1. (/usr/local/nagios/share/perfdata/invu.centerprise.co.uk/#Process_CPU_Consumption.rrd: expected 58 data source readings (got 57) from 1402305008)
Jun  9 10:25:41 qualngs rrdcached[17046]: queue_thread_main: rrd_update_r (/usr/local/nagios/share/perfdata/qls-data.local.qualitas-it.net/#Process_CPU_Consumption.rrd) failed with status -1. (/usr/local/nagios/share/perfdata/qls-data.local.qualitas-it.net/#Process_CPU_Consumption.rrd: found extra data on update argument: 100.0)
Jun  9 10:26:21 qualngs rrdcached[17046]: queue_thread_main: rrd_update_r (/usr/local/nagios/share/perfdata/qls-sql1.local.qualitas-it.net/#Process_CPU_Consumption.rrd) failed with status -1. (/usr/local/nagios/share/perfdata/qls-sql1.local.qualitas-it.net/#Process_CPU_Consumption.rrd: expected 82 data source readings (got 80) from 1402305030)
Jun  9 10:26:22 qualngs rrdcached[17046]: queue_thread_main: rrd_update_r (/usr/local/nagios/share/perfdata/bcotdfs06.bcotac.local/#Process_CPU_Consumption.rrd) failed with status -1. (/usr/local/nagios/share/perfdata/bcotdfs06.bcotac.local/#Process_CPU_Consumption.rrd: expected 78 data source readings (got 71) from 1402305064)

This is the current output from the

sysctl -p

Code: Select all

net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key
error: "net.bridge.bridge-nf-call-iptables" is an unknown key
error: "net.bridge.bridge-nf-call-arptables" is an unknown key
kernel.msgmnb = 262144000
kernel.msgmax = 262144000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
kernel.msgmni = 512000

Code: Select all

[root@qualngs ~]# service nagios status
nagios (pid 60730) is running...
[root@qualngs ~]# service ndo2db status
ndo2db (pid 25309) is running...
[root@qualngs ~]# service mysqld status
mysqld (pid  1989) is running...
[root@qualngs ~]# service postgresql status
postmaster (pid  2038) is running...
[root@qualngs ~]# service crond status
crond (pid  2165) is running...
[root@qualngs ~]# service httpd status
httpd (pid  2157) is running...

tail /var/lib/pgsql/data/pg_log/postgresql-Mon.log

Code: Select all

ERROR:  relation "xi_notifications" does not exist
STATEMENT:  VACUUM ANALYZE xi_notifications;
ERROR:  relation "xi_notifications" does not exist
STATEMENT:  VACUUM ANALYZE xi_notifications;
ERROR:  relation "xi_notifications" does not exist
STATEMENT:  VACUUM ANALYZE xi_notifications;
ERROR:  relation "xi_notifications" does not exist
STATEMENT:  VACUUM ANALYZE xi_notifications;
ERROR:  relation "xi_notifications" does not exist
STATEMENT:  VACUUM ANALYZE xi_notifications;

Posted: **Mon Jun 09, 2014 5:46 am**

Additional

ndo2db.debug

Code: Select all

[1402310564.435857] [002.0] [pid=22393] INSERT INTO nagios_statehistory SET instance_id='1', state_time=FROM_UNIXT$
[1402310564.436208] [002.0] [pid=22393] INSERT INTO nagios_eventhandlers SET instance_id='1', eventhandler_type='0$
[1402310564.436614] [002.0] [pid=22393] INSERT INTO nagios_eventhandlers SET instance_id='1', eventhandler_type='0$
[1402310564.436972] [002.0] [pid=22393] INSERT INTO nagios_eventhandlers SET instance_id='1', eventhandler_type='0$
[1402310564.437351] [002.0] [pid=22393] INSERT INTO nagios_eventhandlers SET instance_id='1', eventhandler_type='0$
[1402310564.437737] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='1$
[1402310564.438279] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='12$
[1402310564.438556] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='1$
[1402310564.439050] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='12$
[1402310564.439355] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='6$
[1402310564.439793] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='68$
[1402310564.440100] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='6$
[1402310564.440461] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='68$
[1402310564.440970] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='6$
[1402310564.441402] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='65$
[1402310564.441726] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='6$
[1402310564.443456] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='65$
[1402310564.443711] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='7$
[1402310564.444103] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='71$
[1402310564.444336] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='7$
[1402310564.444683] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='71$
[1402310564.444894] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='4$
[1402310564.445283] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='46$
[1402310564.445507] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='4$

Posted: **Mon Jun 09, 2014 5:05 pm**

I want to pause this here for a moment and ask you, what version of Core are you actually running? We never recommend updating core manually as there are a lot of hooks in XI that rely on the current version we push out alongside XI. XI is not built to be piecemeal upgraded.

Posted: **Tue Jun 10, 2014 4:46 am**

Hi Core is version

Code: Select all

Nagios® Core™
Version 4.0.6
April 29, 2014
Check for updates

Thanks

Chris

Nagios Support Forum

Nagios XI keeps crashing post upgrade to XI 2014

Re: Nagios XI keeps crashing post upgrade to XI 2014

Re: Nagios XI keeps crashing post upgrade to XI 2014

Re: Nagios XI keeps crashing post upgrade to XI 2014

Re: Nagios XI keeps crashing post upgrade to XI 2014

Re: Nagios XI keeps crashing post upgrade to XI 2014

Re: Nagios XI keeps crashing post upgrade to XI 2014

Re: Nagios XI keeps crashing post upgrade to XI 2014

Re: Nagios XI keeps crashing post upgrade to XI 2014

Re: Nagios XI keeps crashing post upgrade to XI 2014

Re: Nagios XI keeps crashing post upgrade to XI 2014