Page 3 of 5
Re: Nagios XI keeps crashing post upgrade to XI 2014
Posted: Thu Jun 05, 2014 5:00 pm
by chriscamm
Swap details
Code: Select all
[root@qualngs ~]# swapon -s
Filename Type Size Used Priority
/dev/dm-0 partition 262136 0 -1
/swapfile1 file 1048568 0 -2
[root@qualngs ~]# free
total used free shared buffers cached
Mem: 16326396 2363528 13962868 0 90764 316828
-/+ buffers/cache: 1955936 14370460
Swap: 1310704 0 1310704
[root@qualngs ~]#
Re: Nagios XI keeps crashing post upgrade to XI 2014
Posted: Thu Jun 05, 2014 5:13 pm
by chriscamm
Adding more memory has had a very adverse affect I am not getting
nagios.log
Code: Select all
[1402006143] wproc: Core Worker 21390: job 54 (pid=28236): Dormant child reaped
[1402006143] wproc: Core Worker 21393: job 54 (pid=28249) timed out. Killing it
[1402006143] wproc: GLOBAL SERVICE EVENTHANDLER job 54 from worker Core Worker 21393 timed out after 31.01s
error_log
Code: Select all
[Thu Jun 05 22:55:53 2014] [error] [client 172.20.10.254] Script timed out before returning headers: extinfo.cgi, referer: https://rmon.xxxxx.co.uk/nagios/side.php
[Thu Jun 05 22:56:53 2014] [warn] [client 172.20.10.254] Timeout waiting for output from CGI script /usr/local/nagios/sbin/extinfo.cgi, referer: https://rmon.xxxxx.co.uk/nagios/side.php
[Thu Jun 05 23:02:14 2014] [warn] [client 172.20.10.254] Timeout waiting for output from CGI script /usr/local/nagios/sbin/statusjson.cgi, referer: https://rmon.xxxxx.co.uk/nagios/main.php
[Thu Jun 05 23:02:14 2014] [error] [client 172.20.10.254] Script timed out before returning headers: statusjson.cgi, referer: https://rmon.xxxxx.co.uk/nagios/main.php
[Thu Jun 05 23:02:24 2014] [warn] [client 172.20.10.254] Timeout waiting for output from CGI script /usr/local/nagios/sbin/tac.cgi, referer: https://rmon.xxxxx.co.uk/nagios/side.php
[Thu Jun 05 23:02:24 2014] [error] [client 172.20.10.254] Script timed out before returning headers: tac.cgi, referer: https://rmon.xxxxx.co.uk/nagios/side.php
[Thu Jun 05 23:03:14 2014] [warn] [client 172.20.10.254] Timeout waiting for output from CGI script /usr/local/nagios/sbin/statusjson.cgi, referer: https://rmon.xxxxx.co.uk/nagios/main.php
[Thu Jun 05 23:03:25 2014] [warn] [client 172.20.10.254] Timeout waiting for output from CGI script /usr/local/nagios/sbin/tac.cgi, referer: https://rmon.xxxxx.co.uk/nagios/side.php
Cannot even login to
http://localhost/nagios or
http://localhost/nagiosxi at all.
I am going to put the memory back
Thanks
Chris
Re: Nagios XI keeps crashing post upgrade to XI 2014
Posted: Thu Jun 05, 2014 6:01 pm
by chriscamm
Latest from the /var/log/messages
Code: Select all
un 5 23:46:21 qualngs xinetd[1817]: EXIT: nrpe status=0 pid=56716 duration=1(sec)
Jun 5 23:51:20 qualngs xinetd[1817]: START: nrpe pid=11539 from=::ffff:172.20.10.126
Jun 5 23:51:20 qualngs nrpe[11539]: Error: Could not complete SSL handshake. 5
Jun 5 23:51:20 qualngs xinetd[1817]: EXIT: nrpe status=0 pid=11539 duration=0(sec)
Jun 5 23:56:20 qualngs xinetd[1817]: START: nrpe pid=31853 from=::ffff:172.20.10.126
Jun 5 23:56:20 qualngs nrpe[31853]: Error: Could not complete SSL handshake. 5
Jun 5 23:56:20 qualngs xinetd[1817]: EXIT: nrpe status=0 pid=31853 duration=0(sec)
Jun 6 00:00:00 qualngs ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 128000 of 256000 messages and 131072000 of 131072000 bytes in the queue. See README for kernel tuning options.
Jun 6 00:00:02 qualngs ndo2db: Message sent to queue.
Jun 6 00:00:02 qualngs ndo2db: Warning: queue send error, retrying...
You have mail in /var/spool/mail/root
Re: Nagios XI keeps crashing post upgrade to XI 2014
Posted: Thu Jun 05, 2014 6:13 pm
by chriscamm
I increased the size of the kernel to double the size and its running again. I am now getting.
Code: Select all
[root@qualngs ~]# tail /var/log/messages
Jun 6 00:10:27 qualngs rsyslogd-2177: imuxsock lost 342 messages from pid 2107 due to rate-limiting
Jun 6 00:10:28 qualngs rsyslogd-2177: imuxsock begins to drop messages from pid 53153 due to rate-limiting
Jun 6 00:10:28 qualngs rsyslogd-2177: imuxsock begins to drop messages from pid 2107 due to rate-limiting
Jun 6 00:10:29 qualngs rsyslogd-2177: imuxsock lost 27 messages from pid 53153 due to rate-limiting
Jun 6 00:10:33 qualngs rsyslogd-2177: imuxsock lost 544 messages from pid 2107 due to rate-limiting
Jun 6 00:10:36 qualngs rsyslogd-2177: imuxsock begins to drop messages from pid 2107 due to rate-limiting
Jun 6 00:10:39 qualngs rsyslogd-2177: imuxsock lost 36 messages from pid 2107 due to rate-limiting
Jun 6 00:12:00 qualngs rsyslogd-2177: imuxsock begins to drop messages from pid 2107 due to rate-limiting
Jun 6 00:12:02 qualngs rsyslogd-2177: imuxsock lost 80 messages from pid 2107 due to rate-limiting
Jun 6 00:12:07 qualngs rsyslogd-2177: imuxsock begins to drop messages from pid 2107 due to rate-limiting
Re: Nagios XI keeps crashing post upgrade to XI 2014
Posted: Fri Jun 06, 2014 5:44 am
by chriscamm
More information:
Kernel tweaks have not made a difference I guess I need to keep increasing these but dont want that to affect other services etc.
Getting this in the postgresql logs:
Code: Select all
ERROR: relation "xi_notifications" does not exist
STATEMENT: VACUUM ANALYZE xi_notifications;
/var/log/messages
Code: Select all
Jun 6 11:38:08 qualngs ndo2db: Message sent to queue.
Jun 6 11:38:08 qualngs ndo2db: Warning: queue send error, retrying...
Jun 6 11:38:19 qualngs ndo2db: Message sent to queue.
Jun 6 11:38:19 qualngs ndo2db: Warning: queue send error, retrying...
Jun 6 11:38:30 qualngs ndo2db: Message sent to queue.
Jun 6 11:38:30 qualngs ndo2db: Warning: queue send error, retrying...
Jun 6 11:38:41 qualngs ndo2db: Message sent to queue.
Jun 6 11:38:41 qualngs ndo2db: Warning: queue send error, retrying...
Jun 6 11:38:52 qualngs ndo2db: Message sent to queue.
Jun 6 11:38:52 qualngs ndo2db: Warning: queue send error, retrying...
error_log
Code: Select all
[Fri Jun 06 11:33:03 2014] [error] [client xxx.xxx.xxx.xxx] File does not exist: /var/www/html/admin
mysqld.log
Code: Select all
140606 10:10:04 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140606 10:14:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140606 10:14:28 InnoDB: Initializing buffer pool, size = 8.0M
140606 10:14:28 InnoDB: Completed initialization of buffer pool
140606 10:14:28 InnoDB: Started; log sequence number 0 44233
140606 10:14:28 [Note] Event Scheduler: Loaded 0 events
140606 10:14:28 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73' socket: '/var/lib/mysql/mysql.sock' port: 3306 Source distribution
sysctl -p
Code: Select all
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key
error: "net.bridge.bridge-nf-call-iptables" is an unknown key
error: "net.bridge.bridge-nf-call-arptables" is an unknown key
kernel.msgmnb = 262144000
kernel.msgmax = 262144000
kernel.shmmax = 4294967295
kernel.shmall = 4294967295
kernel.msgmni = 256000
Nagios Core stopped and NagiosXI stopped
This is all pointing to the kernel but what should I change the settings too. So many different articles for best config for postgresql and mysql all saying different things.
Thanks
Chris
Re: Nagios XI keeps crashing post upgrade to XI 2014
Posted: Fri Jun 06, 2014 1:51 pm
by lmiltchev
Nagios Core stopped and NagiosXI stopped
Do you mean you are not able to log in the web UI, or services are not running? What is the output of the following commands?
Code: Select all
service nagios status
service ndo2db status
service mysqld status
service postgresql status
service crond status
service httpd status
Re: Nagios XI keeps crashing post upgrade to XI 2014
Posted: Mon Jun 09, 2014 4:36 am
by chriscamm
Hi,
So this is state of play today have spent all weekend tweaking the kernel settings:
1. Nagios Core is Running and working without errors in the event logs
2. Nagios XI - ndo2db is stopped and when it starts it runs for 20 mins and then the kernel is exhausted and XI and Core crash. - I have to stop ndo2db and then run service nagios restart and nagios core starts working again.
I have left ndo2db stopped since 0430 today and Nagios Core has not crashed, the only thing now additional to this is the following errors are now appearing in the /var/log/messages
Code: Select all
Jun 9 10:25:40 qualngs rrdcached[17046]: queue_thread_main: rrd_update_r (/usr/local/nagios/share/perfdata/invu.centerprise.co.uk/#Process_CPU_Consumption.rrd) failed with status -1. (/usr/local/nagios/share/perfdata/invu.centerprise.co.uk/#Process_CPU_Consumption.rrd: expected 58 data source readings (got 57) from 1402305008)
Jun 9 10:25:41 qualngs rrdcached[17046]: queue_thread_main: rrd_update_r (/usr/local/nagios/share/perfdata/qls-data.local.qualitas-it.net/#Process_CPU_Consumption.rrd) failed with status -1. (/usr/local/nagios/share/perfdata/qls-data.local.qualitas-it.net/#Process_CPU_Consumption.rrd: found extra data on update argument: 100.0)
Jun 9 10:26:21 qualngs rrdcached[17046]: queue_thread_main: rrd_update_r (/usr/local/nagios/share/perfdata/qls-sql1.local.qualitas-it.net/#Process_CPU_Consumption.rrd) failed with status -1. (/usr/local/nagios/share/perfdata/qls-sql1.local.qualitas-it.net/#Process_CPU_Consumption.rrd: expected 82 data source readings (got 80) from 1402305030)
Jun 9 10:26:22 qualngs rrdcached[17046]: queue_thread_main: rrd_update_r (/usr/local/nagios/share/perfdata/bcotdfs06.bcotac.local/#Process_CPU_Consumption.rrd) failed with status -1. (/usr/local/nagios/share/perfdata/bcotdfs06.bcotac.local/#Process_CPU_Consumption.rrd: expected 78 data source readings (got 71) from 1402305064)
This is the current output from the
sysctl -p
Code: Select all
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key
error: "net.bridge.bridge-nf-call-iptables" is an unknown key
error: "net.bridge.bridge-nf-call-arptables" is an unknown key
kernel.msgmnb = 262144000
kernel.msgmax = 262144000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
kernel.msgmni = 512000
Code: Select all
[root@qualngs ~]# service nagios status
nagios (pid 60730) is running...
[root@qualngs ~]# service ndo2db status
ndo2db (pid 25309) is running...
[root@qualngs ~]# service mysqld status
mysqld (pid 1989) is running...
[root@qualngs ~]# service postgresql status
postmaster (pid 2038) is running...
[root@qualngs ~]# service crond status
crond (pid 2165) is running...
[root@qualngs ~]# service httpd status
httpd (pid 2157) is running...
tail /var/lib/pgsql/data/pg_log/postgresql-Mon.log
Code: Select all
ERROR: relation "xi_notifications" does not exist
STATEMENT: VACUUM ANALYZE xi_notifications;
ERROR: relation "xi_notifications" does not exist
STATEMENT: VACUUM ANALYZE xi_notifications;
ERROR: relation "xi_notifications" does not exist
STATEMENT: VACUUM ANALYZE xi_notifications;
ERROR: relation "xi_notifications" does not exist
STATEMENT: VACUUM ANALYZE xi_notifications;
ERROR: relation "xi_notifications" does not exist
STATEMENT: VACUUM ANALYZE xi_notifications;
Re: Nagios XI keeps crashing post upgrade to XI 2014
Posted: Mon Jun 09, 2014 5:46 am
by chriscamm
Additional
ndo2db.debug
Code: Select all
[1402310564.435857] [002.0] [pid=22393] INSERT INTO nagios_statehistory SET instance_id='1', state_time=FROM_UNIXT$
[1402310564.436208] [002.0] [pid=22393] INSERT INTO nagios_eventhandlers SET instance_id='1', eventhandler_type='0$
[1402310564.436614] [002.0] [pid=22393] INSERT INTO nagios_eventhandlers SET instance_id='1', eventhandler_type='0$
[1402310564.436972] [002.0] [pid=22393] INSERT INTO nagios_eventhandlers SET instance_id='1', eventhandler_type='0$
[1402310564.437351] [002.0] [pid=22393] INSERT INTO nagios_eventhandlers SET instance_id='1', eventhandler_type='0$
[1402310564.437737] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='1$
[1402310564.438279] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='12$
[1402310564.438556] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='1$
[1402310564.439050] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='12$
[1402310564.439355] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='6$
[1402310564.439793] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='68$
[1402310564.440100] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='6$
[1402310564.440461] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='68$
[1402310564.440970] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='6$
[1402310564.441402] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='65$
[1402310564.441726] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='6$
[1402310564.443456] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='65$
[1402310564.443711] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='7$
[1402310564.444103] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='71$
[1402310564.444336] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='7$
[1402310564.444683] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='71$
[1402310564.444894] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='4$
[1402310564.445283] [002.0] [pid=22393] INSERT INtO nagios_customvariablestatus SET instance_id='1', object_id='46$
[1402310564.445507] [002.0] [pid=22393] INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='4$
Re: Nagios XI keeps crashing post upgrade to XI 2014
Posted: Mon Jun 09, 2014 5:05 pm
by slansing
I want to pause this here for a moment and ask you, what version of Core are you actually running? We never recommend updating core manually as there are a lot of hooks in XI that rely on the current version we push out alongside XI. XI is not built to be piecemeal upgraded.
Re: Nagios XI keeps crashing post upgrade to XI 2014
Posted: Tue Jun 10, 2014 4:46 am
by chriscamm
Hi Core is version
Code: Select all
NagiosĀ® Coreā¢
Version 4.0.6
April 29, 2014
Check for updates
Thanks
Chris