NagiosXI had a seizure
Posted: Sun Dec 15, 2013 10:47 pm
My main NagiosXI server had a seizure this morning and caused 80,000+ alerts. I'll try and write this post so it can be easily understood, but doubt that'll last long, LOL.
Before I get into whatr happened I have a question:
1.) 80k alerts means one heck of a lot of SMS messages. Took them quite some time to all be sent. I use the Multitech component. Where are these queued so I know if this happens again, that I can remove those files, or are they put in a database table?
Now, onto trying to describe what happened and to ask a few other questions:
I execute an "apply configuration" every day at 8am by calling "sudo su -l nagios -c 'cd /usr/local/nagiosxi/scripts/ && ./reconfigure_nagios.sh'" so that my on-call changes are made. This morning around 8:15 I received a message that the load was 28/20/10, so yeah, a little high. Around 8:40 I received an alert that there was now 680 active processes, which is a few hundred more than normal. At 8:45 I received an email from my networking group's manager that nagios had gone crazy and was sending thousands of emails to his team. I was finally woken up from my slumber with that email and headed downstairs to VPN in and check it out. Before I got connected and began my investigation, I received alerts that nagiosmobile wasn't responding and then that max load of 66 was reach, yes, 66!
Just about all host and service checks were timing out.
When I first got connected, Active Host and Service Checks and Notifications were not green checks, but instead were blue exclamations as though they were disabled. This has actually happened when I have manually hit the "apply configuration" a few times and normally I can go to "process info" link and hit start and everything begins to work as desired. That did not fix it this time and I would like to figure out what causes that issue sometimes(Lets call that #2).
I tried restarting nagios from the cli and that didn't fix it. I tried writing config files, verifying and restarting that way. Wrote and Verify went well, but restart came back with red, but no errors. While I was doing all of this, I wanted to disable notifications, when clicking the link to do that I received "An error occurred when processing your command" and I could not do it. I looked in logs and could not find any errors when doing it. Here is the ONLY information in my http error.log:
I'd like to figure out those errors, lets call those #3!
Finally, I just shut down all the services and rebooted the server. When it came back up, everything seemed fine and I thought I was done fixing and just had to investigate. The number of bad services started falling drastically. My happiness was short lived however. I noticed that the 1000+ bad services were listed as handled, but they should have been unhandled. All processes of nagios seemed to be up and running properly and when I clicked host detail, everything showed green and last check date/time was all within 5 minutes and they were updating as I let it sit there, so the hosts were being checked. When I clicked service detail though, hosts appeared gray, like they were pending and the services were showing green, sort of what it looks like if you apply configuration and everything is restarting. If I clicked on a service or host, it would show checks and notifications disabled and last check never and next check as never. Even though when looking at the list of hosts or services, the times were being updated. Eventually, I just restarted nagios process and bam, everything started working.
EDIT: Also, somewhere in there my perfdata stopped processing and that started to fill my ramdrive. I figured I had a few hours before it filled and that is what finally pushed me over the edge to reboot the server(and lose a couple hours of data)
EDIT #2: during this morning's recycle, I was still home and was watching. I saw the config snapshot get mad and the 3 checks turn to exclamation marks as usual. I waited 3 minutes and the only file that seemed to have been written so far was objects.cache. I when ahead and hit the start on the process info page and another 3 minutes later everything was working as expected.
Before I get into whatr happened I have a question:
1.) 80k alerts means one heck of a lot of SMS messages. Took them quite some time to all be sent. I use the Multitech component. Where are these queued so I know if this happens again, that I can remove those files, or are they put in a database table?
Now, onto trying to describe what happened and to ask a few other questions:
I execute an "apply configuration" every day at 8am by calling "sudo su -l nagios -c 'cd /usr/local/nagiosxi/scripts/ && ./reconfigure_nagios.sh'" so that my on-call changes are made. This morning around 8:15 I received a message that the load was 28/20/10, so yeah, a little high. Around 8:40 I received an alert that there was now 680 active processes, which is a few hundred more than normal. At 8:45 I received an email from my networking group's manager that nagios had gone crazy and was sending thousands of emails to his team. I was finally woken up from my slumber with that email and headed downstairs to VPN in and check it out. Before I got connected and began my investigation, I received alerts that nagiosmobile wasn't responding and then that max load of 66 was reach, yes, 66!
Just about all host and service checks were timing out.
When I first got connected, Active Host and Service Checks and Notifications were not green checks, but instead were blue exclamations as though they were disabled. This has actually happened when I have manually hit the "apply configuration" a few times and normally I can go to "process info" link and hit start and everything begins to work as desired. That did not fix it this time and I would like to figure out what causes that issue sometimes(Lets call that #2).
I tried restarting nagios from the cli and that didn't fix it. I tried writing config files, verifying and restarting that way. Wrote and Verify went well, but restart came back with red, but no errors. While I was doing all of this, I wanted to disable notifications, when clicking the link to do that I received "An error occurred when processing your command" and I could not do it. I looked in logs and could not find any errors when doing it. Here is the ONLY information in my http error.log:
Code: Select all
[root@svwdcnagios02 ~]# tail -n 50 -f /var/log/httpd/error_log
[Sun Dec 15 03:21:01 2013] [notice] Digest: generating secret for digest authentication ...
[Sun Dec 15 03:21:01 2013] [notice] Digest: done
[Sun Dec 15 03:21:02 2013] [notice] Apache/2.2.15 (Unix) DAV/2 PHP/5.3.3 mod_ssl/2.2.15 OpenSSL/1.0.0-fips mod_wsgi/3.2 Python/2.6.6 configured -- resuming normal operations
[Sun Dec 15 09:29:35 2013] [error] [client 10.94.6.212] PHP Notice: Undefined index: limit in /usr/local/nagiosxi/html/includes/components/ccm/includes/ccm_log.inc.php on line 65, referer: http://svwdcnagios02/nagiosxi/includes/components/ccm/xi-index.php
[Sun Dec 15 09:29:35 2013] [error] [client 10.94.6.212] PHP Warning: Division by zero in /usr/local/nagiosxi/html/includes/components/ccm/page_templates/ccm_table.php on line 326, referer: http://svwdcnagios02/nagiosxi/includes/components/ccm/xi-index.php
[Sun Dec 15 09:29:35 2013] [error] [client 10.94.6.212] PHP Warning: Division by zero in /usr/local/nagiosxi/html/includes/components/ccm/page_templates/ccm_table.php on line 327, referer: http://svwdcnagios02/nagiosxi/includes/components/ccm/xi-index.php
[Sun Dec 15 09:29:35 2013] [error] [client 10.94.6.212] PHP Notice: Undefined variable: d in /usr/local/nagiosxi/html/includes/components/ccm/includes/ccm_log.inc.php on line 191, referer: http://svwdcnagios02/nagiosxi/includes/components/ccm/xi-index.php
[Sun Dec 15 09:29:35 2013] [error] [client 10.94.6.212] PHP Notice: Undefined index: limit in /usr/local/nagiosxi/html/includes/components/ccm/includes/ccm_log.inc.php on line 218, referer: http://svwdcnagios02/nagiosxi/includes/components/ccm/xi-index.phpFinally, I just shut down all the services and rebooted the server. When it came back up, everything seemed fine and I thought I was done fixing and just had to investigate. The number of bad services started falling drastically. My happiness was short lived however. I noticed that the 1000+ bad services were listed as handled, but they should have been unhandled. All processes of nagios seemed to be up and running properly and when I clicked host detail, everything showed green and last check date/time was all within 5 minutes and they were updating as I let it sit there, so the hosts were being checked. When I clicked service detail though, hosts appeared gray, like they were pending and the services were showing green, sort of what it looks like if you apply configuration and everything is restarting. If I clicked on a service or host, it would show checks and notifications disabled and last check never and next check as never. Even though when looking at the list of hosts or services, the times were being updated. Eventually, I just restarted nagios process and bam, everything started working.
EDIT: Also, somewhere in there my perfdata stopped processing and that started to fill my ramdrive. I figured I had a few hours before it filled and that is what finally pushed me over the edge to reboot the server(and lose a couple hours of data)
EDIT #2: during this morning's recycle, I was still home and was watching. I saw the config snapshot get mad and the 3 checks turn to exclamation marks as usual. I waited 3 minutes and the only file that seemed to have been written so far was objects.cache. I when ahead and hit the start on the process info page and another 3 minutes later everything was working as expected.