NagiosXI had a seizure

Post by **BanditBBS** » Sun Dec 15, 2013 10:47 pm

My main NagiosXI server had a seizure this morning and caused 80,000+ alerts. I'll try and write this post so it can be easily understood, but doubt that'll last long, LOL.

Before I get into whatr happened I have a question:
1.) 80k alerts means one heck of a lot of SMS messages. Took them quite some time to all be sent. I use the Multitech component. Where are these queued so I know if this happens again, that I can remove those files, or are they put in a database table?

Now, onto trying to describe what happened and to ask a few other questions:
I execute an "apply configuration" every day at 8am by calling "sudo su -l nagios -c 'cd /usr/local/nagiosxi/scripts/ && ./reconfigure_nagios.sh'" so that my on-call changes are made. This morning around 8:15 I received a message that the load was 28/20/10, so yeah, a little high. Around 8:40 I received an alert that there was now 680 active processes, which is a few hundred more than normal. At 8:45 I received an email from my networking group's manager that nagios had gone crazy and was sending thousands of emails to his team. I was finally woken up from my slumber with that email and headed downstairs to VPN in and check it out. Before I got connected and began my investigation, I received alerts that nagiosmobile wasn't responding and then that max load of 66 was reach, yes, 66!

Just about all host and service checks were timing out.

When I first got connected, Active Host and Service Checks and Notifications were not green checks, but instead were blue exclamations as though they were disabled. This has actually happened when I have manually hit the "apply configuration" a few times and normally I can go to "process info" link and hit start and everything begins to work as desired. That did not fix it this time and I would like to figure out what causes that issue sometimes(Lets call that #2).

I tried restarting nagios from the cli and that didn't fix it. I tried writing config files, verifying and restarting that way. Wrote and Verify went well, but restart came back with red, but no errors. While I was doing all of this, I wanted to disable notifications, when clicking the link to do that I received "An error occurred when processing your command" and I could not do it. I looked in logs and could not find any errors when doing it. Here is the ONLY information in my http error.log:

Code: Select all

[root@svwdcnagios02 ~]# tail -n 50 -f /var/log/httpd/error_log
[Sun Dec 15 03:21:01 2013] [notice] Digest: generating secret for digest authentication ...
[Sun Dec 15 03:21:01 2013] [notice] Digest: done
[Sun Dec 15 03:21:02 2013] [notice] Apache/2.2.15 (Unix) DAV/2 PHP/5.3.3 mod_ssl/2.2.15 OpenSSL/1.0.0-fips mod_wsgi/3.2 Python/2.6.6 configured -- resuming normal operations
[Sun Dec 15 09:29:35 2013] [error] [client 10.94.6.212] PHP Notice:  Undefined index: limit in /usr/local/nagiosxi/html/includes/components/ccm/includes/ccm_log.inc.php on line 65, referer: http://svwdcnagios02/nagiosxi/includes/components/ccm/xi-index.php
[Sun Dec 15 09:29:35 2013] [error] [client 10.94.6.212] PHP Warning:  Division by zero in /usr/local/nagiosxi/html/includes/components/ccm/page_templates/ccm_table.php on line 326, referer: http://svwdcnagios02/nagiosxi/includes/components/ccm/xi-index.php
[Sun Dec 15 09:29:35 2013] [error] [client 10.94.6.212] PHP Warning:  Division by zero in /usr/local/nagiosxi/html/includes/components/ccm/page_templates/ccm_table.php on line 327, referer: http://svwdcnagios02/nagiosxi/includes/components/ccm/xi-index.php
[Sun Dec 15 09:29:35 2013] [error] [client 10.94.6.212] PHP Notice:  Undefined variable: d in /usr/local/nagiosxi/html/includes/components/ccm/includes/ccm_log.inc.php on line 191, referer: http://svwdcnagios02/nagiosxi/includes/components/ccm/xi-index.php
[Sun Dec 15 09:29:35 2013] [error] [client 10.94.6.212] PHP Notice:  Undefined index: limit in /usr/local/nagiosxi/html/includes/components/ccm/includes/ccm_log.inc.php on line 218, referer: http://svwdcnagios02/nagiosxi/includes/components/ccm/xi-index.php

I'd like to figure out those errors, lets call those #3!

Finally, I just shut down all the services and rebooted the server. When it came back up, everything seemed fine and I thought I was done fixing and just had to investigate. The number of bad services started falling drastically. My happiness was short lived however. I noticed that the 1000+ bad services were listed as handled, but they should have been unhandled. All processes of nagios seemed to be up and running properly and when I clicked host detail, everything showed green and last check date/time was all within 5 minutes and they were updating as I let it sit there, so the hosts were being checked. When I clicked service detail though, hosts appeared gray, like they were pending and the services were showing green, sort of what it looks like if you apply configuration and everything is restarting. If I clicked on a service or host, it would show checks and notifications disabled and last check never and next check as never. Even though when looking at the list of hosts or services, the times were being updated. Eventually, I just restarted nagios process and bam, everything started working.

EDIT: Also, somewhere in there my perfdata stopped processing and that started to fill my ramdrive. I figured I had a few hours before it filled and that is what finally pushed me over the edge to reboot the server(and lose a couple hours of data)

EDIT #2: during this morning's recycle, I was still home and was watching. I saw the config snapshot get mad and the 3 checks turn to exclamation marks as usual. I waited 3 minutes and the only file that seemed to have been written so far was objects.cache. I when ahead and hit the start on the process info page and another 3 minutes later everything was working as expected.

abrist · Post by **abrist** » Mon Dec 16, 2013 11:44 am

You may have an issues with a rare race condition between ndo2db and nagios/mysql. What is your disk i/o at currently?

Post by **BanditBBS** » Mon Dec 16, 2013 11:51 am

abrist wrote:You may have an issues with a rare race condition between ndo2db and nagios/mysql. What is your disk i/o at currently?

Andy...did you see my book of a post, and you respond with one sentence and question? LOL

Code: Select all

Linux 2.6.32-358.18.1.el6.x86_64 (svwdcnagios02.aeo.ae.com)     12/16/2013      _x86_64_        (12 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.65    0.00    0.88    1.44    0.00   95.03

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             168.52       295.51      2062.34   26283587  183428113
sdb               0.00         0.04         0.00       3488          0
dm-0            246.11       274.60      1914.38   24423826  170268672
dm-1              0.08         0.04         0.58       3304      51592
dm-2              0.00         0.01         0.00        784          0
dm-3             17.14        20.53       135.46    1826346   12048032
dm-4              1.49         0.02        11.92       1658    1059768

remember, i do use a ramdisk and an offloaded database.

Post by **BanditBBS** » Mon Dec 16, 2013 11:55 am

Code: Select all

08:00:01 AM     all      2.26      0.00      0.88      1.29      0.00     95.57
08:10:01 AM     all      3.11      0.00      0.79      1.30      0.00     94.80
08:20:01 AM     all      1.97      0.00      0.82      1.44      0.00     95.77
08:30:01 AM     all      2.05      0.00      0.85      1.37      0.00     95.74
08:40:01 AM     all      2.07      0.00      0.82      1.40      0.00     95.71
08:50:01 AM     all      2.11      0.00      0.86      1.41      0.00     95.62
09:00:01 AM     all      2.19      0.00      0.87      1.31      0.00     95.63
09:10:01 AM     all      2.12      0.00      0.82      1.53      0.00     95.52

09:10:01 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
09:20:01 AM     all      2.21      0.00      0.86      1.38      0.00     95.55
09:30:01 AM     all      2.13      0.00      0.86      1.31      0.00     95.70
09:40:01 AM     all      2.10      0.00      0.84      1.45      0.00     95.61
09:50:01 AM     all      2.08      0.00      0.85      1.41      0.00     95.65
10:00:01 AM     all      2.12      0.00      0.87      1.26      0.00     95.74
10:10:01 AM     all      2.14      0.00      0.84      1.37      0.00     95.65
10:20:01 AM     all      2.16      0.00      0.89      1.35      0.00     95.60
10:30:01 AM     all      2.15      0.00      0.87      1.33      0.00     95.65
10:40:01 AM     all      2.39      0.00      0.88      1.39      0.00     95.35
10:50:01 AM     all      2.27      0.00      0.88      1.37      0.00     95.48
11:00:01 AM     all      2.26      0.00      0.86      1.21      0.00     95.66
11:10:01 AM     all      2.23      0.00      0.84      1.39      0.00     95.55
11:20:01 AM     all      2.86      0.00      0.89      1.43      0.00     94.83
11:30:01 AM     all      2.39      0.00      0.86      1.29      0.00     95.46
11:40:01 AM     all      2.34      0.00      0.86      1.44      0.00     95.37
11:50:01 AM     all      2.22      0.00      0.84      1.50      0.00     95.44
Average:        all      2.54      0.00      0.88      1.38      0.00     95.20

More information for ya

abrist · Post by **abrist** » Mon Dec 16, 2013 12:01 pm

BanditBBS wrote: Andy...did you see my book of a post, and you respond with one sentence and question? LOL

Concise brevity is a virtue. Too bad I am not virtuous.

BanditBBS wrote:remember, i do use a ramdisk and an offloaded database.

Yep.
I have seen this exact behavior before, it usually happens when the ndo2db tries to reconnect to the data sink before nagios is done restarting. It does not explain why npcd stopped processing perfdata though. Do you have an exact time the failure happened? Was it during the restart of nagios this morning for the new on call rotation? Or did the issues start earlier in the weekend?

Next time you write configuration, watch the io wait on the XI server and the mysql server. Additionally, when load spikes, save a "ps -aef" or "ps -aux" to a text file to post here.

Is everything working now? Have you attempted to apply configuration or restart the nagios process since?

Post by **BanditBBS** » Mon Dec 16, 2013 12:21 pm

abrist wrote:I have seen this exact behavior before, it usually happens when the ndo2db tries to reconnect to the data sink before nagios is done restarting. It does not explain why npcd stopped processing perfdata though. Do you have an exact time the failure happened? Was it during the restart of nagios this morning for the new on call rotation? Or did the issues start earlier in the weekend?

it happened Sunday morning at the exact same time an Apply Configuration was processed. It also appeared to want to happen this morning, but I forced the processes to start and everything is fine.

abrist wrote:Next time you write configuration, watch the io wait on the XI server and the mysql server. Additionally, when load spikes, save a "ps -aef" or "ps -aux" to a text file to post here.

OK

abrist wrote:Is everything working now? Have you attempted to apply configuration or restart the nagios process since?

Yes and Yes. Seemed to want to happen again this morning. I will do it for our testing, I want to get this fixed.

abrist · Post by **abrist** » Mon Dec 16, 2013 12:30 pm

Are the hosts still "greyed" out?
Let me know how the restart goes.

sreinhardt · Post by **sreinhardt** » Mon Dec 16, 2013 12:45 pm

it happened Sunday morning at the exact same time an Apply Configuration was processed. It also appeared to want to happen this morning, but I forced the processes to start and everything is fine.

That would be a symptom of the ndo\nagios sql race condition that abrist was talking about. Are your configs stored locally to the nagios server on a ramdisk or hard disk? Generally this seems to be due to nagios marking everything inactive while it imports the configs then marking them active again in the DB once everything has been loaded into memory and such. If NDO connects prior to the completed import and re-activation, it wreaks all sorts of havoc. Generally we see this with higher latency systems between nagios and mysql, or if the nagios configs are on a san\nas that are acting poorly. Somehow I'm guessing this isn't normally the case for you, since it hasn't been mentioned before. This is usually noticed on an apply config or nagios service restart.

Resolutions for some customers up to this point have been:
reduce latency to the mysql server from the nagios server.
Move the nagios configs to your ramdisk as well, ideally rsync them back to the actual hdd so that they are kept and do not require applying before they would be imported in the case of a server reboot.
Move the nagios configs local to the system if on a san\nas, not likely your issue.

Other things that might impact it:
high load or disk io on the nagios server when applying config
backups or other high traffic\disk activity on either mysql or nagios when apply config happens
other abnormal network traffic while this may have happened

I just wanted to get this info to you and see if it might make sense in your case, it certainly sounds like this is what is happening, as it really only seems to effect large installs with particular optimizations in place.

Post by **BanditBBS** » Mon Dec 16, 2013 12:52 pm

Spenser, you are so much better at explaining then Andy is...but I won't rip on him anymore as he did me a great favor earlier today

That really sounds like what is happening to me. My disks are local, as my primary XI server happens to be one of only 2 physical boxes in my XI environment(14 total servers now, between Xi and gearman workers).

Want to help me out with the "Move the nagios configs to your ramdisk as well, ideally rsync them back to the actual hdd so that they are kept and do not require applying before they would be imported in the case of a server reboot." method of fixing this?

Also, one question that seemed to get lost in my opening post...where the heck is the queue for all the SMS messages? When this happened it caused so many it took all day for them to all send. I want to be able to empty the queue.

abrist · Post by **abrist** » Mon Dec 16, 2013 1:10 pm

BanditBBS wrote:Also, one question that seemed to get lost in my opening post...where the heck is the queue for all the SMS messages? When this happened it caused so many it took all day for them to all send. I want to be able to empty the queue.

SMS can be sent one of two ways. If you use the XI mailer, they are sent out immediately, no queue, and may be queued on the carrier's servers. If you send SMS with sendmail, then you can check the queue with "mailq".

BanditBBS wrote:Want to help me out with the "Move the nagios configs to your ramdisk as well, ideally rsync them back to the actual hdd so that they are kept and do not require applying before they would be imported in the case of a server reboot." method of fixing this?

My suspicion is load/io wait on the mysql server, io wait on the nagios server during the config write out process, or network latency.
I think Spencer is suggesting creating a ramdisk for /usr/local/nagios/etc/ and then rsyncing it to somewhere for a backup.

EDIT: Are you using the multitech component to send sms? Or a custom script? If a custom script, what method are you using to send?

Nagios Support Forum

NagiosXI had a seizure

NagiosXI had a seizure

Re: NagiosXI had a seizure

Re: NagiosXI had a seizure

Re: NagiosXI had a seizure

Re: NagiosXI had a seizure

Re: NagiosXI had a seizure

Re: NagiosXI had a seizure

Re: NagiosXI had a seizure

Re: NagiosXI had a seizure

Re: NagiosXI had a seizure