Nagios stops checking!!!

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Nagios stops checking!!!

Post by BanditBBS »

XI 2014r2.0

Approximately 13000 total checks. All of a sudden twice in the past 24 hours nagios has just stopped. Everything is running as it should, but the load drops to basically nothing and no checks are being performed. Also when checks are being performed, the last few days the load has spiked hideously throughout the day, which is of course when the most people are using the web interface.

I just had to reboot the box and even then it didn't work right and I ended up have to restart the services. I think its working properly now, but who knows for how long.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Nagios stops checking!!!

Post by BanditBBS »

Server was doing a total of about 2300 checks per minute. My workers were set to 24 and my timeouts on service checks is set to 2 minutes. There is no way it was able to keep up with the checks, right? Could that be what has been making it wig-out lately and not the update to 2.0? I just doubled the workers to 48 and the load has been 2.x for the past 15 minutes with no signs of anything breaking.....as I typed that it went to 3.x, but anything under 8 and I'd be happy.

Edit: Welp checked at ~9:30 and issue happened to just start happening. I can see nothing out of the ordinary happening on the machine....I let it sit for a while, the CPU had spiked to load 100+. After about 5 minutes it calmed down and checks started happening again. Watching it longer and the spikes keep on happening and making the scheduled go bad.

Edit: Noon checkin...apparently around 10:22 it did it again due to perf data stopping. I just checked it again and there was nothing scheduled and load was 0.12. Something is really happening here with the scheduling.

Edit: Its now 2:30 on Sunday. it has been working fine the past 34 hours. I just off the auto rescheduling and the schedule hasn't stopped once yet. Still CPU spiking pretty darn high throughout the daylight hours, but at least I don't have to restart

Edit: Perhaps the number of users leaving dashboards up and running during the day is what is causing my load spike. Is there any setting I can change/check to make httpd less of a load killer?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Nagios stops checking!!!

Post by lmiltchev »

I just off the auto rescheduling and the schedule hasn't stopped once yet.
I've seen a few cases already, when disabling the auto rescheduling option in the nagios.cfg fixed scheduling issues. I will talk to our developers to see if they can figure out what is causing the issue.
Perhaps the number of users leaving dashboards up and running during the day is what is causing my load spike. Is there any setting I can change/check to make httpd less of a load killer?
Any particular time of the day these spikes are happening? You could try reducing the load from the httpd process by installing an opcode cache, which will reduce compiling every pageload.

Code: Select all

yum install php-pecl-apc -y
service httpd restart
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Nagios stops checking!!!

Post by BanditBBS »

lmiltchev wrote:
I just off the auto rescheduling and the schedule hasn't stopped once yet.
I've seen a few cases already, when disabling the auto rescheduling option in the nagios.cfg fixed scheduling issues. I will talk to our developers to see if they can figure out what is causing the issue.
Perhaps the number of users leaving dashboards up and running during the day is what is causing my load spike. Is there any setting I can change/check to make httpd less of a load killer?
Any particular time of the day these spikes are happening? You could try reducing the load from the httpd process by installing an opcode cache, which will reduce compiling every pageload.

Code: Select all

yum install php-pecl-apc -y
service httpd restart
All day during CST business hours. Its just crazy. Luckily the site still works and the schedule seems to continue working as well, just a tad slow. Just installed the cache, will restart apache later.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Nagios stops checking!!!

Post by tmcdonald »

Both opcode caching and some sort of caching proxy would help. Caching proxy for static things that don't need to be re-generated, and opcode caching for database queries, array sorting, etc.
Former Nagios employee
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Nagios stops checking!!!

Post by BanditBBS »

tmcdonald wrote:Both opcode caching and some sort of caching proxy would help. Caching proxy for static things that don't need to be re-generated, and opcode caching for database queries, array sorting, etc.
Want to recommend something simple to install? :)
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Nagios stops checking!!!

Post by tmcdonald »

BanditBBS wrote:
tmcdonald wrote:Both opcode caching and some sort of caching proxy would help. Caching proxy for static things that don't need to be re-generated, and opcode caching for database queries, array sorting, etc.
Want to recommend something simple to install? :)
Simple? Heh. Heh. No.

Squid is a nice reverse/caching proxy, but I cannot say it is simple. It has a ton of features and config options, but it does have a bit of a learning curve.

http://wiki.squid-cache.org/SquidFaq/ConfiguringSquid

As for the PHP opcode caching, APC is the way to go as ludmil pointed out.
Former Nagios employee
User avatar
mrochelle
Posts: 238
Joined: Fri May 04, 2012 11:20 am
Location: Heart of America

Re: Nagios stops checking!!!

Post by mrochelle »

I have not experienced the spiking issue indicated but I'm joining the conversation since I logged in to post the Nagios stops checking since I've experience 3 such incidents over the past weekend up to this morning. As BanditBBS indicated, the load drops to minimal, checks go down to zero. No errors of any kind I can find, logs appear normal. I'm attaching an image shot from this morning 05:31AM the last occurrence. A restart of Nagios gets everything back to normal.
Also for the record, the ndo2db process is ok ( under 30% )during these incidents.
NagiosSS1_12092014_0531am.PNG
Nagios 2014R2.0
CentOS release 6.3
You do not have the required permissions to view the files attached to this post.
Last edited by mrochelle on Tue Dec 09, 2014 12:28 pm, edited 1 time in total.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Nagios stops checking!!!

Post by BanditBBS »

To be honest I think the spiking issue is system resource related and I am making changes to take care of that.

I haven't seen the checks stopping issue since I turned off the check leveling. It is weird though, we both saw it start at the same time
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
mrochelle
Posts: 238
Joined: Fri May 04, 2012 11:20 am
Location: Heart of America

Re: Nagios stops checking!!!

Post by mrochelle »

When you indicate you turned off the check leveling, are you indicating you disabled auto rescheduling option in the nagios.cfg ?
Locked