NDO2DB Issue out of the blue

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

NDO2DB Issue out of the blue

Post by BanditBBS »

So, I'm at a ball game Friday night and receive a call at 8:30 from the helpdesk stating they haven't received any notifications in the past 2 hours. From my cell I logged into XI and saw all green checks in upper right, but looks like no checks have been done in past 15 minutes. I did an apply configuration and everything started working again. I got a call at 00:30 on Saturday and ended up doing same thing. From that point I started to "babysit" the server. I was to be out of office Saturday through Tuesday and busy all 4 days, so didn't have any time to investigate, just continued to apply configs through-out Saturday when checks seems to stop happening. Finally on Saturday night or Sunday(can't remember) I spent a little time investigating. Checks were still being performed, but NDO2DB was using 100% CPU and not doing any of its work?!? So seems to be that the mysterious NDO2DB issue came back to me, after having gone away and never happened since I offloaded it back in December. Through-out Sunday and Monday I continued to just restart NDO2DB. On Monday the load of my nagios server started going haywire. Normally I have ~500 total processes and ~2000 active service checks in past minute. I noticed those number seems to change rather drastically. My total processes were hovering around 1000 and the last minute checks started to fluctuate between a couple hundred and 6,000 and the load on the nagios server go from 2-5 to 100-200. I finally came into the office on Tuesday and restarted both the XI server and the offloaded DB/NDO2DB server. Everything seemed fine from that point. The only difference from before friday was the total processes was bouncing from 380 to 550, but overnight it seems to have calmed down and gone to a smoother number. Now, all of a sudden this morning NDO2DB went 100% again. I did an apply config and then had 2 NDO consuming 100%. I then restarted NDO2DB and now I am back to the numbers of active checks fluctuating like crazy, but at least things are being processed. The load on my XI server is slowly climing though and currently at 7.5. The total processes is over 600 now too.

I just don't know where to start on this one!
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: NDO2DB Issue out of the blue

Post by tmcdonald »

  • Run the attached script, redirect output to /tmp/time.pl and send that back. You may need to edit the file to point to status.dat if it is not in the default location (line 40). Obfuscate output as needed, but it is just host and service names with runtimes.
  • Run ipcs -q and post back
  • Are there any relevant entries in the mysqld log?
Former Nagios employee
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NDO2DB Issue out of the blue

Post by BanditBBS »

Did you forget to attach?

Also, forgot to mention, I am seeing quite a few httpd processes using 90+ CPU on the XI server, and that is odd too.
The error I saw in log last Friday when stuff broke: "Aug 14 18:22:29 iss-chi-nag05 nagios: Warning: A system time change of 930 seconds (0d 0h 15m 30s forwards in time) has been detected. Compensating.."

Code: Select all

[root@iss-chi-nag09 ~]# ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0xe6070002 0          nagios     600        131072000    128000
0x41070002 131073     nagios     600        131072000    128000
0x66070002 262146     nagios     600        84102144     82131
0x14070002 294915     nagios     600        0            0
0xc2070002 360452     nagios     600        0            0
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: NDO2DB Issue out of the blue

Post by tmcdonald »

I did forget to attach. I will do so now, but looking at your ipcs output I don't think that will reveal anything. Run it anyway and we'll see.

One thing Scott pointed out was this thread in which NTP was acting up and fighting the vmware clock:

https://support.nagios.com/forum/viewto ... 95#p110363

I wanna see if getting the clock back on track will help before we dig deeper.
You do not have the required permissions to view the files attached to this post.
Former Nagios employee
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NDO2DB Issue out of the blue

Post by BanditBBS »

Attached is the output.

ntp all seems fine. I verified everything on the VM Host and both the XI server and offloaded DB server, everything is perfectly in sync and was when I first notice that error as well.

We've done a coupel apply configs this morning so far and after the 2nd one, the checks seem to have calmed down as well as the monitoring engine:
Capture.PNG
edit: Still see httpd on XI taking lots of CPU and ndo at 100%, but everything seems to be working fine for the moment
You do not have the required permissions to view the files attached to this post.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: NDO2DB Issue out of the blue

Post by tmcdonald »

What's the check interval like for PRODEBS - CRM CM JOBS? Looking at the output of the time.pl script, it takes an average of 630 seconds to run. If it is being run every 5 minutes this could compound. I would work from the bottom of that list up and try to reduce the check times as much as possible. That may not directly solve the current issue, but it's a step in the right direction.

I am not sure on the NDO issue. As I believe you know, we are battling a very weird almost edge case bug with it, but that's all in the hands of the devs now and all we can say from a support perspective is to restart when the queue gets full.

As for HTTP, the usual questions apply:
  • How many people are accessing the interface?
  • Are a lot of reports being run?
  • Is anything hitting the backend API hard?
Former Nagios employee
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NDO2DB Issue out of the blue

Post by BanditBBS »

tmcdonald wrote:What's the check interval like for PRODEBS - CRM CM JOBS? Looking at the output of the time.pl script, it takes an average of 630 seconds to run. If it is being run every 5 minutes this could compound. I would work from the bottom of that list up and try to reduce the check times as much as possible. That may not directly solve the current issue, but it's a step in the right direction.
The service_check_timeout=630, so that's where that 630 is coming from. The check interval on that check is 20 minutes and its something we are trying to fix and make work within the 630. I think we have it working in 5 minutes now.
tmcdonald wrote:I am not sure on the NDO issue. As I believe you know, we are battling a very weird almost edge case bug with it, but that's all in the hands of the devs now and all we can say from a support perspective is to restart when the queue gets full.
Yeah, I just haven't had an issue with NDO in at least 9 months, so no clue why it all of a sudden started happening, especially when zero changes were made to Nagios the day it started to wig out
tmcdonald wrote:As for HTTP, the usual questions apply:
  • How many people are accessing the interface?
  • Are a lot of reports being run?
  • Is anything hitting the backend API hard?
1.) Not sure, wish I knew how I could tell, but saw saem over weekend when answer should be "Almost none"
2.) Possibly, but I saw the saem thing over the weekend when the answer should have been no
3.) Not a single item
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NDO2DB Issue out of the blue

Post by BanditBBS »

Here is another interesting fact I just recalled.....

Over the weekend I stopped Nagios a few times but I could still see checks being performed. If I "service nagios stop" and it stops, how can checks still be getting performed?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: NDO2DB Issue out of the blue

Post by tmcdonald »

!

ps -ef | grep bin/nagios

Sounds like you have multiples.
Former Nagios employee
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NDO2DB Issue out of the blue

Post by BanditBBS »

I think so.....

Code: Select all

[root@iss-chi-nag05 ~]# ps -ef | grep bin/nagios
nagios    7859     1 14 15:30 ?        00:03:12 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    7861  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7862  7859  0 15:30 ?        00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7863  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7864  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7865  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7866  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7867  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7868  7859  0 15:30 ?        00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7869  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7870  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7871  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7872  7859  0 15:30 ?        00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7873  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7874  7859  0 15:30 ?        00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7875  7859  0 15:30 ?        00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7876  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7877  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7878  7859  0 15:30 ?        00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7879  7859  0 15:30 ?        00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7880  7859  0 15:30 ?        00:00:05 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7881  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7882  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7883  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    7884  7859  0 15:30 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    8166  7859  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root      9756  8838  0 15:53 pts/2    00:00:00 grep bin/nagios
nagios   15856     1  4 14:34 ?        00:03:40 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   15858 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15859 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15860 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15861 15856  0 14:34 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15862 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15863 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15864 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15865 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15866 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15867 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15868 15856  0 14:34 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15869 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15870 15856  0 14:34 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15871 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15872 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15873 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15874 15856  0 14:34 ?        00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15875 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15876 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15877 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15878 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15879 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15880 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15881 15856  0 14:34 ?        00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15933 15856  0 14:34 ?        00:00:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   22989     1  4 08:39 ?        00:21:29 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   22991 22989  0 08:39 ?        00:00:23 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   22992 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   22993 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   22994 22989  0 08:39 ?        00:00:22 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   22995 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   22996 22989  0 08:39 ?        00:00:23 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   22997 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   22998 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   22999 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23000 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23001 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23002 22989  0 08:39 ?        00:00:22 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23003 22989  0 08:39 ?        00:00:22 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23004 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23005 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23006 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23007 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23008 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23009 22989  0 08:39 ?        00:00:22 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23010 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23011 22989  0 08:39 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23012 22989  0 08:39 ?        00:00:26 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23013 22989  0 08:39 ?        00:00:22 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23014 22989  0 08:39 ?        00:00:20 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   23119 22989  0 08:39 ?        00:00:04 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
now I am not sure why that is happening or a band-aid i can apply when I see it
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Locked