NDO2DB Issue out of the blue
NDO2DB Issue out of the blue
So, I'm at a ball game Friday night and receive a call at 8:30 from the helpdesk stating they haven't received any notifications in the past 2 hours. From my cell I logged into XI and saw all green checks in upper right, but looks like no checks have been done in past 15 minutes. I did an apply configuration and everything started working again. I got a call at 00:30 on Saturday and ended up doing same thing. From that point I started to "babysit" the server. I was to be out of office Saturday through Tuesday and busy all 4 days, so didn't have any time to investigate, just continued to apply configs through-out Saturday when checks seems to stop happening. Finally on Saturday night or Sunday(can't remember) I spent a little time investigating. Checks were still being performed, but NDO2DB was using 100% CPU and not doing any of its work?!? So seems to be that the mysterious NDO2DB issue came back to me, after having gone away and never happened since I offloaded it back in December. Through-out Sunday and Monday I continued to just restart NDO2DB. On Monday the load of my nagios server started going haywire. Normally I have ~500 total processes and ~2000 active service checks in past minute. I noticed those number seems to change rather drastically. My total processes were hovering around 1000 and the last minute checks started to fluctuate between a couple hundred and 6,000 and the load on the nagios server go from 2-5 to 100-200. I finally came into the office on Tuesday and restarted both the XI server and the offloaded DB/NDO2DB server. Everything seemed fine from that point. The only difference from before friday was the total processes was bouncing from 380 to 550, but overnight it seems to have calmed down and gone to a smoother number. Now, all of a sudden this morning NDO2DB went 100% again. I did an apply config and then had 2 NDO consuming 100%. I then restarted NDO2DB and now I am back to the numbers of active checks fluctuating like crazy, but at least things are being processed. The load on my XI server is slowly climing though and currently at 7.5. The total processes is over 600 now too.
I just don't know where to start on this one!
I just don't know where to start on this one!
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: NDO2DB Issue out of the blue
- Run the attached script, redirect output to /tmp/time.pl and send that back. You may need to edit the file to point to status.dat if it is not in the default location (line 40). Obfuscate output as needed, but it is just host and service names with runtimes.
- Run ipcs -q and post back
- Are there any relevant entries in the mysqld log?
Former Nagios employee
Re: NDO2DB Issue out of the blue
Did you forget to attach?
Also, forgot to mention, I am seeing quite a few httpd processes using 90+ CPU on the XI server, and that is odd too.
The error I saw in log last Friday when stuff broke: "Aug 14 18:22:29 iss-chi-nag05 nagios: Warning: A system time change of 930 seconds (0d 0h 15m 30s forwards in time) has been detected. Compensating.."
Also, forgot to mention, I am seeing quite a few httpd processes using 90+ CPU on the XI server, and that is odd too.
The error I saw in log last Friday when stuff broke: "Aug 14 18:22:29 iss-chi-nag05 nagios: Warning: A system time change of 930 seconds (0d 0h 15m 30s forwards in time) has been detected. Compensating.."
Code: Select all
[root@iss-chi-nag09 ~]# ipcs -q
------ Message Queues --------
key msqid owner perms used-bytes messages
0xe6070002 0 nagios 600 131072000 128000
0x41070002 131073 nagios 600 131072000 128000
0x66070002 262146 nagios 600 84102144 82131
0x14070002 294915 nagios 600 0 0
0xc2070002 360452 nagios 600 0 0
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: NDO2DB Issue out of the blue
I did forget to attach. I will do so now, but looking at your ipcs output I don't think that will reveal anything. Run it anyway and we'll see.
One thing Scott pointed out was this thread in which NTP was acting up and fighting the vmware clock:
https://support.nagios.com/forum/viewto ... 95#p110363
I wanna see if getting the clock back on track will help before we dig deeper.
One thing Scott pointed out was this thread in which NTP was acting up and fighting the vmware clock:
https://support.nagios.com/forum/viewto ... 95#p110363
I wanna see if getting the clock back on track will help before we dig deeper.
You do not have the required permissions to view the files attached to this post.
Former Nagios employee
Re: NDO2DB Issue out of the blue
Attached is the output.
ntp all seems fine. I verified everything on the VM Host and both the XI server and offloaded DB server, everything is perfectly in sync and was when I first notice that error as well.
We've done a coupel apply configs this morning so far and after the 2nd one, the checks seem to have calmed down as well as the monitoring engine:
edit: Still see httpd on XI taking lots of CPU and ndo at 100%, but everything seems to be working fine for the moment
ntp all seems fine. I verified everything on the VM Host and both the XI server and offloaded DB server, everything is perfectly in sync and was when I first notice that error as well.
We've done a coupel apply configs this morning so far and after the 2nd one, the checks seem to have calmed down as well as the monitoring engine:
edit: Still see httpd on XI taking lots of CPU and ndo at 100%, but everything seems to be working fine for the moment
You do not have the required permissions to view the files attached to this post.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: NDO2DB Issue out of the blue
What's the check interval like for PRODEBS - CRM CM JOBS? Looking at the output of the time.pl script, it takes an average of 630 seconds to run. If it is being run every 5 minutes this could compound. I would work from the bottom of that list up and try to reduce the check times as much as possible. That may not directly solve the current issue, but it's a step in the right direction.
I am not sure on the NDO issue. As I believe you know, we are battling a very weird almost edge case bug with it, but that's all in the hands of the devs now and all we can say from a support perspective is to restart when the queue gets full.
As for HTTP, the usual questions apply:
I am not sure on the NDO issue. As I believe you know, we are battling a very weird almost edge case bug with it, but that's all in the hands of the devs now and all we can say from a support perspective is to restart when the queue gets full.
As for HTTP, the usual questions apply:
- How many people are accessing the interface?
- Are a lot of reports being run?
- Is anything hitting the backend API hard?
Former Nagios employee
Re: NDO2DB Issue out of the blue
The service_check_timeout=630, so that's where that 630 is coming from. The check interval on that check is 20 minutes and its something we are trying to fix and make work within the 630. I think we have it working in 5 minutes now.tmcdonald wrote:What's the check interval like for PRODEBS - CRM CM JOBS? Looking at the output of the time.pl script, it takes an average of 630 seconds to run. If it is being run every 5 minutes this could compound. I would work from the bottom of that list up and try to reduce the check times as much as possible. That may not directly solve the current issue, but it's a step in the right direction.
Yeah, I just haven't had an issue with NDO in at least 9 months, so no clue why it all of a sudden started happening, especially when zero changes were made to Nagios the day it started to wig outtmcdonald wrote:I am not sure on the NDO issue. As I believe you know, we are battling a very weird almost edge case bug with it, but that's all in the hands of the devs now and all we can say from a support perspective is to restart when the queue gets full.
1.) Not sure, wish I knew how I could tell, but saw saem over weekend when answer should be "Almost none"tmcdonald wrote:As for HTTP, the usual questions apply:
- How many people are accessing the interface?
- Are a lot of reports being run?
- Is anything hitting the backend API hard?
2.) Possibly, but I saw the saem thing over the weekend when the answer should have been no
3.) Not a single item
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: NDO2DB Issue out of the blue
Here is another interesting fact I just recalled.....
Over the weekend I stopped Nagios a few times but I could still see checks being performed. If I "service nagios stop" and it stops, how can checks still be getting performed?
Over the weekend I stopped Nagios a few times but I could still see checks being performed. If I "service nagios stop" and it stops, how can checks still be getting performed?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: NDO2DB Issue out of the blue
!
ps -ef | grep bin/nagios
Sounds like you have multiples.
ps -ef | grep bin/nagios
Sounds like you have multiples.
Former Nagios employee
Re: NDO2DB Issue out of the blue
I think so.....
now I am not sure why that is happening or a band-aid i can apply when I see it
Code: Select all
[root@iss-chi-nag05 ~]# ps -ef | grep bin/nagios
nagios 7859 1 14 15:30 ? 00:03:12 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 7861 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7862 7859 0 15:30 ? 00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7863 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7864 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7865 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7866 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7867 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7868 7859 0 15:30 ? 00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7869 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7870 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7871 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7872 7859 0 15:30 ? 00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7873 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7874 7859 0 15:30 ? 00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7875 7859 0 15:30 ? 00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7876 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7877 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7878 7859 0 15:30 ? 00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7879 7859 0 15:30 ? 00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7880 7859 0 15:30 ? 00:00:05 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7881 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7882 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7883 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7884 7859 0 15:30 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 8166 7859 0 15:31 ? 00:00:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 9756 8838 0 15:53 pts/2 00:00:00 grep bin/nagios
nagios 15856 1 4 14:34 ? 00:03:40 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 15858 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15859 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15860 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15861 15856 0 14:34 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15862 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15863 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15864 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15865 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15866 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15867 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15868 15856 0 14:34 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15869 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15870 15856 0 14:34 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15871 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15872 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15873 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15874 15856 0 14:34 ? 00:00:03 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15875 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15876 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15877 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15878 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15879 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15880 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15881 15856 0 14:34 ? 00:00:02 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15933 15856 0 14:34 ? 00:00:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 22989 1 4 08:39 ? 00:21:29 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 22991 22989 0 08:39 ? 00:00:23 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 22992 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 22993 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 22994 22989 0 08:39 ? 00:00:22 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 22995 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 22996 22989 0 08:39 ? 00:00:23 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 22997 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 22998 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 22999 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23000 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23001 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23002 22989 0 08:39 ? 00:00:22 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23003 22989 0 08:39 ? 00:00:22 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23004 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23005 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23006 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23007 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23008 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23009 22989 0 08:39 ? 00:00:22 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23010 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23011 22989 0 08:39 ? 00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23012 22989 0 08:39 ? 00:00:26 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23013 22989 0 08:39 ? 00:00:22 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23014 22989 0 08:39 ? 00:00:20 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 23119 22989 0 08:39 ? 00:00:04 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github