Page 2 of 13

Re: NDO2DB Issue out of the blue

Posted: Wed Aug 19, 2015 4:03 pm
by tmcdonald

Code: Select all

kill -9 7859
kill -9 15856
kill -9 22989
Then run the ps again to make sure nothing is running. service nagios start and ps once more, should just be a sandwich of "nagios -d, workers, nagios -d".

Re: NDO2DB Issue out of the blue

Posted: Wed Aug 19, 2015 4:14 pm
by BanditBBS
Yeah Trevor, that does what we want, but the question is, what causes it? I rebooted the server yesterday at 11am, so what between then and now caused 2 extra?

Re: NDO2DB Issue out of the blue

Posted: Wed Aug 19, 2015 4:24 pm
by tmcdonald
That's a lot harder to say. I suppose a large environment could cause a "service nagios restart" to fail, or the Apply Config to not properly kill off the old nagios process before starting a new one. Is there any pattern to the multiple processes? Do they happen after every Apply Config or randomly? Does it ever happen spontaneously, like without running Apply Config or otherwise resetting nagios?

Re: NDO2DB Issue out of the blue

Posted: Wed Aug 19, 2015 4:27 pm
by BanditBBS
tmcdonald wrote:That's a lot harder to say. I suppose a large environment could cause a "service nagios restart" to fail, or the Apply Config to not properly kill off the old nagios process before starting a new one. Is there any pattern to the multiple processes? Do they happen after every Apply Config or randomly? Does it ever happen spontaneously, like without running Apply Config or otherwise resetting nagios?
I just did an apply config and everything is still fine.....I'll keep an eye on the number of nagios processes running and try and figure out what's causing it. Hopefully it behaves from now until tomorrow morning at least :) Keep this open please(but feel free to get off your dashboard :) )

Re: NDO2DB Issue out of the blue

Posted: Wed Aug 19, 2015 4:36 pm
by tmcdonald
Let us know. Realistically we could write a check that sees if multiples are running, and if so does a killall nagios. Overkill perhaps, and a band-aid to be sure, but just a thought.

Re: NDO2DB Issue out of the blue

Posted: Wed Aug 19, 2015 9:47 pm
by BanditBBS
tmcdonald wrote:Let us know. Realistically we could write a check that sees if multiples are running, and if so does a killall nagios. Overkill perhaps, and a band-aid to be sure, but just a thought.
Well...things aren't well. I checked it once after I got home and notice the active checks numbers started to fall and reached 0 so I restarted NDO2DB and all was well for a few hours. I even set the WiiU remote up to my XI page so I could keep an eye on it in the livingroom. Around 9:30pm I noticed it was again showing no active checks in past 1, 5, 15 mins. Easy fix for me was to do an apply config from the WiiU Remote. That fixe dit and all was well for just a few minutes(maybe 10-15). At that point I saw the first check mark was a red exclamation, indicating the monitoring engine was not running. So I tell it to start(using the GUI) and then goto my home office and do a ps. Well, now there are two instances of nagios running. I killed the first one, leaving the new one that was just started and the red exclamation turned into a green check after a few moments.

This is becoming nuts, I can't sleep tonight because I may be needed to troubleshoot this more if NDO stops or whatever. Can't for the life of me figure out why this started to do this out of the blue like this, but I desperately need some more help now, we can't continue like this.

EDIT: Just did an apply and now two processes running again
EDIT2: I'm really thinking about undoing the offload of ndo2db....hmmm

Re: NDO2DB Issue out of the blue

Posted: Wed Aug 19, 2015 10:02 pm
by Box293
We could try defining how many check workers are allowed to start and see if that helps, reducing the number may help.

https://assets.nagios.com/downloads/nag ... gmain.html
Check Workers

Format: check_workers=<#>
Example: check_workers=10

This setting specifies how many worker process should be started when Nagios Core starts. Worker processes are used to perform host and service checks. If the number of workers is not specified, a default number of workers is determined based on the number of CPU cores on the system (1.5 workers per core). If not specified, there is always a minimum of 4 workers.
Maybe try 16.

Re: NDO2DB Issue out of the blue

Posted: Wed Aug 19, 2015 10:17 pm
by BanditBBS
I did knock it down to 18, will try 16. I'm all for trying things, but I will keep mentioning, it was running fine for so long, with just threshold changes and other minor things, no major changes to anything.

Re: NDO2DB Issue out of the blue

Posted: Wed Aug 19, 2015 10:32 pm
by Box293
Have you started monitoring any new services or started using a new plugin?

Re: NDO2DB Issue out of the blue

Posted: Wed Aug 19, 2015 10:42 pm
by BanditBBS
No new plugins in the past couple weeks at least and this issue just started happening Friday.

Just saw this in the log on the offloaded DB/NDO server:

Code: Select all

Aug 19 22:40:07 iss-chi-nag09 ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 128000 of 32768 messages and 131072000 of 131072000 bytes in the queue. See README for kernel tuning options.
Aug 19 22:40:09 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:09 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Aug 19 22:40:12 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:12 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Aug 19 22:40:14 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:14 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Aug 19 22:40:16 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:16 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Aug 19 22:40:19 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:19 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Aug 19 22:40:21 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:21 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Aug 19 22:40:23 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:23 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Aug 19 22:40:25 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:25 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Aug 19 22:40:27 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:27 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Aug 19 22:40:30 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:30 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Aug 19 22:40:32 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:32 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Aug 19 22:40:35 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:35 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Aug 19 22:40:37 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:37 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Aug 19 22:40:39 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:39 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Aug 19 22:40:42 iss-chi-nag09 ndo2db: Message sent to queue.
Aug 19 22:40:42 iss-chi-nag09 ndo2db: Warning: queue send error, retrying...
Restarted NDO and all is well again

This was in XI messages file:

Code: Select all

Aug 19 22:40:50 iss-chi-nag05 nagios: ndomod: Error writing to data sink!  Some output may get lost...
Aug 19 22:40:50 iss-chi-nag05 nagios: ndomod: Please check remote ndo2db log, database connection or SSL Parameters
Aug 19 22:41:06 iss-chi-nag05 nagios: ndomod: Successfully reconnected to data sink!  2869 items lost, 5000 queued items to flush.
Aug 19 22:41:06 iss-chi-nag05 nagios: ndomod: Successfully flushed 5000 queued items to data sink.