Nagios daemon crashing frequently (extensive logs attached)

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
stephan
Posts: 17
Joined: Fri Sep 27, 2013 5:10 am

Re: Nagios daemon crashing frequently (extensive logs attach

Post by stephan »

I just found this topic through Google, it describes the exact same problem we are having. However one of the first things we did is disabling any plugin except Merlin and Nconf as we cannot operate without these. The pnp4nagios (npcd) however was disabled months ago. Maybe this post brings someone on an idea. The gdb piece seems interesting to try...

http://www.monitoring-portal.org/wbb/in ... =82323&l=2
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios daemon crashing frequently (extensive logs attach

Post by abrist »

I cannot say what cause merlin could have, but as there are a number of *different* things that cause this issue, I would request you open a new thread with your problem. It also keeps all the logs/outputs in this thread relevant to only one person.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
stephan
Posts: 17
Joined: Fri Sep 27, 2013 5:10 am

Re: Nagios daemon crashing frequently (extensive logs attach

Post by stephan »

Hi Abrist. The problem is still exactly the same, the link just describes the issue we are having. I posted it to clarify a bit more.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios daemon crashing frequently (extensive logs attach

Post by slansing »

Are you still working with no crashes after changing that configuration line? It is curious that this happens when trying to move a file as small as a check result, it would be nice to get a listing of how many files are in "/usr/local/nagios/var/spool/checkresults/" at the time of the crash.
stephan
Posts: 17
Joined: Fri Sep 27, 2013 5:10 am

Re: Nagios daemon crashing frequently (extensive logs attach

Post by stephan »

No, unfortunately crashes have been seen afterwards. The malloc config change did not seem to have any effect, maybe it should be exported somehow else I'm not sure.

That's indeed curious, I was thinking the same thing so we are now collecting the content of the checkresults directory before restarting Nagios. We want to have a look in the file that was last moved before the crash.. to see if there's a pattern of some kind.

Awaiting a new crash still since then. Since past week Friday it has been stable. The frequency is fluctuating a lot, and we have no clue how to reproduce/force the crash which makes it so much more difficult to troubleshoot.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios daemon crashing frequently (extensive logs attach

Post by slansing »

Alright, let us know what your findings are I'm interested to see if there are a large number of checkresults in that directory at the time of the crash not being grabbed in time and somehow overloading nagios, we're going to continue to brainstorm on our end, this is a curious issue. Have you tried mirroring your production system to see if it recurs on a different core system?
stephan
Posts: 17
Joined: Fri Sep 27, 2013 5:10 am

Re: Nagios daemon crashing frequently (extensive logs attach

Post by stephan »

Another crash occured just now, here's the saved checkresult directory:

Code: Select all

[root@server crashes]# tar -tvf ./server_2014_02_12_12\:30\:01_checkresults.tar.gz
drwxrwxr-x nagios/nagios     0 2014-02-12 12:26 var/log/nagios/spool/checkresults/
-rw------- nagios/nagios   478 2014-02-12 12:25 var/log/nagios/spool/checkresults/cJGMWqH
-rw------- nagios/nagios     0 2014-02-12 12:26 var/log/nagios/spool/checkresults/ceNgA3B.ok
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cPqy0oQ.ok
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cOTUC5w.ok
-rw------- nagios/nagios   658 2014-02-12 12:26 var/log/nagios/spool/checkresults/ceNgA3B
-rw------- nagios/nagios     0 2014-02-12 12:26 var/log/nagios/spool/checkresults/c9V5smO.ok
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cszzhW1.ok
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/ck54KGi.ok
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cqV4IC4.ok
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/ch0DetO.ok
-rw------- nagios/nagios     0 2014-02-12 12:26 var/log/nagios/spool/checkresults/cBkybx5.ok
-rw------- nagios/nagios   489 2014-02-12 12:25 var/log/nagios/spool/checkresults/ccNvcZp
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cm6E9xg.ok
-rw------- nagios/nagios   426 2014-02-12 12:25 var/log/nagios/spool/checkresults/c42DQ7T
-rw------- nagios/nagios   473 2014-02-12 12:25 var/log/nagios/spool/checkresults/cspb0GE
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/c8bb38J.ok
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/c7lbsRs.ok
-rw------- nagios/nagios   476 2014-02-12 12:25 var/log/nagios/spool/checkresults/cEA4MC7
-rw------- nagios/nagios  1101 2014-02-12 12:26 var/log/nagios/spool/checkresults/cEuAaii
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cA5BUUP.ok
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cspb0GE.ok
-rw------- nagios/nagios  1102 2014-02-12 12:26 var/log/nagios/spool/checkresults/cMbOd5r
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/c42DQ7T.ok
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cb2zOY3.ok
-rw------- nagios/nagios   473 2014-02-12 12:25 var/log/nagios/spool/checkresults/ch0DetO
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cxDbDJT.ok
-rw------- nagios/nagios  1240 2014-02-12 12:26 var/log/nagios/spool/checkresults/c9V5smO
-rw------- nagios/nagios   483 2014-02-12 12:26 var/log/nagios/spool/checkresults/cMNeJU0
-rw------- nagios/nagios     0 2014-02-12 12:26 var/log/nagios/spool/checkresults/cMbOd5r.ok
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cnJayzg.ok
-rw------- nagios/nagios     0 2014-02-12 12:26 var/log/nagios/spool/checkresults/cMNeJU0.ok
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/co6lc0E.ok
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cEA4MC7.ok
-rw------- nagios/nagios   474 2014-02-12 12:25 var/log/nagios/spool/checkresults/co6lc0E
-rw------- nagios/nagios   486 2014-02-12 12:25 var/log/nagios/spool/checkresults/cm6E9xg
-rw------- nagios/nagios     0 2014-02-12 12:26 var/log/nagios/spool/checkresults/cEuAaii.ok
-rw------- nagios/nagios   484 2014-02-12 12:25 var/log/nagios/spool/checkresults/c8bb38J
-rw------- nagios/nagios  1102 2014-02-12 12:26 var/log/nagios/spool/checkresults/cBkybx5
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cS6SnL2.ok
-rw------- nagios/nagios   482 2014-02-12 12:25 var/log/nagios/spool/checkresults/c3aw8ZM
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/ccNvcZp.ok
-rw------- nagios/nagios   427 2014-02-12 12:25 var/log/nagios/spool/checkresults/cqV4IC4
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/c3aw8ZM.ok
-rw------- nagios/nagios   476 2014-02-12 12:25 var/log/nagios/spool/checkresults/cnJayzg
-rw------- nagios/nagios   929 2014-02-12 12:25 var/log/nagios/spool/checkresults/cszzhW1
-rw------- nagios/nagios   509 2014-02-12 12:25 var/log/nagios/spool/checkresults/cA5BUUP
-rw------- nagios/nagios   442 2014-02-12 12:25 var/log/nagios/spool/checkresults/cS6SnL2
-rw------- nagios/nagios   387 2014-02-12 12:25 var/log/nagios/spool/checkresults/ck54KGi
-rw------- nagios/nagios   391 2014-02-12 12:25 var/log/nagios/spool/checkresults/cOTUC5w
-rw------- nagios/nagios   422 2014-02-12 12:25 var/log/nagios/spool/checkresults/cPqy0oQ
-rw------- nagios/nagios   619 2014-02-12 12:25 var/log/nagios/spool/checkresults/cb2zOY3
-rw------- nagios/nagios   442 2014-02-12 12:25 var/log/nagios/spool/checkresults/c7lbsRs
-rw------- nagios/nagios   417 2014-02-12 12:25 var/log/nagios/spool/checkresults/cxDbDJT
-rw------- nagios/nagios     0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cJGMWqH.ok
[root@server crashes]# tar -tvf ./server_2014_02_12_12\:30\:01_checkresults.tar.gz | wc -l
55
55 doesn't seem like a lot, I've been watching this directory with an 'ls' every 0.5 second. Files in this directory shoot from 20 to 100 to 50 to 180 to 0 to 10... etc. I assume this is normal behavior?

The latest logs reported in the debug Log is this:

Code: Select all

[Wed Feb 12 12:25:56 2014.147613] [016.1] [pid=6690] Processing check result file: '/var/log/nagios/spool/checkresults/cUknRdE'
[Wed Feb 12 12:25:56 2014.240223] [016.2] [pid=13311] Moving temp check result file '/var/log/nagios/spool/checkresults/checkh3LtZI' to queue file '/var/log/nagios/spool/checkresults/c42DQ7T'...
[Wed Feb 12 12:25:56 2014.708625] [016.2] [pid=13313] Moving temp check result file '/var/log/nagios/spool/checkresults/check9L6O7R' to queue file '/var/log/nagios/spool/checkresults/cqV4IC4'...
[Wed Feb 12 12:25:57 2014.325122] [016.2] [pid=13227] Moving temp check result file '/var/log/nagios/spool/checkresults/checkJF5Poq' to queue file '/var/log/nagios/spool/checkresults/cspb0GE'...
[Wed Feb 12 12:25:57 2014.361850] [016.2] [pid=13230] Moving temp check result file '/var/log/nagios/spool/checkresults/checkxqq91s' to queue file '/var/log/nagios/spool/checkresults/cJGMWqH'...
[Wed Feb 12 12:25:57 2014.365008] [016.2] [pid=13233] Moving temp check result file '/var/log/nagios/spool/checkresults/checkPyvGJv' to queue file '/var/log/nagios/spool/checkresults/c8bb38J'...
[Wed Feb 12 12:25:57 2014.394159] [016.2] [pid=13236] Moving temp check result file '/var/log/nagios/spool/checkresults/checkPrCvvy' to queue file '/var/log/nagios/spool/checkresults/c3aw8ZM'...
[Wed Feb 12 12:25:57 2014.644306] [016.2] [pid=13315] Moving temp check result file '/var/log/nagios/spool/checkresults/check9gWnk1' to queue file '/var/log/nagios/spool/checkresults/cm6E9xg'...
[Wed Feb 12 12:25:57 2014.700719] [016.2] [pid=13317] Moving temp check result file '/var/log/nagios/spool/checkresults/check38ufBa' to queue file '/var/log/nagios/spool/checkresults/ccNvcZp'...
[Wed Feb 12 12:25:59 2014.650375] [016.2] [pid=13319] Moving temp check result file '/var/log/nagios/spool/checkresults/checkvj7nWj' to queue file '/var/log/nagios/spool/checkresults/co6lc0E'...
[Wed Feb 12 12:25:59 2014.671502] [016.2] [pid=13322] Moving temp check result file '/var/log/nagios/spool/checkresults/checkpXEBlt' to queue file '/var/log/nagios/spool/checkresults/ch0DetO'...
[Wed Feb 12 12:25:59 2014.717137] [016.2] [pid=13328] Moving temp check result file '/var/log/nagios/spool/checkresults/check5zuVmM' to queue file '/var/log/nagios/spool/checkresults/cEA4MC7'...
[Wed Feb 12 12:26:00 2014.080652] [016.2] [pid=13331] Moving temp check result file '/var/log/nagios/spool/checkresults/checktfSHYV' to queue file '/var/log/nagios/spool/checkresults/cEuAaii'...
[Wed Feb 12 12:26:00 2014.121986] [016.2] [pid=13333] Moving temp check result file '/var/log/nagios/spool/checkresults/checkpcriE5' to queue file '/var/log/nagios/spool/checkresults/cMbOd5r'...
[Wed Feb 12 12:26:00 2014.202355] [016.2] [pid=13335] Moving temp check result file '/var/log/nagios/spool/checkresults/checkplz9nf' to queue file '/var/log/nagios/spool/checkresults/ceNgA3B'...
[Wed Feb 12 12:26:00 2014.690954] [016.2] [pid=13325] Moving temp check result file '/var/log/nagios/spool/checkresults/checkPMn7OC' to queue file '/var/log/nagios/spool/checkresults/cMNeJU0'...
[Wed Feb 12 12:26:01 2014.061647] [016.2] [pid=13346] Moving temp check result file '/var/log/nagios/spool/checkresults/checkxTDXbp' to queue file '/var/log/nagios/spool/checkresults/c9V5smO'...
[Wed Feb 12 12:26:03 2014.571541] [016.2] [pid=13348] Moving temp check result file '/var/log/nagios/spool/checkresults/checkb75H3y' to queue file '/var/log/nagios/spool/checkresults/cBkybx5'...
Now I will look into these files to see if there's something weird in there.. but my guess is it will all look normal.

To answer on the mirroring question, the cluster consists out of four different physical servers which all show this behavior. So that's a mirrored scenario there. We have another cluster of two servers for testing purposes, there we do not see these crashes. However they are VM's and don't perform many checks, also not using all the checks/scripts the cluster of four does. So that's not really a good test case.

The difficult thing in mirroring this setup is the fact that we cannot have another system or systems performing the same checks next to the active cluster. It will result in actually executing the scripts locally or through NRPE on other servers twice each x (5) min. For our production systems this is not an option. Any ideas to go about this?

Also, say the crash cause would be some check/script... how can we tell which one this is? The Nagios debug log is set to max logging and max detailed but does not mention this. How can we ever find out what the real cause would be?

Use gbd.. or strace in some advanced way.. ? Anyone ideas to really see where the process crashes on? We might have to look at tools which really hook on to the process or something. Seems Nagios itself is not aware of itself crashing or reporting on a cause.

Thanks for all the help, really appreciate it!
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios daemon crashing frequently (extensive logs attach

Post by slansing »

Ah, well it could be being caused by one of these custom alterations, since this is not effecting your other two nagios servers that are out side of the cluster. They would presumably be the same version, and if this was a bug with Nagios would be exhibiting the same behavior. Now, is it possible for you to pull the temporary check file nagios tries to move before it crashes? Within that file you should be able to see what check/host/service it was destined for. Do you have any details on these custom scripts/checks that are in the cluster?
Locked