Page 3 of 3
Re: Nagios daemon crashing frequently (extensive logs attach
Posted: Mon Feb 03, 2014 10:13 am
by stephan
I just found this topic through Google, it describes the exact same problem we are having. However one of the first things we did is disabling any plugin except Merlin and Nconf as we cannot operate without these. The pnp4nagios (npcd) however was disabled months ago. Maybe this post brings someone on an idea. The gdb piece seems interesting to try...
http://www.monitoring-portal.org/wbb/in ... =82323&l=2
Re: Nagios daemon crashing frequently (extensive logs attach
Posted: Mon Feb 03, 2014 1:48 pm
by abrist
I cannot say what cause merlin could have, but as there are a number of *different* things that cause this issue, I would request you open a new thread with your problem. It also keeps all the logs/outputs in this thread relevant to only one person.
Re: Nagios daemon crashing frequently (extensive logs attach
Posted: Mon Feb 03, 2014 2:20 pm
by stephan
Hi Abrist. The problem is still exactly the same, the link just describes the issue we are having. I posted it to clarify a bit more.
Re: Nagios daemon crashing frequently (extensive logs attach
Posted: Tue Feb 04, 2014 1:26 pm
by slansing
Are you still working with no crashes after changing that configuration line? It is curious that this happens when trying to move a file as small as a check result, it would be nice to get a listing of how many files are in "/usr/local/nagios/var/spool/checkresults/" at the time of the crash.
Re: Nagios daemon crashing frequently (extensive logs attach
Posted: Wed Feb 05, 2014 5:30 am
by stephan
No, unfortunately crashes have been seen afterwards. The malloc config change did not seem to have any effect, maybe it should be exported somehow else I'm not sure.
That's indeed curious, I was thinking the same thing so we are now collecting the content of the checkresults directory before restarting Nagios. We want to have a look in the file that was last moved before the crash.. to see if there's a pattern of some kind.
Awaiting a new crash still since then. Since past week Friday it has been stable. The frequency is fluctuating a lot, and we have no clue how to reproduce/force the crash which makes it so much more difficult to troubleshoot.
Re: Nagios daemon crashing frequently (extensive logs attach
Posted: Wed Feb 05, 2014 11:19 am
by slansing
Alright, let us know what your findings are I'm interested to see if there are a large number of checkresults in that directory at the time of the crash not being grabbed in time and somehow overloading nagios, we're going to continue to brainstorm on our end, this is a curious issue. Have you tried mirroring your production system to see if it recurs on a different core system?
Re: Nagios daemon crashing frequently (extensive logs attach
Posted: Wed Feb 12, 2014 6:51 am
by stephan
Another crash occured just now, here's the saved checkresult directory:
Code: Select all
[root@server crashes]# tar -tvf ./server_2014_02_12_12\:30\:01_checkresults.tar.gz
drwxrwxr-x nagios/nagios 0 2014-02-12 12:26 var/log/nagios/spool/checkresults/
-rw------- nagios/nagios 478 2014-02-12 12:25 var/log/nagios/spool/checkresults/cJGMWqH
-rw------- nagios/nagios 0 2014-02-12 12:26 var/log/nagios/spool/checkresults/ceNgA3B.ok
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cPqy0oQ.ok
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cOTUC5w.ok
-rw------- nagios/nagios 658 2014-02-12 12:26 var/log/nagios/spool/checkresults/ceNgA3B
-rw------- nagios/nagios 0 2014-02-12 12:26 var/log/nagios/spool/checkresults/c9V5smO.ok
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cszzhW1.ok
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/ck54KGi.ok
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cqV4IC4.ok
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/ch0DetO.ok
-rw------- nagios/nagios 0 2014-02-12 12:26 var/log/nagios/spool/checkresults/cBkybx5.ok
-rw------- nagios/nagios 489 2014-02-12 12:25 var/log/nagios/spool/checkresults/ccNvcZp
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cm6E9xg.ok
-rw------- nagios/nagios 426 2014-02-12 12:25 var/log/nagios/spool/checkresults/c42DQ7T
-rw------- nagios/nagios 473 2014-02-12 12:25 var/log/nagios/spool/checkresults/cspb0GE
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/c8bb38J.ok
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/c7lbsRs.ok
-rw------- nagios/nagios 476 2014-02-12 12:25 var/log/nagios/spool/checkresults/cEA4MC7
-rw------- nagios/nagios 1101 2014-02-12 12:26 var/log/nagios/spool/checkresults/cEuAaii
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cA5BUUP.ok
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cspb0GE.ok
-rw------- nagios/nagios 1102 2014-02-12 12:26 var/log/nagios/spool/checkresults/cMbOd5r
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/c42DQ7T.ok
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cb2zOY3.ok
-rw------- nagios/nagios 473 2014-02-12 12:25 var/log/nagios/spool/checkresults/ch0DetO
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cxDbDJT.ok
-rw------- nagios/nagios 1240 2014-02-12 12:26 var/log/nagios/spool/checkresults/c9V5smO
-rw------- nagios/nagios 483 2014-02-12 12:26 var/log/nagios/spool/checkresults/cMNeJU0
-rw------- nagios/nagios 0 2014-02-12 12:26 var/log/nagios/spool/checkresults/cMbOd5r.ok
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cnJayzg.ok
-rw------- nagios/nagios 0 2014-02-12 12:26 var/log/nagios/spool/checkresults/cMNeJU0.ok
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/co6lc0E.ok
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cEA4MC7.ok
-rw------- nagios/nagios 474 2014-02-12 12:25 var/log/nagios/spool/checkresults/co6lc0E
-rw------- nagios/nagios 486 2014-02-12 12:25 var/log/nagios/spool/checkresults/cm6E9xg
-rw------- nagios/nagios 0 2014-02-12 12:26 var/log/nagios/spool/checkresults/cEuAaii.ok
-rw------- nagios/nagios 484 2014-02-12 12:25 var/log/nagios/spool/checkresults/c8bb38J
-rw------- nagios/nagios 1102 2014-02-12 12:26 var/log/nagios/spool/checkresults/cBkybx5
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cS6SnL2.ok
-rw------- nagios/nagios 482 2014-02-12 12:25 var/log/nagios/spool/checkresults/c3aw8ZM
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/ccNvcZp.ok
-rw------- nagios/nagios 427 2014-02-12 12:25 var/log/nagios/spool/checkresults/cqV4IC4
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/c3aw8ZM.ok
-rw------- nagios/nagios 476 2014-02-12 12:25 var/log/nagios/spool/checkresults/cnJayzg
-rw------- nagios/nagios 929 2014-02-12 12:25 var/log/nagios/spool/checkresults/cszzhW1
-rw------- nagios/nagios 509 2014-02-12 12:25 var/log/nagios/spool/checkresults/cA5BUUP
-rw------- nagios/nagios 442 2014-02-12 12:25 var/log/nagios/spool/checkresults/cS6SnL2
-rw------- nagios/nagios 387 2014-02-12 12:25 var/log/nagios/spool/checkresults/ck54KGi
-rw------- nagios/nagios 391 2014-02-12 12:25 var/log/nagios/spool/checkresults/cOTUC5w
-rw------- nagios/nagios 422 2014-02-12 12:25 var/log/nagios/spool/checkresults/cPqy0oQ
-rw------- nagios/nagios 619 2014-02-12 12:25 var/log/nagios/spool/checkresults/cb2zOY3
-rw------- nagios/nagios 442 2014-02-12 12:25 var/log/nagios/spool/checkresults/c7lbsRs
-rw------- nagios/nagios 417 2014-02-12 12:25 var/log/nagios/spool/checkresults/cxDbDJT
-rw------- nagios/nagios 0 2014-02-12 12:25 var/log/nagios/spool/checkresults/cJGMWqH.ok
[root@server crashes]# tar -tvf ./server_2014_02_12_12\:30\:01_checkresults.tar.gz | wc -l
55
55 doesn't seem like a lot, I've been watching this directory with an 'ls' every 0.5 second. Files in this directory shoot from 20 to 100 to 50 to 180 to 0 to 10... etc. I assume this is normal behavior?
The latest logs reported in the debug Log is this:
Code: Select all
[Wed Feb 12 12:25:56 2014.147613] [016.1] [pid=6690] Processing check result file: '/var/log/nagios/spool/checkresults/cUknRdE'
[Wed Feb 12 12:25:56 2014.240223] [016.2] [pid=13311] Moving temp check result file '/var/log/nagios/spool/checkresults/checkh3LtZI' to queue file '/var/log/nagios/spool/checkresults/c42DQ7T'...
[Wed Feb 12 12:25:56 2014.708625] [016.2] [pid=13313] Moving temp check result file '/var/log/nagios/spool/checkresults/check9L6O7R' to queue file '/var/log/nagios/spool/checkresults/cqV4IC4'...
[Wed Feb 12 12:25:57 2014.325122] [016.2] [pid=13227] Moving temp check result file '/var/log/nagios/spool/checkresults/checkJF5Poq' to queue file '/var/log/nagios/spool/checkresults/cspb0GE'...
[Wed Feb 12 12:25:57 2014.361850] [016.2] [pid=13230] Moving temp check result file '/var/log/nagios/spool/checkresults/checkxqq91s' to queue file '/var/log/nagios/spool/checkresults/cJGMWqH'...
[Wed Feb 12 12:25:57 2014.365008] [016.2] [pid=13233] Moving temp check result file '/var/log/nagios/spool/checkresults/checkPyvGJv' to queue file '/var/log/nagios/spool/checkresults/c8bb38J'...
[Wed Feb 12 12:25:57 2014.394159] [016.2] [pid=13236] Moving temp check result file '/var/log/nagios/spool/checkresults/checkPrCvvy' to queue file '/var/log/nagios/spool/checkresults/c3aw8ZM'...
[Wed Feb 12 12:25:57 2014.644306] [016.2] [pid=13315] Moving temp check result file '/var/log/nagios/spool/checkresults/check9gWnk1' to queue file '/var/log/nagios/spool/checkresults/cm6E9xg'...
[Wed Feb 12 12:25:57 2014.700719] [016.2] [pid=13317] Moving temp check result file '/var/log/nagios/spool/checkresults/check38ufBa' to queue file '/var/log/nagios/spool/checkresults/ccNvcZp'...
[Wed Feb 12 12:25:59 2014.650375] [016.2] [pid=13319] Moving temp check result file '/var/log/nagios/spool/checkresults/checkvj7nWj' to queue file '/var/log/nagios/spool/checkresults/co6lc0E'...
[Wed Feb 12 12:25:59 2014.671502] [016.2] [pid=13322] Moving temp check result file '/var/log/nagios/spool/checkresults/checkpXEBlt' to queue file '/var/log/nagios/spool/checkresults/ch0DetO'...
[Wed Feb 12 12:25:59 2014.717137] [016.2] [pid=13328] Moving temp check result file '/var/log/nagios/spool/checkresults/check5zuVmM' to queue file '/var/log/nagios/spool/checkresults/cEA4MC7'...
[Wed Feb 12 12:26:00 2014.080652] [016.2] [pid=13331] Moving temp check result file '/var/log/nagios/spool/checkresults/checktfSHYV' to queue file '/var/log/nagios/spool/checkresults/cEuAaii'...
[Wed Feb 12 12:26:00 2014.121986] [016.2] [pid=13333] Moving temp check result file '/var/log/nagios/spool/checkresults/checkpcriE5' to queue file '/var/log/nagios/spool/checkresults/cMbOd5r'...
[Wed Feb 12 12:26:00 2014.202355] [016.2] [pid=13335] Moving temp check result file '/var/log/nagios/spool/checkresults/checkplz9nf' to queue file '/var/log/nagios/spool/checkresults/ceNgA3B'...
[Wed Feb 12 12:26:00 2014.690954] [016.2] [pid=13325] Moving temp check result file '/var/log/nagios/spool/checkresults/checkPMn7OC' to queue file '/var/log/nagios/spool/checkresults/cMNeJU0'...
[Wed Feb 12 12:26:01 2014.061647] [016.2] [pid=13346] Moving temp check result file '/var/log/nagios/spool/checkresults/checkxTDXbp' to queue file '/var/log/nagios/spool/checkresults/c9V5smO'...
[Wed Feb 12 12:26:03 2014.571541] [016.2] [pid=13348] Moving temp check result file '/var/log/nagios/spool/checkresults/checkb75H3y' to queue file '/var/log/nagios/spool/checkresults/cBkybx5'...
Now I will look into these files to see if there's something weird in there.. but my guess is it will all look normal.
To answer on the mirroring question, the cluster consists out of four different physical servers which all show this behavior. So that's a mirrored scenario there. We have another cluster of two servers for testing purposes, there we do not see these crashes. However they are VM's and don't perform many checks, also not using all the checks/scripts the cluster of four does. So that's not really a good test case.
The difficult thing in mirroring this setup is the fact that we cannot have another system or systems performing the same checks next to the active cluster. It will result in actually executing the scripts locally or through NRPE on other servers twice each x (5) min. For our production systems this is not an option. Any ideas to go about this?
Also, say the crash cause would be some check/script... how can we tell which one this is? The Nagios debug log is set to max logging and max detailed but does not mention this. How can we ever find out what the real cause would be?
Use gbd.. or strace in some advanced way.. ? Anyone ideas to really see where the process crashes on? We might have to look at tools which really hook on to the process or something. Seems Nagios itself is not aware of itself crashing or reporting on a cause.
Thanks for all the help, really appreciate it!
Re: Nagios daemon crashing frequently (extensive logs attach
Posted: Wed Feb 12, 2014 11:28 am
by slansing
Ah, well it could be being caused by one of these custom alterations, since this is not effecting your other two nagios servers that are out side of the cluster. They would presumably be the same version, and if this was a bug with Nagios would be exhibiting the same behavior. Now, is it possible for you to pull the temporary check file nagios tries to move before it crashes? Within that file you should be able to see what check/host/service it was destined for. Do you have any details on these custom scripts/checks that are in the cluster?