nagios dies - sometimes

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

nagios dies - sometimes

Post by MalcolmPreen »

OK - firstly, sorry for the vague title.... its basically true, but first of all you need a description of the setup we have.

We have a central server - which monitors itself - and collect NSCA reports for all of the hosts and services which are monitored by off-site tablets.

All of the systems are running Centos - and since last week, they are all running the latest nagios, 4.2.1.

The problem (the nagios process on the central server failing), has been happening occasionally (once or twice a month) - and from what I can see it almost always fails at around 00:50. It has been happening since we first migrated onto nagios 4 (and possibly even on nagios 3 - but that was quite a time ago),

I become aware of the problem, because the dashboard data for all of the hosts and services is out of date - but interestingly, if I refresh my browser it is managing to get the information from nagios (even though it appears to not be running) - perhaps something is cached or otherwise available in memory?

Additionally, there are many nsca processes running - and /var/log/messages starts to log messages such as;

Code: Select all

date/time host-name xinetd[3613]: FAIL: nsca service_limit from=source
where date/time is a date/time stamp, host-name is the host I am viewing, and source is the IP address the nsca message is from

There is nothing, that I have found to date, within /var/log/messages or /usr/local/nagios/var/nagios.log which indicates a failure .... except for a lack of messages.

To restore the system back to normality, I can "just" restart nagios as normal - although I also have to kill -1 the old nsca processes (as until I do that, no new processes can run... and no more data is received).

So.... is this a problem which is known ?

Assuming it isn't, what can I do to provide diagnostics ?

I have the logs I have at present... but I can turn on other debug "for next time" if that would help. I have not attached any logs as yet, as they are all huge.

Any advice greatly appreciated.

Thanks, Malcolm
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: nagios dies - sometimes

Post by rkennedy »

What version of NSCA are you running?

Also, can you attach your NSCA xinetd configuration file?
Former Nagios Employee
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: nagios dies - sometimes

Post by Box293 »

Here is how to enable debug logging for NSCA.

https://support.nagios.com/kb/article.p ... ategory=94

Because it's only happening occasionally I suspect this is something that "builds up over time". It would be interesting to see if there is anything unusual after collecting a days worth of logs.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

Firstly, thanks for your replies;
rkennedy wrote:What version of NSCA are you running?

Also, can you attach your NSCA xinetd configuration file?
I think NSCA 2.9.1 is "latest"

Code: Select all

NSCA - Nagios Service Check Acceptor
Copyright (c) 2009 Nagios Core Development Team and Community Contributors
Copyright (c) 2000-2009 Ethan Galstad
Version: 2.9.1
Last Modified: 01-27-2012
License: GPL v2
Encryption Routines: AVAILABLE
/etc/xinetd.d/nsca

Code: Select all

# default: on
# description: NSCA (Nagios Service Check Acceptor)
service nsca
{
        disable = no
        flags           = REUSE
        socket_type     = stream
        wait            = no
        user            = nagios
        group           = nagios
        server          = /usr/local/nagios/bin/nsca
        server_args     = -c /usr/local/nagios/etc/nsca.cfg --inetd
        log_on_failure  += USERID
        per_source      = UNLIMITED
}
The nsca.cfg file contains mostly default information, using DES encryption

I'm not sure that nsca is the problem here - as the nsca processes continue to run.... it is the nagios daemon which is dying - leaving many nsca processes (limited to 500 by configuration).

However, I'll continue to follow the questions - just in case I've over looked something.

One of the reasons I don't think nsca is the problem, is that I have a parallel machine.... which also receives nsca packets (the only difference between the two is that one receives the nsca traffic over a vpn link (no fail)... and the other receives them over the internet (firewall rules to allow packets through) (this is the one that fails... sometimes)).

In collecting data, I have found a difference between the two machines,,,,

the one which fails, additionally runs nsca-ng in parallel (again.... not sure if this matters.... but it is a difference between the two)

I will look into the debugging of nsca as per Box293 post... as well.

Is there anything I can do to get a better idea of why the nagios process has failed ? (the nagios debug log appears to wrap very quickly as we have 719 hosts and 7377 services being monitored - although perhaps if nagios dies then the debug file would stop being updated... and therefore I could capture the last few moments.... - perhaps ? - certainly nagios.log wasn't updated between the "time of death" and "time of restart").
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: nagios dies - sometimes

Post by tgriep »

Can you post your nsca.cfg file so we can view it?
Another possible cause could be that NSCA isn't closing down correctly if the data from the remote site was lost and then it held the nagios.cmd file locked causing the Nagios process to stop.
Get the debugging going and post back what you find.
Be sure to check out our Knowledgebase for helpful articles and solutions!
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

nsca.cfg (from /usr/local/nagios/etc/nsca.cfg) attached (password starred out)

Will continue to debug - and attempt to capture anything useful....

Whilst I'm here - are there any recommendations for debug_level setting for nagios ?
Logging ALL seems to be too much information.... but I don't want to miss anything....

My plan at present is to run with;
- nsca debugging on as described
- nagios debug_level=-1 (unless better advice as above)

Monitor for failure (using cron - as nagios won't be able to monitor its own death.... as its dead.... unless anyone knows better)

When failure is detected.... preserve; /var/log/messages /usr/local/nagios/var/* .... and ideally restart nagios... (as its a production system)

Does the above make sense? Any other suggestions?

Malcolm
Attachments
obfuscated_nsca.cfg
nsca.cfg file - only edit is to "star out" the password
(5.23 KiB) Downloaded 402 times
jotagera
Posts: 4
Joined: Mon Jun 06, 2016 9:43 am

Re: nagios dies - sometimes

Post by jotagera »

Hi

Yesterday i hit a nagios 4 behavior simillar to yours. The beginning of the problem was a instalation of a new plugin developed internaly in shell script and used in 30 new services monitoring.

During the event, i tried to send a nsca passive by hand. send_nsca could connect and send data but with no response from server.

Inside server box i could see the number of nsca process increasing during time.

After a long search, I could find the problem: exaustion of number of files permited by sysctl.conf and limits.conf to the nagios user.

I increased the number of files from 1024 to 10000 and the problem was solved.

Regards

PS: Sorry for my poor english. :)
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

Thank you for your reply - I will investigate on Friday (vacation tomorrow) - however, I am not sure the problem you describe is the same.
.
On my server, I see nsca processes come and go.... the only reason they remain appears to be because nagios is not communicating with them (ie it has died unexpectedly).
.
I can "reproduce" this situation.... by "service nagios stop" (ie a controlled stop)... and when I issue a "service nagios start" it is noticeable that there can be a number of nsca processes which were active when nagios was stopped.... these are never picked up by the "new" nagios....
.
Because of this situation, I have a script which monitors the number of nsca processes.... and if they are older than "acceptable", I kill these processes.
.
As such.... I think the nsca processes hanging around is not the problem.... but a consequence of the unexplained nagios abort (but also invoked by a controlled nagios shutdown).
.
Does that help to explain ?
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: nagios dies - sometimes

Post by tgriep »

Here is a description of the debug_level option in the nagios.cfg file.
Setting up debugging in nagios, edit the nagios.cfg file and set it like below. That should be a good starting level.
debug_level=5
This option determines what type of information Nagios should write to the debug_file. This value is a logical OR of the values below.

-1 = Log everything
0 = Log nothing (default)
1 = Function enter/exit information
2 = Config information
4 = Process information
8 = Scheduled event information
16 = Host/service check information
32 = Notification information
64 = Event broker information
Try that and see if you catch anything. If you do, post it here.
Be sure to check out our Knowledgebase for helpful articles and solutions!
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

Just to confirm... I've not forgotten this... the debug stuff is all in place... but the error hasn't recurred....

Malcolm
Locked