Downtime and acknowledgments stop working

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Downtime and acknowledgments stop working

Post by WillemDH »

Hello,

Lately (could be related to updating Nagios XI to 5.4.13) we are experiencing more and more issues with downtimes and acknowledgments no longer working. The load on my Nagios server is low (3-4 on a server with 10 cores)

Already gone through all the recommendations to tune the message queue, such as https://support.nagios.com/kb/article/n ... d-139.html

The httpd logs don't really got me any wiser, but what I did find in /usr/local/nagios/var/nagios.log are several of the following errors:

"External command error: Malformed command"
"External command error: Command failed"

Didn't really find any useful information about this error.. What could cause this error and how can I get rid of it? And could it be related to my downtime and acknowledgment issues (which are solved for several days once time I restart the Nagios service .)

Grtz

Thanks.

Willem
Nagios XI 5.8.1
https://outsideit.net
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Downtime and acknowledgments stop working

Post by cdienger »

What version was the system on previously? The error messages could be related as the and I would try to correlate them with the messages in /usr/local/nagiosxi/var/cmdsubsys.log. The next time it is in this state, could you PM me a copy of this file as well as a profile?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Downtime and acknowledgments stop working

Post by WillemDH »

Hello,

Previous version was 5.4.11.

I'm experiencing the same issues again today... :(
Fyi, ipcs -q doesn't report long queues, so that should'nt be the problem.

No useful logs in /var/log/httpd/ssl_error_log

Checked the nagios core logs again and it seems there are very few external commands failing last few hours. As I have this issue right now, this implies the external command errors have nothing to do with my issues. It seems the command failed errors are related to a server trying to send passive checks over NSCA to a host that doesn't exists.

I'm sending the logfile and the profile to you over a pm.

Restarting the server again, hopefully it keeps working a little longer then 1 day.. Please suggest what's the next step in troubleshooting this.

Grtz

Willem
Nagios XI 5.8.1
https://outsideit.net
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Downtime and acknowledgments stop working

Post by rkennedy »

Hey Willem,

How are you submitting the downtime / acknowledgements, and through what method - automatically, or manually?

Are you restarting the entire server, or just nagios related services?

If you tail the cmdsubsys log, are you able to confirm the external commands coming in for each of the scheduled items?

It sounds clean, maybe related to the upgrade as you mentioned, but hopefully this will help to find some further options for debugging.

Cheers,

Clint
Former Nagios Employee
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Downtime and acknowledgments stop working

Post by WillemDH »

Hey Clint,

Downtimes and acknowledgments are done with a mix of actions done through the gui and actions done with the external command interface, both manual and automatic.

Yesterday I tried restarting the nagios service, which resolved the issue for +- 20 hours. This morning I restarted the server completely, w'll see how long it takes now for the issue to reappear.

Grtz

Willem
Nagios XI 5.8.1
https://outsideit.net
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Downtime and acknowledgments stop working

Post by cdienger »

There are messages logged indicating a problem connecting to the postgres database. The next time this occurs, try just restarting the postgres database to see if it clears things up and help narrow down a cause. I'd also like to get a copy of the /usr/local/nagiosxi/var/dbmaint.log(and yesterday's if availble).
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Downtime and acknowledgments stop working

Post by WillemDH »

Clint,

I'll send you the dbmaint.log file with a pm. About the logfile from yesterday, I'm not sure there is one:

Code: Select all

ls -la /usr/local/nagiosxi/var/dbmain*
-rw-r--r-- 1 nagios nagios 4101947 May 16 10:30 /usr/local/nagiosxi/var/dbmaint.log
-rw-r--r-- 1 nagios nagios 6696460 Aug  6  2017 /usr/local/nagiosxi/var/dbmaint.log-20170806
-rw-r--r-- 1 nagios nagios 6661612 Aug 13  2017 /usr/local/nagiosxi/var/dbmaint.log-20170813
-rw-r--r-- 1 nagios nagios 6658892 Aug 20  2017 /usr/local/nagiosxi/var/dbmaint.log-20170820
-rw-r--r-- 1 nagios nagios 6637984 Aug 27  2017 /usr/local/nagiosxi/var/dbmaint.log-20170827
-rw-r--r-- 1 nagios nagios  274354 May 12 03:13 /usr/local/nagiosxi/var/dbmaint.log-20180512.gz
Is there supposed to be a logfile per day?

The last line of the log says: 'Repair Complete: Removing Lock File'

Is there a sheduled repair on the Postgres db? Or did it repair itself after an issue somehow. While writing this I can see another repair being completed. It seems you are correct that there is something wrong going on in the Postgres db.

Looking forward to your analyse of the Postgres logfile.

Grtz

Willem
Nagios XI 5.8.1
https://outsideit.net
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Downtime and acknowledgments stop working

Post by cdienger »

The log actually show pretty normal behavior and the message just means that the usual maintenance is complete. Have there been any problems since doing the full reboot?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Downtime and acknowledgments stop working

Post by WillemDH »

Good to hear that there are no issues in Postgres. The issue did not re-appear since I rebooted the server for now, but i suspect it will re-appear within the next week. Is there anything else I should check next time we are experiencing this issue?
Nagios XI 5.8.1
https://outsideit.net
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Downtime and acknowledgments stop working

Post by cdienger »

I would be curious to see if acknowledgements sent directly to the cmd.cgi would work next time it is in that state:

https://old.nagios.org/developerinfo/ex ... mand_id=40
https://old.nagios.org/developerinfo/ex ... mand_id=39

Note that instead of "/bin/printf" you may need to use "/usr/bin/printf" and make sure to use the actual host and service name instead of host1 and service1
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked