Hello,
Lately (could be related to updating Nagios XI to 5.4.13) we are experiencing more and more issues with downtimes and acknowledgments no longer working. The load on my Nagios server is low (3-4 on a server with 10 cores)
Already gone through all the recommendations to tune the message queue, such as https://support.nagios.com/kb/article/n ... d-139.html
The httpd logs don't really got me any wiser, but what I did find in /usr/local/nagios/var/nagios.log are several of the following errors:
"External command error: Malformed command"
"External command error: Command failed"
Didn't really find any useful information about this error.. What could cause this error and how can I get rid of it? And could it be related to my downtime and acknowledgment issues (which are solved for several days once time I restart the Nagios service .)
Grtz
Thanks.
Willem
Downtime and acknowledgments stop working
Downtime and acknowledgments stop working
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
Re: Downtime and acknowledgments stop working
What version was the system on previously? The error messages could be related as the and I would try to correlate them with the messages in /usr/local/nagiosxi/var/cmdsubsys.log. The next time it is in this state, could you PM me a copy of this file as well as a profile?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Downtime and acknowledgments stop working
Hello,
Previous version was 5.4.11.
I'm experiencing the same issues again today...
Fyi, ipcs -q doesn't report long queues, so that should'nt be the problem.
No useful logs in /var/log/httpd/ssl_error_log
Checked the nagios core logs again and it seems there are very few external commands failing last few hours. As I have this issue right now, this implies the external command errors have nothing to do with my issues. It seems the command failed errors are related to a server trying to send passive checks over NSCA to a host that doesn't exists.
I'm sending the logfile and the profile to you over a pm.
Restarting the server again, hopefully it keeps working a little longer then 1 day.. Please suggest what's the next step in troubleshooting this.
Grtz
Willem
Previous version was 5.4.11.
I'm experiencing the same issues again today...
Fyi, ipcs -q doesn't report long queues, so that should'nt be the problem.
No useful logs in /var/log/httpd/ssl_error_log
Checked the nagios core logs again and it seems there are very few external commands failing last few hours. As I have this issue right now, this implies the external command errors have nothing to do with my issues. It seems the command failed errors are related to a server trying to send passive checks over NSCA to a host that doesn't exists.
I'm sending the logfile and the profile to you over a pm.
Restarting the server again, hopefully it keeps working a little longer then 1 day.. Please suggest what's the next step in troubleshooting this.
Grtz
Willem
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
Re: Downtime and acknowledgments stop working
Hey Willem,
How are you submitting the downtime / acknowledgements, and through what method - automatically, or manually?
Are you restarting the entire server, or just nagios related services?
If you tail the cmdsubsys log, are you able to confirm the external commands coming in for each of the scheduled items?
It sounds clean, maybe related to the upgrade as you mentioned, but hopefully this will help to find some further options for debugging.
Cheers,
Clint
How are you submitting the downtime / acknowledgements, and through what method - automatically, or manually?
Are you restarting the entire server, or just nagios related services?
If you tail the cmdsubsys log, are you able to confirm the external commands coming in for each of the scheduled items?
It sounds clean, maybe related to the upgrade as you mentioned, but hopefully this will help to find some further options for debugging.
Cheers,
Clint
Former Nagios Employee
Re: Downtime and acknowledgments stop working
Hey Clint,
Downtimes and acknowledgments are done with a mix of actions done through the gui and actions done with the external command interface, both manual and automatic.
Yesterday I tried restarting the nagios service, which resolved the issue for +- 20 hours. This morning I restarted the server completely, w'll see how long it takes now for the issue to reappear.
Grtz
Willem
Downtimes and acknowledgments are done with a mix of actions done through the gui and actions done with the external command interface, both manual and automatic.
Yesterday I tried restarting the nagios service, which resolved the issue for +- 20 hours. This morning I restarted the server completely, w'll see how long it takes now for the issue to reappear.
Grtz
Willem
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
Re: Downtime and acknowledgments stop working
There are messages logged indicating a problem connecting to the postgres database. The next time this occurs, try just restarting the postgres database to see if it clears things up and help narrow down a cause. I'd also like to get a copy of the /usr/local/nagiosxi/var/dbmaint.log(and yesterday's if availble).
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Downtime and acknowledgments stop working
Clint,
I'll send you the dbmaint.log file with a pm. About the logfile from yesterday, I'm not sure there is one:
Is there supposed to be a logfile per day?
The last line of the log says: 'Repair Complete: Removing Lock File'
Is there a sheduled repair on the Postgres db? Or did it repair itself after an issue somehow. While writing this I can see another repair being completed. It seems you are correct that there is something wrong going on in the Postgres db.
Looking forward to your analyse of the Postgres logfile.
Grtz
Willem
I'll send you the dbmaint.log file with a pm. About the logfile from yesterday, I'm not sure there is one:
Code: Select all
ls -la /usr/local/nagiosxi/var/dbmain*
-rw-r--r-- 1 nagios nagios 4101947 May 16 10:30 /usr/local/nagiosxi/var/dbmaint.log
-rw-r--r-- 1 nagios nagios 6696460 Aug 6 2017 /usr/local/nagiosxi/var/dbmaint.log-20170806
-rw-r--r-- 1 nagios nagios 6661612 Aug 13 2017 /usr/local/nagiosxi/var/dbmaint.log-20170813
-rw-r--r-- 1 nagios nagios 6658892 Aug 20 2017 /usr/local/nagiosxi/var/dbmaint.log-20170820
-rw-r--r-- 1 nagios nagios 6637984 Aug 27 2017 /usr/local/nagiosxi/var/dbmaint.log-20170827
-rw-r--r-- 1 nagios nagios 274354 May 12 03:13 /usr/local/nagiosxi/var/dbmaint.log-20180512.gz
The last line of the log says: 'Repair Complete: Removing Lock File'
Is there a sheduled repair on the Postgres db? Or did it repair itself after an issue somehow. While writing this I can see another repair being completed. It seems you are correct that there is something wrong going on in the Postgres db.
Looking forward to your analyse of the Postgres logfile.
Grtz
Willem
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
Re: Downtime and acknowledgments stop working
The log actually show pretty normal behavior and the message just means that the usual maintenance is complete. Have there been any problems since doing the full reboot?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Downtime and acknowledgments stop working
Good to hear that there are no issues in Postgres. The issue did not re-appear since I rebooted the server for now, but i suspect it will re-appear within the next week. Is there anything else I should check next time we are experiencing this issue?
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
Re: Downtime and acknowledgments stop working
I would be curious to see if acknowledgements sent directly to the cmd.cgi would work next time it is in that state:
https://old.nagios.org/developerinfo/ex ... mand_id=40
https://old.nagios.org/developerinfo/ex ... mand_id=39
Note that instead of "/bin/printf" you may need to use "/usr/bin/printf" and make sure to use the actual host and service name instead of host1 and service1
https://old.nagios.org/developerinfo/ex ... mand_id=40
https://old.nagios.org/developerinfo/ex ... mand_id=39
Note that instead of "/bin/printf" you may need to use "/usr/bin/printf" and make sure to use the actual host and service name instead of host1 and service1
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.