Summary
There are many causes for this error. Most often it is due to a connection issue to the backend historical database, crashed database tables, core scheduling/check execution issues, or lack of resources (causing orphaned checks).
Editing Files
In many steps of this article you will be required to edit files. This documentation will use the vi text editor. When using the vi
editor:
- To make changes press i on the keyboard first to enter insert mode
- Press Esc to exit insert mode
- When you have finished, save the changes in vi by typing :wq and press Enter
More Details
The typical workflow can be explained as follows:
Nagios core schedules a check, and once the check is run the output is returned and the ndomod NEB module pushes the check result to the ndo2db daemon by placing it in the kernel message queue. The ndo2db daemon then connects to the mysql "nagios" (ndoutils) historical database and inserts the check result. The Nagios XI php scripts then query the "nagios" database for the status information to display on the frontend (in contrast to Nagios Core CGIs which query the status.dat file directly).
Thus, there are a number of things that can interfere with updating the "Last Check" time on the XI UI.
- The check is failing to be scheduled or executed.
- ndo2db is failing to insert the check result into the "nagios" mysql database or the Nagios XI frontend database query is failing.
Troubleshooting
The troubleshooting step is to verify if the checks are actually getting scheduled and executed. If they are not, it is usually an issue with the Nagios Core engine. If they are, it is most likely a database issue.
The easiest way to verify this is to check the Nagios Core web frontend to see if the "Last Check" time is updating. Browse to:
http://<server_ip_or_hostname>/nagios/
Check any of the details for an object that is currently experiencing issues with "Last Check" times. If the Core interface displays accurate "Last Check" times, proceed to Step 2 below. If the Core interface is experiencing the same issues as the XI interface, follow Step 1 below.
1. The check is failing to be scheduled or executed
Issues with the Nagios Core auto-rescheduler directives:
There were a few bugs with the introduction of the auto_rescheduling feature in Nagios Core 4.0.8 (released 08/12/2014) which is used in Nagios XI 2014R1.4 (released 08/14/2014). Those affected by this bug will notice the nagios.log file filled with errors pertaining to rescheduled checks. Originally, the new directives added to nagios.cfg could cause rescheduled checks to never execute, and instead be continuously rescheduled. The original /usr/local/nagios/etc/nagios.cfg directives were:
auto_reschedule_checks=1 auto_rescheduling_interval=30 auto_rescheduling_window=180
Reducing the auto_rescheduling_window to 45 should resolve this issue:
auto_reschedule_checks=1 auto_rescheduling_interval=30 auto_rescheduling_window=45
Once the above changes are made to nagios.cfg, restart Nagios Core using one of the commands below:
RHEL 7 | CentOS 7 | Oracle Linux 7 | Debian | Ubuntu 16/18
systemctl restart nagios.service
Resource Issues forcing the rescheduling of checks:
If the system ulimit settings are too restrictive, checks may be orphaned and forced to reschedule. Usually, this behaviour is identified by checking the nagios.log file for lines similar to:
[1331905537] Warning: The check of service 'SERVICE' on host 'NAMESERVER' looks like it WAS orphaned (results never Came
back). I'm scheduling an immediate check of the service ... [1331755699] Warning: The check of service 'SWAP' on host 'nameserver'
not could be due to Performed to fork () error 'Resource temporarily unavailable'. The check will be rescheduled.
If many of those lines exist in nagios.log, perform the following tasks to increase the kernel ulimts:
Edit the file /etc/security/limits.conf and define / update the following settings:
#locked memory
* hard memlock 128
* soft memlock 128
#open files
* soft nofile 10000
* hard nofile 10000
root hard nofile 10000
root soft nofile 10000
#max user processes
* hard nproc 4096
* soft nproc 4096
#stack size
* hard stack 20480
* soft stack 20480
If the setting does not exist then add the line. Once you have made the changes save the file and restart the server.
After the server has rebooted, execute the following command to verify that the new settings are in place:
ulimit -a
2. ndo2db is failing to insert the check result into the "nagios" mysql database.
There are crashed tables in the Nagios database:
Crashed tables can be identified by checking the mysql/mariadb logs located at:
/var/log/mysqld.log
or for mariadb:
/var/log/mariadb/
The relevant errors should resemble:
141127 10:40:24 [ERROR] /usr/libexec/mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
Repair the tables with the following command:
cd /usr/local/nagiosxi/scripts/
./repair_databases.sh
Check For Multiple Nagios Processes
After following the steps above, make sure that multiple nagios processes are not running.
Execute this command to check:
ps -ef | grep nagios.cfg | grep -v grep
The following output is healthy:
nagios 5713 1 0 08:40 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 5723 5713 0 08:40 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
You can see the first line has a PID of 5713, this is the parent process.
The second line has the PID of 5723 however you can see that it references the parent PID of 5713, this is a child process of the parent and is normal behavior. On heavily-loaded systems you may see multiple child processes - this is normal behavior.
If your output has more than one parent process, execute the following commands:
RHEL 7 | CentOS 7 | Oracle Linux 7 | Debian | Ubuntu 16/18
systemctl stop nagios.service
killall -9 nagios
systemctl start nagios.service
Final Thoughts
For any support related questions please visit the Nagios Support Forums at: