Views

Difference between revisions of "Nagios XI:FAQs"

From Nagios Support Wiki

(Apply Configuration Page Stalls Out, Never Completes)
(Supported Distributions)
 
(95 intermediate revisions by 9 users not shown)
Line 15: Line 15:
 
* Debugging Configuration Change Problems (if configuration changes are not taking effect)<br />[http://go.nagios.com/forum/595/2435 Write configuration file tool.]
 
* Debugging Configuration Change Problems (if configuration changes are not taking effect)<br />[http://go.nagios.com/forum/595/2435 Write configuration file tool.]
  
 +
==== Hardware Requirements ====
 +
Check out our general guidelines on the hardware requirements needed to run Nagios XI:
 +
 +
[http://assets.nagios.com/downloads/nagiosxi/docs/Nagios_XI_Hardware_Requirements.pdf Nagios XI - Hardware Requirements]
 +
 +
==== Licensing ====
 +
 +
Every Nagios XI License key is valid for 3 installs, each with their own specific purpose.  Each install is necessary to properly manage and maintain a fully functional monitoring implementation.  The following install descriptions are listed below:
 +
# '''Production Install''' - The main monitoring install for a given license key. This is the install that system administrators use on their production servers and infrastructure to monitor their environment and receive notifications when systems are not working properly.
 +
# '''Test/Lab Environment''' - The second install is for use in a test environment. This ensures that when upgrades are necessary, or major configuration changes are implemented, there are not adverse effects to the main monitoring system. The test install allows teams to “preview” their changes without jeopardizing the main system.
 +
# '''Backup Install''' - The final installation use case for a given license key is for use as a backup/failover of the Nagios XI Production install. This allows for a high-availability system to be setup, or can be used as a complete backup of the production install
 +
These use cases, when implemented correctly, provide organizations with an infrastructure monitoring system capable of handling any environment.  If you have any questions about licensing terms for Nagios XI, or any additional questions regarding Nagios Solutions contact us at [mailto:sales@nagios.com sales@nagios.com].
 +
 +
Note: Deviation from the above use cases is a violation of Nagios license terms and conditions.  For more information, contact [mailto:sales@nagios.com sales@nagios.com].
 +
[mailto:sales@nagios.com sales@nagios.com]
  
 
==== Supported Distributions ====
 
==== Supported Distributions ====
 
Nagios XI is currently supported with the following Linux distributions for both 32 and 64 bit installations:
 
Nagios XI is currently supported with the following Linux distributions for both 32 and 64 bit installations:
*CentOS 5 (Recommended)
+
*CentOS 5/6/7
*CentOS 6
+
*RHEL 5/6/7
*RHEL 5 (Recommended)
+
 
*RHEL 6
+
==== Installation Prerequisites ====
 +
'''Important:''' Nagios Enterprises highly recommends and will only support installing Nagios XI on a newly installed, “clean” system (a
 +
bare minimal install with nothing else installed or configured).
 +
 
 +
Attempting to install Nagios XI on a pre-existing system with other applications already installed can cause the Nagios XI installation
 +
process to fail, critical system components and settings (e.g. database servers) to be modified in a way that negatively affects other
 +
applications, and previously installed applications to be automatically upgraded or removed. While installing XI on a system with other
 +
applications is possible, it is not recommended due to the possible interactions and complexity of multiple components that are required
 +
for Nagios XI to function. If you choose to ignore these warnings, you do so at your own risk.
 +
 
 +
Internet access is required for installation and upgrades!
  
 
==== Capabilities ====
 
==== Capabilities ====
 
===== Is Nagios XI capable of Distributed Monitoring? =====
 
===== Is Nagios XI capable of Distributed Monitoring? =====
  
Yes it is! There are multiple options for Distributed Monitoring with Nagios.
+
Yes it is! There are [http://assets.nagios.com/downloads/general/docs/Distributed_Monitoring_Solutions.pdf multiple options] for Distributed Monitoring with Nagios.
  
[http://www.nagios.com/products/nagiosfusion Nagios Fusion] *New*   
+
[http://www.nagios.com/products/nagiosfusion Nagios Fusion]
  
 
Nagios Core (the underlying monitoring engine) can be configured for distributed monitoring.  For more information, read the Nagios Core documentation on [http://nagios.sourceforge.net/docs/3_0/distributed.html distributed monitoring].
 
Nagios Core (the underlying monitoring engine) can be configured for distributed monitoring.  For more information, read the Nagios Core documentation on [http://nagios.sourceforge.net/docs/3_0/distributed.html distributed monitoring].
  
[http://library.nagios.com/library/specialtopics/distributedmonitoring/308-using-dnx-with-nagios Distributed Monitoring with DNX]
+
[http://assets.nagios.com/downloads/nagiosxi/docs/Integrating_Mod_Gearman_with_Nagios_XI.pdf Integrating mod_gearman with Nagios XI]
  
 +
[http://assets.nagios.com/downloads/nagiosxi/docs/Using_DNX.pdf Using DNX With Nagios]
  
 
===== Is it possible to use SMS alerts for a custom SMS gateway? =====
 
===== Is it possible to use SMS alerts for a custom SMS gateway? =====
Yes!  Nagios XI sends SMS alerts by via email.  Although we currently don't have a solution that allows users to define custom SMS gateways, the best way to get around this is to define a contact with an email address that will send the SMS message.  Email address examples are as follows:
+
Yes!  Nagios XI sends SMS alerts by via email.  As of XI 2012, custom SMS gateways can be configured through Admin --> Manage Mobile Carriers.
 +
 
 +
Pre-2012 users can define a contact with an email address that will send the SMS message instead.  Email address examples are as follows:
 
   <phonenumber>@smsgateway.domain
 
   <phonenumber>@smsgateway.domain
 
   1235551234@messaging.sprintpcs.com (send SMS via sprint)
 
   1235551234@messaging.sprintpcs.com (send SMS via sprint)
Line 42: Line 70:
  
 
==== System Configuration Problems ====
 
==== System Configuration Problems ====
 +
 +
===== Resetting The nagiosadmin Password =====
 +
To reset the nagiosadmin password, run the following from the command line:
 +
 +
  /usr/local/nagiosxi/scripts/reset_nagiosadmin_password.php --password=<newpassword>
 +
Note: If you would like to use special characters in your password, you should escape them with "\". For example, if you want to set your new password to be "$newpassword#", then you can run:
 +
  /usr/local/nagiosxi/scripts/reset_nagiosadmin_password.php --password=<\$newpassword\#>
 +
 
===== Problems Using Nagios XI With Proxies =====
 
===== Problems Using Nagios XI With Proxies =====
 
We do not officially support Nagios XI when you install and use proxy software that restricts traffic to or from the Nagios XI server.  There are several reasons for this.  First, Nagios XI requires external access for package installation and updates.  Package installation and updates may not work when proxies are used.  Additionally, the Nagios XI code makes several internal HTTP calls to the local Nagios XI server to import configuration data, apply configuration changes, process AJAX requests, etc.  These functions may not work properly when you deploy a proxy, which would result in a non-functional Nagios XI installation.
 
We do not officially support Nagios XI when you install and use proxy software that restricts traffic to or from the Nagios XI server.  There are several reasons for this.  First, Nagios XI requires external access for package installation and updates.  Package installation and updates may not work when proxies are used.  Additionally, the Nagios XI code makes several internal HTTP calls to the local Nagios XI server to import configuration data, apply configuration changes, process AJAX requests, etc.  These functions may not work properly when you deploy a proxy, which would result in a non-functional Nagios XI installation.
Line 57: Line 93:
 
   http_proxy=<nowiki>http://myname:mypass@someproxyserver:port/</nowiki> <nowiki># A</nowiki>ll in one string this time
 
   http_proxy=<nowiki>http://myname:mypass@someproxyserver:port/</nowiki> <nowiki># A</nowiki>ll in one string this time
 
   no_proxy=localhost,127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 <nowiki>#</nowiki> Hosts to exclude from proxying
 
   no_proxy=localhost,127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 <nowiki>#</nowiki> Hosts to exclude from proxying
 +
 +
If you are using an https proxy:
 +
 +
  https_proxy=<nowiki>https://myname:mypass@someproxyserver:port/</nowiki>
  
 
Quoting is not needed (or helpful) in any of these, but if you have special characters in passwords (especially : or @) and are having problems you probably need to escape them with backslashes.
 
Quoting is not needed (or helpful) in any of these, but if you have special characters in passwords (especially : or @) and are having problems you probably need to escape them with backslashes.
Line 70: Line 110:
  
 
4. Unset system proxy before running E-importnagiosql script
 
4. Unset system proxy before running E-importnagiosql script
 +
 +
 +
'''Update Check Behind a Proxy'''
 +
Updates checks are known to fail for systems behind a proxy.  We created a proxy component that should allow the update check to work behind most proxies.  Install this component from the Admin->Manage Components page and then access the Admin->Proxy Configuration page to configure the proxy settings.
 +
[[http://assets.nagios.com/downloads/nagiosxi/components/proxy.zip Proxy Component]]
 +
 +
==== Resolving Issues with the XI 2014 Upgrade ====
 +
 +
===== "CONFIG ERROR!" During 2014 Upgrade =====
 +
The most common error experienced during the XI 2014 upgrade process is the following core config error:
 +
 +
  CONFIG ERROR! Restart aborted. Check your Nagios configuration.
 +
 +
XI 2014 introduced some new mechanisms to guard against and remove the dreaded "ghost config" errors as well as some issues pertaining to escalations/dependencies.  Due to these changes though, you may receive the above error during the upgrade, immediately after the installation of nagios core 4.
 +
 +
The most common resolution requires fixing the config errors in the CCM, writing and verifying the config, and then re-running the upgrade script. Enumerated steps are below:
 +
 +
# Run ./upgrade until the error occurs.  Do not roll back the VM or installation.  XI will now be half-upgraded and the config errors will have to be resolved before the upgrade can continue.
 +
# In XI, browse to the CCM:  Configure --> Core Config Manager --> Tools --> Write Config Files.
 +
# Click "Write" and then "Verify".  You should receive at least one error.  The text of the error should be fairly descriptive concerning which object is having issues and what those issues potentially are. If you do not see any descriptive errors, you may have issues with escalations or service/host dependencies.  You will most likely want to de-activate these definitions until the upgrade is complete.
 +
# Resolve the error in the Core Config Manager (CCM).
 +
# Once the detected errors are resolved, re-run the "Write" and "Verify" process from the "Write Config Files" tool. Resolve any further errors in the CCM, repeating the process above as many times as necessary until all config errors are resolved.
 +
# Only when the "Verify" process completes without and error should you proceed.
 +
# Click "Apply Configuration" - it should complete without error at this point.
 +
# Now, return to the shell and re-run ./upgrade.  The upgrade process should continue past the core 4 upgrade and nagios process restart.
 +
 +
===== ICMP and Ping Checks Stopped Graphing After XI 2014 Upgrade =====
 +
Due to issues with CentOS/RHEL 5/6, rrdtool, and the performance graphs, rrdtool may cease to record performance data to RRDs from check_icmp.  This is caused by the addition of new performance datasources returned from the check_icmp plugin in newer versions of nagios-plugins.  Usually, rrdtool will just drop those extra datasources, but this is currently not working on CentOS/RHEL 5/6 under certain circumstances.
 +
 +
We provide a script to search for, and subsequently add, the missing datasources to the RRDs in question. For those upgarding to 2014, this script will essentially double the size of all ping/icmp RRDs.  Please varify that your XI server has ample free space before running the script. 
 +
 +
You should backup your XI server, either through a VM snapshot or a full XI Backup.  The script does provide a way to make backups of your RRDs, but it is better to perform the backup through one of the two above mentioned actions. 
 +
 +
The script can be downloaded from:
 +
http://assets.nagios.com/downloads/nagiosxi/scripts/rrd_ds_fix.zip
 +
 +
The script requires the perl library RRD::Simple.
 +
 +
The full steps are below:
 +
 +
  yum install perl-RRD-Simple -y
 +
  cd /tmp
 +
  wget http://assets.nagios.com/downloads/nagiosxi/scripts/rrd_ds_fix.zip
 +
  unzip rrd_ds_fix.zip
 +
To run the script with RRD backups:
 +
  ./fix_ds_quantity.sh -d /usr/local/nagios/share/perfdata/
 +
To run the script without RRD backups (if you have performed one of the suggested backup options above):
 +
  ./fix_ds_quantity.sh -i -d /usr/local/nagios/share/perfdata/
 +
 +
This Process may take a considerable amount of time depending on many RRDs needed to be updated.  The script logs to /tmp/fix_rrd_ds.log.  Once completed, it may take 5-10 minutes for the new datasources to appear in the performance graphs tab (longer if rrdcached is used).
 +
 +
===== Performance Graphing Stops After Upgrade to XI 2014r1.0 =====
 +
This issue was caused by an extraneous newline "\n" returned at the end of performance data.  It was a specific issue Nagios Core 4.x, and has been fixed in Core 4.0.6. XI users can fix this behavior by updating to XI 2014r1.1.
 +
 +
If you are running XI 2014r1.0, you can verify this behavior by checking the problematic performance data for the object in XI (Home --> Details --> Advanced Tab --> Performance Data Field) for an extraneous newline "\n" at the end of the performance data string.
 +
 +
===== Issues with mod_gearman and Performance Data Newlines: "\n" =====
 +
If you have been using Mod Gearman and have upgraded to Nagios XI 2014 / Core 4, or plan on using Mod Gearman on Nagios XI 2014 / Core 4 you will need to
 +
follow a different installation script than is currently posted in our Mod Gearman Integration documentation. To begin, you will want to follow all steps outlined
 +
at:
 +
 +
http://assets.nagios.com/downloads/nagiosxi/docs/Integrating_Mod_Gearman_with_Nagios_XI.pdf
 +
 +
But will want to download and use the following installation script when the time comes to do so:
 +
 +
http://assets.nagios.com/downloads/nagiosxi/scripts/ModGearmanFullinstallVersionCore4.sh
 +
 +
Keep in mind, the current iteration of Mod Gearman that works with XI 2014 / Core 4 does not work with 32-bit distrobutions, it will only work properly on a
 +
server running a 64-bit architcture.
 +
 +
You will also need to modify a couple of the commands that Nagios XI uses to process performance data returned from your plugins when they are ran, this is to
 +
remove an extra new-line character that gets appended to the check results which results in no performance data being graphed in the XI interface.
 +
 +
You will need to change-
 +
 +
process-host-perfdata-file-bulk and process-service-perfdata-file-bulk command's to:
 +
 +
  sed -i 's/\\n//g' /usr/local/nagios/var/host-perfdata &&
 +
  /bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/$TIMET$.perfdata.host
 +
 +
And:
 +
 +
  sed -i 's/\\n//g' /usr/local/nagios/var/service-perfdata &&
 +
  /bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/$TIMET$.perfdata.service
 +
 +
Save, and apply configuration. This work-around should be fairly temporary until we get a more permanent fix in place, but for the time being you will need to
 +
follow these steps to properly integrate Mod Gearman alongside XI 2014 / Core 4.
 +
 +
===== Core 4 Load Spikes on 1.75 and 7 Hour Intervals =====
 +
With the release of Nagios XI 2014 the core version on the back-end was updated to Core 4. This introduced a issue in certain environments where an extremely high system level load can occur at intervals most commonly between an hour and seven hours of the Nagios process starting. As a temporary solution to this we recommend that if you have been experiencing this problem, you should modify:
 +
 +
  /usr/local/nagiosxi/html/config.inc.php
 +
 +
By changing the following line:
 +
 +
"nom_checkpoint_interval" => 1440, // time (in minutes) between nom checkpoints
 +
 +
To:
 +
 +
"nom_checkpoint_interval" => 90, // time (in minutes) between nom checkpoints
 +
 +
You may want to alter the above noted interval based on when you are experiencing these problems. Ideally it should be set to occur as close to the high load anomaly as possible as to minimize system downtime and stress while we work towards a more permanent solution.  This will force the creation of a snapshot, so you may want to archive any important config snaphots as these changes will increase the number of daily snapshots (possibly pushing needed snapshots from the pool).
  
 
==== Installation and Upgrade Problems ====
 
==== Installation and Upgrade Problems ====
 +
 +
===== CentOS 6 Installation Problems =====
 +
Between the the release of Nagios XI 2011R1.7 and 1.8, several changes were made to the CentOS 6 repo that created package conflicts, preventing the Nagios XI installation scripts from completing successfully.  This usually becomes apparent by the "fullinstall" script failing with one of the following two messages:
 +
 +
 +
  ERROR: Prerequisite program 'mysql' not found!
 +
  which: no mysqladmin in (/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:  /usr/sbin:/sbin:/home/admin/bin)
 +
  ERROR: Prerequisite program 'mysqladmin' not found!
 +
  7 prerequisite(s) missing - exiting.
 +
 +
OR
 +
 +
  ls: cannot access /usr/local/nagiosxi/nom/checkpoints/nagioscore/*.gz: No such file or directory
 +
  NO NOM SNAPSHOT FOUND!
 +
  ERROR: NagiosQL import appears to have failed - exiting.  (Reason: Import files are still present in /usr/local/nagios/etc/import)
 +
 +
This issue can be resolved with the following solution if you're attempting to install with the 1.7 tarball.  This problem will be resolved in the 2011R1.8 release of Nagios XI. 
 +
 +
'''Before''' attempting any more installations, run:
 +
  yum install centos-release-cr
 +
Then remove any previous /tmp/nagiosxi directory that is in place, and unpack a fresh tarball:
 +
  cd /tmp
 +
  rm -rf nagiosxi
 +
  tar zxf xi-2011r1.7.tar.gz
 +
  cd nagiosxi
 +
  ./fullinstall
 +
 +
If the installer still fails, contact XI support and attach the install.log file that's generated by the fullinstall script. 
 +
 +
 
===== SourceGuardian Errors =====
 
===== SourceGuardian Errors =====
 
After upgrading to 2009R1.2C, some users started getting an error about SourceGuardian.  Add this line to your /etc/php.ini file:
 
After upgrading to 2009R1.2C, some users started getting an error about SourceGuardian.  Add this line to your /etc/php.ini file:
Line 77: Line 249:
 
Once you make that change, restart Apache:
 
Once you make that change, restart Apache:
 
   service restart httpd
 
   service restart httpd
 +
 +
=====Resolving "Cannot connect to database" Error - Core Config Manager=====
 +
If you're able to access the Nagios XI interface, but can't seem to access the Core Configuration Manager, try the following two steps to see if it resolves the issue.
 +
 +
Access the Admin->Reset Security Credentials page and reset the subsystem credentials.
 +
 +
Run the following command from the shell:
 +
  touch /usr/local/nagiosxi/html/config.inc.php
  
 
=====Resolving "DB Connect Error [nagiosxi]: Database connection failed"=====
 
=====Resolving "DB Connect Error [nagiosxi]: Database connection failed"=====
Line 124: Line 304:
 
   ERROR: PostgresQL not running - exiting.
 
   ERROR: PostgresQL not running - exiting.
 
   ERROR: Nagios XI database was not setup properly - exiting.
 
   ERROR: Nagios XI database was not setup properly - exiting.
 +
 +
===== "ERROR: Please add the 'Optional' channel to your Red Hat systems subscriptions." =====
 +
Red Hat Subscription Manager:
 +
 +
You need to add the Optional software channel so that Nagios XI can install the necessary prerequisites. To do so:
 +
 +
yum install yum-utils
 +
 +
yum-config-manager --enable rhel-6-server-optional-rpms
 +
 +
Red Hat Network Classic:
 +
 +
You need to add the Optional software channel so that Nagios XI can install the necessary prerequisites. To do so, first sign in to your Red Hat Network account at http://rhn.redhat.com/. Then click on the link corresponding to your system. Near the bottom-left corner of the page, click Alter Channel Subscriptions. Check the box labeled RHEL Server Optional and click Change Subscriptions. That's it! You should be able to run the installer again and complete your installation.
 +
 +
===== "Installation errors on customized corporate builds of CentOS or RHEL" =====
 +
We have seen when companies require the use of their "standard build" of either OS Nagios XI will not be able to successfully install if there have been modification to the umask on the machine.
 +
 +
===== "Upgrade errors - root.crontab.orig: cannot overwrite existing file" =====
 +
We have seen problems when upgrading and there are leftover files from previous upgrades.
 +
 +
This problem can be eliminated by running the following command:
 +
  cat /dev/null > /tmp/nagiosxi/uninstall-crontab-root
 +
 +
After this you can proceed to run the upgrade script again.
 +
 +
 +
===== Ajaxterm Installation Aborted =====
 +
Nagios XI 2012 and late versions of 2011 set up apache to be able to utilize the Ajaxterm subcomponent. This component requires a modification to the /etc/httpd/conf.d/ssl.conf file, and adds some proxy information needed specifically for ajaxterm. If apache failed to restart after making these modifications, the previous configuration gets rolled back in order to keep apache running the the system usable. The bad configuration that the fullinstall/upgrade script attempted to apply is saved to /etc/httpd/conf.d/ajaxterm.fail. This file can be debugged and once fixed, can replace the existing ssl.conf file in order to utilize Ajaxterm. '''Note:''' Remove the /etc/httpd/conf.d/ajaxterm.fail file once the issue is resolved to avoid the error message in the UI. Please contact the Nagios support team with any questions.
  
 
====Configuration Problems====
 
====Configuration Problems====
Line 130: Line 338:
 
If you receive an error while attempting to Apply Configuration stating that the configuration verification has failed, then that means there is some sort of syntax error or configuration conflict the configuration that's been defined.  You can isolate this issue by accessing the Core Config Manager->Configuration Snapshots page.  You should see the most recent snapshot highlighted in red.  View the text file from the snapshot to see what config file contained the error.  You can then find that file in the associated tar.gz file and search for the problem based on the error message.  The snapshot represents the information that is CURRENTLY in the CCM database, that Nagios attempted to save.  You'll need to correct the issue through the Core Config Manager, then attempt to Apply Configuration again.   
 
If you receive an error while attempting to Apply Configuration stating that the configuration verification has failed, then that means there is some sort of syntax error or configuration conflict the configuration that's been defined.  You can isolate this issue by accessing the Core Config Manager->Configuration Snapshots page.  You should see the most recent snapshot highlighted in red.  View the text file from the snapshot to see what config file contained the error.  You can then find that file in the associated tar.gz file and search for the problem based on the error message.  The snapshot represents the information that is CURRENTLY in the CCM database, that Nagios attempted to save.  You'll need to correct the issue through the Core Config Manager, then attempt to Apply Configuration again.   
  
The Write Config Tool in the CCM is a manual tool for writing the DB information to the configuration files (it manually Applies Configuration).  It's important to know that Nagios cannot start or restart with a bad configuration.  The config verification must pass in order for Nagios to be able to restart successfully with the new configuration.
+
The Write Config Tool in the CCM is a manual tool for writing the DB information to the configuration files (it manually Applies Configuration).  It's important to know that Nagios cannot start or restart with a bad configuration.  The config verification must pass in order for Nagios to be able to restart successfully with the new configuration.
 
+
  
 
=====Configuration Applies, but still get "Configuration File Is Out Of Date" Error =====
 
=====Configuration Applies, but still get "Configuration File Is Out Of Date" Error =====
Line 141: Line 348:
 
After changing the setting, restart your apache server:
 
After changing the setting, restart your apache server:
 
   service httpd restart
 
   service httpd restart
 +
 +
 +
=====Apply Configuration Fails, No Configuration Problems =====
 +
 +
As of 2011 R1.7, extra sanity checks were added to the Apply Configuration functionality of Nagios XI to prevent false positives and also to prevent that page from stalling out endlessly.  An example error that can show up is:
 +
"Backend login to the Core Config Manager failed"
 +
 +
There are a few different reasons an error like this can show up.  The most common one is the use of a proxy that prevents "wget" from being able to resolve to "localhost" correctly.  However, if you receive an error message when attempting to Apply Configuration '''other than''' "Configuration Error...," run the following commands and send the output file to the Nagios XI support team.
 +
 +
  cd /usr/local/nagiosxi/scripts
 +
  ./reconfigure_nagios.sh &> reconfig.txt
 +
 +
Then also run the following command to begin capturing log output:
 +
 +
  tail -f /usr/local/nagiosxi/var/cmdsubsys.log &> cmd.txt
 +
 +
And attempt to Apply Configuration from the web interface.  After the browser has returned some output to the screen, press Ctrl+C to stop the log tail, and send XI support the '''cmd.txt''' file and the '''reconfig.tx'''t that was generated by the above instructions.
  
 
=====Apply Configuration Page Stalls Out, Never Completes =====
 
=====Apply Configuration Page Stalls Out, Never Completes =====
Line 149: Line 373:
 
and the configuration never applies, the page may be timing out.  If you've recently updated XI, try restarting the server first.  If you're currently running Nagios XI 2011R1.3 there is a known bug that can cause this issue.  You'll need to upgrade to the latest version to resolve the issue.  If that does not resolve the issue, try editing the configuration for your PHP settings.  Open /etc/php.ini file in a text editor and increase the following values.   
 
and the configuration never applies, the page may be timing out.  If you've recently updated XI, try restarting the server first.  If you're currently running Nagios XI 2011R1.3 there is a known bug that can cause this issue.  You'll need to upgrade to the latest version to resolve the issue.  If that does not resolve the issue, try editing the configuration for your PHP settings.  Open /etc/php.ini file in a text editor and increase the following values.   
  
Scroll all the way down to the bottom of the script, and remove a line that says:
 
  memory_limit = 64M      ; Increase the memory limit
 
 
Then scroll up to around line 300, and increase the numbers for the following configs. 
 
  
 
   ;;;;;;;;;;;;;;;;;;;
 
   ;;;;;;;;;;;;;;;;;;;
Line 170: Line 390:
  
 
If the issue persists after the above solutions, the issue could be caused by creating a local DNS entry for the Nagios XI server, but failing to add that name entry to the Nagios XI server itself.  Example, if you're accessing the XI server from the following url: <nowiki>http://nagiosserver/nagiosxi</nowiki>, you need to verify that the XI server can also resolve that DNS name correctly.  The local DNS entry for the XI server needs to be added to the /etc/hosts file.
 
If the issue persists after the above solutions, the issue could be caused by creating a local DNS entry for the Nagios XI server, but failing to add that name entry to the Nagios XI server itself.  Example, if you're accessing the XI server from the following url: <nowiki>http://nagiosserver/nagiosxi</nowiki>, you need to verify that the XI server can also resolve that DNS name correctly.  The local DNS entry for the XI server needs to be added to the /etc/hosts file.
 +
 +
You can observe similar issues if you run out of disk space.
  
 
=====Configuration Applies, No Changes Take Place =====
 
=====Configuration Applies, No Changes Take Place =====
This is generally due to permissions issues with the configuration file.  Use the '''Write Config Tool''' in the Core Config Manager to see if you can manually write the DB information to the config files.  If the Write Config Tool returns error messages related to permissions you can run the following script to correct the permission settings:
+
This is generally due to permissions issues with the configuration file.  Use the [http://assets.nagios.com/downloads/nagiosxi/docs/Exporting_XI_Configuration_Database.pdf Write Config Tool] in the Core Config Manager to see if you can manually write the DB information to the config files.  If the Write Config Tool returns error messages related to permissions you can run the following script to correct the permission settings:
 
   /usr/local/nagiosxi/scripts/reset_config_perms
 
   /usr/local/nagiosxi/scripts/reset_config_perms
  
Line 212: Line 434:
 
Underneath the Parents box in the CCM, make sure the "standard" radio button is selected. If "null" is selected your parent host selection doesn't get written to disk. We're working on a method of fixing the CCM so this doesn't happen with several fields.
 
Underneath the Parents box in the CCM, make sure the "standard" radio button is selected. If "null" is selected your parent host selection doesn't get written to disk. We're working on a method of fixing the CCM so this doesn't happen with several fields.
  
 +
===== Warning: Duplicate definition found for contact 'xi_default_contact' =====
 +
This usually happens if you import the "static" directory config files in Nagios XI. When you try to apply configuration, you see an error, similar to this one:
  
 +
  Warning: Duplicate definition found for contact 'xi_default_contact'
 +
  (config file '/usr/local/nagios/etc/contacts.cfg', starting on line 79)
 +
  Error: Could not add object property in file '/usr/local/nagios/etc/contacts.cfg' on line 80.
 +
  Error processing object config files!
 +
 +
You can resolve this by running the following command and then applying configuration:
 +
 +
  curl -s <nowiki>http://assets.nagios.com/downloads/nagiosxi/scripts/fix_static_import</nowiki>| mysql -pnagiosxi nagiosql
  
 
=====Core Config Manager Problems=====
 
=====Core Config Manager Problems=====
Line 259: Line 491:
  
 
==== Performance Graph Problems ====
 
==== Performance Graph Problems ====
 +
 +
===== General Performance Graph Troubleshooting =====
 +
Performance graphs can experience a number of issues.  Below are solutions for the most common problems.  The log verbosity should be increased before troubleshooting, and should be returned to default settings once resolved.
 +
 +
'''Increase Performance Data Logging Verbosity'''. 
 +
 +
Edit the file:
 +
 +
  /usr/local/nagios/etc/pnp/process_perfdata.cfg
 +
 +
Change:
 +
 +
  LOG_LEVEL = 0
 +
 +
To:
 +
 +
  LOG_LEVEL = 2
 +
 +
The process_perfdata.pl script should now log all errors and debug information to:
 +
 +
  /usr/local/nagios/var/perfdata.log
 +
 +
Remember to return this value to it's default setting when troubleshooting is completed.
 +
 +
'''Increase NPCD Logging Verbosity'''. 
 +
 +
Edit the file:
 +
 +
  /usr/local/nagios/etc/pnp/npcd.cfg
 +
 +
Change the Default value from:
 +
 +
  log_level = 0
 +
 +
To:
 +
 +
  log_level = -1
 +
 +
Save out and restart NPCD:
 +
 +
  service npcd restart
 +
 +
NPCD should now log all errors and debug information to:
 +
 +
  /usr/local/nagios/var/npcd.log
 +
 +
Remember to return this value to it's default setting when troubleshooting is completed.
 +
 +
===== Perfdata Timeout ===== 
 +
As many installations grow, the perfdata processing timeout value may need to be increased.  Check the perfdata log for any recent timeout errors:
 +
 +
  tail -50 /usr/local/nagios/var/perfdata.log | grep TIMEOUT
 +
 +
If the grep found any recent errors, change the TIMEOUT by editing the file:
 +
 +
  /usr/local/nagios/etc/pnp/process_perfdata.cfg
 +
 +
Change the default value from:
 +
 +
  TIMEOUT = 5
 +
 +
To:
 +
 +
  TIMEOUT = 20
 +
 +
As your installations grows further, this value may need to be increased even more.
 +
 +
===== NPCD Load Threshold =====
 +
Bulk NPCD processing has a load threshold setting that is intended to halt performance processing if the system is under heavy load.  Large installations will need this value increased and NPCD restarted.
 +
 +
Check the NPCD log for load warnings (if the log file does not exist, increase the log level, restart npcd, and wait 5 minutes before proceeding):
 +
 +
  tail -50 /usr/local/nagios/var/npcd.log | grep "MAX load reached"
 +
 +
If any recent errors are found, increase load threshold by editing the file:
 +
 +
  /usr/local/nagios/etc/pnp/npcd.cfg
 +
 +
Change:
 +
 +
  load_threshold = 10.0
 +
 +
To:
 +
 +
  load_threshold = 20.0
 +
 +
Save out and restart NPCD:
 +
 +
  service npcd restart
 +
 +
For really large installations, or servers with minimal resources, you may need to increase the npcd load_threshold and perfdata TIMEOUT even more than is suggested above.
 +
 +
===== Unexpected Number of Datasources =====
 +
Nagios XI stores performance data in RRDs (Round robin Databases).  These are binary files with a static number of "tracks" (datasources).  If check is changed to return more datasources of performance data than the RRD was initially created for, those additional metrics will not be added to the RRD.
 +
 +
To verify if this is the case, check the perfdata.log file (you may have to increase logging verbosity):
 +
 +
  tail -50 /usr/local/nagios/var/perfdata.log | grep "ERROR" | grep "expected"
 +
 +
If the grep found any errors, the number of datasources returned for the particular check has changed since the RRD was created. 
 +
 +
The easiest resolution is to delete the rrd in question as a new one will be created correctly after the next few checks.  Be aware that deleting the RRD will result in the loss of historical performance data for the check.
  
 
===== Performance Graphs Are Missing Or Not Displayed =====
 
===== Performance Graphs Are Missing Or Not Displayed =====
Line 265: Line 599:
 
*'''Make sure you're using the latest version of Nagios XI'''.  Old releases may have issues that will not necessarily be resolved from the below solutions.  [http://library.nagios.com/library/products/nagiosxi/documentation/249-upgrading-nagios-xi Upgrading Nagios XI]
 
*'''Make sure you're using the latest version of Nagios XI'''.  Old releases may have issues that will not necessarily be resolved from the below solutions.  [http://library.nagios.com/library/products/nagiosxi/documentation/249-upgrading-nagios-xi Upgrading Nagios XI]
  
* '''We have a new patch available''' that should resolve most graph related issues.  It will be included in v2011R1.3 and later.  For earlier releases, download this [http://assets.nagios.com/downloads/nagiosxi/patches/perfdata.zip component], and install through the Admin->Manage Components page. 
 
  
* Reset Security Credentials <br />Select the ''Reset Security Credentials'' option in the Admin menu and click ''Update''
+
'''Verify That process_perfdata.pl has correct permissions''' Make sure that the file /usr/local/nagios/libexec/process_perfdata.pl has execute permissions and is owned by nagios:nagios. 
* Reset File Permissions<br />
+
 
 +
'''2011 R1.8 Fix''' There is a known bug on some XI installs for this release that have incorrect permissions for the performance data directory.  This can be resolved by running the following command as the root user.
 +
  chmod -R +x /usr/local/nagios/share/perfdata/
 +
 
 +
 
 +
*'''1.6 and 1.7 RHEL/CentOS 6 Users'''. There were some hiccups with the repos which cause a necessary component for MRTG graphing to not be installed. This is a very simple fix. Log into the CLI of your Nagios XI server as root, and type:
 +
    yum install bc
 +
That should fix the graphing issues. Note that this does not apply to versions of Nagios XI later than 1.8.
 +
 
 +
* '''Run the command manually'''. Try running the command that Nagios XI runs to check status of a device. For instance, when monitoring a router or switch, Nagios XI uses the check_rrdtraf plugin. Test running this plugin manually by navigating to your libexec directory and running a check, similar to the following:
 +
    ./check_rrdtraf -f '/var/lib/mrtg/192.168.6.1_1.rrd' -w 1 -c 2
 +
This should return something that looks like:
 +
    OK - Current BW in: 1.57Kbps Out: 365.41bps|in=1.573002Kb/s;1;2 out=365.413424b/s;1;2
 +
If it gives errors, then that is the problem. Fix the issues the error gives and then Nagios XI can start graphing performance data.<br />
 +
 
 +
* '''Check perfdata directory permissions'''. Nagios XI needs to be able to write to its nagios/share/perdata/ directory. Check the file permissions on that directory and its subdirectories. For example:
 +
    ll /usr/local/nagios/share/perfdata
 +
Should return something like this:
 +
    drwxrwxrwx 2 nagios nagios 4096 Oct 18 17:01 192.168.5.1
 +
drwxrwxrwx 2 nagios nagios 4096 Oct 18 17:02 192.168.5.4
 +
drwxrwxrwx 2 nagios nagios 4096 Oct 17 15:36 imap.fusemail.net
 +
drwxrwxrwx 2 nagios nagios 4096 Oct 18 17:02 localhost
 +
 
 +
If those folders are not writable and readable by Nagios, then that is problem and you should set write and read access for Nagios. Please note that all files contained in these folders also needs to be writable and readable by nagios.<br />
 +
 
 +
* '''Reset File Permissions'''<br />
 
Execute the following command to reset your configuration file permissions.
 
Execute the following command to reset your configuration file permissions.
 
   /usr/local/nagiosxi/scripts/reset_config_perms
 
   /usr/local/nagiosxi/scripts/reset_config_perms
 
You can also view if you have any permissions related issues by accessing the Admin->Check File Permissions page in the XI interface (v1.3g+).   
 
You can also view if you have any permissions related issues by accessing the Admin->Check File Permissions page in the XI interface (v1.3g+).   
 
* Make sure you have not removed or renamed the nagiosadmin user.  This user is the nagios equivalent to 'root user' and should never be removed.  <br />
 
* Make sure you have not removed or renamed the nagiosadmin user.  This user is the nagios equivalent to 'root user' and should never be removed.  <br />
* Some users reported that editing the following lines in their /usr/local/nagios/etc/nagios.cfg file fixed their graphing issues: <br />
+
 
  service_perfdata_file_processing_command=process-service-perfdata-file-bulk
+
  host_perfdata_file_processing_command=process-host-perfdata-file-bulk
+
Change To
+
  service_perfdata_file_processing_command=process-service-perfdata-file-pnp-bulk
+
  host_perfdata_file_processing_command=process-host-perfdata-file-pnp-bulk
+
  
 
*Make sure your password for Nagios XI only contains alpha-numeric characters.  Some users have reported graphs disappearing from using special characters, creating a permissions issue.
 
*Make sure your password for Nagios XI only contains alpha-numeric characters.  Some users have reported graphs disappearing from using special characters, creating a permissions issue.
 
*Performance graphs are pulled via an internal proxy, so users with their Nagios server behind their own proxy or using strict SSL settings may experience problems viewing graphs.  If you're using an environment with a proxy or SSL and having issues viewing graphs post the problem to our support forums and specify your use of proxy or SSL right away.
 
*Performance graphs are pulled via an internal proxy, so users with their Nagios server behind their own proxy or using strict SSL settings may experience problems viewing graphs.  If you're using an environment with a proxy or SSL and having issues viewing graphs post the problem to our support forums and specify your use of proxy or SSL right away.
* Having an internal DNS hostname that is not defined on the XI server can also cause problems with internal proxy call.  If you've defined a custom DNS host entry for your XI server, make sure it's defined in your /etc/hosts file as well.  For further information on this, contact our support team at support.nagios.com/forum.   
+
* Having an internal DNS hostname that is not defined on the XI server can also cause problems with internal proxy call.  If you've defined a custom DNS host entry for your XI server, make sure it's defined in your /etc/hosts file as well.  For further information on this, contact our support team at support.nagios.com/forum.
 +
 
 +
===== Network Performance Graphs Are Displayed But Have No Data =====
 +
 
 +
'''2011R3.2 and 3.3 issues graphs display but are empty'''. Try running the following commands to see if an excessive amount of performance data files have built up.
 +
 
 +
   cd /usr/local/nagios/var/spool/xidpe
 +
  ls -f | wc -l
 +
 
 +
If the file count is very large, run the following commands, which should restore regular performance graphing.
 +
 
 +
  cd /usr/local/nagios/var/spool
 +
  rm -rf xidpe
 +
  mkdir xidpe
 +
  chown nagios.nagios xidpe
 +
  chmod 755 xidpe
 +
 
 +
'''Only Switch and Router Graphs display but have no data'''
 +
 
 +
A fuller description of this problem is when you are monitoring a switch or router, but its bandwidth graphs are always zero when you know for sure they should have data. Keep in mind, be absolutely sure that the graphs should have data.<br />
 +
 
 +
*'''Make sure the /var/lock/mrtg directory exists.''' It has been witnessed that this directory will occasionally disappear. It is a trivial matter recreating it.
 +
    mkdir /var/lock/mrtg
 +
 
 +
*'''Make sure none of the mrtg.cfg entries are using SNMP v2c'''. Older verions of the Switch Wizard called mrtg with arguments for SNMPv2c, which MRTG does not use. Open up /etc/mrtg/mrtg.cfg and look for
 +
    Target[www.hostaddress.com]: 1:SNMP_Community_String@www.hostaddress.com:::::1
 +
Notice that after the multitude of colons, there is a 1, this represents the SNMP version MRTG will use to poll the device. If this is instead 2c, change it to 2 and save the file. This will need to be done to every metric that is affected by being created with 2c.
 +
 
 
* Further reading<br />[http://go.nagios.com/forum/565 Forum Article.]
 
* Further reading<br />[http://go.nagios.com/forum/565 Forum Article.]
 +
 +
 +
===== Can I Migrate Performance Data From A Different Install? =====
 +
 +
RRD performance data files are compiled binaries, so for a simple file transfer a user would have to have the architecture match on both machines.  If you want to migrate files from a 32bit to 64bit machine, you'll have to convert the data to XML and import it into RRD's on the new machine.  Forum user '''srrhd''' was kind enough to supply the commands used for a working migration:
 +
 +
On the old 32bit machine:
 +
 +
  cd /usr/local/nagios/share/perfdata/
 +
  for i in `find -name "*.rrd"`; do rrdtool dump $i > $i.xml; done
 +
  tar -cvzf perfdata.tar.gz */*.rrd.xml
 +
  for i in `find -name "*.rrd.xml"`; do rm -f $i; done
 +
 +
Then transfer the archive to the new server in the same directory.
 +
On the new x_64 server:
 +
 +
  cd /usr/local/nagios/share/perfdata/
 +
  for i in `find -name "*.rrd"`; do rm -f $i; done
 +
  tar -xvzf perfdata.tar.gz
 +
  for i in `find -name "*.rrd.xml"`; do rrdtool restore $i `echo $i |sed s/.xml//g`; done
 +
  for i in `find -name "*.rrd"`; do chown nagios:nagios $i; done
 +
  for i in `find -name "*.rrd.xml"`; do rm -f $i; done
  
 
==== Notification Problems ====
 
==== Notification Problems ====
  
  
===== Nagios Admin Account Notifications Not Controlled Through XI =====
+
===== Basic Troubleshooting Steps =====
* The ''nagiosadmin'' user was set to use the ''generic_template'' contact template, which resulted in notifications not being controlled through the XI interface.<br />This can be corrected by changing the user's contact template to be ''xi_generic_template'' is the Core Config Manager. This bug was corrected in 2009R1.2 and only effects systems that had/have previous versions installed.
+
'''1. Email Tests'''
 +
 
 +
Send a test email to see if the Nagios server can send email to an account by going to:
 +
 
 +
Configure->My Account Settings->Send Test Notifications.
 +
 
 +
Then, check to see if the test email arrives. If it doesn't arrive, the problem could be one of the following:
 +
 
 +
- Nagios server cannot send mail outside of your network (if you are using Sendmail)
 +
 
 +
- Also Nagios may not be able to relay mail through your company server (if you are using SMTP)
 +
 
 +
Outbound SMTP connections may be blocked by your border firewall.
 +
Lastly, unauthenticated SMTP relaying may be denied somewhere downstream - try switching email methods from Sendmail to SMTP in the admin section.
 +
 
 +
'''2. User's Notification Options'''
 +
 
 +
Check if Notifications are enabled globally - click on the "Monitoring Process" menu on the left from the Home page, and make sure you see a green dot next to the Notifications in the "Monitoring Engine Process" window. You can enable/disable Notifications by clicking on the "Action" button on the right hand side.
 +
 
 +
Check if Notifications are enabled for the user currently logged into Nagios XI - click on the username in the upper right corner next to "Logged in as: ...", then click on "Notification Preferences" under "Notification Options" from the left panel menu. Make sure that the "Enable Notifications" check-box is checked.
 +
 
 +
Review the selected Notification Types - the user will be notified only on host/service states, that are selected.
 +
 
 +
From the same page, click on "Notification Methods" and make sure a Notification Method is selected.
 +
 
 +
'''3. Host/Service Notification Options'''
 +
 
 +
Check if Notifications are enabled for a particular host/service. If you are having issues with Notifications for a particular Host or Service, log into the Core Config Manager and click on "Hosts" or "Services" under "Monitoring" from the left panel menu. Find your Host or Service and click on the "Modify" Action button to the right. Click on "Alert Settings" tab and verify that the "on" radio button next to the "Notification Enabled" is selected.
 +
 
 +
Make sure that the Check Period under the "Check Settings" tab is equal or larger than the Notification Period under the "Alert Settings" tab on the Host/Service Management page in the CCM. If Nagios is not checking a host or service during a specific time, then it will certainly not send notification during that time.
 +
 
 +
Check the "Alert Settings" tab under the Host/Service Management page in the CCM for two things:
 +
 
 +
- Make sure "Notification enabled" is not set to "off".
 +
 
 +
- See which options are selected under "Notification options", because this will determine the states of hosts/services that you will be notified for.
 +
 
 +
''Note:'' If you are having issues with many hosts and services, you should check the templates you are using - "xiwizard_generic_host" and "xiwizard_generic_service" should be the first ones to be checked. Any changes you make in these templates will affect all hosts and services that reference them. You can override this by modifying the host or service configuration itself. If you need to know more on the topic, please read the full explanation of Nagios object inheritance here:
 +
[http://nagios.sourceforge.net/docs/3_0/objectinheritance.html http://nagios.sourceforge.net/docs/3_0/objectinheritance.html]
 +
 
 +
'''4. Contacts'''
 +
 
 +
The contact must be either directly associated with the host or service or be part of a contactgroup that is connected to the host or service.
 +
 
 +
Make sure users and contacts that were added within Nagios XI are set up with the proper notification handlers:
 +
 
 +
* If you are using Users, which are also Contacts (you've added a Contact to them):
 +
 
 +
''xi_host_notification_handler and xi_service_notification_handler''
 +
 
 +
* If you are using Contacts only:
 +
 
 +
''notify-host-by-email and notify-service-by-email''
 +
 
 +
Contacts and users are similar but not the same - read more about it here: http://assets.nagios.com/downloads/nagiosxi/docs/XI_Users_And_Contacts.pdf
 +
 
 +
If you are not receiving notifications, it also possible that the nagiosadmin user was set to use the generic_template contact template, which resulted in notifications not being controlled through the XI interface.
 +
This can be corrected by changing the user's contact template to be xi_generic_template is the Core Config Manager. This bug was corrected in 2009R1.2 and only affects systems that had/have previous versions installed.
 +
 
 +
'''5. Contact Timeperiods'''
 +
 
 +
Each contact has a timeperiod management option that determines when they get notification. Closely review if there are any time exclusions set within contact's timeperiod. These are times that the user will not be sent notifications.
 +
 
 +
'''6. Acknowledgements and Scheduled Downtime'''
 +
 
 +
If the problem has been acknowledged or the host/service is in downtime, alerts won't be sent.
 +
 
 +
'''7. Testing From Host or Service (Sending Custom Notification)'''
 +
 
 +
If you proceed to the host or service in question on the Nagios server and then select the Advanced tab, you can send a test email (custom notification) from the specific host or service that you are testing.
 +
 
 +
'''8. Tracking Notifications'''
 +
 
 +
If you go to Home->Incident Management->Notifications you should see that Nagios is sending notification based on the settings you have chosen and to the appropriate contacts. Using this tool helps you track down if Nagios intends to notify the appropriate contact.
 +
 
 +
===== Test Emails Fail, "Invalid address" Error =====
 +
We identified a bug in 1.9 and some earlier versions where test emails to addresses like "root@localhost" or "user@xiserver" will fail to send because they fail email address validation.  The email address needs to have some sort of domain at the end of it to pass validation and send.  The browser may falsely display a success message for Users testing from their "Send Test Notification" page, while the browser will get an error message if a user runs the test from the Admin->Manage Email Settings->Send A Test Email page.  This bug will be fixed in R1.10, but a workaround in the meantime would be to make sure users have the Nagios XI Sending Address in the Admin->Manage Email Settings page set to an email address with a FDQN OR the address listed below will also work:
 +
  Nagios XI <root@localhost.localdomain>
 +
 
 +
Make sure initial setup for the Admin->Manage Email Settings page has been done and that you've pressed '''Update''' on the email settings.  
  
===== Email Notifications Are Not Going Out =====
+
This bug can be identified by a debug message showing up at the top of the test email page that says "Invalid address:".   
This can happen for a variety of reasons:
+
* The ''nagiosadmin'' is set to use the ''generic_template'' contact template.<br />This should be ''xi_generic_template'', and can be modified by using the Core Config ManagerThis bug was corrected in 2009R1.2 and only effects systems that had/have previous versions installed.
+
* Outbound SMTP connections may be blocked by your border firewall
+
* Unauthenticated SMTP relaying may be denied somewhere downstream - try switching email methods from ''Sendmail'' to ''SMTP'' in the admin section
+
  
 +
This bug is specific to installations using version of PHP 5.2+.
  
 
==== XI Display Problems ====
 
==== XI Display Problems ====
Line 384: Line 862:
 
Additional Documentation:
 
Additional Documentation:
 
[[http://library.nagios.com/library/products/nagiosxi/documentation/460-enabling-nrpe-with-nsclient Enabling NRPE with NSClient]]
 
[[http://library.nagios.com/library/products/nagiosxi/documentation/460-enabling-nrpe-with-nsclient Enabling NRPE with NSClient]]
 +
===== Windows Event Log Check =====
 +
 +
WMI can be used to gather information from the Windows Event Log.  Here are some example command definitions for use with check_wmi_plus.
 +
 +
Check Windows event log system for errors in the last 4 hours. Warn on 1 occurrence, critical if 6 or more.
 +
  check_xi_service_wmiplus!administrator!password!checkeventlog!-a system -o 2 -3 4 -w 1 -c 6
 +
 +
Check Windows event log application for errors in last 1 hour.  Warn on 3 occurrences, critical on 6 or more.
 +
  check_xi_service_wmiplus!administrator!password!checkeventlog!-a application -o 2 -3 4 -w 3 -c 6
 +
 +
===== Linux Cached Memory Not Added to Free Memory =====
 +
 +
It is normal for Linux to "borrow" unused memory for disk caching. This may however create false "Warning" or "Critical" alerts, even though you are NOT low on memory. In order to fix this, we have modified the "custom_check_mem" script, part of our Linux agent install script by adding an optional flag [-n|--nocache]. Basically, cached memory is added to the free memory when you use the "-n" flag.
 +
  Usage: custom_check_mem [-w|--warning]<percent free> [-c|--critical]<percent free> [-n|--nocache]
 +
If you are downloading a new copy of our Linux agent, the updated "custom_check_mem" will be included.
 +
If you already installed the Linux agent, you can just download the updated "custom_check_mem" from [http://assets.nagios.com/downloads/nagiosxi/scripts/custom_check_mem here].
 +
 +
Copy the new script over the old "custom_check_mem".
 +
 +
Go to the Core Config Manager->Monitoring->Services->Memory Usage->Modify and under the "Common Settings" tab modify the $ARG2$ field by adding a "-n" flag.
 +
 +
For example, if you had:
 +
  -a '-w 20 -c 10'
 +
change it to:
 +
  -a '-w 20 -c 10 -n'
 +
Click on "Save" and "Apply Configuration".
 +
 +
''Note: One gotcha - make sure the "custom_check_mem" has Unix EOL before you copy it over.''
  
 
==== Other Issues ====
 
==== Other Issues ====
 +
 +
===== Nagios did not exit in a timely manner =====
 +
For use when Nagios doesn't appear to be exiting cleanly. If the run
 +
file, lock file, or temp check files are getting left behind, try doing this
 +
mod around line 150 of /etc/init.d/nagios.  (The mods are increasing the
 +
for loop from 10 seconds to 30 seconds). This gives the Nagios daemon more time to cleanly shut down all of it's processes and clean up after itself
 +
 +
 +
  # now we have to wait for nagios to exit and remove its
 +
  # own NagiosRunFile, otherwise a following "start" could
 +
  # happen, and then the exiting nagios will remove the
 +
  # new NagiosRunFile, allowing multiple nagios daemons
 +
  # to (sooner or later) run - John Sellens
 +
  #echo -n 'Waiting for nagios to exit .'
 +
  for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30; do
 +
      if status_nagios > /dev/null; then
 +
          echo -n '.'
 +
          sleep 1
 +
      else
 +
          break
 +
      fi
 +
  done
 +
 +
 +
 +
 +
===== Upgrade to 2011R3.x Issues =====
 +
If you're experiencing any of the following issues after an upgrade from Nagios XI 2011r2.x to 3.x:
 +
* Missing hosts or services or status data
 +
* Takes a VERY long time to Apply Configuration or restart the Nagios process
 +
* Unusually high CPU load
 +
* A flood of messages in the /var/log/messages related to ndo2db
 +
 +
Then you may need to manually set a few kernel settings on your system.  In Nagios XI 2011r3.x+ the Ndoutils subcomponent now uses asynchronous writes to log status information to the database, and these messages are sent to the Linux kernel's message queue.  Our upgrade scripts will tune the kernel settings automatically as of 2011r3.2, but in the event that you see the above symptoms on your system, we recommend applying the following settings to your system.
 +
 +
* Open '''/etc/sysctl.conf''' with a text editor. Edit the file to match the following values:
 +
 +
  # Controls the maximum size of a message, in bytes
 +
  kernel.msgmnb = 131072000
 +
 
 +
  # Controls the default maxmimum size of a mesage queue
 +
  kernel.msgmax = 131072000
 +
 
 +
  # Controls the maximum shared segment size, in bytes
 +
  kernel.shmmax = 4294967295
 +
 
 +
  # Controls the maximum number of shared memory segments, in pages
 +
  kernel.shmall = 268435456
 +
 +
  ## The maximum number of messages allowed in any one message queue
 +
  kernel.msgmni = 256000
 +
 +
 +
 +
''Note: If you don't have these entries in the "/etc/sysctl.conf" file, just add them to the end of the file.''
 +
 +
* After these settings are saved to the file, run:
 +
 +
  sysctl -p
 +
 +
To apply the new settings.  If the system still appears to be working improperly, reboot the machine.
 +
 
===== Login Screen Keeps Redirecting To Itself =====
 
===== Login Screen Keeps Redirecting To Itself =====
 
The web browser keeps redirecting to the login screen even after entering login credentials.  This has been noticed in Internet Explorer.
 
The web browser keeps redirecting to the login screen even after entering login credentials.  This has been noticed in Internet Explorer.
  
 
Nagios XI uses cookies to save session state.  These cookies are set to expire after 30 minutes.  If the time on the Nagios XI server is incorrect, the cookies returned to the client's browser might appear to be expired due to the time difference between the client's computer and the Nagios XI server.  Solution: Fix the time on the Nagios XI server to ensure it is correct.
 
Nagios XI uses cookies to save session state.  These cookies are set to expire after 30 minutes.  If the time on the Nagios XI server is incorrect, the cookies returned to the client's browser might appear to be expired due to the time difference between the client's computer and the Nagios XI server.  Solution: Fix the time on the Nagios XI server to ensure it is correct.
 
  
 
===== Check Services Being Orphaned =====
 
===== Check Services Being Orphaned =====
Line 404: Line 971:
  
 
Related forum post can be read [http://support.nagios.com/forum/viewtopic.php?f=16&t=907 here.]
 
Related forum post can be read [http://support.nagios.com/forum/viewtopic.php?f=16&t=907 here.]
 +
 +
 +
 +
If the issue continues to persist after reboots and restarts of the Nagios service, then the issue is most likely caused by either a memory leak in embedded perl, or system ulimit restrictions.  Symptoms can include the /tmp directory filling up quickly with check* files, and the following errors in the nagios log.
 +
 +
  [1331905537] Warning: The check of service 'SERVICE' on host 'NAMESERVER' looks like it WAS
 +
  orphaned (results never Came back). I'm scheduling an immediate check of the service ...
 +
  [1331755699] Warning: The check of service 'SWAP' on host 'nameserver' not could be due to Performed
 +
  to fork () error 'Resource temporarily unavailable'. The check will be rescheduled.
 +
 +
Try the following solutions:
 +
 +
Edit /etc/security/limits.conf
 +
 +
  * hard memlock 128    #locked memory
 +
  * soft memlock 128
 +
 +
  * soft nofile 4096      #open files
 +
  * hard nofile 4096
 +
 +
  * hard nproc 4096    #max user processes
 +
  * soft nproc 4096
 +
 +
  * hard stack 20480    #stack size
 +
  * soft stack 20480
 +
 +
and restart the server.  Run
 +
 +
  ulimit -a
 +
 +
to verify that the new settings are in place.
 +
 +
 +
And also update the settings in your nagios.cfg file to match the following:
 +
 +
  enable_embedded_perl=0
 +
  use_embedded_perl_implicitly=0
 +
 +
===== Postgresql: Postmaster CPU Is High or "Transaction wraparound limit" in log =====
 +
 +
Although Nagios XI performs routine database maintenance on the postgres data tables, if you notice either a high CPU usage for the postmaster process, or a repeated error message in the '''/var/lib/pgsql/data/pg_log''' file that says "transaction ID wrap limit is 2147484146", then you may need to perform a manual VACUUM of the postgres databases.  Run the following commands from the command line:
 +
 +
  psql nagiosxi nagiosxi
 +
  VACUUM;
 +
  VACUUM ANALYZE;
 +
  VACUUM FULL;
 +
  \q
 +
 +
You will see messages like the following when running the above commands:
 +
  WARNING:  skipping "pg_authid" --- only table or database owner can vacuum it
 +
This is normal.  You may need to run the above commands more than once if the CPU usage from postmaster is extremely high. 
 +
 +
Next, vacuum the tables as the postgres user.
 +
 +
  psql postgres postgres
 +
  VACUUM;
 +
  VACUUM ANALYZE;
 +
  VACUUM FULL;
 +
  \q
  
 
==== XI Component/Addon Problems ====
 
==== XI Component/Addon Problems ====
Line 460: Line 1,086:
 
Run the our repair script for mysql tables.   
 
Run the our repair script for mysql tables.   
  
   /usr/local/nagiosxi/scripts/repairmysql.sh nagios *
+
   /usr/local/nagiosxi/scripts/repairmysql.sh nagios
  
 
Unzip and copy the the following [http://assets.nagios.com/downloads/nagiosxi/patches/dbmaint.zip dbmaint] file to /usr/local/nagiosxi/cron/.  
 
Unzip and copy the the following [http://assets.nagios.com/downloads/nagiosxi/patches/dbmaint.zip dbmaint] file to /usr/local/nagiosxi/cron/.  
Line 498: Line 1,124:
  
 
If problems continue to persist, contact our support team at our [http://support.nagios.com/forum support forums].
 
If problems continue to persist, contact our support team at our [http://support.nagios.com/forum support forums].
 +
 +
==== Bandwidth Usage for Offloaded MySQL  ====
 +
We don't have an official documentation for benchmarks on bandwidth usage for a Nagios server, but the following specs were recorded and submitted by a user for network traffic between a Nagios XI server and an offloaded MySQL server.  Thanks Stephen Wallace for contributing this!
 +
 +
* 500 hosts, 10 services each at 5mn interval (5500 checks)
 +
* Breaks down to around 18 checks per second
 +
* Produces around 3MB of network traffic daily between Nagios and MySQL
 +
 +
==== "Still have questions?"  ====
 +
If you haven't found an answer to your question, you can check the Nagios XI Manuals:
 +
 +
[http://assets.nagios.com/downloads/nagiosxi/guides/user/ Nagios XI User Guide]
 +
 +
[http://assets.nagios.com/downloads/nagiosxi/guides/administrator/ Nagios XI Administrator Guide]

Latest revision as of 10:50, 24 March 2015

Back To Nagios XI Overview

Answers to Frequently Asked Questions (FAQs) regarding Nagios XI can be found here.


Contents

FAQs

What Are FAQs? Frequently Asked Questions, or "FAQs", are answers to questions that are frequently asked in some context.


Common Problems - Try These Solutions First

Follow these steps if you are encountering problems with Nagios XI. These actions solve many commonly asked questions.

  • Clear your browser's cache to get the newest XI javascript code.
    Instructions on how to do it.
  • How To Reset Security Credentials (if performance graphs aren't displayed)
    Select the Reset Security Credentials option in the Admin section and click Update.
  • How To Reset File Permissions (if configuration changes are not taking effect)
    Instructions how.
  • Debugging Configuration Change Problems (if configuration changes are not taking effect)
    Write configuration file tool.

Hardware Requirements

Check out our general guidelines on the hardware requirements needed to run Nagios XI:

Nagios XI - Hardware Requirements

Licensing

Every Nagios XI License key is valid for 3 installs, each with their own specific purpose. Each install is necessary to properly manage and maintain a fully functional monitoring implementation. The following install descriptions are listed below:

  1. Production Install - The main monitoring install for a given license key. This is the install that system administrators use on their production servers and infrastructure to monitor their environment and receive notifications when systems are not working properly.
  2. Test/Lab Environment - The second install is for use in a test environment. This ensures that when upgrades are necessary, or major configuration changes are implemented, there are not adverse effects to the main monitoring system. The test install allows teams to “preview” their changes without jeopardizing the main system.
  3. Backup Install - The final installation use case for a given license key is for use as a backup/failover of the Nagios XI Production install. This allows for a high-availability system to be setup, or can be used as a complete backup of the production install

These use cases, when implemented correctly, provide organizations with an infrastructure monitoring system capable of handling any environment. If you have any questions about licensing terms for Nagios XI, or any additional questions regarding Nagios Solutions contact us at sales@nagios.com.

Note: Deviation from the above use cases is a violation of Nagios license terms and conditions. For more information, contact sales@nagios.com.

sales@nagios.com

Supported Distributions

Nagios XI is currently supported with the following Linux distributions for both 32 and 64 bit installations:

  • CentOS 5/6/7
  • RHEL 5/6/7

Installation Prerequisites

Important: Nagios Enterprises highly recommends and will only support installing Nagios XI on a newly installed, “clean” system (a bare minimal install with nothing else installed or configured).

Attempting to install Nagios XI on a pre-existing system with other applications already installed can cause the Nagios XI installation process to fail, critical system components and settings (e.g. database servers) to be modified in a way that negatively affects other applications, and previously installed applications to be automatically upgraded or removed. While installing XI on a system with other applications is possible, it is not recommended due to the possible interactions and complexity of multiple components that are required for Nagios XI to function. If you choose to ignore these warnings, you do so at your own risk.

Internet access is required for installation and upgrades!

Capabilities

Is Nagios XI capable of Distributed Monitoring?

Yes it is! There are multiple options for Distributed Monitoring with Nagios.

Nagios Fusion

Nagios Core (the underlying monitoring engine) can be configured for distributed monitoring. For more information, read the Nagios Core documentation on distributed monitoring.

Integrating mod_gearman with Nagios XI

Using DNX With Nagios

Is it possible to use SMS alerts for a custom SMS gateway?

Yes! Nagios XI sends SMS alerts by via email. As of XI 2012, custom SMS gateways can be configured through Admin --> Manage Mobile Carriers.

Pre-2012 users can define a contact with an email address that will send the SMS message instead. Email address examples are as follows:

 <phonenumber>@smsgateway.domain
 1235551234@messaging.sprintpcs.com (send SMS via sprint)
 1235551234@tmomail.net (send SMS via t-mobile)

System Configuration Problems

Resetting The nagiosadmin Password

To reset the nagiosadmin password, run the following from the command line:

 /usr/local/nagiosxi/scripts/reset_nagiosadmin_password.php --password=<newpassword>

Note: If you would like to use special characters in your password, you should escape them with "\". For example, if you want to set your new password to be "$newpassword#", then you can run:

 /usr/local/nagiosxi/scripts/reset_nagiosadmin_password.php --password=<\$newpassword\#>
Problems Using Nagios XI With Proxies

We do not officially support Nagios XI when you install and use proxy software that restricts traffic to or from the Nagios XI server. There are several reasons for this. First, Nagios XI requires external access for package installation and updates. Package installation and updates may not work when proxies are used. Additionally, the Nagios XI code makes several internal HTTP calls to the local Nagios XI server to import configuration data, apply configuration changes, process AJAX requests, etc. These functions may not work properly when you deploy a proxy, which would result in a non-functional Nagios XI installation.

There are two things that need to be configured to make XI installation work with a proxy; the yum and wget configurations. Do both of these before starting anything about the installation process.

In /etc/yum.conf :

 proxy=http://someproxyserver:port/ # Shouldn't need to be quoted, remember the trailing slash
 proxy_username=myname  # The username you authenticate to your proxy with, if applicable
 proxy_password=mypass  # The password you provide to your proxy, if applicable

In /etc/wgetrc :

 http_proxy=http://myname:mypass@someproxyserver:port/ # All in one string this time
 no_proxy=localhost,127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 # Hosts to exclude from proxying

If you are using an https proxy:

 https_proxy=https://myname:mypass@someproxyserver:port/ 

Quoting is not needed (or helpful) in any of these, but if you have special characters in passwords (especially : or @) and are having problems you probably need to escape them with backslashes.

Here is a proxy install solution reported by forum user TSCAdmin:

1. Before running any installation script install php-pear package manually

2. Set proxy for PHP Pear

 pear config-set http_proxy 'http://example.com:8080'

3. Run Nagios installation scripts sequentially

4. Unset system proxy before running E-importnagiosql script


Update Check Behind a Proxy Updates checks are known to fail for systems behind a proxy. We created a proxy component that should allow the update check to work behind most proxies. Install this component from the Admin->Manage Components page and then access the Admin->Proxy Configuration page to configure the proxy settings. [Proxy Component]

Resolving Issues with the XI 2014 Upgrade

"CONFIG ERROR!" During 2014 Upgrade

The most common error experienced during the XI 2014 upgrade process is the following core config error:

 CONFIG ERROR! Restart aborted. Check your Nagios configuration.

XI 2014 introduced some new mechanisms to guard against and remove the dreaded "ghost config" errors as well as some issues pertaining to escalations/dependencies. Due to these changes though, you may receive the above error during the upgrade, immediately after the installation of nagios core 4.

The most common resolution requires fixing the config errors in the CCM, writing and verifying the config, and then re-running the upgrade script. Enumerated steps are below:

  1. Run ./upgrade until the error occurs. Do not roll back the VM or installation. XI will now be half-upgraded and the config errors will have to be resolved before the upgrade can continue.
  2. In XI, browse to the CCM: Configure --> Core Config Manager --> Tools --> Write Config Files.
  3. Click "Write" and then "Verify". You should receive at least one error. The text of the error should be fairly descriptive concerning which object is having issues and what those issues potentially are. If you do not see any descriptive errors, you may have issues with escalations or service/host dependencies. You will most likely want to de-activate these definitions until the upgrade is complete.
  4. Resolve the error in the Core Config Manager (CCM).
  5. Once the detected errors are resolved, re-run the "Write" and "Verify" process from the "Write Config Files" tool. Resolve any further errors in the CCM, repeating the process above as many times as necessary until all config errors are resolved.
  6. Only when the "Verify" process completes without and error should you proceed.
  7. Click "Apply Configuration" - it should complete without error at this point.
  8. Now, return to the shell and re-run ./upgrade. The upgrade process should continue past the core 4 upgrade and nagios process restart.
ICMP and Ping Checks Stopped Graphing After XI 2014 Upgrade

Due to issues with CentOS/RHEL 5/6, rrdtool, and the performance graphs, rrdtool may cease to record performance data to RRDs from check_icmp. This is caused by the addition of new performance datasources returned from the check_icmp plugin in newer versions of nagios-plugins. Usually, rrdtool will just drop those extra datasources, but this is currently not working on CentOS/RHEL 5/6 under certain circumstances.

We provide a script to search for, and subsequently add, the missing datasources to the RRDs in question. For those upgarding to 2014, this script will essentially double the size of all ping/icmp RRDs. Please varify that your XI server has ample free space before running the script.

You should backup your XI server, either through a VM snapshot or a full XI Backup. The script does provide a way to make backups of your RRDs, but it is better to perform the backup through one of the two above mentioned actions.

The script can be downloaded from: http://assets.nagios.com/downloads/nagiosxi/scripts/rrd_ds_fix.zip

The script requires the perl library RRD::Simple.

The full steps are below:

 yum install perl-RRD-Simple -y
 cd /tmp
 wget http://assets.nagios.com/downloads/nagiosxi/scripts/rrd_ds_fix.zip
 unzip rrd_ds_fix.zip

To run the script with RRD backups:

 ./fix_ds_quantity.sh -d /usr/local/nagios/share/perfdata/

To run the script without RRD backups (if you have performed one of the suggested backup options above):

 ./fix_ds_quantity.sh -i -d /usr/local/nagios/share/perfdata/

This Process may take a considerable amount of time depending on many RRDs needed to be updated. The script logs to /tmp/fix_rrd_ds.log. Once completed, it may take 5-10 minutes for the new datasources to appear in the performance graphs tab (longer if rrdcached is used).

Performance Graphing Stops After Upgrade to XI 2014r1.0

This issue was caused by an extraneous newline "\n" returned at the end of performance data. It was a specific issue Nagios Core 4.x, and has been fixed in Core 4.0.6. XI users can fix this behavior by updating to XI 2014r1.1.

If you are running XI 2014r1.0, you can verify this behavior by checking the problematic performance data for the object in XI (Home --> Details --> Advanced Tab --> Performance Data Field) for an extraneous newline "\n" at the end of the performance data string.

Issues with mod_gearman and Performance Data Newlines: "\n"

If you have been using Mod Gearman and have upgraded to Nagios XI 2014 / Core 4, or plan on using Mod Gearman on Nagios XI 2014 / Core 4 you will need to follow a different installation script than is currently posted in our Mod Gearman Integration documentation. To begin, you will want to follow all steps outlined at:

http://assets.nagios.com/downloads/nagiosxi/docs/Integrating_Mod_Gearman_with_Nagios_XI.pdf

But will want to download and use the following installation script when the time comes to do so:

http://assets.nagios.com/downloads/nagiosxi/scripts/ModGearmanFullinstallVersionCore4.sh

Keep in mind, the current iteration of Mod Gearman that works with XI 2014 / Core 4 does not work with 32-bit distrobutions, it will only work properly on a server running a 64-bit architcture.

You will also need to modify a couple of the commands that Nagios XI uses to process performance data returned from your plugins when they are ran, this is to remove an extra new-line character that gets appended to the check results which results in no performance data being graphed in the XI interface.

You will need to change-

process-host-perfdata-file-bulk and process-service-perfdata-file-bulk command's to:

 sed -i 's/\\n//g' /usr/local/nagios/var/host-perfdata &&
 /bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/$TIMET$.perfdata.host

And:

 sed -i 's/\\n//g' /usr/local/nagios/var/service-perfdata &&
 /bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/$TIMET$.perfdata.service

Save, and apply configuration. This work-around should be fairly temporary until we get a more permanent fix in place, but for the time being you will need to follow these steps to properly integrate Mod Gearman alongside XI 2014 / Core 4.

Core 4 Load Spikes on 1.75 and 7 Hour Intervals

With the release of Nagios XI 2014 the core version on the back-end was updated to Core 4. This introduced a issue in certain environments where an extremely high system level load can occur at intervals most commonly between an hour and seven hours of the Nagios process starting. As a temporary solution to this we recommend that if you have been experiencing this problem, you should modify:

 /usr/local/nagiosxi/html/config.inc.php

By changing the following line:

"nom_checkpoint_interval" => 1440, // time (in minutes) between nom checkpoints

To:

"nom_checkpoint_interval" => 90, // time (in minutes) between nom checkpoints

You may want to alter the above noted interval based on when you are experiencing these problems. Ideally it should be set to occur as close to the high load anomaly as possible as to minimize system downtime and stress while we work towards a more permanent solution. This will force the creation of a snapshot, so you may want to archive any important config snaphots as these changes will increase the number of daily snapshots (possibly pushing needed snapshots from the pool).

Installation and Upgrade Problems

CentOS 6 Installation Problems

Between the the release of Nagios XI 2011R1.7 and 1.8, several changes were made to the CentOS 6 repo that created package conflicts, preventing the Nagios XI installation scripts from completing successfully. This usually becomes apparent by the "fullinstall" script failing with one of the following two messages:


 ERROR: Prerequisite program 'mysql' not found!
 which: no mysqladmin in (/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:   /usr/sbin:/sbin:/home/admin/bin)
 ERROR: Prerequisite program 'mysqladmin' not found!
 7 prerequisite(s) missing - exiting.

OR

 ls: cannot access /usr/local/nagiosxi/nom/checkpoints/nagioscore/*.gz: No such file or directory
 NO NOM SNAPSHOT FOUND!
 ERROR: NagiosQL import appears to have failed - exiting.  (Reason: Import files are still present in /usr/local/nagios/etc/import)

This issue can be resolved with the following solution if you're attempting to install with the 1.7 tarball. This problem will be resolved in the 2011R1.8 release of Nagios XI.

Before attempting any more installations, run:

 yum install centos-release-cr

Then remove any previous /tmp/nagiosxi directory that is in place, and unpack a fresh tarball:

 cd /tmp
 rm -rf nagiosxi
 tar zxf xi-2011r1.7.tar.gz
 cd nagiosxi
 ./fullinstall

If the installer still fails, contact XI support and attach the install.log file that's generated by the fullinstall script.


SourceGuardian Errors

After upgrading to 2009R1.2C, some users started getting an error about SourceGuardian. Add this line to your /etc/php.ini file:

 extension=ixed.5.1.lin

Once you make that change, restart Apache:

 service restart httpd
Resolving "Cannot connect to database" Error - Core Config Manager

If you're able to access the Nagios XI interface, but can't seem to access the Core Configuration Manager, try the following two steps to see if it resolves the issue.

Access the Admin->Reset Security Credentials page and reset the subsystem credentials.

Run the following command from the shell:

 touch /usr/local/nagiosxi/html/config.inc.php
Resolving "DB Connect Error [nagiosxi]: Database connection failed"

The problem we identified with gnome was that the PATH for the "service" command gets changed under gnome. This needs to be set correctly so that the scripts starting with 3-dbservers will run correctly. You can test if the path is set correctly by trying the following commands:

service httpd restart
service postgresql restart

The important thing is that it includes the "sbin" directories. Normally it would look like this, although this isn't the only "correct" answer possible:

/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
Resolving "NSP: Sorry Dave, I can't let you do that" Errors

Session protection was added to 2009R1.2C to prevent CSRF attacks. This code to do this caused some users to see this error. The problem was due to the user's browser caching older versions of the XI javascript code. In order to clear the cache and prevent this from happening, you need to clear your browser's cache. This is typically done (in Firefox) by holding down the shift key and clicking reload. See Other well documented procedures on clearing the browser cache.

The other possible cause of this is that the XI server's time is out of sync with the web browser. Try the following:

 yum install ntp
 ntpdate time.nist.gov


If that still doesn't fix the error, then you may have to specify your timezone in your /etc/php.ini file. Newer releases of PHP require this setting for your server to reflect the correct system time and timezone. To change this setting, edit the /etc/php.ini file with the following line:

 date.timezone = Etc/GMT-13

Change the timezone to match your location. These zones are listed at the following URL. PHP Timezones After changing the setting, restart your apache server:

 service httpd restart
"HTTP 500 Error"/"PHP Parse error - Unexpected $end"

For those doing manual installations, some of the tools embedded in Nagios XI use the PHP short tags feature, which is not necessarily enabled on all web servers by default. To fix this issue, locate your php.ini file (located at /etc/php.ini for CentOS installations), and verify that "short_open_tag" is set to "on." We intend to use full tags for future version, but some components and addons may still use them, so we recommend leaving this setting to "on."

"ERROR: PostgresQL not running - exiting."

This anomaly will rarely occur during a VM set up of Nagios XI. You may try restarting the server but in some cases will have to start the Nagios XI install from the beginning.

The following is an example of what it may look like:

 cp: cannot create regular file `/usr/local/nagiosxi/scripts': 
  Read-only file system
 cp: cannot create regular file `/usr/local/nagiosxi/scripts': 
  Read-only file system
 chown: cannot access `/usr/local/nagiosxi/scripts/reset_config_perms': No such file or directory
 chown: cannot access `/usr/local/nagiosxi/scripts/reset_config_perms.sh': No such file or directory
 chmod: cannot access `/usr/local/nagiosxi/scripts/reset_config_perms': No such file or directory
 chmod: cannot access `/usr/local/nagiosxi/scripts/reset_config_perms.sh': No such file or directory
 /tmp/nagiosxi
 Checking PostgresQL status...
 ERROR: PostgresQL not running - exiting.
 ERROR: Nagios XI database was not setup properly - exiting.
"ERROR: Please add the 'Optional' channel to your Red Hat systems subscriptions."

Red Hat Subscription Manager:

You need to add the Optional software channel so that Nagios XI can install the necessary prerequisites. To do so:

yum install yum-utils

yum-config-manager --enable rhel-6-server-optional-rpms

Red Hat Network Classic:

You need to add the Optional software channel so that Nagios XI can install the necessary prerequisites. To do so, first sign in to your Red Hat Network account at http://rhn.redhat.com/. Then click on the link corresponding to your system. Near the bottom-left corner of the page, click Alter Channel Subscriptions. Check the box labeled RHEL Server Optional and click Change Subscriptions. That's it! You should be able to run the installer again and complete your installation.

"Installation errors on customized corporate builds of CentOS or RHEL"

We have seen when companies require the use of their "standard build" of either OS Nagios XI will not be able to successfully install if there have been modification to the umask on the machine.

"Upgrade errors - root.crontab.orig: cannot overwrite existing file"

We have seen problems when upgrading and there are leftover files from previous upgrades.

This problem can be eliminated by running the following command:

 cat /dev/null > /tmp/nagiosxi/uninstall-crontab-root

After this you can proceed to run the upgrade script again.


Ajaxterm Installation Aborted

Nagios XI 2012 and late versions of 2011 set up apache to be able to utilize the Ajaxterm subcomponent. This component requires a modification to the /etc/httpd/conf.d/ssl.conf file, and adds some proxy information needed specifically for ajaxterm. If apache failed to restart after making these modifications, the previous configuration gets rolled back in order to keep apache running the the system usable. The bad configuration that the fullinstall/upgrade script attempted to apply is saved to /etc/httpd/conf.d/ajaxterm.fail. This file can be debugged and once fixed, can replace the existing ssl.conf file in order to utilize Ajaxterm. Note: Remove the /etc/httpd/conf.d/ajaxterm.fail file once the issue is resolved to avoid the error message in the UI. Please contact the Nagios support team with any questions.

Configuration Problems

Apply Configuration Fails: General Troubleshooting

If you receive an error while attempting to Apply Configuration stating that the configuration verification has failed, then that means there is some sort of syntax error or configuration conflict the configuration that's been defined. You can isolate this issue by accessing the Core Config Manager->Configuration Snapshots page. You should see the most recent snapshot highlighted in red. View the text file from the snapshot to see what config file contained the error. You can then find that file in the associated tar.gz file and search for the problem based on the error message. The snapshot represents the information that is CURRENTLY in the CCM database, that Nagios attempted to save. You'll need to correct the issue through the Core Config Manager, then attempt to Apply Configuration again.

The Write Config Tool in the CCM is a manual tool for writing the DB information to the configuration files (it manually Applies Configuration). It's important to know that Nagios cannot start or restart with a bad configuration. The config verification must pass in order for Nagios to be able to restart successfully with the new configuration.

Configuration Applies, but still get "Configuration File Is Out Of Date" Error

If your configuration is applying successfully and the changes are visible in the XI interface, but you're still seeing an error message in the CCM that says "Configuration File Is Out Of Date", then you may have to specify your timezone in your /etc/php.ini file. Newer releases of PHP require this setting for your server to reflect the correct system time and timezone. To change this setting, edit the /etc/php.ini file with the following line:

 date.timezone = Etc/GMT-13

Change the timezone to match your location. These zones are listed at the following URL. PHP Timezones After changing the setting, restart your apache server:

 service httpd restart


Apply Configuration Fails, No Configuration Problems

As of 2011 R1.7, extra sanity checks were added to the Apply Configuration functionality of Nagios XI to prevent false positives and also to prevent that page from stalling out endlessly. An example error that can show up is: "Backend login to the Core Config Manager failed"

There are a few different reasons an error like this can show up. The most common one is the use of a proxy that prevents "wget" from being able to resolve to "localhost" correctly. However, if you receive an error message when attempting to Apply Configuration other than "Configuration Error...," run the following commands and send the output file to the Nagios XI support team.

 cd /usr/local/nagiosxi/scripts
 ./reconfigure_nagios.sh &> reconfig.txt

Then also run the following command to begin capturing log output:

 tail -f /usr/local/nagiosxi/var/cmdsubsys.log &> cmd.txt

And attempt to Apply Configuration from the web interface. After the browser has returned some output to the screen, press Ctrl+C to stop the log tail, and send XI support the cmd.txt file and the reconfig.txt that was generated by the above instructions.

Apply Configuration Page Stalls Out, Never Completes

If you attempt to Apply Configuration and you're seeing the following output:

 * Configuration submitted for processing...
 * Waiting for configuration verification.................. 

and the configuration never applies, the page may be timing out. If you've recently updated XI, try restarting the server first. If you're currently running Nagios XI 2011R1.3 there is a known bug that can cause this issue. You'll need to upgrade to the latest version to resolve the issue. If that does not resolve the issue, try editing the configuration for your PHP settings. Open /etc/php.ini file in a text editor and increase the following values.


 ;;;;;;;;;;;;;;;;;;;
 ; Resource Limits ;
 ;;;;;;;;;;;;;;;;;;;
 max_execution_time = 60     ; Maximum execution time of each script, in  seconds
 max_input_time = 60     ; Maximum amount of time each script may spend parsing request data
 memory_limit = 256M      ; Maximum amount of memory a script may consume 


After this, run:

 service httpd restart


Note: If you're running a large installation with several thousand hosts/services, you may need to increase these numbers more to allow enough time and memory for large configuration changes to take effect.

If the issue persists after the above solutions, the issue could be caused by creating a local DNS entry for the Nagios XI server, but failing to add that name entry to the Nagios XI server itself. Example, if you're accessing the XI server from the following url: http://nagiosserver/nagiosxi, you need to verify that the XI server can also resolve that DNS name correctly. The local DNS entry for the XI server needs to be added to the /etc/hosts file.

You can observe similar issues if you run out of disk space.

Configuration Applies, No Changes Take Place

This is generally due to permissions issues with the configuration file. Use the Write Config Tool in the Core Config Manager to see if you can manually write the DB information to the config files. If the Write Config Tool returns error messages related to permissions you can run the following script to correct the permission settings:

 /usr/local/nagiosxi/scripts/reset_config_perms

There is a known bug in XI 1.3E and F where this script was not automatically running when configurations were applied. If you're running a Nagios XI version earlier than 1.3g, we recommend updating to correct this issue.

Modifying The Contents Of /usr/local/nagios/etc
  • You can keep custom configuration files in the /usr/local/nagios/etc/static directory
  • Don't modify config files directly in /usr/local/nagios/etc, as they will be overwritten by the Core Config Manager
Unable To Delete Hosts

Hosts can only be deleted after all of their dependent services and associated relationships have been deleted. Make sure to delete any associated services or other objects before deleting the host.

Host Still Visible After Deletion: (Ghost Hosts)

If you have successfully deleted a host and all of it's services from the Core Config Manager, but you're still seeing it in the status tables, then you most likely have multiple instances of Nagios running on your machine. To make sure all instances are stopped, type the following in the command-line.

 killall nagios
 service nagios start


Host Still Visible In XI After Deletion From the CCM

Go to the Core Config Manager->Write Config Tool, and use that tool to manually write out the configuration data to file. Verify your configuration. If it verifies, go ahead and restart Nagios.

If by chance the host and all of it's services are completely deleted in the Core Config Manager, and the actual host config file is still there after using the Write Config Tool, then go ahead and delete the config file. The files will be located in the following directories.

/usr/local/nagios/etc/hosts
/usr/local/nagios/etc/services

On rare occasions the CCM will somehow lose a file, we haven't nailed down what causes it, but it is usually related to deleting the host.

Network status map parent/child relationship not updating(v1.3)

Underneath the Parents box in the CCM, make sure the "standard" radio button is selected. If "null" is selected your parent host selection doesn't get written to disk. We're working on a method of fixing the CCM so this doesn't happen with several fields.

Warning: Duplicate definition found for contact 'xi_default_contact'

This usually happens if you import the "static" directory config files in Nagios XI. When you try to apply configuration, you see an error, similar to this one:

 Warning: Duplicate definition found for contact 'xi_default_contact' 
 (config file '/usr/local/nagios/etc/contacts.cfg', starting on line 79)
 Error: Could not add object property in file '/usr/local/nagios/etc/contacts.cfg' on line 80.
 Error processing object config files!

You can resolve this by running the following command and then applying configuration:

 curl -s http://assets.nagios.com/downloads/nagiosxi/scripts/fix_static_import| mysql -pnagiosxi nagiosql
Core Config Manager Problems
GUI Issues

Most of these are related to IE's implementation of JavaScript. If possible, use a browser that more closely implements the ECMAScript Language Specification.

In the event of the the Core Config Manager not visible or components missing from the page, this generally relates to a proxy and the following thread covers how to address this issue:
Nagios Core Config Manager not showing up.

Configuration Changes

If you make changes to your configuration and they are not reflected in XI, it may be due to file permissions. Here are two options to try:

  • Reset File Permissions

Execute the following command to reset your configuration file permissions.

 /usr/local/nagiosxi/scripts/reset_config_perms

You can also view if you have any permissions related issues by accessing the Admin->Check File Permissions page in the XI interface (v1.3g+).

Restoring Default Configuration

If you've somehow messed up your configurations irreparably, or simply want to reset a test system, you can restore the configuration to the defaults as shipped with XI. To do so, download these two files and transfer them (via SCP) to your XI server:
restore_defaults.sh
nagiosql_defaults.sql
Then, log into the console of your XI server, and in whatever directory you put those two files run these commands:

 chmod +x restore_defaults.sh
 ./restore_defaults.sh

This will delete all of your hosts and services and reload just the demo ones that were initially set up.

Making A Mass Change In The CCM

Changing The Field Entry For A Large Amount Of Objects

Occasionally admins need to change a specific settings for a huge quantity of services or hosts, and this change can't be made from a template. Although we highly recommend the use of templating whenever possible, sometimes it's just not possible to make the change there. Our unofficial solution for this is to write a SQL query that will manually update the DB fields where you need them change. NOTE: Test your queries on a single test host/service first, and try this solution at your own risk, we are not responsible if you break something with this! Here's an example a user posted of a change made to the check_interval for all 'Disk Monitor' services.

 mysql> use nagiosql;
 mysql> update tbl_service set check_interval=60 where service_description='Disk Monitor';
 mysql> select config_name, service_description, check_interval from tbl_service where service_description='Disk Monitor';

If the change you wanted was successful, Apply Configuration to write the changes to the config files.

Using Scripts To Make Changes in the CCM

Some admins make use of internal scripts to update and maintain their monitoring environment. Although we're only able to offer limited support on a situation like this, a useful script to know about is:

 /usr/local/nagiosxi/scripts/reconfigure_nagios.sh  

This is the command-line version of "Apply Configuration" in the XI interface. It will write the CCM DB info to the config files and restart Nagios.

To automate importing configs using scripts, you can simply place config files in the /usr/local/nagios/etc/import directory, and then run the reconfigure_nagios.sh script. This will handle the import to the DB, writing the configs, verification, and then restarting Nagios.

Currently there is not a streamlined way to remove hosts and services from the Core Config Manager using scripts. We hope to have features like this implemented in 2012.

Performance Graph Problems

General Performance Graph Troubleshooting

Performance graphs can experience a number of issues. Below are solutions for the most common problems. The log verbosity should be increased before troubleshooting, and should be returned to default settings once resolved.

Increase Performance Data Logging Verbosity.

Edit the file:

 /usr/local/nagios/etc/pnp/process_perfdata.cfg

Change:

 LOG_LEVEL = 0

To:

 LOG_LEVEL = 2

The process_perfdata.pl script should now log all errors and debug information to:

 /usr/local/nagios/var/perfdata.log

Remember to return this value to it's default setting when troubleshooting is completed.

Increase NPCD Logging Verbosity.

Edit the file:

 /usr/local/nagios/etc/pnp/npcd.cfg

Change the Default value from:

 log_level = 0

To:

 log_level = -1

Save out and restart NPCD:

 service npcd restart

NPCD should now log all errors and debug information to:

 /usr/local/nagios/var/npcd.log

Remember to return this value to it's default setting when troubleshooting is completed.

Perfdata Timeout

As many installations grow, the perfdata processing timeout value may need to be increased. Check the perfdata log for any recent timeout errors:

 tail -50 /usr/local/nagios/var/perfdata.log | grep TIMEOUT

If the grep found any recent errors, change the TIMEOUT by editing the file:

 /usr/local/nagios/etc/pnp/process_perfdata.cfg

Change the default value from:

 TIMEOUT = 5

To:

 TIMEOUT = 20

As your installations grows further, this value may need to be increased even more.

NPCD Load Threshold

Bulk NPCD processing has a load threshold setting that is intended to halt performance processing if the system is under heavy load. Large installations will need this value increased and NPCD restarted.

Check the NPCD log for load warnings (if the log file does not exist, increase the log level, restart npcd, and wait 5 minutes before proceeding):

 tail -50 /usr/local/nagios/var/npcd.log | grep "MAX load reached"

If any recent errors are found, increase load threshold by editing the file:

 /usr/local/nagios/etc/pnp/npcd.cfg

Change:

 load_threshold = 10.0

To:

 load_threshold = 20.0

Save out and restart NPCD:

 service npcd restart

For really large installations, or servers with minimal resources, you may need to increase the npcd load_threshold and perfdata TIMEOUT even more than is suggested above.

Unexpected Number of Datasources

Nagios XI stores performance data in RRDs (Round robin Databases). These are binary files with a static number of "tracks" (datasources). If check is changed to return more datasources of performance data than the RRD was initially created for, those additional metrics will not be added to the RRD.

To verify if this is the case, check the perfdata.log file (you may have to increase logging verbosity):

 tail -50 /usr/local/nagios/var/perfdata.log | grep "ERROR" | grep "expected"

If the grep found any errors, the number of datasources returned for the particular check has changed since the RRD was created.

The easiest resolution is to delete the rrd in question as a new one will be created correctly after the next few checks. Be aware that deleting the RRD will result in the loss of historical performance data for the check.

Performance Graphs Are Missing Or Not Displayed

This can happen for a variety of reasons, but there are several simple solutions that resolve this issue for most people:

  • Make sure you're using the latest version of Nagios XI. Old releases may have issues that will not necessarily be resolved from the below solutions. Upgrading Nagios XI


Verify That process_perfdata.pl has correct permissions Make sure that the file /usr/local/nagios/libexec/process_perfdata.pl has execute permissions and is owned by nagios:nagios.

2011 R1.8 Fix There is a known bug on some XI installs for this release that have incorrect permissions for the performance data directory. This can be resolved by running the following command as the root user.

 chmod -R +x /usr/local/nagios/share/perfdata/


  • 1.6 and 1.7 RHEL/CentOS 6 Users. There were some hiccups with the repos which cause a necessary component for MRTG graphing to not be installed. This is a very simple fix. Log into the CLI of your Nagios XI server as root, and type:
   yum install bc

That should fix the graphing issues. Note that this does not apply to versions of Nagios XI later than 1.8.

  • Run the command manually. Try running the command that Nagios XI runs to check status of a device. For instance, when monitoring a router or switch, Nagios XI uses the check_rrdtraf plugin. Test running this plugin manually by navigating to your libexec directory and running a check, similar to the following:
   ./check_rrdtraf -f '/var/lib/mrtg/192.168.6.1_1.rrd' -w 1 -c 2

This should return something that looks like:

   OK - Current BW in: 1.57Kbps Out: 365.41bps|in=1.573002Kb/s;1;2 out=365.413424b/s;1;2

If it gives errors, then that is the problem. Fix the issues the error gives and then Nagios XI can start graphing performance data.

  • Check perfdata directory permissions. Nagios XI needs to be able to write to its nagios/share/perdata/ directory. Check the file permissions on that directory and its subdirectories. For example:
   ll /usr/local/nagios/share/perfdata

Should return something like this:

    drwxrwxrwx 2 nagios nagios 4096 Oct 18 17:01 192.168.5.1

drwxrwxrwx 2 nagios nagios 4096 Oct 18 17:02 192.168.5.4 drwxrwxrwx 2 nagios nagios 4096 Oct 17 15:36 imap.fusemail.net drwxrwxrwx 2 nagios nagios 4096 Oct 18 17:02 localhost

If those folders are not writable and readable by Nagios, then that is problem and you should set write and read access for Nagios. Please note that all files contained in these folders also needs to be writable and readable by nagios.

  • Reset File Permissions

Execute the following command to reset your configuration file permissions.

 /usr/local/nagiosxi/scripts/reset_config_perms

You can also view if you have any permissions related issues by accessing the Admin->Check File Permissions page in the XI interface (v1.3g+).

  • Make sure you have not removed or renamed the nagiosadmin user. This user is the nagios equivalent to 'root user' and should never be removed.


  • Make sure your password for Nagios XI only contains alpha-numeric characters. Some users have reported graphs disappearing from using special characters, creating a permissions issue.
  • Performance graphs are pulled via an internal proxy, so users with their Nagios server behind their own proxy or using strict SSL settings may experience problems viewing graphs. If you're using an environment with a proxy or SSL and having issues viewing graphs post the problem to our support forums and specify your use of proxy or SSL right away.
  • Having an internal DNS hostname that is not defined on the XI server can also cause problems with internal proxy call. If you've defined a custom DNS host entry for your XI server, make sure it's defined in your /etc/hosts file as well. For further information on this, contact our support team at support.nagios.com/forum.
Network Performance Graphs Are Displayed But Have No Data

2011R3.2 and 3.3 issues graphs display but are empty. Try running the following commands to see if an excessive amount of performance data files have built up.

 cd /usr/local/nagios/var/spool/xidpe
 ls -f | wc -l

If the file count is very large, run the following commands, which should restore regular performance graphing.

 cd /usr/local/nagios/var/spool
 rm -rf xidpe
 mkdir xidpe
 chown nagios.nagios xidpe
 chmod 755 xidpe

Only Switch and Router Graphs display but have no data

A fuller description of this problem is when you are monitoring a switch or router, but its bandwidth graphs are always zero when you know for sure they should have data. Keep in mind, be absolutely sure that the graphs should have data.

  • Make sure the /var/lock/mrtg directory exists. It has been witnessed that this directory will occasionally disappear. It is a trivial matter recreating it.
   mkdir /var/lock/mrtg
  • Make sure none of the mrtg.cfg entries are using SNMP v2c. Older verions of the Switch Wizard called mrtg with arguments for SNMPv2c, which MRTG does not use. Open up /etc/mrtg/mrtg.cfg and look for
   Target[www.hostaddress.com]: 1:SNMP_Community_String@www.hostaddress.com:::::1

Notice that after the multitude of colons, there is a 1, this represents the SNMP version MRTG will use to poll the device. If this is instead 2c, change it to 2 and save the file. This will need to be done to every metric that is affected by being created with 2c.


Can I Migrate Performance Data From A Different Install?

RRD performance data files are compiled binaries, so for a simple file transfer a user would have to have the architecture match on both machines. If you want to migrate files from a 32bit to 64bit machine, you'll have to convert the data to XML and import it into RRD's on the new machine. Forum user srrhd was kind enough to supply the commands used for a working migration:

On the old 32bit machine:

 cd /usr/local/nagios/share/perfdata/
 for i in `find -name "*.rrd"`; do rrdtool dump $i > $i.xml; done
 tar -cvzf perfdata.tar.gz */*.rrd.xml
 for i in `find -name "*.rrd.xml"`; do rm -f $i; done

Then transfer the archive to the new server in the same directory. On the new x_64 server:

 cd /usr/local/nagios/share/perfdata/
 for i in `find -name "*.rrd"`; do rm -f $i; done
 tar -xvzf perfdata.tar.gz
 for i in `find -name "*.rrd.xml"`; do rrdtool restore $i `echo $i |sed s/.xml//g`; done
 for i in `find -name "*.rrd"`; do chown nagios:nagios $i; done
 for i in `find -name "*.rrd.xml"`; do rm -f $i; done

Notification Problems

Basic Troubleshooting Steps

1. Email Tests

Send a test email to see if the Nagios server can send email to an account by going to:

Configure->My Account Settings->Send Test Notifications.

Then, check to see if the test email arrives. If it doesn't arrive, the problem could be one of the following:

- Nagios server cannot send mail outside of your network (if you are using Sendmail)

- Also Nagios may not be able to relay mail through your company server (if you are using SMTP)

Outbound SMTP connections may be blocked by your border firewall. Lastly, unauthenticated SMTP relaying may be denied somewhere downstream - try switching email methods from Sendmail to SMTP in the admin section.

2. User's Notification Options

Check if Notifications are enabled globally - click on the "Monitoring Process" menu on the left from the Home page, and make sure you see a green dot next to the Notifications in the "Monitoring Engine Process" window. You can enable/disable Notifications by clicking on the "Action" button on the right hand side.

Check if Notifications are enabled for the user currently logged into Nagios XI - click on the username in the upper right corner next to "Logged in as: ...", then click on "Notification Preferences" under "Notification Options" from the left panel menu. Make sure that the "Enable Notifications" check-box is checked.

Review the selected Notification Types - the user will be notified only on host/service states, that are selected.

From the same page, click on "Notification Methods" and make sure a Notification Method is selected.

3. Host/Service Notification Options

Check if Notifications are enabled for a particular host/service. If you are having issues with Notifications for a particular Host or Service, log into the Core Config Manager and click on "Hosts" or "Services" under "Monitoring" from the left panel menu. Find your Host or Service and click on the "Modify" Action button to the right. Click on "Alert Settings" tab and verify that the "on" radio button next to the "Notification Enabled" is selected.

Make sure that the Check Period under the "Check Settings" tab is equal or larger than the Notification Period under the "Alert Settings" tab on the Host/Service Management page in the CCM. If Nagios is not checking a host or service during a specific time, then it will certainly not send notification during that time.

Check the "Alert Settings" tab under the Host/Service Management page in the CCM for two things:

- Make sure "Notification enabled" is not set to "off".

- See which options are selected under "Notification options", because this will determine the states of hosts/services that you will be notified for.

Note: If you are having issues with many hosts and services, you should check the templates you are using - "xiwizard_generic_host" and "xiwizard_generic_service" should be the first ones to be checked. Any changes you make in these templates will affect all hosts and services that reference them. You can override this by modifying the host or service configuration itself. If you need to know more on the topic, please read the full explanation of Nagios object inheritance here: http://nagios.sourceforge.net/docs/3_0/objectinheritance.html

4. Contacts

The contact must be either directly associated with the host or service or be part of a contactgroup that is connected to the host or service.

Make sure users and contacts that were added within Nagios XI are set up with the proper notification handlers:

  • If you are using Users, which are also Contacts (you've added a Contact to them):

xi_host_notification_handler and xi_service_notification_handler

  • If you are using Contacts only:

notify-host-by-email and notify-service-by-email

Contacts and users are similar but not the same - read more about it here: http://assets.nagios.com/downloads/nagiosxi/docs/XI_Users_And_Contacts.pdf

If you are not receiving notifications, it also possible that the nagiosadmin user was set to use the generic_template contact template, which resulted in notifications not being controlled through the XI interface. This can be corrected by changing the user's contact template to be xi_generic_template is the Core Config Manager. This bug was corrected in 2009R1.2 and only affects systems that had/have previous versions installed.

5. Contact Timeperiods

Each contact has a timeperiod management option that determines when they get notification. Closely review if there are any time exclusions set within contact's timeperiod. These are times that the user will not be sent notifications.

6. Acknowledgements and Scheduled Downtime

If the problem has been acknowledged or the host/service is in downtime, alerts won't be sent.

7. Testing From Host or Service (Sending Custom Notification)

If you proceed to the host or service in question on the Nagios server and then select the Advanced tab, you can send a test email (custom notification) from the specific host or service that you are testing.

8. Tracking Notifications

If you go to Home->Incident Management->Notifications you should see that Nagios is sending notification based on the settings you have chosen and to the appropriate contacts. Using this tool helps you track down if Nagios intends to notify the appropriate contact.

Test Emails Fail, "Invalid address" Error

We identified a bug in 1.9 and some earlier versions where test emails to addresses like "root@localhost" or "user@xiserver" will fail to send because they fail email address validation. The email address needs to have some sort of domain at the end of it to pass validation and send. The browser may falsely display a success message for Users testing from their "Send Test Notification" page, while the browser will get an error message if a user runs the test from the Admin->Manage Email Settings->Send A Test Email page. This bug will be fixed in R1.10, but a workaround in the meantime would be to make sure users have the Nagios XI Sending Address in the Admin->Manage Email Settings page set to an email address with a FDQN OR the address listed below will also work:

 Nagios XI <root@localhost.localdomain>

Make sure initial setup for the Admin->Manage Email Settings page has been done and that you've pressed Update on the email settings.

This bug can be identified by a debug message showing up at the top of the test email page that says "Invalid address:".

This bug is specific to installations using version of PHP 5.2+.

XI Display Problems

Tables Displaying A Count, But No Results

A recent issue has been identified where characters outside of the ASCII table are being generated by some of the check plugins, which causes an issue with XI's XML generation. The result is a table with a returned count of services, but no actual table data. This issue can be verified by checking the following url:

 http://<serveraddress>/nagiosxi/backend/?cmd=getservicestatus

If this XML page returns an error, it should identify the line number of the issue which can be found in the page source. Below is a code patch that will be included in the next update of XI. Paste this code as a replacement to the xmlentities() function on line 30 of the /usr/local/nagiosxi/html/includes/utilsx.inc.php

 function xmlentities($string){
       $data=str_replace ( array ( '&', '"', "'", '<', '>' ), 
        array ( '&' , '"', ''' , '<' , '>' ), $string );
       preg_match_all('/([\x09\x0a\x0d\x20-\x7e]'. // ASCII characters
       '|[\xc2-\xdf][\x80-\xbf]'. // 2-byte (except overly longs)
       '|\xe0[\xa0-\xbf][\x80-\xbf]'. // 3 byte (except overly longs)
       '|[\xe1-\xec\xee\xef][\x80-\xbf]{2}'. // 3 byte (except overly longs)
       '|\xed[\x80-\x9f][\x80-\xbf])+/', // 3 byte (except UTF-16 surrogates)
       $data, $clean_pieces );
       $clean_output = join('?', $clean_pieces[0] );
       return $clean_output;
       }


Problems with Check Commands

How To Test Check Commands From The Command-line

Okay, you'll need to go through a few steps to establish what exactly is being run. Grab some paper to note settings as you go. Start by going to the Core Config Manager (under "Configure"), under Services in the left sidebar, find the service in question, and click the crossed tools "Configure" icon. On the "Common Settings" tab, note what it says for "Command view", the values of the eight ARG variables, and anything listed under "Additional templates". Now, in the left sidebar again, click "Templates -> Service templates", and find any that were listed on the previous step. If any of the ARG variables that were blank on the first page are filled in here, write down the value on the template. Repeat this step if any of the templates in turn have templates listed on their definitions. Similarly, if the Check command and Command view were blank, fill them in from the template.

Now, starting with what you had for "Command view", replace $USER1$ with /usr/local/nagios/libexec , and replace $HOSTADDRESS$ with the IP address of the host this service is associated with.

As an example, I have a host called "Server Room", with an IP address of 192.168.5.254, and am running a simple ping check against it. For "Check command" and "Command view" they're blank, $ARG5$ = -p 5, and for templates it has "xiwizard_websensor_ping_service". The template for xiwizard_websensor_ping_service has a "Check command" of "check_xi_service_ping" and a "Command view" of '$USER1$/check_icmp -H $HOSTADDRESS$ -w $ARG1$,$ARG2$ -c $ARG3$,$ARG4$ $ARG5$', with $ARG1$ = 3000.0, $ARG2$ = 80%, $ARG3$ = 5000.0, $ARG4$ = 100%, $ARG5$ = -p 8, and a template of "xiwizard_generic_service". The "xiwizard_generic_service" template has a check command of "check_xi_service_none" and a command view of '$USER1$/check_dummy 0 "Nothing to monitor"', with blank args and no additional template. Nothing gets filled in from this template because all of the values it defines are already defined in a higher-priority setting.

Here, the first step is to look at '$USER1$/check_icmp -H $HOSTADDRESS$ -w $ARG1$,$ARG2$ -c $ARG3$,$ARG4$ $ARG5$'. Step two fills in $ARG5$ from the service definition, and we get '$USER1$/check_icmp -H $HOSTADDRESS$ -w $ARG1$,$ARG2$ -c $ARG3$,$ARG4$ -p 5'. Step three gets args 1-4 from the xiwizard_websensor_ping_service template, giving '$USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5'. The $ARG5$ is left alone because it was already set. Step four does nothing - the last template doesn't have any new info. Step five is to fill in the macros, so you get '/usr/local/nagios/libexec/check_icmp -H 192.168.5.254 -w 3000.0,80% -c 5000.0,100% -p 5'. That's your full check command.

Now, log into your Nagios XI server as root, either on a direct terminal or through SSH. Enclose your command in single quotes like I've been doing here, put su -c before it and nagios after it, and hit enter. It should look something like this:


 [root@demo ~]# su -c '/usr/local/nagios/libexec/check_icmp -H 192.168.5.254 -w 3000.0,80% -c 5000.0,100% -p 5' nagios
 OK - 192.168.5.254: rta 50.903ms, lost 0%|rta=50.903ms;3000.000;5000.000;0; pl=0%;80;100;;
 [root@demo ~]#


Obviously that will be filled in with different details based on the check you're trying to run, but hopefully that demonstrates the progression of how to build the line.


Problems with $ Signs in the Check Command

(Solution posted by Dietmar Lang)

In your service definition file, you may need to pass a $ symbol as an argument to a service check. For example, MS SQL Server instances are named "MSSQL$INSTANCE1". Your service definition would look like this: check_command

 check_nt!SERVICESTATE!-d SHOWALL -l MSSQL$INSTANCE 

This will not work.

For Nagios 3, add two backslashes and a second dollar (\$) symbol, like this: check_command check_command

 check_nt!SERVICESTATE!-d SHOWALL -l MSSQL\\$$INSTANCE


Windows Memory Check Values Doubled

(contributed by Forum user GreatWolfResorts)

This is a result of how the check_nt plugin calculates memory values. The preferred solution for most users seems to be to use the check_nrpe plugin to distinguish the memory types.

Quoted from GreatWolfResorts: I essentially created the following custom command:

check_xi_service_nrpe:

 $USER1$/check_nrpe -H $HOSTADDRESS$ -p 5666 -c $ARG1$ -a MaxWarn=$ARG2$% MaxCrit=$ARG3$% $ARG4$ $ARG5$
 $ARG4$ = "type=physical"

Note: You will need to enable some NRPE commands in the nsc.ini file on the remote device. Specifically: allow_arguments=1

Alternatively, a full understanding of the check_nt MEMUSE command helps when reviewing the values returned. Windows refers to the sum of memory and swap files, that is, the entire available virtual memory. Windows regularly swaps program and data code from the main memory, even when it still has spare reserves. In this respect the load of the entire virual memory in Windows is the more important parameter to observe over simply physical or swap.

So in the end, the values returned weren't necessarily a bug in NagiosXI or nsclient++, but rather a view of the virtual memory of the machine.

Hope this helps!

Additional Documentation: [Enabling NRPE with NSClient]

Windows Event Log Check

WMI can be used to gather information from the Windows Event Log. Here are some example command definitions for use with check_wmi_plus.

Check Windows event log system for errors in the last 4 hours. Warn on 1 occurrence, critical if 6 or more.

 check_xi_service_wmiplus!administrator!password!checkeventlog!-a system -o 2 -3 4 -w 1 -c 6

Check Windows event log application for errors in last 1 hour. Warn on 3 occurrences, critical on 6 or more.

 check_xi_service_wmiplus!administrator!password!checkeventlog!-a application -o 2 -3 4 -w 3 -c 6
Linux Cached Memory Not Added to Free Memory

It is normal for Linux to "borrow" unused memory for disk caching. This may however create false "Warning" or "Critical" alerts, even though you are NOT low on memory. In order to fix this, we have modified the "custom_check_mem" script, part of our Linux agent install script by adding an optional flag [-n|--nocache]. Basically, cached memory is added to the free memory when you use the "-n" flag.

 Usage: custom_check_mem [-w|--warning]<percent free> [-c|--critical]<percent free> [-n|--nocache]

If you are downloading a new copy of our Linux agent, the updated "custom_check_mem" will be included. If you already installed the Linux agent, you can just download the updated "custom_check_mem" from here.

Copy the new script over the old "custom_check_mem".

Go to the Core Config Manager->Monitoring->Services->Memory Usage->Modify and under the "Common Settings" tab modify the $ARG2$ field by adding a "-n" flag.

For example, if you had:

 -a '-w 20 -c 10'

change it to:

 -a '-w 20 -c 10 -n'

Click on "Save" and "Apply Configuration".

Note: One gotcha - make sure the "custom_check_mem" has Unix EOL before you copy it over.

Other Issues

Nagios did not exit in a timely manner

For use when Nagios doesn't appear to be exiting cleanly. If the run file, lock file, or temp check files are getting left behind, try doing this mod around line 150 of /etc/init.d/nagios. (The mods are increasing the for loop from 10 seconds to 30 seconds). This gives the Nagios daemon more time to cleanly shut down all of it's processes and clean up after itself


 # now we have to wait for nagios to exit and remove its
 # own NagiosRunFile, otherwise a following "start" could
 # happen, and then the exiting nagios will remove the
 # new NagiosRunFile, allowing multiple nagios daemons
 # to (sooner or later) run - John Sellens
 #echo -n 'Waiting for nagios to exit .'
 for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30; do
     if status_nagios > /dev/null; then
         echo -n '.'
         sleep 1
     else
         break
     fi
 done



Upgrade to 2011R3.x Issues

If you're experiencing any of the following issues after an upgrade from Nagios XI 2011r2.x to 3.x:

  • Missing hosts or services or status data
  • Takes a VERY long time to Apply Configuration or restart the Nagios process
  • Unusually high CPU load
  • A flood of messages in the /var/log/messages related to ndo2db

Then you may need to manually set a few kernel settings on your system. In Nagios XI 2011r3.x+ the Ndoutils subcomponent now uses asynchronous writes to log status information to the database, and these messages are sent to the Linux kernel's message queue. Our upgrade scripts will tune the kernel settings automatically as of 2011r3.2, but in the event that you see the above symptoms on your system, we recommend applying the following settings to your system.

  • Open /etc/sysctl.conf with a text editor. Edit the file to match the following values:
 # Controls the maximum size of a message, in bytes
 kernel.msgmnb = 131072000
 
 # Controls the default maxmimum size of a mesage queue
 kernel.msgmax = 131072000
 
 # Controls the maximum shared segment size, in bytes
 kernel.shmmax = 4294967295
 
 # Controls the maximum number of shared memory segments, in pages
 kernel.shmall = 268435456
 ## The maximum number of messages allowed in any one message queue
 kernel.msgmni = 256000


Note: If you don't have these entries in the "/etc/sysctl.conf" file, just add them to the end of the file.

  • After these settings are saved to the file, run:
 sysctl -p

To apply the new settings. If the system still appears to be working improperly, reboot the machine.

Login Screen Keeps Redirecting To Itself

The web browser keeps redirecting to the login screen even after entering login credentials. This has been noticed in Internet Explorer.

Nagios XI uses cookies to save session state. These cookies are set to expire after 30 minutes. If the time on the Nagios XI server is incorrect, the cookies returned to the client's browser might appear to be expired due to the time difference between the client's computer and the Nagios XI server. Solution: Fix the time on the Nagios XI server to ensure it is correct.

Check Services Being Orphaned

Some users have encountered large numbers of warning messages that accumulate quickly that read as follows:

Warning: The check of service <Your Service> on host <Your Host> looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service..

This is most likely caused by multiple instances of Nagios running. To fix this kill all instances of Nagios and then restart the process.

 killall -9 nagios

Then restart Nagios from the Admin menu of the web interface.

Related forum post can be read here.


If the issue continues to persist after reboots and restarts of the Nagios service, then the issue is most likely caused by either a memory leak in embedded perl, or system ulimit restrictions. Symptoms can include the /tmp directory filling up quickly with check* files, and the following errors in the nagios log.

 [1331905537] Warning: The check of service 'SERVICE' on host 'NAMESERVER' looks like it WAS 
 orphaned (results never Came back). I'm scheduling an immediate check of the service ...
 [1331755699] Warning: The check of service 'SWAP' on host 'nameserver' not could be due to Performed 
 to fork () error 'Resource temporarily unavailable'. The check will be rescheduled. 

Try the following solutions:

Edit /etc/security/limits.conf

 * hard memlock 128     #locked memory
 * soft memlock 128
 * soft nofile 4096      #open files
 * hard nofile 4096
 * hard nproc 4096     #max user processes
 * soft nproc 4096
 * hard stack 20480     #stack size
 * soft stack 20480

and restart the server. Run

 ulimit -a 

to verify that the new settings are in place.


And also update the settings in your nagios.cfg file to match the following:

 enable_embedded_perl=0
 use_embedded_perl_implicitly=0
Postgresql: Postmaster CPU Is High or "Transaction wraparound limit" in log

Although Nagios XI performs routine database maintenance on the postgres data tables, if you notice either a high CPU usage for the postmaster process, or a repeated error message in the /var/lib/pgsql/data/pg_log file that says "transaction ID wrap limit is 2147484146", then you may need to perform a manual VACUUM of the postgres databases. Run the following commands from the command line:

 psql nagiosxi nagiosxi
 VACUUM;
 VACUUM ANALYZE;
 VACUUM FULL;
 \q

You will see messages like the following when running the above commands:

 WARNING:  skipping "pg_authid" --- only table or database owner can vacuum it

This is normal. You may need to run the above commands more than once if the CPU usage from postmaster is extremely high.

Next, vacuum the tables as the postgres user.

 psql postgres postgres
 VACUUM;
 VACUUM ANALYZE;
 VACUUM FULL;
 \q

XI Component/Addon Problems

Website Wizard Content Check Failure

Some users have reported website content checks being blocked by the "dotDefender" application. See the following forum thread for the solution. Website Wizard Content Check Failure

Plugin/Component/Wizard Installation Problems

When plugins, components or wizards are not installed through the proper menus, this creates problems in Nagios XI, such as "wiping out" all wizards, so they can not be viewed in the Web interface, blank pages in the Web browser and other weird behaviors.

One common mistake is installing a component in place of the wizard and vice versa.

The proper way of doing it is: download the plugin, component or wizard you need to install, go to the "Admin" menu and then select the proper sub-menu from the left panel under the "System Extensions":

for plugins -> "Manage Plugins" -> "Browse" (select your plugin installation file) -> "Open" -> "Upload Plugin"

for components -> "Manage Components" -> "Browse" (select your component installation file) -> "Open" -> "Upload Component"

for wizards -> "Manage Config Wizards" -> "Browse" (select your wizard installation file) -> "Open" -> "Upload Wizard"

Note: Don't unzip the installation file prior to selecting it through "Browse". Also, don't rename the installation files. This will cause the installation to fail. The name of the file should be: "somename".zip. If you had a previous copy of the file and you download it again, your new file will be named "somename"(1).zip, which will not work.

If you already made a mistake and erroneously installed a component in place of the wizard or vice versa, here is what you should do:

Remove the problematic component/wizard by running in terminal as a root:

 # rm -rf /usr/local/nagiosxi/html/includes/components/"somecomponent"
 # rm -rf /usr/local/nagiosxi/html/includes/configwizards/"somewizard"

Try installing the component/wizard again.

If you have blank pages in the web browser, this usually means there is a PHP error. Run:

 # tail /var/log/httpd/error_log

right after loading that page to see what the errors are.

Sometimes, when you try to install a plugin you may receive an error message: "Plugin could not be installed - directory permissions may be incorrect". In order to check the permissions of your "libexec" directory, run in terminal:

 # ls -l /usr/local/nagios

The owner of "libexec" directory should be nagios:nagios and the permissions should be set to 775 (drwxrwxr-x). If this is not what you have, run in terminal:

 # chmod 775 /usr/local/nagios/libexec
 # chown nagios:nagios /usr/local/nagios/libexec

"Event Data Is Stale"

We've had a known bug relating to event data in versions 2009R1.4B-2011R1.1. This bug has been patched and will be available in releases later than the versions posted above, but if you're experiencing this error, and/or the nagios service is taking an excessively long time to start, you may have a corrupted mysql table that needs repair. We suggest taking the following steps.

Stop the following services

 service nagios stop
 service ndo2db stop
 service mysqld stop

Run the our repair script for mysql tables.

 /usr/local/nagiosxi/scripts/repairmysql.sh nagios

Unzip and copy the the following dbmaint file to /usr/local/nagiosxi/cron/. This will overwrite the previous version.

 cd /tmp
 wget http://assets.nagios.com/downloads/nagiosxi/patches/dbmaint.zip
 unzip dbmaint.zip
 chmod +x dbmaint.php
 cp dbmaint.php /usr/local/nagiosxi/cron

Run the following commands:

 service mysqld start
 rm -f /usr/local/nagiosxi/var/dbmaint.lock
 /usr/local/nagiosxi/cron/dbmaint.php

After running this script, restart services.

 service ndo2db start
 service nagios start

However, if you see any error output from this script, similar to this one:

 SQL: DELETE FROM nagios_logentries WHERE logentry_time < FROM_UNIXTIME(1293570334)
    SQL:         SQL Error [ndoutils] :</b> Table './nagios/nagios_logentries' is marked 
    as crashed and last (automatic?) repair failedCLEANING ndoutils TABLE 'notifications'...

you may need to run a force repair on the tables:

 service mysqld stop
 cd /var/lib/mysql/nagios
 myisamchk -r -f nagios_<corrupted_table>
 
 service mysqld start
 rm -f /usr/local/nagiosxi/var/dbmaint.lock
 /usr/local/nagiosxi/cron/dbmaint.php  

If problems continue to persist, contact our support team at our support forums.

Bandwidth Usage for Offloaded MySQL

We don't have an official documentation for benchmarks on bandwidth usage for a Nagios server, but the following specs were recorded and submitted by a user for network traffic between a Nagios XI server and an offloaded MySQL server. Thanks Stephen Wallace for contributing this!

  • 500 hosts, 10 services each at 5mn interval (5500 checks)
  • Breaks down to around 18 checks per second
  • Produces around 3MB of network traffic daily between Nagios and MySQL

"Still have questions?"

If you haven't found an answer to your question, you can check the Nagios XI Manuals:

Nagios XI User Guide

Nagios XI Administrator Guide


// ?>