Page 2 of 4

Re: Hosts and services temporarily unavailable

Posted: Fri Aug 02, 2019 11:36 am
by drug
The MySQL log is empty.

Here is the output from the other commands (I removed the "Duplicate Definition found for service..." errors as they reveal internal host and service names; assume they're not related to the issue?):

Code: Select all

# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Nagios Core 4.4.3
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 2019-01-15
License: GPL

Website: https://www.nagios.org
Reading configuration data...
   Read main config file okay...

   Read object config files okay...

Running pre-flight check on configuration data...

Checking objects...
	Checked 7562 services.
	Checked 1136 hosts.
	Checked 118 host groups.
	Checked 7 service groups.
	Checked 104 contacts.
	Checked 16 contact groups.
	Checked 295 commands.
	Checked 142 time periods.
	Checked 0 host escalations.
	Checked 0 service escalations.
Checking for circular paths...
	Checked 1136 hosts
	Checked 23 service dependencies
	Checked 1 host dependencies
	Checked 142 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 0
Total Errors:   0

Things look okay - No serious problems were detected during the pre-flight check
# su nagios
bash-4.2$ time /usr/local/nagiosxi/scripts/reconfigure_nagios.sh

--- reset_config_perms.sh ------------
> Setting script permissions
> Setting CCM script permissions
> Setting special script permissions
> Setting special component script permissions
> Setting configuration file/directory permissions
> Setting perfdata directory and RRD permissions
> Setting libexec directory permissions
> Setting Nagios XI config permissions
> Setting NOM checkpoint user:group permissions
> + Setting Nagios Core corelog.newobjects user:group permissions
> + Setting CCM configuration file user:group permissions
> + Setting Recurring Downtime file user:group permissions
> + Setting BPI configuration file user:group permissions
--------------------------------------

--- ccm_import.php -------------------
> Setting import directory: /usr/local/nagios/etc/import/
> Importing config files into the CCM
  No files to import
--------------------------------------

--- ccm_export.php -------------------
> Writing CCM configuration to Nagios files
  Finished writing out configuraton
--------------------------------------

--------------------------------------
> Verifying configuration with Nagios Core
> Output: 
Nagios Core 4.4.3
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 2019-01-15
License: GPL

Website: https://www.nagios.org
Reading configuration data...
   Read main config file okay...

   Read object config files okay...

Running pre-flight check on configuration data...

Checking objects...
	Checked 7562 services.
	Checked 1136 hosts.
	Checked 118 host groups.
	Checked 7 service groups.
	Checked 104 contacts.
	Checked 16 contact groups.
	Checked 295 commands.
	Checked 142 time periods.
	Checked 0 host escalations.
	Checked 0 service escalations.
Checking for circular paths...
	Checked 1136 hosts
	Checked 23 service dependencies
	Checked 1 host dependencies
	Checked 142 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 0
Total Errors:   0

Things look okay - No serious problems were detected during the pre-flight check
> Return Code: 0
--------------------------------------

real	0m7.507s
user	0m1.403s
sys	0m1.428s
bash-4.2$ ps -ef | grep [n]agios
nagios    4331  4329  0 16:32 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php >> /usr/local/nagiosxi/var/cmdsubsys.log 2>&1
nagios    4332  4331  0 16:32 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php
nagios    4333  4328  0 16:32 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php >> /usr/local/nagiosxi/var/eventman.log 2>&1
nagios    4334  4333  0 16:32 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php
nagios    4336  4330  0 16:32 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php >> /usr/local/nagiosxi/var/sysstat.log 2>&1
nagios    4339  4326  0 16:32 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php >> /usr/local/nagiosxi/var/feedproc.log 2>&1
nagios    4340  4327  0 16:32 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/event_handler.php >> /usr/local/nagiosxi/var/event_handler.log 2>&1
nagios    4346  4325  0 16:32 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php >> /usr/local/nagiosxi/var/perfdataproc.log 2>&1
nagios    4350  4336  0 16:32 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
nagios    4351  4339  0 16:32 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php
nagios    4352  4346  0 16:32 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php
nagios    4354  4340  0 16:32 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/event_handler.php
root      4471  2484  0 16:32 pts/0    00:00:00 su nagios
nagios    4472  4471  0 16:32 pts/0    00:00:00 bash
nagios    4694     1  0 16:32 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    4695  4694  0 16:32 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    4696  4694  0 16:32 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    4697  4694  0 16:32 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    4698  4694  0 16:32 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    4699  4694  0 16:32 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    4700  4694  0 16:32 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    4701  4694  0 16:32 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    4702  4694  0 16:32 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    4703  4694  0 16:32 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    4704  4694  0 16:32 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    4705  4694  0 16:32 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    4706  4694  0 16:32 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    4707     1  0 16:32 pts/0    00:00:00 /bin/bash /usr/local/nagiosxi/scripts/nom_create_nagioscore_checkpoint.sh
nagios    4710  4472  0 16:32 pts/0    00:00:00 ps -ef
nagios    4712  4472  0 16:32 pts/0    00:00:00 grep [n]agios
nagios    4714 18754  0 16:32 ?        00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg -f
nagios    4715  4714  0 16:32 ?        00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg -f
nagios   18754     1  0 Jul30 ?        00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg -f
nagios   18822     1  0 Jul30 ?        00:00:22 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
bash-4.2$ grep 'max_execution_time\|max_input_time\|memory_limit\|max_input_vars' /etc/php.ini | grep -v ';'
max_execution_time = 60
max_input_time = 120
memory_limit = 1G
bash-4.2$ 


Re: Hosts and services temporarily unavailable

Posted: Fri Aug 02, 2019 3:37 pm
by cdienger
There are multiple queues piling up so something isn't keeping up. Try changing the reaper settings per https://assets.nagios.com/downloads/nag ... ios-XI.pdf and then clear out/increase the queue settings per https://support.nagios.com/kb/article/n ... d-139.html.

Re: Hosts and services temporarily unavailable

Posted: Mon Aug 05, 2019 4:47 pm
by drug
According to the documentation here (https://assets.nagios.com/downloads/nag ... tsnew.html) check_result_reaper_frequency and max_check_result_reaper_time have been deprecated. Are there other reaper settings that we should be looking at?

Our kernel settings were already set appropriately.

Our host and service latencies consistently land around .5 seconds so we haven't determined a need to do any additional tuning. Note that we use gearmand to distribute the checks to worker nodes.
cdienger wrote:There are multiple queues piling up so something isn't keeping up. Try changing the reaper settings per https://assets.nagios.com/downloads/nag ... ios-XI.pdf and then clear out/increase the queue settings per https://support.nagios.com/kb/article/n ... d-139.html.

Re: Hosts and services temporarily unavailable

Posted: Tue Aug 06, 2019 11:14 am
by cdienger
The docs will be updated to mention that the reaper settings still apply to passive checks. That said, since gearman is running on the system the checks are probably mostly active, but it wouldn't hurt to update the reaper settings.

Trying to reduce some of the load on ndo2db where we can, edit /usr/local/nagios/etc/ndo2db.cfg and change the debug_verbosity and debug_level options both to 0:

Code: Select all

debug_level=0
debug_verbosity=1
and restart nagios and the ndo2db daemon:

Code: Select all

service nagios stop
service ndo2db stop
service ndo2db start
service nagios start

Re: Hosts and services temporarily unavailable

Posted: Mon Aug 12, 2019 4:40 pm
by drug
The reaper settings were increased per the documentation and ndo2db settings modified as well but we're not seeing any noticeable improvement. Is there anything else we might be able to do?

Re: Hosts and services temporarily unavailable

Posted: Tue Aug 13, 2019 12:24 pm
by ssax
Please PM me a FRESH copy of your profile.

Additionally, please send the output of these commands (as root):
- NOTE: You may need to adjust the -h 127.0.0.1, the -uroot, and -pnagiosxi in the first command if your DB is offloaded to another server and/or you've changed the root mysql password

Code: Select all

echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --table
Then run this command:

Code: Select all

grep mysql /usr/local/nagiosxi/html/config.inc.php | wc -l
If it outputs the number 2, run the command below as well and include the output, if it outputs anything other than 2 - don't run the command, it's safe to run but will fail with an error. (some XI systems use both mysql and postgresql if they were install prior to XI 5.0 and then upgraded from there).

Code: Select all

echo "SELECT relname as Table, pg_size_pretty(pg_total_relation_size(relid)) As Size, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as ExternalSize FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;" | psql nagiosxi nagiosxi

Re: Hosts and services temporarily unavailable

Posted: Tue Aug 13, 2019 1:23 pm
by drug
PM sent with details.

Re: Hosts and services temporarily unavailable

Posted: Tue Aug 13, 2019 1:46 pm
by ssax
Did you upgrade your gearman once you upgraded XI?

Code: Select all

https://assets.nagios.com/downloads/nagiosxi/docs/Integrating_Mod_Gearman_with_Nagios_XI.pdf
I see you're running modsecurity, please send me your your /var/log/httpd/modsec* logs so that I can see if it's impacting it.

What is the output of these commands?

Code: Select all

chage -l nagios
chage -l apache
grep nag /etc/group
grep "User \|Group " /etc/httpd/conf/httpd.conf
Try doing this:

Code: Select all

tail -Fn0 /var/log/httpd/*
Additionally, what is the "top" command output on your offloaded DB server? And where the commands you PMed run against the local mysql server or the offloaded DB? Make sure it was the offloaded DB.


Replicate the issue and send any output from that tail command so that we can debug further.

Re: Hosts and services temporarily unavailable

Posted: Thu Aug 15, 2019 9:06 am
by drug
ssax wrote:Did you upgrade your gearman once you upgraded XI?

Code: Select all

https://assets.nagios.com/downloads/nagiosxi/docs/Integrating_Mod_Gearman_with_Nagios_XI.pdf
We upgraded some time ago to version 3 (presently gearmand is 0.33-7 and libgearman is 1.1.2); mod_gearman is 3.0.7 (we don't run any workers on this instance, they're distributed only).
ssax wrote: What is the output of these commands?

Code: Select all

chage -l nagios
chage -l apache
grep nag /etc/group
grep "User \|Group " /etc/httpd/conf/httpd.conf

Code: Select all

nagios:x:996:nagios,apache
nagcmd:x:59004:nagios,apache
User apache
Group apache
ssax wrote: Additionally, what is the "top" command output on your offloaded DB server? And where the commands you PMed run against the local mysql server or the offloaded DB? Make sure it was the offloaded DB.
The query was run on the offloaded DB. Top from that server:

Code: Select all

top - 13:51:51 up 124 days,  1:04,  1 user,  load average: 0.13, 0.18, 0.15
Tasks: 105 total,   1 running, 104 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.4 us,  2.9 sy,  0.0 ni, 91.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  16471800 total,  7558304 used,  8913496 free,   307316 buffers
KiB Swap:  1949692 total,        0 used,  1949692 free.  5420404 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                
 1155 mysql     20   0  9.955g 1.426g  11364 S  17.2  9.1  25300:05 mysqld                                                 
 6375 root      20   0       0      0      0 S   0.7  0.0   0:31.41 kworker/0:2                                            
    1 root      20   0   28720   4948   3104 S   0.0  0.0   2:03.30 systemd                                                
    2 root      20   0       0      0      0 S   0.0  0.0   0:01.34 kthreadd                                               
    3 root      20   0       0      0      0 S   0.0  0.0  88:07.54 ksoftirqd/0                                            
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                           
    7 root      20   0       0      0      0 S   0.0  0.0  53:23.05 rcu_sched                                              
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh                                                 
    9 root      rt   0       0      0      0 S   0.0  0.0   0:04.14 migration/0                                            
   10 root      rt   0       0      0      0 S   0.0  0.0   0:39.28 watchdog/0                                             
   11 root      rt   0       0      0      0 S   0.0  0.0   0:32.60 watchdog/1                                             
   12 root      rt   0       0      0      0 S   0.0  0.0   0:03.88 migration/1                                            
   13 root      20   0       0      0      0 S   0.0  0.0 117:11.40 ksoftirqd/1                                            
   15 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:0H                                           
   16 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 khelper                                                
   17 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kdevtmpfs                                              
   18 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 netns                                                  
   19 root      20   0       0      0      0 S   0.0  0.0   0:05.98 khungtaskd                                             
   20 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 writeback                                              
   21 root      25   5       0      0      0 S   0.0  0.0   0:00.00 ksmd                                                   
   22 root      39  19       0      0      0 S   0.0  0.0   0:00.00 khugepaged                                             
   23 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 crypto                                                 
   24 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kintegrityd                                            
   25 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 bioset                                                 
   26 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kblockd                                                
   29 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kswapd0      
I have PM'd the requested logs.

Re: Hosts and services temporarily unavailable

Posted: Thu Aug 15, 2019 5:06 pm
by ssax
If you are running Core 4.4.3 (which you are), you are REQUIRED to upgrade gearman server on XI server and gearman workers.

https://assets.nagios.com/downloads/nag ... ios_XI.pdf