nagios_logentries causing problems

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
chicjo01
Posts: 194
Joined: Tue Jul 28, 2015 2:52 pm

nagios_logentries causing problems

Post by chicjo01 »

Nagios Support,
Our DBA reported over the weekend the nagios_logentries was causing locking alerts and database slowdown because Nagios is attempting to do a select * on the table with over 13 million rows. The DBA have truncated the table in order to correct the issue for now. Do you have an recommendations to help prevent this from continuing to be a problem? We are still in our migration and still increasing the number of checks we will be performing.


I ran a test to see how quickly it is growing.
Time - Count
13:25:23 - 3432214
13:38:57 - 3478374

That is a 46,000 increase in less then 10 mins.

Nagios XI Version: 5.2.9
Max Log Entries Age is set to 7 Days

Code: Select all

mysql -V
mysql  Ver 14.14 Distrib 5.6.25-73.1, for Linux (x86_64) using  6.2

Code: Select all

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Nagios Core 4.1.1
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 08-19-2015
License: GPL

Website: https://www.nagios.org
Reading configuration data...
   Read main config file okay...
   Read object config files okay...

Running pre-flight check on configuration data...

Checking objects...
        Checked 13559 services.
        Checked 2192 hosts.
        Checked 5359 host groups.
        Checked 3525 service groups.
        Checked 174 contacts.
        Checked 5327 contact groups.
        Checked 141 commands.
        Checked 9 time periods.
        Checked 0 host escalations.
        Checked 0 service escalations.
Checking for circular paths...
        Checked 2192 hosts
        Checked 0 service dependencies
        Checked 0 host dependencies
        Checked 9 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 0
Total Errors:   0

Things look okay - No serious problems were detected during the pre-flight check
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: nagios_logentries causing problems

Post by mcapra »

Is there any way for you to grab some of those incoming nagios_logentries rows so we can check the content of them? The obvious answer is "truncate more" but if there's something spitting out a bunch of erroneous events shutting off the source would be a better option.
Former Nagios employee
https://www.mcapra.com/
User avatar
chicjo01
Posts: 194
Joined: Tue Jul 28, 2015 2:52 pm

Re: nagios_logentries causing problems

Post by chicjo01 »

When I asked about this on the forums before (https://support.nagios.com/forum/viewto ... 16&t=40089), the solution was to apply configure. So when I watched the logs, after the apply configuration nagios was processing the checks, but not the ones filling up the log entries table. It appears, something is keeping stale information, but I do not know what.

When I do an apply configure, the nagios_hosts, nagios_services, nagios_hoststatus, nagios_servicestatus all get removed, then populated. I do not know what is populating this, my guess is the retention.dat file.

Would removing the retention.dat file and then doing an apply configure to force a new check on all monitors correct the issue?

the table is getting filled up with the below type of messages.

Code: Select all

Runtime Error	2016-09-06 17:04:56	Unable to send check for host 'Remote Server' to worker (ret=-2)
Runtime Error	2016-09-06 17:04:50	Unable to run check for service 'Unix Zenoss ZenRender' on host 'Remote Server'
Runtime Error	2016-09-06 17:04:50	Unable to run check for service 'Unix Zenoss ZenEventlog' on host 'Remote Server'
Runtime Error	2016-09-06 17:04:48	Unable to run check for service 'Unix Db_osrc Disk /ap' on host 'Remote Server'
Runtime Error	2016-09-06 17:04:48	Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
Runtime Error	2016-09-06 17:04:48	Unable to run check for service 'Unix Msd Ping' on host 'Remote Server'
Runtime Error	2016-09-06 17:04:46	Unable to run check for service 'Unix Zenoss ZenSyslog' on host 'Remote Server'
Below is what I found for the remote server in the nagios.log. the last time it was processed was 8 am.

Code: Select all

grep Remote Server nagios.log | grep 'Disk Usage for /tmp' | perl -pe 's/(\d+)/localtime($1)/e'
[Tue Sep  6 08:07:33 2016] Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
[Tue Sep  6 08:13:28 2016] Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
[Tue Sep  6 08:18:56 2016] Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
[Tue Sep  6 08:26:56 2016] Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
[Tue Sep  6 08:32:12 2016] Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
[Tue Sep  6 08:37:09 2016] Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
[Tue Sep  6 08:42:05 2016] Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: nagios_logentries causing problems

Post by tmcdonald »

A few things:

1.) How many checks per 5 minutes are you running? You can see under Admin -> Monitoring Engine Status, under the Monitoring Engine Check Statistics dashlet. Total up the active and passive host and service checks for the 5-minute stat and post that here.

2.) What are your DB retention settings configured for? Admin -> Performance Settings -> Databases tab -> Screenshot and post

3.) Do you perhaps have multiple Core processes running? ps -ef | grep bin/nagios
Former Nagios employee
User avatar
chicjo01
Posts: 194
Joined: Tue Jul 28, 2015 2:52 pm

Re: nagios_logentries causing problems

Post by chicjo01 »

1.) How many checks per 5 minutes are you running? You can see under Admin -> Monitoring Engine Status, under the Monitoring Engine Check Statistics dashlet. Total up the active and passive host and service checks for the 5-minute stat and post that here.

Code: Select all

Monitoring Engine Check Statistics
Metric
Value
Active Host Checks
1-min	0	
 
5-min	0	
 
15-min	543	
 
Passive Host Checks
1-min	0	
 
5-min	0	
 
15-min	0	
 
Active Service Checks
1-min	0	
 
5-min	0	
 
15-min	5,749	
 
Passive Service Checks
1-min	0	
 
5-min	0	
 
15-min	0	

Code: Select all

/usr/local/nagios/bin/nagiostats

Nagios Stats 4.1.1
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 08-19-2015
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /usr/local/nagios/var/status.dat
Status File Age:                        0d 0h 1m 3s
Status File Version:                    4.1.1

Program Running Time:                   0d 1h 13m 49s
Nagios PID:                             81748

Total Services:                         14930
Services Checked:                       14930
Services Scheduled:                     14930
Services Actively Checked:              14930
Services Passively Checked:             0
Total Service State Change:             0.000 / 33.880 / 0.016 %
Active Service Latency:                 6.157 / 138.273 / 70.542 sec
Active Service Execution Time:          0.002 / 180.038 / 1.563 sec
Active Service State Change:            0.000 / 33.880 / 0.016 %
Active Services Last 1/5/15/60 min:     0 / 7254 / 14617 / 14930
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              14569 / 45 / 68 / 248
Services Flapping:                      0
Services In Downtime:                   0

Total Hosts:                            2183
Hosts Checked:                          2183
Hosts Scheduled:                        2183
Hosts Actively Checked:                 2183
Host Passively Checked:                 0
Total Host State Change:                0.000 / 6.250 / 1.448 %
Active Host Latency:                    9.307 / 150.977 / 83.680 sec
Active Host Execution Time:             0.001 / 0.051 / 0.005 sec
Active Host State Change:               0.000 / 6.250 / 1.448 %
Active Hosts Last 1/5/15/60 min:        0 / 1675 / 2183 / 2183
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  2183 / 0 / 0
Hosts Flapping:                         0
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     19 / 2091 / 6030
   Scheduled:                           19 / 2072 / 5951
   On-demand:                           0 / 19 / 79
   Parallel:                            19 / 2072 / 5951
   Serial:                              0 / 0 / 0
   Cached:                              0 / 19 / 79
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  1030 / 8139 / 25937
   Scheduled:                           1030 / 8139 / 25937
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0

2.) What are your DB retention settings configured for? Admin -> Performance Settings -> Databases tab -> Screenshot and post
Capture.PNG
3.) Do you perhaps have multiple Core processes running? ps -ef | grep bin/nagios

Code: Select all

 ps -ef | grep bin/nagios
nagios    81748      1  1 12:36 ?        00:01:06 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    81750  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81751  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81752  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81753  81748  0 12:36 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81754  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81755  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81756  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81757  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81758  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81759  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81760  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81761  81748  0 12:36 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81762  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81763  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81764  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81765  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81766  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81767  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81768  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81769  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81771  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81772  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81774  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81775  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81776  81748  0 12:36 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81777  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81778  81748  0 12:36 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81779  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81780  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81781  81748  0 12:36 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81782  81748  0 12:36 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81783  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81784  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81785  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81786  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81788  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81789  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81790  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81792  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81793  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81794  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81796  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81797  81748  0 12:36 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81798  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81799  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81800  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81801  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81802  81748  0 12:36 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    81972  81748  0 12:36 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

Code: Select all

ps -ef f | grep bin/nagios | grep cfg
nagios    81748      1  1 12:36 ?        Ss     1:12 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    81972  81748  0 12:36 ?        S      0:00  \_ /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

Code: Select all

ipcs
------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0x63000002 3375104    nagios     600        109716480    107145
0xfe000002 3407873    nagios     600        131072000    128000
0x82000002 3440642    nagios     600        16040960     15665
0x0f000002 3473411    nagios     600        27145216     26509
0x1e000002 3506180    nagios     600        62213120     60755
0xdc000002 3571717    nagios     600        104013824    101576
0x06000002 3604486    nagios     600        51188736     49989
0x37000002 3637255    nagios     600        62806016     61334
0x70000002 3670024    nagios     600        3269632      3193
0xac000002 3899401    nagios     600        131050496    127979

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x01131cac 163840     root       600        1000       0
0x01136199 28278785   root       600        1000       11

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x00000000 187301888  apache     600        1
0x00000000 187334657  apache     600        1
0x00000000 187269122  apache     600        1
0x00000000 187367427  apache     600        1
0x00000000 187400196  apache     600        1
0x00000000 187432965  apache     600        1
0x00000000 187465734  apache     600        1
0x00000000 187498503  apache     600        1
0x00000000 196182024  apache     600        1
0x00000000 196214793  apache     600        1
0x00000000 196149258  apache     600        1
0x00000000 196247563  apache     600        1
0x00000000 196280332  apache     600        1
0x00000000 196313101  apache     600        1
0x00000000 196345870  apache     600        1
0x00000000 196378639  apache     600        1
You do not have the required permissions to view the files attached to this post.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: nagios_logentries causing problems

Post by ssax »

Looks like you have too many message queues, please run these commands and see if that alleviates the issue:

Code: Select all

service nagios stop
killall -9 nagios
service ndo2db stop
service mysqld restart
rm -rf /usr/local/nagios/var/rw/nagios.cmd
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service ndo2db start
service nagios start
User avatar
chicjo01
Posts: 194
Joined: Tue Jul 28, 2015 2:52 pm

Re: nagios_logentries causing problems

Post by chicjo01 »

I have had to do that a number of time, why does the message queue keep expanding. Can you provide more insight into why this cause a problem and how often does this task need to be performed?

I will let you know if it improves the problem.
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: nagios_logentries causing problems

Post by Box293 »

How many total objects does your XI server have (hosts + services) ?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
chicjo01
Posts: 194
Joined: Tue Jul 28, 2015 2:52 pm

Re: nagios_logentries causing problems

Post by chicjo01 »

Current we have:

Hosts: 2192 (Windows + Linux)
Services: 19435 (Linux)

We still need to add in Windows services, but that is a different problem then this. And we still need to add in process and custom scripts.

So my guess would be after all is said and done with both windows and linux.

Hosts: 2192
Services: 40000
Total: 42192 ballpark
User avatar
chicjo01
Posts: 194
Joined: Tue Jul 28, 2015 2:52 pm

Re: nagios_logentries causing problems

Post by chicjo01 »

truncated the nagios_logentries table before I left for the night around 6 pm eastern. I just checked and the table is up to 2.6 million rows. I also checked the eventlog via the web interface and the problem is still happening.

I also checked ipcs again this morning after performing the task you requested. It appears to have more then one queue as well. Do you have any recommendations to get this fixed?

Code: Select all

ipcs

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0x24000002 4030464    nagios     600        130468864    127411
0x81000002 3964929    nagios     600        131052544    127981
0x85000002 4063234    nagios     600        131040256    127969

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x01131cac 163840     root       600        1000       0
0x01136199 28278785   root       600        1000       11

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x00000000 187301888  apache     600        1
0x00000000 187334657  apache     600        1
0x00000000 187269122  apache     600        1
0x00000000 187367427  apache     600        1
0x00000000 187400196  apache     600        1
0x00000000 187432965  apache     600        1
0x00000000 187465734  apache     600        1
0x00000000 187498503  apache     600        1
0x00000000 196182024  apache     600        1
0x00000000 196214793  apache     600        1
0x00000000 196149258  apache     600        1
0x00000000 196247563  apache     600        1
0x00000000 196280332  apache     600        1
0x00000000 196313101  apache     600        1
0x00000000 196345870  apache     600        1
0x00000000 196378639  apache     600        1

Locked