Page 1 of 4
nagios_logentries causing problems
Posted: Tue Sep 06, 2016 12:46 pm
by chicjo01
Nagios Support,
Our DBA reported over the weekend the nagios_logentries was causing locking alerts and database slowdown because Nagios is attempting to do a select * on the table with over 13 million rows. The DBA have truncated the table in order to correct the issue for now. Do you have an
recommendations to help prevent this from continuing to be a problem? We are still in our migration and still increasing the number of checks we will be performing.
I ran a test to see how quickly it is growing.
Time - Count
13:25:23 - 3432214
13:38:57 - 3478374
That is a 46,000 increase in less then 10 mins.
Nagios XI Version: 5.2.9
Max Log Entries Age is set to 7 Days
Code: Select all
mysql -V
mysql Ver 14.14 Distrib 5.6.25-73.1, for Linux (x86_64) using 6.2
Code: Select all
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Nagios Core 4.1.1
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 08-19-2015
License: GPL
Website: https://www.nagios.org
Reading configuration data...
Read main config file okay...
Read object config files okay...
Running pre-flight check on configuration data...
Checking objects...
Checked 13559 services.
Checked 2192 hosts.
Checked 5359 host groups.
Checked 3525 service groups.
Checked 174 contacts.
Checked 5327 contact groups.
Checked 141 commands.
Checked 9 time periods.
Checked 0 host escalations.
Checked 0 service escalations.
Checking for circular paths...
Checked 2192 hosts
Checked 0 service dependencies
Checked 0 host dependencies
Checked 9 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...
Total Warnings: 0
Total Errors: 0
Things look okay - No serious problems were detected during the pre-flight check
Re: nagios_logentries causing problems
Posted: Tue Sep 06, 2016 4:04 pm
by mcapra
Is there any way for you to grab some of those incoming nagios_logentries rows so we can check the content of them? The obvious answer is "truncate more" but if there's something spitting out a bunch of erroneous events shutting off the source would be a better option.
Re: nagios_logentries causing problems
Posted: Tue Sep 06, 2016 4:40 pm
by chicjo01
When I asked about this on the forums before (
https://support.nagios.com/forum/viewto ... 16&t=40089), the solution was to apply configure. So when I watched the logs, after the apply configuration nagios was processing the checks, but not the ones filling up the log entries table. It appears, something is keeping stale information, but I do not know what.
When I do an apply configure, the nagios_hosts, nagios_services, nagios_hoststatus, nagios_servicestatus all get removed, then populated. I do not know what is populating this, my guess is the retention.dat file.
Would removing the retention.dat file and then doing an apply configure to force a new check on all monitors correct the issue?
the table is getting filled up with the below type of messages.
Code: Select all
Runtime Error 2016-09-06 17:04:56 Unable to send check for host 'Remote Server' to worker (ret=-2)
Runtime Error 2016-09-06 17:04:50 Unable to run check for service 'Unix Zenoss ZenRender' on host 'Remote Server'
Runtime Error 2016-09-06 17:04:50 Unable to run check for service 'Unix Zenoss ZenEventlog' on host 'Remote Server'
Runtime Error 2016-09-06 17:04:48 Unable to run check for service 'Unix Db_osrc Disk /ap' on host 'Remote Server'
Runtime Error 2016-09-06 17:04:48 Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
Runtime Error 2016-09-06 17:04:48 Unable to run check for service 'Unix Msd Ping' on host 'Remote Server'
Runtime Error 2016-09-06 17:04:46 Unable to run check for service 'Unix Zenoss ZenSyslog' on host 'Remote Server'
Below is what I found for the remote server in the nagios.log. the last time it was processed was 8 am.
Code: Select all
grep Remote Server nagios.log | grep 'Disk Usage for /tmp' | perl -pe 's/(\d+)/localtime($1)/e'
[Tue Sep 6 08:07:33 2016] Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
[Tue Sep 6 08:13:28 2016] Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
[Tue Sep 6 08:18:56 2016] Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
[Tue Sep 6 08:26:56 2016] Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
[Tue Sep 6 08:32:12 2016] Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
[Tue Sep 6 08:37:09 2016] Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
[Tue Sep 6 08:42:05 2016] Unable to run check for service 'Disk Usage for /tmp' on host 'Remote Server'
Re: nagios_logentries causing problems
Posted: Wed Sep 07, 2016 12:37 pm
by tmcdonald
A few things:
1.) How many checks per 5 minutes are you running? You can see under Admin -> Monitoring Engine Status, under the Monitoring Engine Check Statistics dashlet. Total up the active and passive host and service checks for the 5-minute stat and post that here.
2.) What are your DB retention settings configured for? Admin -> Performance Settings -> Databases tab -> Screenshot and post
3.) Do you perhaps have multiple Core processes running? ps -ef | grep bin/nagios
Re: nagios_logentries causing problems
Posted: Wed Sep 07, 2016 12:51 pm
by chicjo01
1.) How many checks per 5 minutes are you running? You can see under Admin -> Monitoring Engine Status, under the Monitoring Engine Check Statistics dashlet. Total up the active and passive host and service checks for the 5-minute stat and post that here.
Code: Select all
Monitoring Engine Check Statistics
Metric
Value
Active Host Checks
1-min 0
5-min 0
15-min 543
Passive Host Checks
1-min 0
5-min 0
15-min 0
Active Service Checks
1-min 0
5-min 0
15-min 5,749
Passive Service Checks
1-min 0
5-min 0
15-min 0
Code: Select all
/usr/local/nagios/bin/nagiostats
Nagios Stats 4.1.1
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 08-19-2015
License: GPL
CURRENT STATUS DATA
------------------------------------------------------
Status File: /usr/local/nagios/var/status.dat
Status File Age: 0d 0h 1m 3s
Status File Version: 4.1.1
Program Running Time: 0d 1h 13m 49s
Nagios PID: 81748
Total Services: 14930
Services Checked: 14930
Services Scheduled: 14930
Services Actively Checked: 14930
Services Passively Checked: 0
Total Service State Change: 0.000 / 33.880 / 0.016 %
Active Service Latency: 6.157 / 138.273 / 70.542 sec
Active Service Execution Time: 0.002 / 180.038 / 1.563 sec
Active Service State Change: 0.000 / 33.880 / 0.016 %
Active Services Last 1/5/15/60 min: 0 / 7254 / 14617 / 14930
Passive Service Latency: 0.000 / 0.000 / 0.000 sec
Passive Service State Change: 0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit: 14569 / 45 / 68 / 248
Services Flapping: 0
Services In Downtime: 0
Total Hosts: 2183
Hosts Checked: 2183
Hosts Scheduled: 2183
Hosts Actively Checked: 2183
Host Passively Checked: 0
Total Host State Change: 0.000 / 6.250 / 1.448 %
Active Host Latency: 9.307 / 150.977 / 83.680 sec
Active Host Execution Time: 0.001 / 0.051 / 0.005 sec
Active Host State Change: 0.000 / 6.250 / 1.448 %
Active Hosts Last 1/5/15/60 min: 0 / 1675 / 2183 / 2183
Passive Host Latency: 0.000 / 0.000 / 0.000 sec
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 2183 / 0 / 0
Hosts Flapping: 0
Hosts In Downtime: 0
Active Host Checks Last 1/5/15 min: 19 / 2091 / 6030
Scheduled: 19 / 2072 / 5951
On-demand: 0 / 19 / 79
Parallel: 19 / 2072 / 5951
Serial: 0 / 0 / 0
Cached: 0 / 19 / 79
Passive Host Checks Last 1/5/15 min: 0 / 0 / 0
Active Service Checks Last 1/5/15 min: 1030 / 8139 / 25937
Scheduled: 1030 / 8139 / 25937
On-demand: 0 / 0 / 0
Cached: 0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0
External Commands Last 1/5/15 min: 0 / 0 / 0
2.) What are your DB retention settings configured for? Admin -> Performance Settings -> Databases tab -> Screenshot and post
Capture.PNG
3.) Do you perhaps have multiple Core processes running? ps -ef | grep bin/nagios
Code: Select all
ps -ef | grep bin/nagios
nagios 81748 1 1 12:36 ? 00:01:06 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 81750 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81751 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81752 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81753 81748 0 12:36 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81754 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81755 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81756 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81757 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81758 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81759 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81760 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81761 81748 0 12:36 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81762 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81763 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81764 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81765 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81766 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81767 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81768 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81769 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81771 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81772 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81774 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81775 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81776 81748 0 12:36 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81777 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81778 81748 0 12:36 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81779 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81780 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81781 81748 0 12:36 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81782 81748 0 12:36 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81783 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81784 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81785 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81786 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81788 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81789 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81790 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81792 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81793 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81794 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81796 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81797 81748 0 12:36 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81798 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81799 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81800 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81801 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81802 81748 0 12:36 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 81972 81748 0 12:36 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Code: Select all
ps -ef f | grep bin/nagios | grep cfg
nagios 81748 1 1 12:36 ? Ss 1:12 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 81972 81748 0 12:36 ? S 0:00 \_ /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Code: Select all
ipcs
------ Message Queues --------
key msqid owner perms used-bytes messages
0x63000002 3375104 nagios 600 109716480 107145
0xfe000002 3407873 nagios 600 131072000 128000
0x82000002 3440642 nagios 600 16040960 15665
0x0f000002 3473411 nagios 600 27145216 26509
0x1e000002 3506180 nagios 600 62213120 60755
0xdc000002 3571717 nagios 600 104013824 101576
0x06000002 3604486 nagios 600 51188736 49989
0x37000002 3637255 nagios 600 62806016 61334
0x70000002 3670024 nagios 600 3269632 3193
0xac000002 3899401 nagios 600 131050496 127979
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x01131cac 163840 root 600 1000 0
0x01136199 28278785 root 600 1000 11
------ Semaphore Arrays --------
key semid owner perms nsems
0x00000000 187301888 apache 600 1
0x00000000 187334657 apache 600 1
0x00000000 187269122 apache 600 1
0x00000000 187367427 apache 600 1
0x00000000 187400196 apache 600 1
0x00000000 187432965 apache 600 1
0x00000000 187465734 apache 600 1
0x00000000 187498503 apache 600 1
0x00000000 196182024 apache 600 1
0x00000000 196214793 apache 600 1
0x00000000 196149258 apache 600 1
0x00000000 196247563 apache 600 1
0x00000000 196280332 apache 600 1
0x00000000 196313101 apache 600 1
0x00000000 196345870 apache 600 1
0x00000000 196378639 apache 600 1
Re: nagios_logentries causing problems
Posted: Wed Sep 07, 2016 3:16 pm
by ssax
Looks like you have too many message queues, please run these commands and see if that alleviates the issue:
Code: Select all
service nagios stop
killall -9 nagios
service ndo2db stop
service mysqld restart
rm -rf /usr/local/nagios/var/rw/nagios.cmd
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service ndo2db start
service nagios start
Re: nagios_logentries causing problems
Posted: Wed Sep 07, 2016 4:10 pm
by chicjo01
I have had to do that a number of time, why does the message queue keep expanding. Can you provide more insight into why this cause a problem and how often does this task need to be performed?
I will let you know if it improves the problem.
Re: nagios_logentries causing problems
Posted: Wed Sep 07, 2016 4:39 pm
by Box293
How many total objects does your XI server have (hosts + services) ?
Re: nagios_logentries causing problems
Posted: Wed Sep 07, 2016 4:59 pm
by chicjo01
Current we have:
Hosts: 2192 (Windows + Linux)
Services: 19435 (Linux)
We still need to add in Windows services, but that is a different problem then this. And we still need to add in process and custom scripts.
So my guess would be after all is said and done with both windows and linux.
Hosts: 2192
Services: 40000
Total: 42192 ballpark
Re: nagios_logentries causing problems
Posted: Thu Sep 08, 2016 7:32 am
by chicjo01
truncated the nagios_logentries table before I left for the night around 6 pm eastern. I just checked and the table is up to 2.6 million rows. I also checked the eventlog via the web interface and the problem is still happening.
I also checked ipcs again this morning after performing the task you requested. It appears to have more then one queue as well. Do you have any recommendations to get this fixed?
Code: Select all
ipcs
------ Message Queues --------
key msqid owner perms used-bytes messages
0x24000002 4030464 nagios 600 130468864 127411
0x81000002 3964929 nagios 600 131052544 127981
0x85000002 4063234 nagios 600 131040256 127969
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x01131cac 163840 root 600 1000 0
0x01136199 28278785 root 600 1000 11
------ Semaphore Arrays --------
key semid owner perms nsems
0x00000000 187301888 apache 600 1
0x00000000 187334657 apache 600 1
0x00000000 187269122 apache 600 1
0x00000000 187367427 apache 600 1
0x00000000 187400196 apache 600 1
0x00000000 187432965 apache 600 1
0x00000000 187465734 apache 600 1
0x00000000 187498503 apache 600 1
0x00000000 196182024 apache 600 1
0x00000000 196214793 apache 600 1
0x00000000 196149258 apache 600 1
0x00000000 196247563 apache 600 1
0x00000000 196280332 apache 600 1
0x00000000 196313101 apache 600 1
0x00000000 196345870 apache 600 1
0x00000000 196378639 apache 600 1