next check

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Claro
Posts: 8
Joined: Wed Jul 11, 2012 1:34 pm

next check

Post by Claro »

Hello Everybody,

I’m back for making some questions about this article. Right now I have 615 hots and 5490 services being monitored with an interval of 5 minutes, at this point we are noticing that in this time the Last Check and Next Check are not coherent with the defined interval of 5 minutes, they are even taking up to 2 hours to query the service.

For the case you mention, is it all right to say the following?

Total Check = 6110
Interval Check = 5 minutes
6110 / 5 / 60 seconds = 20,3666 checks per second

In this case I don’t understand how the delay is being calculated if I know how many checks I have to calculate per second, ¿Is there any table that shows what the values are according to the number of checks per second?

In the Nagios.cfg the variable values are like this:

use_large_installation_tweaks=1
service_inter_check_delay_method=s

Thanks a lot.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: next check

Post by mguthrie »

My guess you might be running into check latency because of overtaxing either the CPU or the disk. Can you post a screenshot of the Admin->Monitoring Engine Status page in XI? What kind of CPU load is running on a 15mn average on your system? What kind of output do you get from running:

Code: Select all

iostat
Claro
Posts: 8
Joined: Wed Jul 11, 2012 1:34 pm

Re: next check

Post by Claro »

Regards,

We now have a virtual server in vmware which has to this machine 8 processors of 2.4, we have assigned 16GB memory which are configured so that if you ever need this were to take them in the resource pool, the database msql is in which is hosted on another virtual server which is of excellent performance, and storage we're taking a datastorege 500GB (San).

We implemented the Ram disk, but we also have 2 machines which are the DNX, which are almost the same characteristics of the master, the had to disable because this was further degraded the master server in the pulling.
The file (nagios.cfg) has the following configuration,

use_large_installation_tweaks = 1
service_inter_check_delay_method = s

As you will have done all, I am very attentive to any idea that we can out of this problem.

requested screenshots attached.

Image


Image



I am very aware of more information or evidence required to perform.

Note:
Sorry for the English, because not handling very well.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: next check

Post by scottwilkerson »

What version of Nagios XI are you running? If it is 3.x there was a bug in core that was introduced when core was upgraded as a part of the XI upgrade.

Run the following on Nagios XI servers that are experiencing the problem to apply a patch to Nagios core.

Code: Select all

cd /tmp
wget http://assets.nagios.com/downloads/nagiosxi/patches/nagioscore.tar.gz
tar xzf nagioscore.tar.gz
cd nagioscore
./install
service nagios restart
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Claro
Posts: 8
Joined: Wed Jul 11, 2012 1:34 pm

Re: next check

Post by Claro »

Regards,

Engineers now have the next version. as this is not the version 3, we must apply it anyway?.

( Nagios XI 2011R2.1 Copyright © 2008-2011 Nagios Enterprises, LLC.)

We are currently negotiating a 5% residential and business platform for Colombia (Colombia course) with nagios, which is why we atrabemos to perform an update on the management system (Nagios) in full production until the update or patch is what sufuciente stable.

They need to send any other settings?


Note:
If you could see the screenshots?

I am very attentive.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: next check

Post by mguthrie »

Can you tell us a little bit more as to what kind of checks you're running? Switches? Linux Servers? SNMP?

Do you have any checks, event handlers, or notification commands that take a long time to execute (10 or more seconds)?

If mysql is offloaded, can you run top and show us what you're seeing for processes using the max "CPU time"

You can also install iotop to track what the biggest disk I/O users are:

Code: Select all

yum install -y iotop
iotop -aP
Claro
Posts: 8
Joined: Wed Jul 11, 2012 1:34 pm

Re: next check

Post by Claro »

greeting.

1. Engineers are monitoring (Linux Servers Windows, F5, IronPort, Fortinet, ASAs, Allot, DNS Servers, FTP, mail, phone devices themselves, switches) is a data center infrastructure.

2. All we are carrying out monitoring by SNMP, without installing agents on implementation issues.

3. All monitoring of all the metrics we are doing to 5 minutes.

4. We have not stopped the database as the platform is in full operation, for they would have to make a scheduled RFC.

5. Send command capture log that indicate,

We were very attentive


Total DISK READ: 0.00 B/s | Total DISK WRITE: 555.19 K/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
1 be/4 root 0.00 B 0.00 B -133607291819.71 % 99.99 % init [3]
4 rt/3 root 0.00 B 0.00 B -133611590489.53 % 15.55 % [migration/1]
956 be/3 root 0.00 B 568.00 K 0.00 % 2.45 % [kjournald]
6378 be/4 postgres 0.00 B 32.00 K 0.08 % 2.19 % postgres: nagiosxi nagiosxi 127.0.0.1(59738) idle
2808 be/3 root 0.00 B 0.00 B 0.00 % 1.89 % [kjournald]
2814 be/3 root 0.00 B 168.00 K 0.04 % 0.51 % [kjournald]
6392 be/4 postgres 0.00 B 1408.00 K 2.19 % 0.46 % postgres: nagiosxi nagiosxi 127.0.0.1(59744) SELECT
6367 be/4 nagios 0.00 B 0.00 B 0.01 % 0.29 % sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/sy~tat.php > /usr/local/nagiosxi/var/sysstat.log 2>&1
6374 be/4 postgres 0.00 B 112.00 K 0.00 % 0.28 % postgres: nagiosxi nagiosxi 127.0.0.1(59727) idle
5203 be/4 nagios 0.00 B 8.00 K 0.00 % 0.11 % nagios -d /usr/local/nagios/etc/nagios.cfg
4699 be/4 postgres 0.00 B 116.00 K 0.00 % 0.10 % postgres: writer process
4740 be/4 smmsp 0.00 B 0.00 B 0.07 % 0.09 % sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
21614 be/4 apache 0.00 B 68.00 K 0.00 % 0.09 % httpd
6385 be/4 postgres 0.00 B 96.00 K 0.05 % 0.09 % postgres: nagiosxi nagiosxi 127.0.0.1(59741) idle
6375 be/4 postgres 0.00 B 40.00 K 0.00 % 0.08 % postgres: nagiosxi nagiosxi 127.0.0.1(59728) idle
23991 be/4 apache 0.00 B 40.00 K 0.06 % 0.07 % httpd
4701 be/4 postgres 0.00 B 0.00 B 0.10 % 0.07 % postgres: stats collector process
21765 be/4 apache 0.00 B 8.00 K 0.09 % 0.06 % httpd
29696 be/4 postgres 0.00 B 0.00 B 0.00 % 0.06 % postgres: nagiosxi nagiosxi 127.0.0.1(42053) idle
5285 be/4 root 0.00 B 0.00 B 0.00 % 0.05 % httpd
24286 be/4 apache 0.00 B 0.00 B 0.07 % 0.05 % httpd
18631 be/4 root 0.00 B 0.00 B 0.02 % 0.05 % sshd: root@pts/4
26939 be/4 nagios 0.00 B 0.00 B 0.03 % 0.05 % ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
3305 be/3 root 0.00 B 0.00 B 0.00 % 0.05 % [vmmemctl]
6376 be/4 postgres 0.00 B 8.00 K 0.28 % 0.05 % postgres: nagiosxi nagiosxi 127.0.0.1(59730) idle
6366 be/4 nagios 0.00 B 0.00 B 0.00 % 0.04 % sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cm~s.php > /usr/local/nagiosxi/var/cmdsubsys.log 2>&1
14571 be/4 apache 0.00 B 4.00 K 0.01 % 0.04 % httpd
2803 be/3 root 0.00 B 0.00 B 0.00 % 0.04 % [kjournald]
21639 be/4 postgres 0.00 B 0.00 B 0.00 % 0.04 % postgres: nagiosxi nagiosxi 127.0.0.1(37310) idle
15869 be/4 postgres 0.00 B 0.00 B 0.00 % 0.04 % postgres: nagiosxi nagiosxi 127.0.0.1(36521) idle
4479 be/4 root 0.00 B 0.00 B 0.00 % 0.04 % ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
4730 be/4 root 0.00 B 0.00 B 0.01 % 0.03 % sendmail: accepting connections
6347 be/4 nagios 0.00 B 0.00 B 0.21 % 0.03 % crond
26405 be/4 apache 0.00 B 0.00 B 0.01 % 0.03 % httpd
2820 be/3 root 0.00 B 0.00 B 0.51 % 0.03 % [kjournald]
14173 be/4 root 0.00 B 0.00 B 0.00 % 0.03 % sshd: root@pts/3
2827 be/3 root 0.00 B 12.00 K 0.00 % 0.03 % [kjournald]
6733 be/4 root 0.00 B 0.00 B 0.09 % 0.02 % python /usr/bin/iotop -aP
18681 be/4 root 0.00 B 0.00 B 0.04 % 0.02 % -bash
3328 be/4 apache 0.00 B 8.00 K 0.00 % 0.02 % httpd
6349 be/4 nagios 0.00 B 0.00 B 0.05 % 0.02 % crond
15778 be/4 apache 0.00 B 0.00 B 0.00 % 0.02 % httpd
3570 ?dif root 0.00 B 0.00 B 0.02 % 0.02 % multipathd
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: next check

Post by scottwilkerson »

These entries seem super off to us

Code: Select all

PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
1 be/4 root 0.00 B 0.00 B -133607291819.71 % 99.99 % init [3]
4 rt/3 root 0.00 B 0.00 B -133611590489.53 % 15.55 % [migration/1]
956 be/3 root 0.00 B 568.00 K 0.00 % 2.45 % [kjournald]
2808 be/3 root 0.00 B 0.00 B 0.00 % 1.89 % [kjournald]
2814 be/3 root 0.00 B 168.00 K 0.04 % 0.51 % [kjournald]
We would suggest scheduling to take the machine offline, and reboot the server.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Claro
Posts: 8
Joined: Wed Jul 11, 2012 1:34 pm

Re: next check

Post by Claro »

Regards,

Engineers perform this procedure is the (20/07/2012) and we had a very high CPU load and latency, on the other hand this is a virtual machine, sent machine characteristics.

Server: Virtualized (VMware 5.0 for datacenter)
Operating System: Red Hat Enterprise Linux 5 (64-bit)
Computer: HP ProLiant BL680c G5
Processing: 8 x 2.40 GHz CPU
Virtual Disk: 200 GB
Memory: 16GB
Red: 2 Interfaces (services and management) with only mack.
Datastore: 500GB (St.) (It stores all the settings and write to disk)
Mysql is on another virtual server almost identical characteristics to it.

Note:
Engineers you have a company representative in the country of Colombia with experience we can help with this issue the most quickly as possible? Or have any other support line where you validate the settings on this platform?
This management system is implemented to clear Colombia (http://www.claro.com.co/portal/co/pc3/personas/) which provides solutions for Datacenter tir 4

As always we were very attentive.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: next check

Post by scottwilkerson »

We do not have anyone in Colombia. Have you restarted the server as I suggested?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked