Scheduling very unstable

Post by **cdienger** » Fri Apr 06, 2018 1:51 pm

Try increasing the memory limit to 1024M or 2048M in php.ini and restarting the httpd service again. Digging a bit deeper in the code, I don't see where much else could go wrong like this unless the nagios database is also having issues and the full queue can't be pulled. Based on the profile though I would say the php memory problem is more likely.

The table that is queried for this chart is nagios_timedeventqueue. To check it the following can be run(note the row returned):

echo "select * from nagios_timedeventqueue" | mysql --uroot -pnagiosxi -Dnagios

https://assets.nagios.com/downloads/nag ... tabase.pdf covers repairing the database and although I don't think at this time it would be the problem, it wouldn't hurt to run the repair script anyway.

rajasegar · Post by **rajasegar** » Sun Apr 08, 2018 7:57 pm

cdienger wrote:Try increasing the memory limit to 1024M or 2048M in php.ini and restarting the httpd service again. Digging a bit deeper in the code, I don't see where much else could go wrong like this unless the nagios database is also having issues and the full queue can't be pulled. Based on the profile though I would say the php memory problem is more likely.

The table that is queried for this chart is nagios_timedeventqueue. To check it the following can be run(note the row returned):

echo "select * from nagios_timedeventqueue" | mysql --uroot -pnagiosxi -Dnagios

https://assets.nagios.com/downloads/nag ... tabase.pdf covers repairing the database and although I don't think at this time it would be the problem, it wouldn't hurt to run the repair script anyway.

echo "select * from nagios_timedeventqueue" | mysql -uroot -pnagiosxi -Dnagios --host=10.17.19.237
This returns Empty set (0.00 sec). Scheduling seems fine now. Can you please double check?

scottwilkerson · Post by **scottwilkerson** » Mon Apr 09, 2018 12:52 pm

rajasegar wrote:Scheduling seems fine now. Can you please double check?

There is not a lot to check if it all seems stable.

I have noted this behavior a few times in the past on systems with mod_gearman if they somehow spawn 2 nagios parent processes, but if you restart several things before we get any logs it's hard to tell. If you notice it again, it would be great if you could create a profile BEFORE trying ot take any corrective action, and open a ticket with that for analysis

rajasegar · Post by **rajasegar** » Mon Apr 09, 2018 6:49 pm

scottwilkerson wrote:
rajasegar wrote:Scheduling seems fine now. Can you please double check?
There is not a lot to check if it all seems stable.

I have noted this behavior a few times in the past on systems with mod_gearman if they somehow spawn 2 nagios parent processes, but if you restart several things before we get any logs it's hard to tell. If you notice it again, it would be great if you could create a profile BEFORE trying ot take any corrective action, and open a ticket with that for analysis

Ok noted. Will do that if it happens again.

As I updated previously I disabled the mod_gearman in nagios.cfg.

Code: Select all

#broker_module=/usr/lib64/mod_gearman2/mod_gearman2.o config=/etc/mod_gearman2/module.conf eventhandler=no

rajasegar · Post by **rajasegar** » Tue Apr 10, 2018 4:32 am

This time another server is giving issue. Took the system profile before restarting.

18 cores, 20GB. DB not offloaded.

Capture_XI3.JPG

profile_xi3.zip

scottwilkerson · Post by **scottwilkerson** » Tue Apr 10, 2018 8:28 am

your system log is showing these errors

Code: Select all

ndo2db: Warning: queue send error, retrying...

Please follow this document and perform all the changes
https://support.nagios.com/kb/article/n ... d-139.html

rajasegar · Post by **rajasegar** » Tue Apr 10, 2018 6:39 pm

scottwilkerson wrote:your system log is showing these errors
Code: Select all
ndo2db: Warning: queue send error, retrying... 
Please follow this document and perform all the changes
https://support.nagios.com/kb/article/n ... d-139.html

Already done this before. Here is the current config.
I can keep increasing it but the problem will still persist like in the other server before.

Code: Select all

# Controls the default maxmimum size of a mesage queue
#kernel.msgmnb = 431072000
kernel.msgmnb = 352144000

# Controls the maximum size of a message, in bytes
#kernel.msgmax = 431072000
kernel.msgmax = 352144000

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 4294967295

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 268435456

# The maximum number of messages allowed in any one message queue
#kernel.msgmni = 356000
kernel.msgmni = 512000

scottwilkerson · Post by **scottwilkerson** » Wed Apr 11, 2018 10:06 am

We may want to try increasing

Code: Select all

kernel.msgmni = 512000

to

Code: Select all

kernel.msgmni = 768000

The server is getting to the point that it could really benefit from offloading the MYSQL databases to reduce some of the load on the loacal machine any move the simultaneous processing to a second server

https://assets.nagios.com/downloads/nag ... Server.pdf

rajasegar · Post by **rajasegar** » Mon Apr 30, 2018 4:03 am

The problem came back. This is driving me nuts whole day

Capture.JPG

profile.zip

Repaired the databases, restarted services, restarted App and DB servers.
Still same problem.
Please advice what else to check.

# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 352144000

# Controls the maximum size of a message, in bytes
kernel.msgmax = 352144000

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 4294967295

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 268435456

# The maximum number of messages allowed in any one message queue
kernel.msgmni = 768000

Post by **tgriep** » Mon Apr 30, 2018 1:38 pm

Can you run the following commands on the nagios server and post the output to the ticket? If the username, password and the name of the MYSQL is not like the examples below, please change them.

Code: Select all

/usr/local/nagios/bin/nagiostats

echo 'SELECT COUNT(*) AS total FROM nagios_hoststatus  WHERE TRUE AND`active_checks_enabled`=1 AND (TIMESTAMPDIFF(SECOND,nagios_hoststatus.last_check,NOW()) < 60);' |mysql -u nagios -pnagios --databases nagios -h nagiosproddb1
echo 'SELECT COUNT(*) AS total FROM nagios_hoststatus  WHERE TRUE AND`active_checks_enabled`=1 AND (TIMESTAMPDIFF(SECOND,nagios_hoststatus.last_check,NOW()) < 300);' |mysql -u nagios -pnagios --databases nagios -h nagiosproddb1 
echo 'SELECT COUNT(*) AS total FROM nagios_hoststatus  WHERE TRUE AND`active_checks_enabled`=1 AND (TIMESTAMPDIFF(SECOND,nagios_hoststatus.last_check,NOW()) < 900);' |mysql -u nagios -pnagios --databases nagios -h nagiosproddb1 

echo 'SELECT COUNT(*) AS total FROM nagios_servicestatus  WHERE TRUE AND`active_checks_enabled`=1 AND (TIMESTAMPDIFF(SECOND,nagios_servicestatus.last_check,NOW()) < 60);' |mysql -u nagios -pnagios --databases nagios -h nagiosproddb1
echo 'SELECT COUNT(*) AS total FROM nagios_servicestatus  WHERE TRUE AND`active_checks_enabled`=1 AND (TIMESTAMPDIFF(SECOND,nagios_servicestatus.last_check,NOW()) < 300);' |mysql -u nagios -pnagios --databases nagios -h nagiosproddb1
echo 'SELECT COUNT(*) AS total FROM nagios_servicestatus  WHERE TRUE AND`active_checks_enabled`=1 AND (TIMESTAMPDIFF(SECOND,nagios_servicestatus.last_check,NOW()) < 900);' |mysql --u nagios -pnagios --databases nagios -h nagiosproddb1

What this will do it to print out that Nagios Core has tested and what is in the MYSQL database so we can compare the information.

Nagios Support Forum

Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable