Scheduling very unstable

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Scheduling very unstable

Post by cdienger »

Try increasing the memory limit to 1024M or 2048M in php.ini and restarting the httpd service again. Digging a bit deeper in the code, I don't see where much else could go wrong like this unless the nagios database is also having issues and the full queue can't be pulled. Based on the profile though I would say the php memory problem is more likely.

The table that is queried for this chart is nagios_timedeventqueue. To check it the following can be run(note the row returned):

echo "select * from nagios_timedeventqueue" | mysql --uroot -pnagiosxi -Dnagios

https://assets.nagios.com/downloads/nag ... tabase.pdf covers repairing the database and although I don't think at this time it would be the problem, it wouldn't hurt to run the repair script anyway.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Scheduling very unstable

Post by rajasegar »

cdienger wrote:Try increasing the memory limit to 1024M or 2048M in php.ini and restarting the httpd service again. Digging a bit deeper in the code, I don't see where much else could go wrong like this unless the nagios database is also having issues and the full queue can't be pulled. Based on the profile though I would say the php memory problem is more likely.

The table that is queried for this chart is nagios_timedeventqueue. To check it the following can be run(note the row returned):

echo "select * from nagios_timedeventqueue" | mysql --uroot -pnagiosxi -Dnagios

https://assets.nagios.com/downloads/nag ... tabase.pdf covers repairing the database and although I don't think at this time it would be the problem, it wouldn't hurt to run the repair script anyway.

echo "select * from nagios_timedeventqueue" | mysql -uroot -pnagiosxi -Dnagios --host=10.17.19.237
This returns Empty set (0.00 sec). Scheduling seems fine now. Can you please double check?
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Scheduling very unstable

Post by scottwilkerson »

rajasegar wrote:Scheduling seems fine now. Can you please double check?
There is not a lot to check if it all seems stable.

I have noted this behavior a few times in the past on systems with mod_gearman if they somehow spawn 2 nagios parent processes, but if you restart several things before we get any logs it's hard to tell. If you notice it again, it would be great if you could create a profile BEFORE trying ot take any corrective action, and open a ticket with that for analysis
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Scheduling very unstable

Post by rajasegar »

scottwilkerson wrote:
rajasegar wrote:Scheduling seems fine now. Can you please double check?
There is not a lot to check if it all seems stable.

I have noted this behavior a few times in the past on systems with mod_gearman if they somehow spawn 2 nagios parent processes, but if you restart several things before we get any logs it's hard to tell. If you notice it again, it would be great if you could create a profile BEFORE trying ot take any corrective action, and open a ticket with that for analysis
Ok noted. Will do that if it happens again.

As I updated previously I disabled the mod_gearman in nagios.cfg.

Code: Select all

#broker_module=/usr/lib64/mod_gearman2/mod_gearman2.o config=/etc/mod_gearman2/module.conf eventhandler=no
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Scheduling very unstable

Post by rajasegar »

This time another server is giving issue. Took the system profile before restarting.

18 cores, 20GB. DB not offloaded.
Capture_XI3.JPG

profile_xi3.zip
You do not have the required permissions to view the files attached to this post.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Scheduling very unstable

Post by scottwilkerson »

your system log is showing these errors

Code: Select all

ndo2db: Warning: queue send error, retrying... 
Please follow this document and perform all the changes
https://support.nagios.com/kb/article/n ... d-139.html
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Scheduling very unstable

Post by rajasegar »

scottwilkerson wrote:your system log is showing these errors

Code: Select all

ndo2db: Warning: queue send error, retrying... 
Please follow this document and perform all the changes
https://support.nagios.com/kb/article/n ... d-139.html
Already done this before. Here is the current config.
I can keep increasing it but the problem will still persist like in the other server before.

Code: Select all

# Controls the default maxmimum size of a mesage queue
#kernel.msgmnb = 431072000
kernel.msgmnb = 352144000

# Controls the maximum size of a message, in bytes
#kernel.msgmax = 431072000
kernel.msgmax = 352144000

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 4294967295

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 268435456

# The maximum number of messages allowed in any one message queue
#kernel.msgmni = 356000
kernel.msgmni = 512000
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Scheduling very unstable

Post by scottwilkerson »

We may want to try increasing

Code: Select all

kernel.msgmni = 512000
to

Code: Select all

kernel.msgmni = 768000
The server is getting to the point that it could really benefit from offloading the MYSQL databases to reduce some of the load on the loacal machine any move the simultaneous processing to a second server

https://assets.nagios.com/downloads/nag ... Server.pdf
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Scheduling very unstable

Post by rajasegar »

The problem came back. This is driving me nuts whole day
Capture.JPG
profile.zip
Repaired the databases, restarted services, restarted App and DB servers.
Still same problem.
Please advice what else to check.
# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 352144000

# Controls the maximum size of a message, in bytes
kernel.msgmax = 352144000

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 4294967295

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 268435456

# The maximum number of messages allowed in any one message queue
kernel.msgmni = 768000
You do not have the required permissions to view the files attached to this post.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Scheduling very unstable

Post by tgriep »

Can you run the following commands on the nagios server and post the output to the ticket? If the username, password and the name of the MYSQL is not like the examples below, please change them.

Code: Select all

/usr/local/nagios/bin/nagiostats

echo 'SELECT COUNT(*) AS total FROM nagios_hoststatus  WHERE TRUE AND`active_checks_enabled`=1 AND (TIMESTAMPDIFF(SECOND,nagios_hoststatus.last_check,NOW()) < 60);' |mysql -u nagios -pnagios --databases nagios -h nagiosproddb1
echo 'SELECT COUNT(*) AS total FROM nagios_hoststatus  WHERE TRUE AND`active_checks_enabled`=1 AND (TIMESTAMPDIFF(SECOND,nagios_hoststatus.last_check,NOW()) < 300);' |mysql -u nagios -pnagios --databases nagios -h nagiosproddb1 
echo 'SELECT COUNT(*) AS total FROM nagios_hoststatus  WHERE TRUE AND`active_checks_enabled`=1 AND (TIMESTAMPDIFF(SECOND,nagios_hoststatus.last_check,NOW()) < 900);' |mysql -u nagios -pnagios --databases nagios -h nagiosproddb1 

echo 'SELECT COUNT(*) AS total FROM nagios_servicestatus  WHERE TRUE AND`active_checks_enabled`=1 AND (TIMESTAMPDIFF(SECOND,nagios_servicestatus.last_check,NOW()) < 60);' |mysql -u nagios -pnagios --databases nagios -h nagiosproddb1
echo 'SELECT COUNT(*) AS total FROM nagios_servicestatus  WHERE TRUE AND`active_checks_enabled`=1 AND (TIMESTAMPDIFF(SECOND,nagios_servicestatus.last_check,NOW()) < 300);' |mysql -u nagios -pnagios --databases nagios -h nagiosproddb1
echo 'SELECT COUNT(*) AS total FROM nagios_servicestatus  WHERE TRUE AND`active_checks_enabled`=1 AND (TIMESTAMPDIFF(SECOND,nagios_servicestatus.last_check,NOW()) < 900);' |mysql --u nagios -pnagios --databases nagios -h nagiosproddb1 
What this will do it to print out that Nagios Core has tested and what is in the MYSQL database so we can compare the information.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked