Scheduling very unstable

rajasegar · Post by **rajasegar** » Tue Apr 03, 2018 2:48 am

We have been having scheduling problem with one of our XI instance.
Sometime the scheduling drop to 20 and stays there until it is restarted.

Then out of sudden it will be all ok until the cycle starts again some other day.
Server restart does not solve the problem.
DB logs no issues (it is off loaded by the way)

This instance has been active since XI version 1.5 and updated regularly.
We have another instance like this with more load but it is fine.

Capture2.JPG

Capture.JPG

Server resources all ok.

Capture3.JPG

Only errors I see in /var/log/messages is this

Code: Select all

Apr  3 15:31:26 nagiosprodxi1 nagios: job 1326 (pid=8518): read() returned error 11
Apr  3 15:31:26 nagiosprodxi1 nagios: job 1326 (pid=8517): read() returned error 11
Apr  3 15:31:44 nagiosprodxi1 ndo2db: Trimming timedevents.
Apr  3 15:31:44 nagiosprodxi1 ndo2db: Trimming systemcommands.
Apr  3 15:31:44 nagiosprodxi1 ndo2db: Trimming servicechecks.
Apr  3 15:31:44 nagiosprodxi1 ndo2db: Trimming hostchecks.
Apr  3 15:31:44 nagiosprodxi1 ndo2db: Trimming eventhandlers.
Apr  3 15:32:45 nagiosprodxi1 ndo2db: Trimming timedevents.
Apr  3 15:32:45 nagiosprodxi1 ndo2db: Trimming systemcommands.
Apr  3 15:32:45 nagiosprodxi1 ndo2db: Trimming servicechecks.
Apr  3 15:32:45 nagiosprodxi1 ndo2db: Trimming hostchecks.
Apr  3 15:32:45 nagiosprodxi1 ndo2db: Trimming eventhandlers.
Apr  3 15:33:33 nagiosprodxi1 nagios: job 1469 (pid=15485): read() returned error 11
Apr  3 15:33:33 nagiosprodxi1 nagios: job 1469 (pid=15486): read() returned error 11
Apr  3 15:33:46 nagiosprodxi1 ndo2db: Trimming timedevents.
Apr  3 15:33:46 nagiosprodxi1 ndo2db: Trimming systemcommands.
Apr  3 15:33:46 nagiosprodxi1 ndo2db: Trimming servicechecks.
Apr  3 15:33:46 nagiosprodxi1 ndo2db: Trimming hostchecks.
Apr  3 15:33:46 nagiosprodxi1 ndo2db: Trimming eventhandlers.
Apr  3 15:34:06 nagiosprodxi1 nagios: job 1509 (pid=21484): read() returned error 11
Apr  3 15:34:45 nagiosprodxi1 rrdcached[24507]: queue_thread_main: rrd_update_r (/usr/local/nagios/share/perfdata/MYUCBRIFRSAPP02/Disk__All_Partitions.rrd) failed with status -1. (/usr/local/nagios/share/perfdata/MYUCBRIFRSAPP02/Disk__All_Partitions.rrd: expected 19 data source readings (got 18) from 1522739981)

I hope someone can help to remote in and solve this as we are having endless issues with this.
Please note we work on GMT+8 timezone.

npolovenko · Post by **npolovenko** » Tue Apr 03, 2018 10:18 am

Hello, @rajasegar. Does the scheduling stop for only one service or for all services at the same time? Next time this happens please open the Nagios Core Interface and see if the checks are scheduled there:

Code: Select all

http://nagios_xi_ip/nagios/

Also, please run the database repair script.

DB logs no issues (it is off loaded by the way)

Have you looked into logs on the remote server itself?

Could you send in your Nagios XI System Profile so I can review it?
To send us your system profile. Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file, upload it to a cloud storage of your choice and share a link with me via private message.
After that please post something in this thread to bring it back up in the support queue.

Post by **cdienger** » Tue Apr 03, 2018 10:20 am

Please PM me a profile(Admin > System Config > System Profile > Download System Profile) as well as a copy of /etc/sudoers. Please also verify that selinux is disabled on the command line with:

sestatus

Did you notice this behavior start after a certain update or change in the environment or config?

rajasegar · Post by **rajasegar** » Tue Apr 03, 2018 6:42 pm

npolovenko wrote:Hello, @rajasegar. Does the scheduling stop for only one service or for all services at the same time? Next time this happens please open the Nagios Core Interface and see if the checks are scheduled there:
Code: Select all
http://nagios_xi_ip/nagios/ 
Also, please run the database repair script.
DB logs no issues (it is off loaded by the way)
Have you looked into logs on the remote server itself?

Could you send in your Nagios XI System Profile so I can review it?
To send us your system profile. Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file, upload it to a cloud storage of your choice and share a link with me via private message.
After that please post something in this thread to bring it back up in the support queue.

It effects scheduling for everything, all services and hosts. It grinds to a halt.

The remote DB server is fine also. I looked at the mysqld.log in this server.
I ran repair so many times already.

Sometimes it gives scheduling problems for 2 days then out of the blue becomes ok.

I will post the profile shortly.

rajasegar · Post by **rajasegar** » Tue Apr 03, 2018 6:46 pm

cdienger wrote:Please PM me a profile(Admin > System Config > System Profile > Download System Profile) as well as a copy of /etc/sudoers. Please also verify that selinux is disabled on the command line with:

sestatus

Did you notice this behavior start after a certain update or change in the environment or config?

This happened a few times a few years back when the server was heavily loaded.
It has since been offloaded to 3 instances. So this instance is not that loaded now.

SELinux is disabled.

/etc/sudoers

Code: Select all

[nagios@nagiosprodxi1 ~]$ sudo cat /etc/sudoers
## Sudoers allows particular users to run various commands as
## the root user, without needing the root password.
##
## Examples are provided at the bottom of the file for collections
## of related commands, which can then be delegated out to particular
## users or groups.
##
## This file must be edited with the 'visudo' command.

## Host Aliases
## Groups of machines. You may prefer to use hostnames (perhaps using
## wildcards for entire domains) or IP addresses instead.
# Host_Alias     FILESERVERS = fs1, fs2
# Host_Alias     MAILSERVERS = smtp, smtp2

## User Aliases
## These aren't often necessary, as you can use regular groups
## (ie, from files, LDAP, NIS, etc) in this file - just use %groupname
## rather than USERALIAS
# User_Alias ADMINS = jsmith, mikem

## Command Aliases
## These are groups of related commands...

## Networking
# Cmnd_Alias NETWORKING = /sbin/route, /sbin/ifconfig, /bin/ping, /sbin/dhclient, /usr/bin/net, /sbin/iptables, /usr/bin/rfcomm, /usr/bin/wvdial, /sbin/iwconfig, /sbin/mii-tool

## Installation and management of software
# Cmnd_Alias SOFTWARE = /bin/rpm, /usr/bin/up2date, /usr/bin/yum

## Services
# Cmnd_Alias SERVICES = /sbin/service, /sbin/chkconfig

## Updating the locate database
# Cmnd_Alias LOCATE = /usr/bin/updatedb

## Storage
# Cmnd_Alias STORAGE = /sbin/fdisk, /sbin/sfdisk, /sbin/parted, /sbin/partprobe, /bin/mount, /bin/umount

## Delegating permissions
# Cmnd_Alias DELEGATING = /usr/sbin/visudo, /bin/chown, /bin/chmod, /bin/chgrp

## Processes
# Cmnd_Alias PROCESSES = /bin/nice, /bin/kill, /usr/bin/kill, /usr/bin/killall

## Drivers
# Cmnd_Alias DRIVERS = /sbin/modprobe

# Defaults specification

#
# Disable "ssh hostname sudo <cmd>", because it will show the password in clear.
#         You have to run "ssh -t hostname sudo <cmd>".
#
#####Defaults    requiretty

#
# Refuse to run if unable to disable echo on the tty. This setting should also be
# changed in order to be able to use sudo without a tty. See requiretty above.
#
Defaults   !visiblepw

#
# Preserving HOME has security implications since many programs
# use it when searching for configuration files. Note that HOME
# is already set when the the env_reset option is enabled, so
# this option is only effective for configurations where either
# env_reset is disabled or HOME is present in the env_keep list.
my02390 ALL=(ALL)       NOPASSWD: ALL
nagios  ALL=(ALL)       NOPASSWD: ALL
backup_admin    ALL=(ALL)       NOPASSWD: ALL
#
Defaults    always_set_home

Defaults    env_reset
Defaults    env_keep =  "COLORS DISPLAY HOSTNAME HISTSIZE INPUTRC KDEDIR LS_COLORS"
Defaults    env_keep += "MAIL PS1 PS2 QTDIR USERNAME LANG LC_ADDRESS LC_CTYPE"
Defaults    env_keep += "LC_COLLATE LC_IDENTIFICATION LC_MEASUREMENT LC_MESSAGES"
Defaults    env_keep += "LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE"
Defaults    env_keep += "LC_TIME LC_ALL LANGUAGE LINGUAS _XKB_CHARSET XAUTHORITY"
Defaults    env_keep += "TZ"

#
# Adding HOME to env_keep may enable a user to run unrestricted
# commands via sudo.
#
# Defaults   env_keep += "HOME"

Defaults    secure_path = /sbin:/bin:/usr/sbin:/usr/bin

## Next comes the main part: which users can run what software on
## which machines (the sudoers file can be shared between multiple
## systems).
## Syntax:
##
##      user    MACHINE=COMMANDS
##
## The COMMANDS section may have other options added to it.
##
## Allow root to run any commands anywhere
root    ALL=(ALL)       ALL

## Allows members of the 'sys' group to run networking, software,
## service management apps and more.
# %sys ALL = NETWORKING, SOFTWARE, SERVICES, STORAGE, DELEGATING, PROCESSES, LOCATE, DRIVERS

## Allows people in group wheel to run all commands
# %wheel        ALL=(ALL)       ALL

## Same thing without a password
# %wheel        ALL=(ALL)       NOPASSWD: ALL

## Allows members of the users group to mount and unmount the
## cdrom as root
# %users  ALL=/sbin/mount /mnt/cdrom, /sbin/umount /mnt/cdrom

## Allows members of the users group to shutdown this system
# %users  localhost=/sbin/shutdown -h now

## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

User_Alias      NAGIOSXI=nagios
User_Alias              NAGIOSXIWEB=apache
NAGIOSXI ALL = NOPASSWD:/etc/init.d/nagios start
NAGIOSXI ALL = NOPASSWD:/etc/init.d/nagios stop
NAGIOSXI ALL = NOPASSWD:/etc/init.d/nagios restart
NAGIOSXI ALL = NOPASSWD:/etc/init.d/nagios reload
NAGIOSXI ALL = NOPASSWD:/etc/init.d/nagios status
NAGIOSXI ALL = NOPASSWD:/etc/init.d/nagios checkconfig
NAGIOSXI ALL = NOPASSWD:/etc/init.d/ndo2db start
NAGIOSXI ALL = NOPASSWD:/etc/init.d/ndo2db stop
NAGIOSXI ALL = NOPASSWD:/etc/init.d/ndo2db restart
NAGIOSXI ALL = NOPASSWD:/etc/init.d/ndo2db reload
NAGIOSXI ALL = NOPASSWD:/etc/init.d/ndo2db status
NAGIOSXI ALL = NOPASSWD:/etc/init.d/npcd start
NAGIOSXI ALL = NOPASSWD:/etc/init.d/npcd stop
NAGIOSXI ALL = NOPASSWD:/etc/init.d/npcd restart
NAGIOSXI ALL = NOPASSWD:/etc/init.d/npcd reload
NAGIOSXI ALL = NOPASSWD:/etc/init.d/npcd status
NAGIOSXI ALL = NOPASSWD:/usr/bin/php /usr/local/nagiosxi/html/includes/components/autodiscovery/scripts/autodiscover_new.php *
NAGIOSXI ALL = NOPASSWD:/usr/local/nagiosxi/html/includes/components/profile/getprofile.sh
NAGIOSXI ALL = NOPASSWD:/usr/local/nagiosxi/scripts/upgrade_to_latest.sh
NAGIOSXI ALL = NOPASSWD:/usr/local/nagiosxi/scripts/change_timezone.sh
NAGIOSXI ALL = NOPASSWD:/usr/local/nagiosxi/scripts/manage_services.sh *
NAGIOSXI ALL = NOPASSWD:/usr/local/nagiosxi/scripts/reset_config_perms.sh
NAGIOSXI ALL = NOPASSWD:/usr/local/nagiosxi/scripts/backup_xi.sh *
NAGIOSXIWEB ALL = NOPASSWD:/usr/bin/tail -100 /var/log/messages
NAGIOSXIWEB ALL = NOPASSWD:/usr/bin/tail -100 /var/log/httpd/error_log
NAGIOSXIWEB ALL = NOPASSWD:/usr/bin/tail -100 /var/log/mysqld.log
NAGIOSXIWEB ALL = NOPASSWD:/usr/bin/php /usr/local/nagiosxi/html/includes/components/autodiscovery/scripts/autodiscover_new.php *
NAGIOSXIWEB ALL = NOPASSWD:/usr/local/nagiosxi/html/includes/components/profile/getprofile.sh
NAGIOSXIWEB ALL = NOPASSWD:/etc/init.d/snmptt restart
NAGIOSXIWEB ALL = NOPASSWD:/usr/local/nagiosxi/scripts/repair_databases.sh
NAGIOSXIWEB ALL = NOPASSWD:/usr/local/nagiosxi/scripts/manage_services.sh *

rajasegar · Post by **rajasegar** » Tue Apr 03, 2018 6:54 pm

Profile sent to cdienger & npolovenko .

Please update the limit of the PM. The profile is more than 1 MB and it was rejected.
Had to split it into 2 files.

Thanks.

Post by **cdienger** » Wed Apr 04, 2018 11:50 am

How many cpus are on the system? Check /proc/cpuinfo to get a count. The load on the machine at the time of the profile was more than 5 which can be high if the number of cpus is lower. It could also be spiking higher at the time of the drops in the graph. If the load on the XI server isn't already being monitored then use the Linux Server Wizard to monitor set one up so we see if the load may be causing this.

The load at the time of the profile looks due to a couple of java process which look to be caused by checks like:

/usr/bin/java -cp /usr/local/nagios/libexec ..

These are running on the XI server. Can/should they be offloaded to a gearman worker?

rajasegar · Post by **rajasegar** » Wed Apr 04, 2018 6:55 pm

cdienger wrote:How many cpus are on the system? Check /proc/cpuinfo to get a count. The load on the machine at the time of the profile was more than 5 which can be high if the number of cpus is lower. It could also be spiking higher at the time of the drops in the graph. If the load on the XI server isn't already being monitored then use the Linux Server Wizard to monitor set one up so we see if the load may be causing this.

The load at the time of the profile looks due to a couple of java process which look to be caused by checks like:

/usr/bin/java -cp /usr/local/nagios/libexec ..

These are running on the XI server. Can/should they be offloaded to a gearman worker?

These java checks has been running 24 x 7 for ages without any issues.
This also does not explain why it is intermittent.
FYI the server is back to normal now.

Capture.JPG

20 Cores

Code: Select all

processor       : 19
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
stepping        : 7
microcode       : 1808
cpu MHz         : 2893.028
cache size      : 20480 KB
physical id     : 0
siblings        : 20
core id         : 19
cpu cores       : 20
apicid          : 19
initial apicid  : 19
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt aes xsave avx hypervisor lahf_lm ida arat epb xsaveopt pln pts dts
bogomips        : 5786.05
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

I need to fix this problem for good as it keeps on reoccurring in this server.

Post by **cdienger** » Thu Apr 05, 2018 10:11 am

The code that controls this dashlet is php(/usr/local/nagiosxi/html/includes/components/xicore/dashlets-monitoringengine.inc.php) which can have problems displaying reports or charts if the settings(particularly memory) are too low. This could be the case here given that there does appear to be a spike in number of scheduled jobs that it would need to process and restarting the services clears it up. Follow https://support.nagios.com/kb/article/n ... e-611.html to increase the values and let us know if this helps.

rajasegar · Post by **rajasegar** » Thu Apr 05, 2018 10:26 pm

cdienger wrote:The code that controls this dashlet is php(/usr/local/nagiosxi/html/includes/components/xicore/dashlets-monitoringengine.inc.php) which can have problems displaying reports or charts if the settings(particularly memory) are too low. This could be the case here given that there does appear to be a spike in number of scheduled jobs that it would need to process and restarting the services clears it up. Follow https://support.nagios.com/kb/article/n ... e-611.html to increase the values and let us know if this helps.

No difference. No errors in the error_log file.
BTW, mod gearman is disabled. I remarked the line in nagios.cfg.

After a few restarts, it started acting up again

Capture.JPG

Capture.1.JPG

profile_XI1.zip

Nagios Support Forum

Scheduling very unstable

Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable

Re: Scheduling very unstable