service already deleted in CCM but Nagios still monitoring?!

Post by **tgriep** » Tue Feb 04, 2020 3:18 pm

Do you know what process or processes that were using us the most CPU when the issue is happening?

Do you know if anyone was doing any maintenance on the server?
Anyone creating new checks and Applying the Config?

When you have to login to the server to manually do something, what are you doing?

xpertech · Post by **xpertech** » Fri Feb 14, 2020 1:50 am

tgriep wrote:Do you know what process or processes that were using us the most CPU when the issue is happening?
please refer to attachments(top_0204.txt、check20200204.log)

Do you know if anyone was doing any maintenance on the server?
that NagiosXI host is a backup server, so no one will connect to it

Anyone creating new checks and Applying the Config?
no one will connect to that server until received cpu high alert

When you have to login to the server to manually do something, what are you doing?
no one will connect to that server until received cpu high alert, and then will login to the server and follow the steps you recommended manually

will NagiosXI itself doing something that may cause this?!
on 04 Feb. the cpu high again(around 12:50~13:30), have to do the same steps manually, and then cpu will go down

also on 04 Feb. some errors ...
Feb 04 12:54:40 twtpelnag02p nagios[16961]: job 220618 (pid=12444): read() returned error 11
Feb 04 12:54:40 twtpelnag02p nagios[16969]: job 220900 (pid=21121): read() returned error 11
Feb 04 12:54:40 twtpelnag02p nagios[16965]: job 220840 (pid=16515): read() returned error 11
Feb 04 12:54:40 twtpelnag02p nagios[16965]: job 220904 (pid=16832): read() returned error 11
Feb 04 12:54:40 twtpelnag02p nagios[16965]: job 220904 (pid=16832): read() returned error 11
Feb 04 12:54:40 twtpelnag02p nagios[16961]: job 220622 (pid=12448): read() returned error 11
Feb 04 12:54:40 twtpelnag02p nagios[16961]: job 220622 (pid=12448): read() returned error 11
Feb 04 12:54:40 twtpelnag02p nagios[16961]: job 220622 (pid=12448): read() returned error 11
Feb 04 12:54:40 twtpelnag02p nagios[16961]: job 220622 (pid=12448): read() returned error 11
Feb 04 12:54:41 twtpelnag02p nagios[16961]: job 220622 (pid=12448): read() returned error 11

found some error messages almost every day (refer attachments returned error11)
Feb 2 03:25:17 twtpelnag02p nagios: job 345752 (pid=20093): read() returned error 11
Feb 2 03:25:17 twtpelnag02p nagios: job 345752 (pid=20093): read() returned error 11
Feb 2 03:25:17 twtpelnag02p nagios: job 345752 (pid=20098): read() returned error 11

the profile PM to you.

Post by **tgriep** » Fri Feb 14, 2020 10:16 am

Thanks for the profile. I could not find anything conclusive for the cause except for the following.
I see an error writing to a MYSQL table so that needs to be investigated.

Open a root shell in the Nagios server and run the following commands. Get the /tmp/info.txt file and upload it to the forum.

Code: Select all

mysql -u root -pnagiosxi -e "show global status like '%used_connections%'; show variables like 'max_connections';" >/tmp/info.txt
echo "SELECT table_schema as 'Database', table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES ORDER BY (data_length + index_length) DESC;" |mysql -t -u root -pnagiosxi >>/tmp/info.txt
echo 'select * from nagios_conninfo;' |mysql -t -u root -pnagiosxi nagios >>/tmp/info.txt
echo 'desc nagios_conninfo;' |mysql -t -u root -pnagiosxi nagios >>/tmp/info.txt

The other thing is when the system first gets loaded, I think VMWare is throttling down the CPU's to the system which makes it run worse as the system cannot get back to running as full speed. When you stop the processes, the load drops, the VMWare limit is gone so the system runs normally.
See if you can increase the limit imposed by the VMWare settings.

Next time if it happens, get the profile while the issue is happening if possible and we may see more information on the cause.

xpertech · Post by **xpertech** » Thu Feb 20, 2020 8:50 am

tgriep wrote:Thanks for the profile. I could not find anything conclusive for the cause except for the following.
I see an error writing to a MSQL table so that needs to be investigated.

Open a root shell in the Nagios server and run the following commands. Get the /tmp/info.txt file and upload it to the forum.
Code: Select all
mysql -u root -pnagiosxi -e "show global status like '%used_connections%'; show variables like 'max_connections';" >/tmp/info.txt
echo "SELECT table_schema as 'Database', table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES ORDER BY (data_length + index_length) DESC;" |mysql -t -u root -pnagiosxi >>/tmp/info.txt
echo 'select * from nagios_conninfo;' |mysql -t -u root -pnagiosxi nagios >>/tmp/info.txt
echo 'desc nagios_conninfo;' |mysql -t -u root -pnagiosxi nagios >>/tmp/info.txt
The other thing is when the system first gets loaded, I think VMWare is throttling down the CPU's to the system which makes it run worse as the system cannot get back to running as full speed. When you stop the processes, the load drops, the VMWare limit is gone so the system runs normally.
See if you can increase the limit imposed by the VMWare settings.

Next time if it happens, get the profile while the issue is happening if possible and we may see more information on the cause.

Your point of view about the VMware affect NagiosXI, could you provide more details or some examples happened in other cases?! so we can explain to the maintainer of VMware system.

When it happened again next time, what steps we could do to collect information for troubleshooting?

Post by **tgriep** » Thu Feb 20, 2020 11:09 am

These are the messages you should look for the /var/log/messages file to check for VMWare throttling the VM.

Feb 4 13:07:26 twtpelnag02p kernel: NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [kworker/6:8:24996]
Feb 4 13:07:26 twtpelnag02p kernel: Modules linked in: binfmt_misc ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vmw_vsock_vmci_transport vsock coretemp iosf_mbi ppdev crc32_pclmul ghash_clmulni_intel vmw_balloon aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg joydev pcspkr parport_pc parport i2c_piix4 vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi vmwgfx drm_kms_helper syscopyarea
Feb 4 13:07:26 twtpelnag02p kernel: sysfillrect sysimgblt fb_sys_fops ttm drm mptspi scsi_transport_spi ata_piix crct10dif_pclmul crct10dif_common crc32c_intel mptscsih libata e1000 serio_raw mptbase drm_panel_orientation_quirks floppy dm_mirror dm_region_hash dm_log dm_mod
Feb 4 13:07:26 twtpelnag02p kernel: CPU: 6 PID: 24996 Comm: kworker/6:8 Kdump: loaded Tainted: G W L ------------ 3.10.0-1062.4.1.el7.x86_64 #1
Feb 4 13:07:26 twtpelnag02p kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015

Your screen capture of the search does not match what you are looking for so nothing comes up.
Just use GREGEMAIL in the search field.

The database information looks good so thanks for providing that.

xpertech · Post by **xpertech** » Tue Mar 03, 2020 11:15 am

on 14 Feb. you mentioned "Next time if it happens, get the profile while the issue is happening if possible and we may see more information on the cause." , except getting the profile, what other things we can do for troubleshooting?

Post by **tgriep** » Tue Mar 03, 2020 1:29 pm

When the system gets loaded, get the system profile and check for the messages in the /var/log/messages file that I posted on Feburary 20th.
The Kernel: messages similar to the following.

kernel: NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [kworker/6:8:24996]
kernel: CPU: 6 PID: 24996 Comm: kworker/6:8 Kdump: loaded Tainted: G W L ------------ 3.10.0-1062.4.1.el7.x86_64 #1
kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015

Nagios Support Forum

service already deleted in CCM but Nagios still monitoring?!

Re: service already deleted in CCM but Nagios still monitori

Re: service already deleted in CCM but Nagios still monitori

Re: service already deleted in CCM but Nagios still monitori

Re: service already deleted in CCM but Nagios still monitori

Re: service already deleted in CCM but Nagios still monitori

Re: service already deleted in CCM but Nagios still monitori

Re: service already deleted in CCM but Nagios still monitori