false critical alerts on all hosts

rnjie · Post by **rnjie** » Mon Aug 24, 2020 1:48 pm

we have been having this issue for a while now, we noticed a /var alert pops up on all hosts in a group called network and it all shows critical and initially all these hosts do not monitor /var because they are all network switches so there is no file system monitoring for this.
i did a temporal fix that cleared the alerts by deleting config.writing it and restarted Nagios on the UI, that cleared and after a few hours it came back and we have over 500 /var critical alerts which cannot be deleted on the config file manager because it cannot be seen but we see the alerts, its getting frustrating and i hope i can get a solution as soon as possible because these alerts also causes the load average on the server to increase, not sure why

scottwilkerson · Post by **scottwilkerson** » Mon Aug 24, 2020 2:48 pm

Can you PM me your system profile Admin -> System profile) along with the name of one of these hosts and the specific service name

Thanks

rnjie · Post by **rnjie** » Mon Aug 24, 2020 4:45 pm

i managed to clear the alerts again but i know it will come back, please see profile attached and i will be sending you a screenshot i took of some of th ehosts have issues

profile.zip

rnjie · Post by **rnjie** » Mon Aug 24, 2020 4:46 pm

see screenshot

scottwilkerson · Post by **scottwilkerson** » Mon Aug 24, 2020 4:59 pm

If you go to the CCM -> Serviced
select TPCDAL-BIGIP-01.transplace.com config from the drop down

If you look at the /var Disk Usage service, under hostgroups you will see Network has been added.

when you add a hostgroup to a service ALL member of the hostgroup will get the service added.

Looking at the config for this host, it appears that the Network hostgroup was added to many/all of them along with another hostgroup

these should be removed or you will have the behavior you are seeing

Once removed Apply Configuration

This should resolve the issue

rnjie · Post by **rnjie** » Tue Aug 25, 2020 8:48 am

i noticed this host before when i checked for errors on the system and i decided to delete the host and its services completely which fixed the issue but then it came back hours later, i have deleted TPCDAL-BIGIP-01.transplace.com about 4 times and j=hours later it comes back, i have tried your own method and removed the host group network from the service and it cleared the alerts, i will observe and see if it comes back

scottwilkerson · Post by **scottwilkerson** » Tue Aug 25, 2020 9:23 am

rnjie wrote:i noticed this host before when i checked for errors on the system and i decided to delete the host and its services completely which fixed the issue but then it came back hours later, i have deleted TPCDAL-BIGIP-01.transplace.com about 4 times and j=hours later it comes back, i have tried your own method and removed the host group network from the service and it cleared the alerts, i will observe and see if it comes back

It would be really weird for a deleted host configuration to come back. Is there someone that could be restoring a config from an earlier time?

rnjie · Post by **rnjie** » Fri Aug 28, 2020 2:40 pm

i thought it was weird too, your method worked an di havent gotten the alerts again, but th eload average on the VM is still very high, is there a way to figure out what resources i would need to support the load based on the profile i sent you before?
it currently has 18cpu cores and 10G memory, all of these resources are below 30% used but we are still getting high load averages on the system and backed up processes, just wondering if this has anything to do with nagios because top processes are mysqld and httpd

scottwilkerson · Post by **scottwilkerson** » Fri Aug 28, 2020 2:54 pm

The profile you sent before showed this in the top.txt

Code: Select all

load average: 3.18, 4.21, 4.83

On a system with 16cpu cores this isn't a high load average, there is no waiting taking place at all
If you were regularly sustaining a load above 15, that would be concerning to me

scottwilkerson · Post by **scottwilkerson** » Fri Aug 28, 2020 2:55 pm

One thing worth pointing out, you said this is a VM, if you have over-provisioned your VM's all with a large amount of CPUs this can make the hypervisor spend a considerable amount of CPU resources just determining which processor to run each operation on.

Nagios Support Forum

false critical alerts on all hosts

false critical alerts on all hosts

Re: false critical alerts on all hosts

Re: false critical alerts on all hosts

Re: false critical alerts on all hosts

Re: false critical alerts on all hosts

Re: false critical alerts on all hosts

Re: false critical alerts on all hosts

Re: false critical alerts on all hosts

Re: false critical alerts on all hosts

Re: false critical alerts on all hosts