false critical alerts on all hosts

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
rnjie
Posts: 157
Joined: Wed Mar 20, 2019 4:59 pm

false critical alerts on all hosts

Post by rnjie »

we have been having this issue for a while now, we noticed a /var alert pops up on all hosts in a group called network and it all shows critical and initially all these hosts do not monitor /var because they are all network switches so there is no file system monitoring for this.
i did a temporal fix that cleared the alerts by deleting config.writing it and restarted Nagios on the UI, that cleared and after a few hours it came back and we have over 500 /var critical alerts which cannot be deleted on the config file manager because it cannot be seen but we see the alerts, its getting frustrating and i hope i can get a solution as soon as possible because these alerts also causes the load average on the server to increase, not sure why
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: false critical alerts on all hosts

Post by scottwilkerson »

Can you PM me your system profile Admin -> System profile) along with the name of one of these hosts and the specific service name

Thanks
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
rnjie
Posts: 157
Joined: Wed Mar 20, 2019 4:59 pm

Re: false critical alerts on all hosts

Post by rnjie »

i managed to clear the alerts again but i know it will come back, please see profile attached and i will be sending you a screenshot i took of some of th ehosts have issues
profile.zip
You do not have the required permissions to view the files attached to this post.
rnjie
Posts: 157
Joined: Wed Mar 20, 2019 4:59 pm

Re: false critical alerts on all hosts

Post by rnjie »

see screenshot
You do not have the required permissions to view the files attached to this post.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: false critical alerts on all hosts

Post by scottwilkerson »

If you go to the CCM -> Serviced
select TPCDAL-BIGIP-01.transplace.com config from the drop down

If you look at the /var Disk Usage service, under hostgroups you will see Network has been added.

when you add a hostgroup to a service ALL member of the hostgroup will get the service added.

Looking at the config for this host, it appears that the Network hostgroup was added to many/all of them along with another hostgroup

these should be removed or you will have the behavior you are seeing

Once removed Apply Configuration

This should resolve the issue
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
rnjie
Posts: 157
Joined: Wed Mar 20, 2019 4:59 pm

Re: false critical alerts on all hosts

Post by rnjie »

i noticed this host before when i checked for errors on the system and i decided to delete the host and its services completely which fixed the issue but then it came back hours later, i have deleted TPCDAL-BIGIP-01.transplace.com about 4 times and j=hours later it comes back, i have tried your own method and removed the host group network from the service and it cleared the alerts, i will observe and see if it comes back
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: false critical alerts on all hosts

Post by scottwilkerson »

rnjie wrote:i noticed this host before when i checked for errors on the system and i decided to delete the host and its services completely which fixed the issue but then it came back hours later, i have deleted TPCDAL-BIGIP-01.transplace.com about 4 times and j=hours later it comes back, i have tried your own method and removed the host group network from the service and it cleared the alerts, i will observe and see if it comes back
It would be really weird for a deleted host configuration to come back. Is there someone that could be restoring a config from an earlier time?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
rnjie
Posts: 157
Joined: Wed Mar 20, 2019 4:59 pm

Re: false critical alerts on all hosts

Post by rnjie »

i thought it was weird too, your method worked an di havent gotten the alerts again, but th eload average on the VM is still very high, is there a way to figure out what resources i would need to support the load based on the profile i sent you before?
it currently has 18cpu cores and 10G memory, all of these resources are below 30% used but we are still getting high load averages on the system and backed up processes, just wondering if this has anything to do with nagios because top processes are mysqld and httpd
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: false critical alerts on all hosts

Post by scottwilkerson »

The profile you sent before showed this in the top.txt

Code: Select all

load average: 3.18, 4.21, 4.83
On a system with 16cpu cores this isn't a high load average, there is no waiting taking place at all
If you were regularly sustaining a load above 15, that would be concerning to me
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: false critical alerts on all hosts

Post by scottwilkerson »

One thing worth pointing out, you said this is a VM, if you have over-provisioned your VM's all with a large amount of CPUs this can make the hypervisor spend a considerable amount of CPU resources just determining which processor to run each operation on.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked