false critical alerts on all hosts
false critical alerts on all hosts
we have been having this issue for a while now, we noticed a /var alert pops up on all hosts in a group called network and it all shows critical and initially all these hosts do not monitor /var because they are all network switches so there is no file system monitoring for this.
i did a temporal fix that cleared the alerts by deleting config.writing it and restarted Nagios on the UI, that cleared and after a few hours it came back and we have over 500 /var critical alerts which cannot be deleted on the config file manager because it cannot be seen but we see the alerts, its getting frustrating and i hope i can get a solution as soon as possible because these alerts also causes the load average on the server to increase, not sure why
i did a temporal fix that cleared the alerts by deleting config.writing it and restarted Nagios on the UI, that cleared and after a few hours it came back and we have over 500 /var critical alerts which cannot be deleted on the config file manager because it cannot be seen but we see the alerts, its getting frustrating and i hope i can get a solution as soon as possible because these alerts also causes the load average on the server to increase, not sure why
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: false critical alerts on all hosts
Can you PM me your system profile Admin -> System profile) along with the name of one of these hosts and the specific service name
Thanks
Thanks
Re: false critical alerts on all hosts
i managed to clear the alerts again but i know it will come back, please see profile attached and i will be sending you a screenshot i took of some of th ehosts have issues
You do not have the required permissions to view the files attached to this post.
Re: false critical alerts on all hosts
see screenshot
You do not have the required permissions to view the files attached to this post.
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: false critical alerts on all hosts
If you go to the CCM -> Serviced
select TPCDAL-BIGIP-01.transplace.com config from the drop down
If you look at the /var Disk Usage service, under hostgroups you will see Network has been added.
when you add a hostgroup to a service ALL member of the hostgroup will get the service added.
Looking at the config for this host, it appears that the Network hostgroup was added to many/all of them along with another hostgroup
these should be removed or you will have the behavior you are seeing
Once removed Apply Configuration
This should resolve the issue
select TPCDAL-BIGIP-01.transplace.com config from the drop down
If you look at the /var Disk Usage service, under hostgroups you will see Network has been added.
when you add a hostgroup to a service ALL member of the hostgroup will get the service added.
Looking at the config for this host, it appears that the Network hostgroup was added to many/all of them along with another hostgroup
these should be removed or you will have the behavior you are seeing
Once removed Apply Configuration
This should resolve the issue
Re: false critical alerts on all hosts
i noticed this host before when i checked for errors on the system and i decided to delete the host and its services completely which fixed the issue but then it came back hours later, i have deleted TPCDAL-BIGIP-01.transplace.com about 4 times and j=hours later it comes back, i have tried your own method and removed the host group network from the service and it cleared the alerts, i will observe and see if it comes back
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: false critical alerts on all hosts
It would be really weird for a deleted host configuration to come back. Is there someone that could be restoring a config from an earlier time?rnjie wrote:i noticed this host before when i checked for errors on the system and i decided to delete the host and its services completely which fixed the issue but then it came back hours later, i have deleted TPCDAL-BIGIP-01.transplace.com about 4 times and j=hours later it comes back, i have tried your own method and removed the host group network from the service and it cleared the alerts, i will observe and see if it comes back
Re: false critical alerts on all hosts
i thought it was weird too, your method worked an di havent gotten the alerts again, but th eload average on the VM is still very high, is there a way to figure out what resources i would need to support the load based on the profile i sent you before?
it currently has 18cpu cores and 10G memory, all of these resources are below 30% used but we are still getting high load averages on the system and backed up processes, just wondering if this has anything to do with nagios because top processes are mysqld and httpd
it currently has 18cpu cores and 10G memory, all of these resources are below 30% used but we are still getting high load averages on the system and backed up processes, just wondering if this has anything to do with nagios because top processes are mysqld and httpd
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: false critical alerts on all hosts
The profile you sent before showed this in the top.txt
On a system with 16cpu cores this isn't a high load average, there is no waiting taking place at all
If you were regularly sustaining a load above 15, that would be concerning to me
Code: Select all
load average: 3.18, 4.21, 4.83If you were regularly sustaining a load above 15, that would be concerning to me
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: false critical alerts on all hosts
One thing worth pointing out, you said this is a VM, if you have over-provisioned your VM's all with a large amount of CPUs this can make the hypervisor spend a considerable amount of CPU resources just determining which processor to run each operation on.