Gaps in alerting data and failed alerts
Posted: Mon Sep 21, 2015 12:49 pm
Good morning to some and afternoon to others. I've got yet another quandary.
I have a few alerts setup to look for a specific web request to be coming in regularly. This is to verify my web backend is working and also to graph the frequency of requests.
So I've got three alerts setup.
The first checks for the correct type (apache_error OR apache_access)
Then checks for the appropriate host (host:hostname.tld)
Then checks for the request field for the specific request ( /mailbox in one and /refill.php in the other)
It is set to check every 5 minutes with a critical and warning value of 1: and 1: respectively.
the /refill.php check has a lookback period of 12hours to make sure there's been one successful POST request to that in the last 12 hours (always should be).
The problem I have is that starting at 8pm and going forward until sometimes 8-10am the following day these requests fail saying 0 records found. If I click to view the alert on a dashboard I can clearly see that there are requests coming in. This morning on a whim I changed the lookback period from 12h to 1h and then re-ran the alert. It immediately came back with the results I expected and was no longer critical. When i changed the lookback period to 12h it again failed returning 0 results. If I changed it to 10 hours it actually had results. It almost seems like something is happening to the search when the index period rolls over where if the search crosses an index it fails. Which appears to happen at 8pm (GMT -04:00) which would be midnight GMT. Based on reading I expect the indexes to roll over then because ES works on GMT.
Here are some examples of my graphs in NagiosXI that are created from the NRDP data sent from the log server. Any ideas?
--
Wayne
I have a few alerts setup to look for a specific web request to be coming in regularly. This is to verify my web backend is working and also to graph the frequency of requests.
So I've got three alerts setup.
The first checks for the correct type (apache_error OR apache_access)
Then checks for the appropriate host (host:hostname.tld)
Then checks for the request field for the specific request ( /mailbox in one and /refill.php in the other)
It is set to check every 5 minutes with a critical and warning value of 1: and 1: respectively.
the /refill.php check has a lookback period of 12hours to make sure there's been one successful POST request to that in the last 12 hours (always should be).
The problem I have is that starting at 8pm and going forward until sometimes 8-10am the following day these requests fail saying 0 records found. If I click to view the alert on a dashboard I can clearly see that there are requests coming in. This morning on a whim I changed the lookback period from 12h to 1h and then re-ran the alert. It immediately came back with the results I expected and was no longer critical. When i changed the lookback period to 12h it again failed returning 0 results. If I changed it to 10 hours it actually had results. It almost seems like something is happening to the search when the index period rolls over where if the search crosses an index it fails. Which appears to happen at 8pm (GMT -04:00) which would be midnight GMT. Based on reading I expect the indexes to roll over then because ES works on GMT.
Here are some examples of my graphs in NagiosXI that are created from the NRDP data sent from the log server. Any ideas?
--
Wayne