Page 1 of 1

WMI checks with unknown status

Posted: Thu Aug 27, 2015 10:20 am
by CFT6Server
We are using WMI for our windows service checks and there a number of hosts that have some odd behavior for CPU checks. This seems to only happen with CPU while other WMI checks are fine. I am getting the unknown error of "Collecting first WMI sample because the previously stored state data has expired. Results will be shown the next time the plugin runs." The odd thing is that the checks worked at some point as shown in the graph.
cpu gaps.png
Is this due to the mod gearman setup? We have one XI server with 3 mod gearman workers.
cpu checks.JPG

Re: WMI checks with unknown status

Posted: Thu Aug 27, 2015 10:24 am
by BanditBBS
Have you changed anything in your setup recently(i.e. added some gearman workers)?

I remember having this issue back in the day I was using WMI and it was because the checks were being split between multiple workers so I eventually just had all the CPU checks go to one specific worker.

Re: WMI checks with unknown status

Posted: Thu Aug 27, 2015 10:30 am
by CFT6Server
Hmm this could be challenging to consolidate specific checks to a particular worker. We have thousands of checks and the reason is that to offload the work. If we put all cpu checks to one particular box, wouldn't that defeat the purpose of having the workload distributed?

Re: WMI checks with unknown status

Posted: Thu Aug 27, 2015 10:36 am
by BanditBBS
CFT6Server wrote:Hmm this could be challenging to consolidate specific checks to a particular worker. We have thousands of checks and the reason is that to offload the work. If we put all cpu checks to one particular box, wouldn't that defeat the purpose of having the workload distributed?
Yes it would defeat the purpose. Luckily I wasn't doing all that many Windows servers when I chose to take the easy way out. I can't recall as its been a while since I messed with WMI, but I think you can specify a expire value for the CPU check. This still inst a perfect solution though. The first check may be on worker1, second check on worker2 and 3rd check on worker3. All of those would give you the unknown result, then check 4 may go against worker2 and then check 5 maybe the same. Then check 6 to worker1 and the check may work if y9o uadjust the expire value, but then it still isn't going to be great data, because the time between checks may be 30 minutes on that worker and wouldn't be giving you the true usage over the past 5 minutes like you want if you're checking ever 5 minutes.

So basically, pros and cons to both ways unfortunately. Maybe someone else can give better hint/idea.

Re: WMI checks with unknown status

Posted: Thu Aug 27, 2015 11:15 am
by jdalrymple
Out of our control.

check_wmi_plus uses a state file. IMO the best solution is to ditch WMI for nsclient which doesn't use a state file. I understand this isn't likely feasible. I'll add the state file to my list of reasons agent based is better when selling ideas to our customers.

Alternative option - check_wmi_plus can be configured to use a specific path for the statefile I think - you could use shared storage (I think - never actually tried it)
CFT6Server wrote:wouldn't that defeat the purpose of having the workload distributed
Not really - it just complicates it. There is no express need to have your workers randomly grab the checks - as a matter of fact there are great benefits to fixing servicegroups and/or hostgroups to a specific worker (geographic and/or network locality).

Re: WMI checks with unknown status

Posted: Thu Aug 27, 2015 11:45 am
by CFT6Server
I will explore the possibility of having the state files sit in a shared storage. In the meantime, I will configure the CPU check to only run on one worker node. Question: we are using service templates, so can I just add the service template to the service group, or will I have to update all the checks individually? If so, is there a simple way to do this bulk?

Thanks.

Re: WMI checks with unknown status

Posted: Thu Aug 27, 2015 2:50 pm
by tgriep
There is a way to disable the keepstate for the WMI check
KEEPING STATE
This only applies to checks that need to perform 2 WMI queries to get a complete result eg checkcpu, checkio, checknetwork etc.

Checks like this take 2 samples of WMI values (with a DELAY in between) and then calculate a result by differencing the values. By default, the plugin will "keepstate", which means that 1 WMI sample will
be collected each time the plugin runs and it will calculate the difference using the values from the previous plugin run. For something like checkcpu, this means that the plugin gives you an average CP
U utilisation between runs of the plugin. Even if you are 0% utilisation each time the plugin runs but 100% in between, checkcpu will still report a very high utilisation percentage. This makes for a ver
y accurate plugin result. Another benefit of keeping state is that it results in fewer WMI queries and the plugin runs faster. If you disable keeping state, then, for these types of checks, the plugin re
verts to taking 2 WMI samples with a DELAY each time it runs and works out the result then and there. This means that, for example, any CPU activity that happens between plugin runs is never detected.

There are some specific state keeping options:
--nokeepstate,
--keepexpiry KEXPIRY,
--keepid KID,

The files used to keep state are stored in /tmp. The DELAY setting is not applicable when keeping state is used. If you are "keeping state" then one of the first things shown in the plugin output is the
sample period.

-y DELAY Specify the delay between 2 consecutive WMI queries that are run in a single call to the plugin. Defaults are set for certain checks. Only valid if --nokeepstate used.
You can specify the --nokeepstate and set -y 1 for the delay.

The downside of this is that the checks will run twice to calculate the results.
Are you interested in doing this?

Re: WMI checks with unknown status

Posted: Thu Aug 27, 2015 4:21 pm
by CFT6Server
I am definitely interested in setting this up. I will test this to see if it will work better for us.
I am not sure which will work better for us. Since we have different check times, having check intervals of over 15mins, the average becomes less of an issue if I understand that correctly.

Re: WMI checks with unknown status

Posted: Thu Aug 27, 2015 4:36 pm
by tgriep
The -y 1 means that when you schedule the check, it will run it twice with a 1 second delay between them and then return the performance data.
That make sense?