Hi Girls/Guys,
I've setup nagios for our render farm. To reduce costs we suspend computing "nodes" when they are not used. So I want to check that machines, but only when they are ON. I've tried many things. I've created Ping Service, and make all other Services depended on this one. But in this case I have flapping ping service and all other services returning ERROR. So I have done it in another way. I created a wrapper (script) for nrpe. In that script I test the state of the machine (OFF/ON) using IPMI. If the machine is OFF I return immediately OK. If the machine is ON I pass the call to nrpe. The first result was brilliant. But after a while I start thinking. If the state of one of the services is CRITICAL, I will be notified. Than machine go to suspend. I receive status OK. Machine wakes up, again CRITICAL. So the state if flapping when it shouldn't.
So my question. Is it possible to define a condition when the service is being checked? If machine is OFF -> don't do anything. Simply show the same state. In that case if Service is CRITICAL, machine go to suspend I wat to see CRITICAL state.
Sorry for long post. I want to be clear, and my english is not perfect.
nagios for suspeded machine
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: nagios for suspeded machine
@bartek_zgo, Check out here: https://support.nagios.com/kb/article.php?id=505 Under Disable Service Checks. Basically, there's a setting you can enable in nagios.cfg file that will stop Nagios from checking any services once the host is down. One thing to keep in mind is that this setting is Global, meaning that it will apply to all hosts. Also, this feature will become available in XI 5.5.0 which is not yet available but should be released shortly.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
-
bartek_zgo
- Posts: 5
- Joined: Thu Mar 16, 2017 4:41 am
Re: nagios for suspeded machine
Thanks for reply, but this doesn't sound like a solution. First thing it is global. Second thing, you have to restart nagios to take effect. We have about 10 main servers that are always ON supervised by nagios. And Nodes that go to suspend automatically. So I can not restart nagios each time one node wakes up. It works like that. A tasks arrive to server. Server decides to wake up 10 Nodes. They work for 2 hours and go sleep. During execution of one task may arrive another task that will take only 10 minutes for example. So this must be automatic. The perfect wold be to define a script for service or host what would be executed before the check_command. Like host_down_disable_service_checks but defined for host.
Re: nagios for suspeded machine
In terms of something that is "native to Nagios", the external commands file is your best bet. Here's a list of all the available commands:bartek_zgo wrote:So I can not restart nagios each time one node wakes up ... So this must be automatic.
https://old.nagios.org/developerinfo/ex ... ndlist.php
DISABLE_HOST_CHECK / ENABLE_HOST_CHECK and DISABLE_HOST_SVC_CHECKS / ENABLE_HOST_SVC_CHECKS might be of particular interest. This doesn't exactly "remove" the host and it's services from the Nagios GUI though. If you wanted to do that, some heavy lifting would need to be done with the REST API included with Nagios XI (see the "Help" section of the Nagios XI GUI for usage/documentation).
In my mind, as part of the node's "shutdown" process, you could include calls to the Nagios XI API to remove the node from Nagios XI. Similarly, as part of the node's "startup" process, add it back in to Nagios XI. This would require a full restart of the Nagios daemon, though. Less than ideal by your description, which is why I offered the external commands file as an alternative option; It doesn't require a Nagios daemon restart to enable/disable active checks for a particular object.
And if you're not totally comfortable with part of your node startup / shutdown process executing shell commands on the Nagios XI box, NRDP can be used to submit them via HTTP using an access key.
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
-
bartek_zgo
- Posts: 5
- Joined: Thu Mar 16, 2017 4:41 am
Re: nagios for suspeded machine
Thanks! This is interesting solution. I have to read how to use REST in nagios and give permission for nodes to call it. It is not ideal solution because it depends on node. So if some error happens to node during wake up (the REST call is lost) the host will remain down. Maybe I will write a script, that will scan though nodes, and if state changes, it will write a command to commandfile. Thanks for showing me this option. I didn't know about the commandfile!
-
dwhitfield
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: nagios for suspeded machine
I think a lot of the issues around this are going to be resolved with Core 5. Unfortunately, we are a long way off from a Core 5 release date. It seems like you have a tentative solution for now.
Did you have any other questions we can address now?
Did you have any other questions we can address now?