Page 1 of 1

onDemand Test for WoL Renderfarm, Adaptive Monitoring?

Posted: Mon Aug 06, 2012 10:57 am
by xoanon
Hy there,

i have a nice little renderfarm. Since they are not used 24/7 they are turned off/on by a managing software. When a node is supposed to be online I need to monitor the node.
Is the node realy online? Service online? NFS mount ok? etc ...

The manging software is not great but I know when a node is supposed to be online.

I tried to disabel the host checks via external commands. But this does not help. they are still shown a critical. Which is not very helpful because when 37 Hosts are offline thats great! Saving a lot of Power and the Air condition also runs at 50%.. But host number 38 was supposed to be online! ...

I really need them green. I can turn off all alerts. But that's not enough.

Right now my idea is to use Adaptive Monitoring an exchange all test command with fake one producing lots of "OK". Is there a better approach?

I have similar problems with a full hard disk (yeeees its full... I know. I told the staff. thank you.) ... Is there a plugin or something which removes hosts/services from the "critical-red-we-are-going-to-die" list when I say stop obsessing/checks for this host/service ? I like the color blue, or even pink with lilac dots ... Just not red. It always looks like I am not doing my job.


Have a nice day,

Timo

Re: onDemand Test for WoL Renderfarm, Adaptive Monitoring?

Posted: Mon Aug 06, 2012 11:50 am
by yancy
Timo,

A solution that comes to mind is to use a passive agent. That way, the host is responsible for sending check results back to Nagios, which will only happen when the host is online. That way, you'll be alerted when there is a problem with a service like NFS.

Link for NRDS passive agent
http://exchange.nagios.org/directory/Do ... DS/details

As for a plugin that will change the alert color to pink with lilac dots, I haven't heard of one, but maybe someone else on the forum has :D .

-Yancy

Re: onDemand Test for WoL Renderfarm, Adaptive Monitoring?

Posted: Thu Aug 09, 2012 4:19 am
by xoanon
hm ... but passive test will timout too and get red?

Perhaps we have a little misunderstanding:

The core of my problem is the Nagios Status Page. Right now I have a Status page with 24 knecht01 to knecht24 render nodes showing RED "Critical" although they are fine ( offline ) together with 96 Services. All in all Nagios is reporting 122 problems. 24 Offline Render Nodes. A full disk and one(!) hard disk which seems to start failing. All render Nodes are set to: Active Checks: Disabled, Obsessing: Disabled, Notifications: Disabled and ( just to be sure ) have a scheduled downtime... RED

The only thing that needs attention is the one harddisk which seems to start failing.

Right now Nobody except me can use the status page because everything is red.

I always had the problem with nagios that certain alerts cant be fixed. I can not erase data form a full disk. ( I can ... but... you know... ) So I changed the critical levels to get rid of the reds. This is perhaps from one viewpoint the right thing to do. when its not critical that the disk is full dont raise an alert. And for that approach your suggested solution would help.

But the nodes are offline. I cant change their config.

Right now I will go for Adaptive Monitoring. Writing a host test which will exchange all service checks with check that will always report "OK" when the host should be down. I admit It could be argued that this is the way to go because thing that are ok should report "OK". But its complicated.

Is there a way via external commands to list all the service checks for a host? Getting the config inside a check (perl) ?

Re: onDemand Test for WoL Renderfarm, Adaptive Monitoring?

Posted: Thu Aug 09, 2012 9:25 am
by yancy
--snip--
hm ... but passive test will timout too and get red?
--snip--

Only if you have freshness checink enabled on Nagios. By default, the last results from the passive agent will be used. So if the last result was "OK" and the machine was turned off, it would stay OK until a new status update.

I'm not very familier with how this would be solved with adaptive monitoring, but it sounds like it would accomplish the same goal.

--snip--
Is there a way via external commands to list all the service checks for a host? Getting the config inside a check (perl) ?
--snip--

NRDS Passive agent allows you to configure a distribute checks from Nagios.

Hope that helps!

-Yancy