Hello all! I have 2 machines in a cluster. I have installed Nagios on both the machines. I have considered one machine as the monitoring server and installed NRPE as a plugin here. Another machine (called the remote host) is monitored by the server and I have installed NRPE as a daemon here. Everything works fine but at times, the processes in remote machine goes high with increased load and the system does not respond to any users in the network. I have installed few plugins in this remote machine to monitor the services I need and checking the gpu status is one among them. I have two tesla M2050 GPUs attached to the remote machine and I often monitor the status of this device from server.
I have installed check_gpu_sensor for GPU monitoring. The remote in which I installed the check_gpu_sensor plugin got hanged this morning due to heavy load. I was not using the machine but other people in the network said 'check_gpu_sensor' and 'ps' were the two processes which were getting executed under nagios continuously and that caused the problem. It seems the 'check_gpu_sensor' process reached D state (uninterruptable sleep mostly) and was not getting killed.
I stopped the nagios process in both the server and remote machine. But that did not help me. Now they shut the connections for maintenance.
What might be the actual problem? How can I rectify it? Is this the problem related to the plugin or gpu device or network or nrpe? Please help!
Plugin getting into uninterruptable sleep state
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Plugin getting into uninterruptable sleep state
How often were you having checks run the that machine? Do you have freshness checks enabled on the nagios server's service configuration for this plugin? You also do not need Nagios on multiple systems to monitor them, you need only have Nagios on one system, and then set up checks to the others.