Plugin getting into uninterruptable sleep state
Posted: Thu Mar 28, 2013 11:42 pm
Hello all! I have 2 machines in a cluster. I have installed Nagios on both the machines. I have considered one machine as the monitoring server and installed NRPE as a plugin here. Another machine (called the remote host) is monitored by the server and I have installed NRPE as a daemon here. Everything works fine but at times, the processes in remote machine goes high with increased load and the system does not respond to any users in the network. I have installed few plugins in this remote machine to monitor the services I need and checking the gpu status is one among them. I have two tesla M2050 GPUs attached to the remote machine and I often monitor the status of this device from server.
I have installed check_gpu_sensor for GPU monitoring. The remote in which I installed the check_gpu_sensor plugin got hanged this morning due to heavy load. I was not using the machine but other people in the network said 'check_gpu_sensor' and 'ps' were the two processes which were getting executed under nagios continuously and that caused the problem. It seems the 'check_gpu_sensor' process reached D state (uninterruptable sleep mostly) and was not getting killed.
I stopped the nagios process in both the server and remote machine. But that did not help me. Now they shut the connections for maintenance.
What might be the actual problem? How can I rectify it? Is this the problem related to the plugin or gpu device or network or nrpe? Please help!
I have installed check_gpu_sensor for GPU monitoring. The remote in which I installed the check_gpu_sensor plugin got hanged this morning due to heavy load. I was not using the machine but other people in the network said 'check_gpu_sensor' and 'ps' were the two processes which were getting executed under nagios continuously and that caused the problem. It seems the 'check_gpu_sensor' process reached D state (uninterruptable sleep mostly) and was not getting killed.
I stopped the nagios process in both the server and remote machine. But that did not help me. Now they shut the connections for maintenance.
What might be the actual problem? How can I rectify it? Is this the problem related to the plugin or gpu device or network or nrpe? Please help!