Hello,
Every few days we have gaps in our performance data, this will usually resolve itself after 1-3 days but has never gone longer then 5-6 days without issue for the past year. During these gap timelines I noticed the nagios process consumes nearly all available memory on CentOS
I even added more virtual memory during one of these periods and it slowly consumed a few more extra gigs equaling around 7GB of 8GB available which leads me to believe this may be some sort of memory leak? Here are some things I have tried so far
1. Changed the timeout value in /usr/local/nagios/etc/npcd.cfg, current value at 35 but have tried 15 and 20
2. Modified threshold value in /usr/local/nagios/etc/pnp/process_perfdata.cfg to 80.0%
"thought maybe if I set the higher threshold perfdata would still process during high memory usage"
Here is the error in the nagios.log when perfdata stops processing data
------------------------
Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1545804007.perfdata.host" - errno: Cannot allocate memory
Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1545804007.perfdata.service" - errno: Cannot allocate memory
-----------------------
Nagios core version 4.2.4
"We rely heavily on mod_gearman so could not update nagios core"
Another related error is
"could not write to destination directory /usr/local/nagios/var/spool/xidpe"
During the time this error was filling the logs I could not see any spool files
Sorry for posting so much information, hopefully it is helpful.
Thank you
Random performance data missing
-
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: Random performance data missing
@fortyx40, Have you run the "top" command to see which processes are consuming the memory? How many host and service checks are you monitoring? How much memory does your server have in total?
Decreasing the timeout value is not a good idea because if NPCD delay exceeds the timeout value it will skip a portion of the spooled perdata and not process it.
Load threshold needs to be increased in the /usr/local/nagios/etc/pnp/npcd.cfg file.
Timeout needs to be increased in the /usr/local/nagios/etc/pnp/process_perfdata.cfg.
npcd service needs to be restarted in order for changes to get applied:
Decreasing the timeout value is not a good idea because if NPCD delay exceeds the timeout value it will skip a portion of the spooled perdata and not process it.
Load threshold needs to be increased in the /usr/local/nagios/etc/pnp/npcd.cfg file.
Timeout needs to be increased in the /usr/local/nagios/etc/pnp/process_perfdata.cfg.
npcd service needs to be restarted in order for changes to get applied:
Also, please run:service npcd restart
chmod u+w /usr/local/nagios/var/spool/xidpe/
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Random performance data missing
I did run the top command to come to the conclusion nagios was using all the memory, I increased the load timeout and the threshold not decrease.
I also gave full permissions to /usr/local/nagios/var/spool/xidpe/ though this only happens when nagios starts eating all the memory. I restart the npcd service after every change, basically I believe it's nagios process itself causing all the issues. We have around 1,100 HOST and 2,500 services
Everything runs smooth until nagios decides to consume all the resources.
I attached an example of our perfdata
Thank you!
I also gave full permissions to /usr/local/nagios/var/spool/xidpe/ though this only happens when nagios starts eating all the memory. I restart the npcd service after every change, basically I believe it's nagios process itself causing all the issues. We have around 1,100 HOST and 2,500 services
Everything runs smooth until nagios decides to consume all the resources.
I attached an example of our perfdata
Thank you!
You do not have the required permissions to view the files attached to this post.
-
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: Random performance data missing
@fortyx40, Your problem could be related to this reported issue:
https://github.com/NagiosEnterprises/na ... issues/455
The good news is that the fix that would allow mod gearman integration with the latest Core is already in the QA stage. Once we test it out it will become available for users.
https://github.com/NagiosEnterprises/na ... issues/455
The good news is that the fix that would allow mod gearman integration with the latest Core is already in the QA stage. Once we test it out it will become available for users.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Random performance data missing
That's great news! Will this fix be integrated in a official release of nagios core or a separate entity or patch?
Thank you
Thank you
-
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: Random performance data missing
@fortyx40, We'll publish a separate script for mod gearman and updated mod gearman packages to our repo. No changes required on the Core side.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Random performance data missing
Is there anyway to sign up for notifications on this scripts release? Will it be available in https://github.com/NagiosEnterprises/nagioscore master branch once it's released?
Thank you
Thank you
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Random performance data missing
If you click the star at the top of the project page you will be notified when releases are made
Re: Random performance data missing
I'm having troubles finding this script, I have looked all through the repo to no avail. Has it been published? I see that new core has been released, do I still need this separate script or is modgearman compatibility just added natively to the new core?
Thank you
Thank you
Re: Random performance data missing
I believe I found the correct article https://support.nagios.com/kb/article/n ... e-839.html
Looks like we need to just update to mod_gearman3 to use the newer core versions
Looks like we need to just update to mod_gearman3 to use the newer core versions