System resources high utilization

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
sandeepatil
Posts: 211
Joined: Tue Dec 27, 2016 3:12 am

System resources high utilization

Post by sandeepatil »

We using nagios core 4.3.4, pnp4nagios, configured passive monitoring, System configuration is VM with RHEL 7.6, 32GB RAM. 8 core processor of 2.40GHz

We have monitoring 1000 host and 80000 servers with nagios core.

Whenever alerting flood or due bad monitor alerts, checkresults directory filling inode and due to this cpu utilization is high.
Filesystem /var and /tmp inode also filling upto 100%.

Need you help to fixing this issue or increasing nagios core capacity to manage alerts flood with utilization.
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: System resources high utilization

Post by tgriep »

Do you know what is using up all of the space and inodes on the system?

Some example commands you can run to find that out.

Find largest 10 directories by size command:

Code: Select all

find / -type d -print0 | xargs -0 du | sort -n | tail -10 | cut -f2 | xargs -I{} du -sh {}
Find the largest 10 files by size command:

Code: Select all

find / -type f -print0 | xargs -0 du | sort -n | tail -10 | cut -f2 | xargs -I{} du -sh {}
Find the highest inode count.

Code: Select all

for i in /*; do echo $i; find $i |wc -l; done
Can you run the following commands to display the processes running on the system and post the output?

Code: Select all

top -n 1
ps -ef --cols=300
Have you checked the Nagios, pnp4nagios log files and the log files in the /var/log folder for any errors or messages that would cause the issue?
Be sure to check out our Knowledgebase for helpful articles and solutions!
sandeepatil
Posts: 211
Joined: Tue Dec 27, 2016 3:12 am

Re: System resources high utilization

Post by sandeepatil »

Yes, I have checked checkresults directory first filling inodes after 100%, alert data filling /tmp and due this genrate error and filling /var/log/https/error_logs

And nagios process utilizing most of resources.

For resolving issue doing below steps,
1) Stop nagios and httpd process
2) Delete checkresults directory, clear unwanted data from /tmp and /var
3) Again create checkresults and restart nagios and httpd services.

This will resolve until next alert flood.
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: System resources high utilization

Post by tgriep »

The checkresults folder should be cleaned out after the check's data has been processed.

In the nagios.cfg file, make sure the max_check_result_file_age option is set.

Code: Select all

max_check_result_file_age=3600
This options determines the maximum age in seconds that Nagios will consider check result files found in the check_result_path directory to be valid. Check result files that are older that this threshold will be deleted by Nagios and the check results they contain will not be processed. By using a value of zero (0) with this option, Nagios will process all check result files.

Can you post what sort of files are in the various folders and any error messages?
Be sure to check out our Knowledgebase for helpful articles and solutions!
sandeepatil
Posts: 211
Joined: Tue Dec 27, 2016 3:12 am

Re: System resources high utilization

Post by sandeepatil »

Not getting below point from your shared details,
By using a value of zero (0) with this option, Nagios will process all check result files.

Error messages is all type of bad monitoring, e.g. monitoring log file and file not exists live that 1000 log file.
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: System resources high utilization

Post by tgriep »

The zero means to process everything in the checkresults folder, no matter how old it is.
Setting it to 3600, it will process the files newer than 3600 seconds and delete all of the unprocessed files older than 3600 seconds.
Be sure to check out our Knowledgebase for helpful articles and solutions!
sandeepatil
Posts: 211
Joined: Tue Dec 27, 2016 3:12 am

Re: System resources high utilization

Post by sandeepatil »

Got it, zero process means.

Your shared setting 3600 second set from last 1 year, but we are facing inode fill issue alerting flood, this may be because of bad monitoring or network issue.

We can solve this issue, but want to solution or tips to set nagios core to handle any type of alerting flood without stopping process or inode fill.

Need set nagios core to max processing caoacity.
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: System resources high utilization

Post by tgriep »

Without seeing any error messages when the issue starts, it is hard to help out.

What agent are you using in the Remote systems to send the checks to the Nagios server?
What is the time interval that the passive checks are setup on the remote servers?

On the Nagios server, what did you setup to receive the passive checks from the remote hosts?

One thing to think about, with that many host and service checks, setting the max_check_result_file_age option to 3600 may be too large and should be decreased as depending on the check interval, it could fill up that folder fairly quick and that causes the other issues.

And, you may want to increase the default settings number of open files to a large value in case a lot of checks are received by the server.
https://unix.stackexchange.com/question ... -processes
Be sure to check out our Knowledgebase for helpful articles and solutions!
sandeepatil
Posts: 211
Joined: Tue Dec 27, 2016 3:12 am

Re: System resources high utilization

Post by sandeepatil »

What agent are you using in the Remote systems to send the checks to the Nagios server?
"In-house developed working as per passive agent mechanism."

What is the time interval that the passive checks are setup on the remote servers?
"Time interval very from 1s to once in a day, we have 1000 host and prox 80000 services"

On the Nagios server, what did you setup to receive the passive checks from the remote hosts
"Enable passive check receive setting"
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
1347
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
2734
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
6806
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
3512
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
4530
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
5546
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
5092
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
4155
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: System resources high utilization

Post by scottwilkerson »

sandeepatil wrote:What agent are you using in the Remote systems to send the checks to the Nagios server?
"In-house developed working as per passive agent mechanism."
Because of this there isn't much we are going to be able to do...
sandeepatil wrote: What is the time interval that the passive checks are setup on the remote servers?
"Time interval very from 1s to once in a day, we have 1000 host and prox 80000 services"
this explains why there are so many files building up in the check results path.

the only thing I can suggest that can drastically improve performance is to move the checkresults path to a RAM disk, we have a guide for XI that you could glean the information required from
https://assets.nagios.com/downloads/nag ... giosXI.pdf
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Locked