System resources high utilization

sandeepatil · Post by **sandeepatil** » Mon Jul 15, 2019 6:17 am

We using nagios core 4.3.4, pnp4nagios, configured passive monitoring, System configuration is VM with RHEL 7.6, 32GB RAM. 8 core processor of 2.40GHz

We have monitoring 1000 host and 80000 servers with nagios core.

Whenever alerting flood or due bad monitor alerts, checkresults directory filling inode and due to this cpu utilization is high.
Filesystem /var and /tmp inode also filling upto 100%.

Need you help to fixing this issue or increasing nagios core capacity to manage alerts flood with utilization.

Post by **tgriep** » Mon Jul 15, 2019 3:17 pm

Do you know what is using up all of the space and inodes on the system?

Some example commands you can run to find that out.

Find largest 10 directories by size command:

Code: Select all

find / -type d -print0 | xargs -0 du | sort -n | tail -10 | cut -f2 | xargs -I{} du -sh {}

Find the largest 10 files by size command:

Code: Select all

find / -type f -print0 | xargs -0 du | sort -n | tail -10 | cut -f2 | xargs -I{} du -sh {}

Find the highest inode count.

Code: Select all

for i in /*; do echo $i; find $i |wc -l; done

Can you run the following commands to display the processes running on the system and post the output?

Code: Select all

top -n 1
ps -ef --cols=300

Have you checked the Nagios, pnp4nagios log files and the log files in the /var/log folder for any errors or messages that would cause the issue?

sandeepatil · Post by **sandeepatil** » Tue Jul 16, 2019 12:33 am

Yes, I have checked checkresults directory first filling inodes after 100%, alert data filling /tmp and due this genrate error and filling /var/log/https/error_logs

And nagios process utilizing most of resources.

For resolving issue doing below steps,
1) Stop nagios and httpd process
2) Delete checkresults directory, clear unwanted data from /tmp and /var
3) Again create checkresults and restart nagios and httpd services.

This will resolve until next alert flood.

Post by **tgriep** » Tue Jul 16, 2019 8:33 am

The checkresults folder should be cleaned out after the check's data has been processed.

In the nagios.cfg file, make sure the max_check_result_file_age option is set.

Code: Select all

max_check_result_file_age=3600

This options determines the maximum age in seconds that Nagios will consider check result files found in the check_result_path directory to be valid. Check result files that are older that this threshold will be deleted by Nagios and the check results they contain will not be processed. By using a value of zero (0) with this option, Nagios will process all check result files.

Can you post what sort of files are in the various folders and any error messages?

sandeepatil · Post by **sandeepatil** » Tue Jul 16, 2019 9:40 am

Not getting below point from your shared details,

By using a value of zero (0) with this option, Nagios will process all check result files.

Error messages is all type of bad monitoring, e.g. monitoring log file and file not exists live that 1000 log file.

Post by **tgriep** » Tue Jul 16, 2019 4:18 pm

The zero means to process everything in the checkresults folder, no matter how old it is.
Setting it to 3600, it will process the files newer than 3600 seconds and delete all of the unprocessed files older than 3600 seconds.

sandeepatil · Post by **sandeepatil** » Thu Jul 18, 2019 12:25 am

Got it, zero process means.

Your shared setting 3600 second set from last 1 year, but we are facing inode fill issue alerting flood, this may be because of bad monitoring or network issue.

We can solve this issue, but want to solution or tips to set nagios core to handle any type of alerting flood without stopping process or inode fill.

Need set nagios core to max processing caoacity.

Post by **tgriep** » Thu Jul 18, 2019 12:59 pm

Without seeing any error messages when the issue starts, it is hard to help out.

What agent are you using in the Remote systems to send the checks to the Nagios server?
What is the time interval that the passive checks are setup on the remote servers?

On the Nagios server, what did you setup to receive the passive checks from the remote hosts?

One thing to think about, with that many host and service checks, setting the max_check_result_file_age option to 3600 may be too large and should be decreased as depending on the check interval, it could fill up that folder fairly quick and that causes the other issues.

And, you may want to increase the default settings number of open files to a large value in case a lot of checks are received by the server.
https://unix.stackexchange.com/question ... -processes

sandeepatil · Post by **sandeepatil** » Tue Jul 23, 2019 8:09 am

What agent are you using in the Remote systems to send the checks to the Nagios server?
"In-house developed working as per passive agent mechanism."

What is the time interval that the passive checks are setup on the remote servers?
"Time interval very from 1s to once in a day, we have 1000 host and prox 80000 services"

On the Nagios server, what did you setup to receive the passive checks from the remote hosts
"Enable passive check receive setting"

[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
1347
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
2734
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
6806
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
3512
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
4530
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
5546
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
5092
[nagios@server ~]$ ll /opt/app/nagios/var/spool/checkresults/ | wc -l
4155

scottwilkerson · Post by **scottwilkerson** » Tue Jul 23, 2019 9:38 am

sandeepatil wrote:What agent are you using in the Remote systems to send the checks to the Nagios server?
"In-house developed working as per passive agent mechanism."

Because of this there isn't much we are going to be able to do...

sandeepatil wrote: What is the time interval that the passive checks are setup on the remote servers?
"Time interval very from 1s to once in a day, we have 1000 host and prox 80000 services"

this explains why there are so many files building up in the check results path.

the only thing I can suggest that can drastically improve performance is to move the checkresults path to a RAM disk, we have a guide for XI that you could glean the information required from
https://assets.nagios.com/downloads/nag ... giosXI.pdf

Nagios Support Forum

System resources high utilization

System resources high utilization

Re: System resources high utilization

Re: System resources high utilization

Re: System resources high utilization

Re: System resources high utilization

Re: System resources high utilization

Re: System resources high utilization

Re: System resources high utilization

Re: System resources high utilization

Re: System resources high utilization