Nagios not cleaning up old switch configs from mrtg?

aclauss · Post by **aclauss** » Mon Jan 31, 2022 10:24 am

This weekend, the Nagios VM (based on the VMWare image provided by Nagios) has started sending emails to the "root" account every 5 minutes. This looks to be coming from the cronjob for mrtg, subject line of the emails is:
Cron <root@nagios> LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lib/mrtg/mrtg.lock --confcache-file /var/lib/mrtg/mrtg.ok --user=nagios --group=nagios

The email is quite long, and is basically a whole bunch of SNMP failures (example below):

SNMPGET Problem for ifInOctets.10102 ifOutOctets.10102 on readonly@IP-address-snipped::::1:v4only: No response from remote host "IP-address-snipped" at /usr/bin/../lib/mrtg2/Net_SNMP_util.pm line 594.
Net_SNMP_util::snmpget('readonly@IP-address-snipped:161::::1:v4only', 'HASH(0x235dfb8)', 'ifInOctets.10102', 'ifOutOctets.10102') called at /usr/bin/mrtg line 2331
main::getsnmparg('HASH(0x1546de8)', 'HASH(0x2361b38)', 'HASH(0x1434fa8)', 'HASH(0x18ea380)') called at /usr/bin/mrtg line 2511
main::readtargets('HASH(0x1546de8)', 'ARRAY(0x15b12a8)', 'HASH(0x1434fa8)') called at /usr/bin/mrtg line 404
main::main called at /usr/bin/mrtg line 144

Most, if not all, of the IPs in this email are for network switches that were previously configured to be monitored by Nagios but have since been removed (some removed months ago).

This brings two questions to mind:
1) Why is the underlying operating system sending emails every 5 minutes for these failures? These aren't actual Nagios application-level notifications. Historically, the VM has been sending emails typically once a day, related to the "automysqlbackup" cron (these don't seem to be errors, but informational with the output of the backup job). We've just ignored those since they are low-volume, but now these mrtg ones are flooding.

2) Why are these older addresses still being queried? Looking on the file system, the /etc/mrtg/conf.d/ folder contains cfg files for some 55 IP addresses. But Nagios itself only has 4 switches configured in it at this time. It seems like Nagios XI is not deleting these files when the corresponding hosts/services are deleted?

As for the timing of this starting to happen this weekend (while the switches were deleted a long time ago), two changes took place:
1) Nagios was updated from 5.8.6 to 5.8.7.
2) The underlying OS packages were updated (yum check-update / yum update).

Post by **pbroste** » Mon Jan 31, 2022 4:35 pm

Hello @aclauss

Thanks for reaching out, want to have you check on the 'nagiosadmin' user account notification settings and make necessary adjustments found in the web console.

admin_users.png
notification_perf.png

Please let us know if you need anything further,
Perry

aclauss · Post by **aclauss** » Mon Jan 31, 2022 4:49 pm

Unfortunately, it looks like the nagiosadmin has a different/valid email address (one of mine).

nagiosadmin.PNG

Post by **pbroste** » Mon Jan 31, 2022 4:49 pm

following up please let us know if you want us to take a look at any other errors as well.

To send us your system profile.

Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and share

Thanks,
Perry

aclauss · Post by **aclauss** » Mon Jan 31, 2022 4:56 pm

pbroste wrote:following up please let us know if you want us to take a look at any other errors as well.

To send us your system profile.

Login to the Nagios XI GUI using a web browser.

Click the "Admin" > "System Profile" Menu

Click the "Download Profile" button

Save the profile.zip file and share

Thanks,
Perry

And yes, I would like to chase down why Nagios is still trying to talk to things no longer configured. System profile attached.

Post by **pbroste** » Tue Feb 01, 2022 12:11 pm

Hello @aclauss

Thanks for following up and sending over the System Profile and the details.

First, look at the outline issue on number one we see that 'root@localhost' is listed in the nagios.cfg. Have a feeling that there is a ghost cfg that is hanging around, but want to have you confirm by running this to see if there is another cfg:

Code: Select all

find /usr/local/nagios/ -type f -exec grep -HEri "root@localhost" {} \;

Please review the results, and make changes if necessary.

The second issue that we are looking at, wondering if there are 'Unconfigured Objects' sticking around but also want run through a re-index as well.

Please head over to the 'Unconfigured Objects' and clear these out.
unconfiguredobjects.png

[/list]

If you want you can go ahead and remove the '/etc/mrtg/conf.d/xxhost..xx' host that are hanging around.

Let's also have you increase the NPCD, because you're getting things like:

NPCD: WARN: MAX load reached: load 12.800000

in your /usr/local/nagios/var/npcd.log...

Increase the following values in these files:

/usr/local/nagios/etc/pnp/process_perfdata.cfg:
TIMEOUT = 15

/usr/local/nagios/etc/pnp/npcd.cfg:
sleep_time = 10

Try a timeout of 30 or greater and a sleep time of 15 (or greater) and see if that improves things.

Then do:

Code: Select all

systemctl restart npcd

Let's go ahead and reindex the Core Configuration Manager (CCM) configs by:

1: command list all running /bin/nagios -> ps -aux | grep -E '/bin/nagios'
Code: Select all
```
ps -aux | grep -E '/bin/nagios'
```

[*]2: command -> killall -9 nagios (or pkill nagios)[/*]

Code: Select all
```
pkill -f /bin/nagios
```

[*]3: command -> rm -rf /usr/local/nagios/etc/import/*[/*]

Code: Select all
```
rm -rf /usr/local/nagios/etc/import/*
```

[*]4: Restart nagios.service by terminal command: -> systemctl restart nagios[/*]

Code: Select all
```
systemctl restart nagios
```

[*]5: Head over to the Nagios XI web console
==> Core Configuration Manager (CCM)
==> Config File Management ==> [Delete Files]
==> [Write Files]
==> [Verify Files][/*]
[*]6: Core Configuration Manager (CCM)
==> Under Quick Tools
==> "Apply Configuration"[/*]
[*]7: Restart nagios.service by terminal command: -> systemctl restart nagios[/*]

Code: Select all
```
systemctl restart nagios
```

[/list]

Verify that the host and services look good and verify that there are no errors in core by:

Code: Select all

/usr/local/nagios/bin/nagios -vvv /usr/local/nagios/etc/nagios.cfg

Update the mrtg directory with the following:

Then run the following as root to set the permissions of the files the wizard / plugin use to get the bandwidth information.

Code: Select all

chown apache:nagios /etc/mrtg -R
chmod 775 /etc/mrtg -R
chown apache:nagios /var/lib/mrtg -R
chmod 775 /var/lib/mrtg -R

Code: Select all

systemctl restart npcd

Give it an hour or so, andiIf the above does not work, can you run the following commands as root and post the /tmp/mrtg.txt file here?

Code: Select all

LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg -debug=cfg,base,log &> /tmp/mrtg.txt
LANG=C LC_ALL=C /usr/bin/mrtg &>> /tmp/mrtg.txt
LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lib/mrtg/mrtg.lock --confcache-file /var/lib/mrtg/mrtg.ok &>> /tmp/mrtg.txt
{ time LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg 2>1 ; } 2>> /tmp/mrtg.txt

Thanks,
Perry

aclauss · Post by **aclauss** » Tue Feb 01, 2022 5:40 pm

pbroste wrote: First, look at the outline issue on number one we see that 'root@localhost' is listed in the nagios.cfg. Have a feeling that there is a ghost cfg that is hanging around, but want to have you confirm by running this to see if there is another cfg:
Code: Select all
find /usr/local/nagios/ -type f -exec grep -HEri "root@localhost" {} \;
Please review the results, and make changes if necessary.

It looks like there is only a single nagios.cfg.

Code: Select all

[root@nagios ~]# find /usr/local/nagios/ -type f -exec grep -HEri "root@localhost" {} \;
/usr/local/nagios/etc/nagios.cfg:admin_email=root@localhost
/usr/local/nagios/etc/nagios.cfg:admin_pager=root@localhost

As we deployed using the VMware image, I'm not familiar specifically with this file and what changes should be made to it (as opposed to via the web UI). Should I be manually editing this config file to not reference root@localhost? Or, were you worried there were MULTIPLE files, and since the only seems to be one, we are OK?

pbroste wrote: The second issue that we are looking at, wondering if there are 'Unconfigured Objects' sticking around but also want run through a re-index as well.

Please head over to the 'Unconfigured Objects' and clear these out.

Unfortunately - "No unconfigured passive objects found." Attempting to clear it anyway gave an error re-confirming that there were not any.

nagios_unconfigured_objects.PNG

pbroste wrote: If you want you can go ahead and remove the '/etc/mrtg/conf.d/xxhost..xx' host that are hanging around.

Done - I left the ones pertaining to the switches still configured in Nagios.

pbroste wrote: Increase the following values in these files:

/usr/local/nagios/etc/pnp/process_perfdata.cfg:
TIMEOUT = 15

/usr/local/nagios/etc/pnp/npcd.cfg:
sleep_time = 10

Try a timeout of 30 or greater and a sleep time of 15 (or greater) and see if that improves things.

Timeout was 5, I increased it to 15. Sleep time was set to 15, I set it to 20. Npcd restarted.

pbroste wrote: Let's go ahead and reindex the Core Configuration Manager (CCM) configs by:
...
[*]5: Head over to the Nagios XI web console
==> Core Configuration Manager (CCM)
==> Config File Management ==> [Delete Files]
==> [Write Files]
==> [Verify Files][/*]

This did indicate one warning:

Warning: Duplicate definition found for service 'Ping' on host 's68-psr-10GigE-b' (config file '/usr/local/nagios/etc/services/s68-psr-10g-b.cfg', starting on line 80)

Opening the file in question, I did not find a duplicate service. Did the Verify step also remove this duplicate? I'm continuing on the assumption "yes".

pbroste wrote: Give it an hour or so, andiIf the above does not work, can you run the following commands as root and post the /tmp/mrtg.txt file here?

Steps performed, I'll update in a bit. Thanks for all the information.

Post by **pbroste** » Wed Feb 02, 2022 4:17 pm

Hello @aclauss

Replying to the post for 'SLA' update, also thanks for the details. Question; are you still seeing system emails from root@localhost?

Let us know how things are looking,
Perry

aclauss · Post by **aclauss** » Tue Feb 08, 2022 11:11 am

The emails had stopped. However, it looks like a new case of the same just popped up. One of the remaining switches was 'moved' onto a new management network (another group in our organization is taking over management of it). The switch is no longer reachable at its IP. That immediately started up the emails for it, and they continued after the switch was removed from Nagios.

I can certainly go clear out the manual data again, but it does kind of feel like there is a bug here and removing the switch is not cleaning up the mrtg files it creates?

Post by **pbroste** » Wed Feb 09, 2022 11:21 am

Hello @aclauss

Thanks for following up, want to verify the cron job is going off without issue. We see that every 5 minutes the following cron is run:

find /etc/cron* -type f -exec grep -Eri "mrtg" -A 2 -B 2 {} \;
*/5 * * * * root LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lib/mrtg/mrtg.lock --confcache-file /var/lib/mrtg/mrtg.ok --user=nagios --group=nagios

Want to verify the cron status:

Code: Select all

systemctl status crond

Should see similar:

systemctl status crond
● crond.service - Command Scheduler
Loaded: loaded (/usr/lib/systemd/system/crond.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2022-02-08 16:18:10 CST; 17h ago
Main PID: 1327 (crond)
Tasks: 1 (limit: 11404)
Memory: 14.2M
CGroup: /system.slice/crond.service
└─1327 /usr/sbin/crond -n

Feb 09 09:30:01 localhost.localdomain CROND[236300]: (root) CMD (LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.>
Feb 09 09:35:01 localhost.localdomain CROND[238591]: (root) CMD (LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.>
Feb 09 09:40:01 localhost.localdomain CROND[240904]: (root) CMD (LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.>
Feb 09 09:45:01 localhost.localdomain CROND[243226]: (root) CMD (LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.>
Feb 09 09:50:01 localhost.localdomain CROND[245573]: (root) CMD (LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.

Manual run through to verify:

Code: Select all

LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lib/mrtg/mrtg.lock --confcache-file /var/lib/mrtg/mrtg.ok --user=nagios --group=nagios --debug="cfg,snpo" --check

Should look similar to example:

LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lib/mrtg/mrtg.lock --confcache-file /var/lib/mrtg/mrtg.ok --user=nagios --group=nagios --debug="cfg,snpo" --check
--cfg: /etc/mrtg/mrtg.cfg[1]: ######################################################################
--cfg: /etc/mrtg/mrtg.cfg[2]: # Multi Router Traffic Grapher -- Example Configuration File
--cfg: /etc/mrtg/mrtg.cfg[3]: ######################################################################
--cfg: /etc/mrtg/mrtg.cfg[4]: # This file is for use with mrtg-2.0
--cfg: /etc/mrtg/mrtg.cfg[5]: #
--cfg: /etc/mrtg/mrtg.cfg[6]: # Note:
--cfg: /etc/mrtg/mrtg.cfg[7]: #
--cfg: /etc/mrtg/mrtg.cfg[8]: # * Keywords must start at the begin of a line.
--cfg: /etc/mrtg/mrtg.cfg[9]: #
--cfg: /etc/mrtg/mrtg.cfg[10]: # * Lines which follow a keyword line which do start
--cfg: /etc/mrtg/mrtg.cfg[11]: # with a blank are appended to the keyword line
--cfg: /etc/mrtg/mrtg.cfg[12]: #
--cfg: /etc/mrtg/mrtg.cfg[13]: # * Empty Lines are ignored
--cfg: /etc/mrtg/mrtg.cfg[14]: #
--cfg: /etc/mrtg/mrtg.cfg[15]: # * Lines starting with a # sign are comments.
--cfg: /etc/mrtg/mrtg.cfg[16]:
--cfg: /etc/mrtg/mrtg.cfg[17]: # Where should the logfiles, and webpages be created?
--cfg: /etc/mrtg/mrtg.cfg[18]:
--cfg: /etc/mrtg/mrtg.cfg[19]: # Minimal mrtg.cfg
--cfg: /etc/mrtg/mrtg.cfg[20]: #--------------------
--cfg: /etc/mrtg/mrtg.cfg[21]:
--cfg: /etc/mrtg/mrtg.cfg[22]: HtmlDir: /var/www/mrtg
--cfg: /etc/mrtg/mrtg.cfg[23]: ImageDir: /var/www/mrtg
--cfg: /etc/mrtg/mrtg.cfg[24]: LogFormat: rrdtool
--cfg: /etc/mrtg/mrtg.cfg[25]: LogDir: /var/lib/mrtg
--cfg: /etc/mrtg/mrtg.cfg[26]: ThreshDir: /var/lib/mrtg
--cfg: /etc/mrtg/mrtg.cfg[27]: WorkDir: /var/lib/mrtg
--cfg: /etc/mrtg/mrtg.cfg[28]: Forks: 4
--cfg: /etc/mrtg/mrtg.cfg[29]: EnableSnmpV3: yes
--cfg: /etc/mrtg/mrtg.cfg[30]:
--cfg: conf.d/*.cfg[32]:
--cfg: conf.d/*.cfg[33]: EnableSNMPv3: yes

Quick look at the logs that pop-up for any interesting messages:

Code: Select all

LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lib/mrtg/mrtg.lock --confcache-file /var/lib/mrtg/mrtg.ok --user=nagios --group=nagios --debug="cfg,snpo" --check

Send over the copy of your mrtg.cfg:

Code: Select all

/etc/mrtg/mrtg.cfg

Let us know how things look.

Thanks,
Perry

Nagios Support Forum

Nagios not cleaning up old switch configs from mrtg?

Nagios not cleaning up old switch configs from mrtg?

Re: Nagios not cleaning up old switch configs from mrtg?

Re: Nagios not cleaning up old switch configs from mrtg?

Re: Nagios not cleaning up old switch configs from mrtg?

Re: Nagios not cleaning up old switch configs from mrtg?

Re: Nagios not cleaning up old switch configs from mrtg?

Re: Nagios not cleaning up old switch configs from mrtg?

Re: Nagios not cleaning up old switch configs from mrtg?

Re: Nagios not cleaning up old switch configs from mrtg?

Re: Nagios not cleaning up old switch configs from mrtg?