Page 2 of 2
Re: Production server wproc errors returned
Posted: Tue Jun 11, 2019 5:21 pm
by pkarr
Ok here are NOM log files. There were 2.
nom-listing.PNG
In saying that NOM was acting up, Greg meant that in the Nagios XI Jobs service check. Nom was stale,
first at a warning level than then went to critical before clearing after I ran nom.php from command line.
This time none of the other Nagios XI Jobs reported any issues.
thanks,
Penny
Re: Production server wproc errors returned
Posted: Wed Jun 12, 2019 9:28 am
by tgriep
Thanks for the log files.
I see some database connection issues to the Postgress database that the NOM script uses to determine if it needs to run so can you get the following files from the Nagios server and upload them to the post?
Code: Select all
/var/lib/pgsql/data/pg_log/postgresql-Mon.log
This seems to cause the Apply Config to fail and that could of caused the issue so I would like to check a few more things.
Can you run the following commands as root and upload the /tmp/info.txt file to the post?
Code: Select all
echo "SELECT relname AS objectname, relkind AS objecttype, reltuples, pg_size_pretty(relpages::bigint*8*1024) AS size FROM pg_class WHERE relpages >= 8 ORDER BY relpages DESC;" | psql nagiosxi nagiosxi >/tmp/info.txt
echo "select * from xi_meta;" | psql nagiosxi nagiosxi |grep last_nom_nagioscore_checkpoint >>/tmp/info.txt
ls -lR /usr/local/nagiosxi/nom/ >>/tmp/info.txt
ls -lR /usr/local/nagios/share/perfdata/ >>/tmp/info.txt
Thanks.
Re: Production server wproc errors returned
Posted: Wed Jun 12, 2019 9:48 am
by pkarr
Hi Tom,
Here is info.txt
I've included the postgresql log file from Monday as well.
thanks,
Penny
Re: Production server wproc errors returned
Posted: Wed Jun 12, 2019 1:24 pm
by tgriep
The Postgres log was full of these errors.
FATAL: connection limit exceeded for non-superusers
That probably caused the issue with the NOM script as it could not connect to the Postgres database to update the information when it was running.
Edit this file on the Nagios server
Code: Select all
/var/lib/pgsql/data/postgresql.conf
change this from
to
Save the file and restart the nagios processes by running the following as root
Code: Select all
service npcd stop
service nagios stop
service ndo2db stop
service crond stop
service postgresql restart
rm -f /usr/local/nagios/var/rw/nagios.cmd
rm -f /usr/local/nagios/var/nagios.lock
rm -f /var/run/nagios.lock
rm -f /usr/local/nagios/var/ndo.sock
rm -f /usr/local/nagios/var/ndo2db.lock
rm -f /var/lib/mrtg/mrtg_l
rm -f /usr/local/nagiosxi/var/*.lock
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
pkill -9 -u nagios
pkill python
service httpd restart
service ndo2db start
service nagios start
service npcd start
service crond start
Re: Production server wproc errors returned
Posted: Fri Jun 14, 2019 11:21 am
by gregwhite
Since making those changes we haven't had the issue with NOM. However, the event log did record a time change on Monday.
2019-06-10 15:54:14 Warning: A system time change of -1 seconds (0d 0h 0m 1s backwards in time) has been detected. Compensating.
If you remember, when we first had an issue with the production server on May 16th, initially you thought it could be attributed to the Time change message. We had actually received four of those messages during that weekend. Then on the following Friday you upped the parameters on the mysql database and things seemed to run smoothly.
Seeing this again has me nervous. I understand that it is at the system level but we have moved to an ntp server that is more reliable. I had seen a post that said it could be a bad battery on the motherboard. Wanted to mention it in case you had any thoughts or suggestions.
Thanks,
Greg
Re: Production server wproc errors returned
Posted: Fri Jun 14, 2019 11:56 am
by tgriep
If I remember right, the time changes from before were for many minutes, this change is only for -1 second which I would not worry about.