Upgraded to Nagios 5.8.7 and also now on the Centos 7.9 OS (used OVA file to upgrade to new virtual machine).
Employee left, so went into Nagios to remove him. Most checks went to "pending status" after did Apply Configuration.
We are finding out, we are getting paged when these pending are getting flipped to their current status....so my coworker and I have just received dozens of pages.
We are worried now that every time we do Apply Configuration we'll get paged hundreds/thousands of times.
Is there a setting to globally turn this "feature" off?
Help! Getting too many pages....
Re: Help! Getting too many pages....
Actually - we not sure what going on...we still getting pages.
We just guessing - have no clue.
Could these pages be old and were "tanked up" on old system (didn't go for whatever reason)....but now on new server they are going and this info is "out of date"? Cause we get paged...check system that paged us ...there is no issue. It's like this is "old information".
We just guessing - have no clue.
Could these pages be old and were "tanked up" on old system (didn't go for whatever reason)....but now on new server they are going and this info is "out of date"? Cause we get paged...check system that paged us ...there is no issue. It's like this is "old information".
Re: Help! Getting too many pages....
Another clue?
Pages have not time stamp on them.
I see the exact same message in /var/log/messages from yesterday, but nothing today (same site, same message)
Does something read thru /var/log/messages and email out? If so what is this and where is it controlled?
(Again, we just "grasping at straws)
Pages have not time stamp on them.
I see the exact same message in /var/log/messages from yesterday, but nothing today (same site, same message)
Does something read thru /var/log/messages and email out? If so what is this and where is it controlled?
(Again, we just "grasping at straws)
Re: Help! Getting too many pages....
OK ...found more logs in /usr/local/nagiosxi/var.
We getting several " CHECK_NRPE: Error - Could not connect to X.X.X.X: Connection reset by peer"
The new server has same IP as old server...so we should not have to update the allowed_hosts on each remote site (in nrpe.cfg).
Where else would we need to look for this not working when a new server replaces an old server?
We getting several " CHECK_NRPE: Error - Could not connect to X.X.X.X: Connection reset by peer"
The new server has same IP as old server...so we should not have to update the allowed_hosts on each remote site (in nrpe.cfg).
Where else would we need to look for this not working when a new server replaces an old server?
Re: Help! Getting too many pages....
I've run /usr/local/nagios/libexec/check_nrpe to hosts that complain (hosts we get paged on with NRPE issue) and same command run manual works OK.
Re: Help! Getting too many pages....
So similar info as before...get paged about NRPE issue at remote location. Can run same exact command from command line from new Nagios serer. Can run command manual from the Web Gui. Not sure why get paged on the issue obviously since the issue doesn't exist (in this case was memory test on the remote server), and the NRPE connection works if try manual.
Re: Help! Getting too many pages....
OK let me explain:
On Thursday:
1) loaded new OVA template to new virtual machine - gave this a temporary IP
2) backed up config on our old Nagios
3) restored config onto new Nagios
4) spent some time doing other config on new
5) eventually disconnected old, and re-IP new (gave it the old IP)
(so for awhile maybe 2 or possibly 3 hours(?) the new one was running on a fake IP)
After cutover, all the people in our oncall rotation got "flooded" with pages for a bit (maybe 10-20 minutes if that).
Yesterday (Friday), since an employee had left the company, I tried removing him but couldn't - there some "circular logic" somewhere we need to find to remove him.
After this attempted apply, my coworker and I (and only us 2 - we are the Sys Admin), got paged for hours and hours and hours (stopping just before 6PM our time) - most of the messages were of the type "check_nrpe: error - could not connect to x.x.x.x: connection reset by peer"
Now...when the new server had fake IP...the NRPE stuff wouldn't work due to the "allow_host" IP being wrong on our 200 client servers....could all these "check_nrpe" errors have been from then and have been "tanked up" then released when this server became the real Nagios server? But...why did this start Friday and not Thursday? The console is "green" and when we do check_nrpe test to same server it is fine...so it appears these are "old messages".
If these "check_nrpe" are old how can we prove that...where would I look on Nagios server to see if this is messages was created "now" versus 2 days ago?
Also..as test...took host down and we did NOT get paged on it....we wonder if Verizon shut off the paging since they were "flooded".
On Thursday:
1) loaded new OVA template to new virtual machine - gave this a temporary IP
2) backed up config on our old Nagios
3) restored config onto new Nagios
4) spent some time doing other config on new
5) eventually disconnected old, and re-IP new (gave it the old IP)
(so for awhile maybe 2 or possibly 3 hours(?) the new one was running on a fake IP)
After cutover, all the people in our oncall rotation got "flooded" with pages for a bit (maybe 10-20 minutes if that).
Yesterday (Friday), since an employee had left the company, I tried removing him but couldn't - there some "circular logic" somewhere we need to find to remove him.
After this attempted apply, my coworker and I (and only us 2 - we are the Sys Admin), got paged for hours and hours and hours (stopping just before 6PM our time) - most of the messages were of the type "check_nrpe: error - could not connect to x.x.x.x: connection reset by peer"
Now...when the new server had fake IP...the NRPE stuff wouldn't work due to the "allow_host" IP being wrong on our 200 client servers....could all these "check_nrpe" errors have been from then and have been "tanked up" then released when this server became the real Nagios server? But...why did this start Friday and not Thursday? The console is "green" and when we do check_nrpe test to same server it is fine...so it appears these are "old messages".
If these "check_nrpe" are old how can we prove that...where would I look on Nagios server to see if this is messages was created "now" versus 2 days ago?
Also..as test...took host down and we did NOT get paged on it....we wonder if Verizon shut off the paging since they were "flooded".
Re: Help! Getting too many pages....
And...this morning one of our stores was down ...I not paged on it but someone else was (which is good - they should have been). Again, I wonder if Verizon cut off my paging as pages to me had gone all day Friday.
In any case, we need to figure out why we got hundreds of possible old pages (or how to prove they old).
Related question: if Nagios server offline for whatever reason (network outage)...do these NRPE attempt "tank up" and we'd get flooded again with old messages once the network is back online?
In any case, we need to figure out why we got hundreds of possible old pages (or how to prove they old).
Related question: if Nagios server offline for whatever reason (network outage)...do these NRPE attempt "tank up" and we'd get flooded again with old messages once the network is back online?
-
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: Help! Getting too many pages....
Hi,
Thanks for contacting the Nagios Support Team and providing a detailed description of the issue here. It does look like there are multiple issues present.
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Thanks,
Benjamin
Thanks for contacting the Nagios Support Team and providing a detailed description of the issue here. It does look like there are multiple issues present.
Are you still having the issue with pending status after applying configuration? This maybe related to a database issue, please send the system profile so we can take a look.The employee left, so went into Nagios to remove him. Most checks went to "pending status" after did Apply Configuration
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
The best way is to review the nagios.logs from the time this happened. Can you attach the corresponding logs from the archive directory? They are rotated every 24 hours, so if if the log is dated 12-13, it would contain the information from 12-12.In any case, we need to figure out why we got hundreds of possible old pages (or how to prove they old).
Code: Select all
/usr/local/nagios/var/archives
Please enable phpmailer logging, so we can verify the notifications were sent as expected and let me know the name of the host so I can review the configuraitons.Also..as test...took host down and we did NOT get paged on it....we wonder if Verizon shut off the paging since they were "flooded".
Thanks,
Benjamin
You do not have the required permissions to view the files attached to this post.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!