Help! Getting too many pages....

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
capmarvel
Posts: 14
Joined: Tue Mar 03, 2015 9:50 am

Help! Getting too many pages....

Post by capmarvel »

Upgraded to Nagios 5.8.7 and also now on the Centos 7.9 OS (used OVA file to upgrade to new virtual machine).

Employee left, so went into Nagios to remove him. Most checks went to "pending status" after did Apply Configuration.

We are finding out, we are getting paged when these pending are getting flipped to their current status....so my coworker and I have just received dozens of pages.

We are worried now that every time we do Apply Configuration we'll get paged hundreds/thousands of times.

Is there a setting to globally turn this "feature" off?
capmarvel
Posts: 14
Joined: Tue Mar 03, 2015 9:50 am

Re: Help! Getting too many pages....

Post by capmarvel »

Actually - we not sure what going on...we still getting pages.

We just guessing - have no clue.

Could these pages be old and were "tanked up" on old system (didn't go for whatever reason)....but now on new server they are going and this info is "out of date"? Cause we get paged...check system that paged us ...there is no issue. It's like this is "old information".
capmarvel
Posts: 14
Joined: Tue Mar 03, 2015 9:50 am

Re: Help! Getting too many pages....

Post by capmarvel »

Another clue?

Pages have not time stamp on them.

I see the exact same message in /var/log/messages from yesterday, but nothing today (same site, same message)

Does something read thru /var/log/messages and email out? If so what is this and where is it controlled?

(Again, we just "grasping at straws)
capmarvel
Posts: 14
Joined: Tue Mar 03, 2015 9:50 am

Re: Help! Getting too many pages....

Post by capmarvel »

OK ...found more logs in /usr/local/nagiosxi/var.

We getting several " CHECK_NRPE: Error - Could not connect to X.X.X.X: Connection reset by peer"

The new server has same IP as old server...so we should not have to update the allowed_hosts on each remote site (in nrpe.cfg).

Where else would we need to look for this not working when a new server replaces an old server?
capmarvel
Posts: 14
Joined: Tue Mar 03, 2015 9:50 am

Re: Help! Getting too many pages....

Post by capmarvel »

I've run /usr/local/nagios/libexec/check_nrpe to hosts that complain (hosts we get paged on with NRPE issue) and same command run manual works OK.
capmarvel
Posts: 14
Joined: Tue Mar 03, 2015 9:50 am

Re: Help! Getting too many pages....

Post by capmarvel »

So similar info as before...get paged about NRPE issue at remote location. Can run same exact command from command line from new Nagios serer. Can run command manual from the Web Gui. Not sure why get paged on the issue obviously since the issue doesn't exist (in this case was memory test on the remote server), and the NRPE connection works if try manual.
capmarvel
Posts: 14
Joined: Tue Mar 03, 2015 9:50 am

Re: Help! Getting too many pages....

Post by capmarvel »

OK let me explain:

On Thursday:
1) loaded new OVA template to new virtual machine - gave this a temporary IP
2) backed up config on our old Nagios
3) restored config onto new Nagios
4) spent some time doing other config on new
5) eventually disconnected old, and re-IP new (gave it the old IP)
(so for awhile maybe 2 or possibly 3 hours(?) the new one was running on a fake IP)

After cutover, all the people in our oncall rotation got "flooded" with pages for a bit (maybe 10-20 minutes if that).

Yesterday (Friday), since an employee had left the company, I tried removing him but couldn't - there some "circular logic" somewhere we need to find to remove him.

After this attempted apply, my coworker and I (and only us 2 - we are the Sys Admin), got paged for hours and hours and hours (stopping just before 6PM our time) - most of the messages were of the type "check_nrpe: error - could not connect to x.x.x.x: connection reset by peer"

Now...when the new server had fake IP...the NRPE stuff wouldn't work due to the "allow_host" IP being wrong on our 200 client servers....could all these "check_nrpe" errors have been from then and have been "tanked up" then released when this server became the real Nagios server? But...why did this start Friday and not Thursday? The console is "green" and when we do check_nrpe test to same server it is fine...so it appears these are "old messages".

If these "check_nrpe" are old how can we prove that...where would I look on Nagios server to see if this is messages was created "now" versus 2 days ago?

Also..as test...took host down and we did NOT get paged on it....we wonder if Verizon shut off the paging since they were "flooded".
capmarvel
Posts: 14
Joined: Tue Mar 03, 2015 9:50 am

Re: Help! Getting too many pages....

Post by capmarvel »

And...this morning one of our stores was down ...I not paged on it but someone else was (which is good - they should have been). Again, I wonder if Verizon cut off my paging as pages to me had gone all day Friday.

In any case, we need to figure out why we got hundreds of possible old pages (or how to prove they old).

Related question: if Nagios server offline for whatever reason (network outage)...do these NRPE attempt "tank up" and we'd get flooded again with old messages once the network is back online?
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: Help! Getting too many pages....

Post by benjaminsmith »

Hi,

Thanks for contacting the Nagios Support Team and providing a detailed description of the issue here. It does look like there are multiple issues present.
The employee left, so went into Nagios to remove him. Most checks went to "pending status" after did Apply Configuration
Are you still having the issue with pending status after applying configuration? This maybe related to a database issue, please send the system profile so we can take a look.

To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
In any case, we need to figure out why we got hundreds of possible old pages (or how to prove they old).
The best way is to review the nagios.logs from the time this happened. Can you attach the corresponding logs from the archive directory? They are rotated every 24 hours, so if if the log is dated 12-13, it would contain the information from 12-12.

Code: Select all

/usr/local/nagios/var/archives
Also..as test...took host down and we did NOT get paged on it....we wonder if Verizon shut off the paging since they were "flooded".
phpmailer-logging.png
Please enable phpmailer logging, so we can verify the notifications were sent as expected and let me know the name of the host so I can review the configuraitons.
Thanks,
Benjamin
You do not have the required permissions to view the files attached to this post.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked