Server Moving out from maintenance

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
raamardhani7
Posts: 459
Joined: Tue Jun 02, 2015 12:36 am

Server Moving out from maintenance

Post by raamardhani7 »

Hi Team,

We are facing a problem with Maintenance mode in Nagios. It seems that yesterday when I have set all the devices in NA region for maintenance mode it has overridden all the previous hosts/services which were under MM for one-month or longer (Windows and AIX Servers which were to be decommissioned).

We are receiving incidents from Nagios of those hosts, which were under maintenance for about a month or more, but had been removed from old downtime because of last night's downtime activity.Because of this availability is getting affected and we are escalated badly

Could you please help us in resolving the issue on priority.
raamardhani7
Posts: 459
Joined: Tue Jun 02, 2015 12:36 am

Re: Server Moving out from maintenance

Post by raamardhani7 »

Also there are few servers for which downtime ends prematurely and host down alert is being sent.

Nagios version-Nagios XI 5.4.4
User avatar
tacolover101
Posts: 432
Joined: Mon Apr 10, 2017 11:55 am

Re: Server Moving out from maintenance

Post by tacolover101 »

did you use a regular schedule downtime, or a reoccuring downtime?

how was your windows / aix servers scheduled in downtime?

my guess is they'll need to reproduce this.
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Server Moving out from maintenance

Post by lmiltchev »

raamardhani7, can you describe in details all of the steps you took to place your devices in maintenance mode, prior to encountering the issue? Screenshots of the previous scheduled and/or recurring downtimes would be helpful, along with relevant logs. We will try to recreate the issue in house.
Be sure to check out our Knowledgebase for helpful articles and solutions!
raamardhani7
Posts: 459
Joined: Tue Jun 02, 2015 12:36 am

Re: Server Moving out from maintenance

Post by raamardhani7 »

Hi ,

If a device has been kept in maintenance mode for a month and if there is an activity for which multiple devices have to be put in maintenance mode for 2 hours, then the previous maintenance period of one month gets canceled/overridden.

This issue is we are facing from this June.Earlier it was the host takes longest scheduled time of the windows.
EX if a host is put in maintenance mode in today for week but after 2 days someone puts the same server in maintenece mode for 15 days .The host was in maintenance mode for 17 days (2 +15 ).


We have 3 NagiosXI server but only this one Nagios XI server is having this issue. This mostly happens whenever there is network activty and all the servers configured in that Nagios XI is put in Maintenance mode.
Also one more finding we found that during such activty our Nagios XI disk space reaches 100% and also the database also crashes.The event handler is filling rapidly which causes the disk space to reach 100%. Not sure this has something to do with
server coming out of scheduled downtime in Nagios.

Currently we have 1128 servers and 180501 services configured on that NagiosXI
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Server Moving out from maintenance

Post by npolovenko »

Hello, @raamardhani7.
Currently we have 1128 servers and 180501 services configured on that NagiosXI
The event handler is filling rapidly which causes the disk space to reach 100%.
Have you considered splitting the XI load between two servers? I don't know your hardware configuration but it seems that 180501 services are a lot to handle for only 1 xi server.
Nagios version-Nagios XI 5.4.4
I'd start with upgrading your Nagios XI to the latest version. That might automatically fix the issue.

Also, could you upload timeperiods.cfg file from /usr/local/nagios/etc/
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
raamardhani7
Posts: 459
Joined: Tue Jun 02, 2015 12:36 am

Re: Server Moving out from maintenance

Post by raamardhani7 »

Please find the attached file of timeperiod.cfg file.
We still facing the same issue.
You do not have the required permissions to view the files attached to this post.
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Server Moving out from maintenance

Post by npolovenko »

@raamardhani7, I think upgrading Nagios XI to the latest version may fix this issue. There were a few bug fixes related to the scheduled downtime since version 5.4.4. However, before you upgrade I highly recommend doing some optimizations on your system. Can you post the output of df -h. Chances are your system needs more memory to be able to function normally. Also, you said that this XI is responsible for 180501 service checks, that seems like a lot! Did you mean to say 18501 by chance?
PS: For a faster resolution, you may also create a support ticket: https://support.nagios.com/tickets
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
raamardhani7
Posts: 459
Joined: Tue Jun 02, 2015 12:36 am

Re: Server Moving out from maintenance

Post by raamardhani7 »

I have added the df -h file.
Yes there are 18501 services only.
You do not have the required permissions to view the files attached to this post.
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Server Moving out from maintenance

Post by lmiltchev »

Let's check a few things. Run the following commands and show the output:

# These two commands will show us the ramdisk entries in the /etc/init.d/nagios file, and the entire /etc/sysconfig/nagios file

Code: Select all

grep -i ramdisk /etc/init.d/nagios
cat /etc/sysconfig/nagios
# These commands will show us the nagiosramdisk entries in various config files

Code: Select all

grep nagiosramdisk /usr/local/nagios/etc/nagios.cfg
grep nagiosramdisk /usr/local/nagiosmobile/include.inc.php
grep nagiosramdisk /usr/local/nrdp/server/config.inc.php
grep nagiosramdisk /usr/local/nagiosxi/html/config.inc.php
grep nagiosramdisk /usr/local/nagios/etc/pnp/npcd.cfg
grep nagiosramdisk /usr/local/nagios/etc/commands.cfg
# These commands will show the permissions on the checkresults directory, and how many perfdata files are in it

Code: Select all

ls -lad /var/nagiosramdisk/spool/checkresults
ls /var/nagiosramdisk/spool/checkresults/ | wc -l
# Let's see if the perdataproc cron job is running

Code: Select all

ps -ef | grep [p]erfdataproc
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked