Page 1 of 2
Server Moving out from maintenance
Posted: Mon Jul 31, 2017 1:24 am
by raamardhani7
Hi Team,
We are facing a problem with Maintenance mode in Nagios. It seems that yesterday when I have set all the devices in NA region for maintenance mode it has overridden all the previous hosts/services which were under MM for one-month or longer (Windows and AIX Servers which were to be decommissioned).
We are receiving incidents from Nagios of those hosts, which were under maintenance for about a month or more, but had been removed from old downtime because of last night's downtime activity.Because of this availability is getting affected and we are escalated badly
Could you please help us in resolving the issue on priority.
Re: Server Moving out from maintenance
Posted: Mon Jul 31, 2017 1:41 am
by raamardhani7
Also there are few servers for which downtime ends prematurely and host down alert is being sent.
Nagios version-Nagios XI 5.4.4
Re: Server Moving out from maintenance
Posted: Mon Jul 31, 2017 9:09 am
by tacolover101
did you use a regular schedule downtime, or a reoccuring downtime?
how was your windows / aix servers scheduled in downtime?
my guess is they'll need to reproduce this.
Re: Server Moving out from maintenance
Posted: Mon Jul 31, 2017 10:06 am
by lmiltchev
raamardhani7, can you describe in details all of the steps you took to place your devices in maintenance mode, prior to encountering the issue? Screenshots of the previous scheduled and/or recurring downtimes would be helpful, along with relevant logs. We will try to recreate the issue in house.
Re: Server Moving out from maintenance
Posted: Fri Nov 03, 2017 8:55 am
by raamardhani7
Hi ,
If a device has been kept in maintenance mode for a month and if there is an activity for which multiple devices have to be put in maintenance mode for 2 hours, then the previous maintenance period of one month gets canceled/overridden.
This issue is we are facing from this June.Earlier it was the host takes longest scheduled time of the windows.
EX if a host is put in maintenance mode in today for week but after 2 days someone puts the same server in maintenece mode for 15 days .The host was in maintenance mode for 17 days (2 +15 ).
We have 3 NagiosXI server but only this one Nagios XI server is having this issue. This mostly happens whenever there is network activty and all the servers configured in that Nagios XI is put in Maintenance mode.
Also one more finding we found that during such activty our Nagios XI disk space reaches 100% and also the database also crashes.The event handler is filling rapidly which causes the disk space to reach 100%. Not sure this has something to do with
server coming out of scheduled downtime in Nagios.
Currently we have 1128 servers and 180501 services configured on that NagiosXI
Re: Server Moving out from maintenance
Posted: Fri Nov 03, 2017 12:59 pm
by npolovenko
Hello,
@raamardhani7.
Currently we have 1128 servers and 180501 services configured on that NagiosXI
The event handler is filling rapidly which causes the disk space to reach 100%.
Have you considered splitting the XI load between two servers? I don't know your hardware configuration but it seems that 180501 services are a lot to handle for only 1 xi server.
Nagios version-Nagios XI 5.4.4
I'd start with upgrading your Nagios XI to the latest version. That might automatically fix the issue.
Also, could you upload timeperiods.cfg file from /usr/local/nagios/etc/
Re: Server Moving out from maintenance
Posted: Mon Nov 20, 2017 8:33 am
by raamardhani7
Please find the attached file of timeperiod.cfg file.
We still facing the same issue.
Re: Server Moving out from maintenance
Posted: Mon Nov 20, 2017 12:06 pm
by npolovenko
@raamardhani7, I think upgrading Nagios XI to the latest version may fix this issue. There were a few bug fixes related to the scheduled downtime since version 5.4.4. However, before you upgrade I highly recommend doing some optimizations on your system. Can you post the output of
df -h. Chances are your system needs more memory to be able to function normally. Also, you said that this XI is responsible for 180501 service checks, that seems like a lot! Did you mean to say 18501 by chance?
PS: For a faster resolution, you may also create a support ticket:
https://support.nagios.com/tickets
Re: Server Moving out from maintenance
Posted: Tue Nov 21, 2017 10:09 am
by raamardhani7
I have added the df -h file.
Yes there are 18501 services only.
Re: Server Moving out from maintenance
Posted: Tue Nov 21, 2017 11:42 am
by lmiltchev
Let's check a few things. Run the following commands and show the output:
# These two commands will show us the ramdisk entries in the /etc/init.d/nagios file, and the entire /etc/sysconfig/nagios file
Code: Select all
grep -i ramdisk /etc/init.d/nagios
cat /etc/sysconfig/nagios
# These commands will show us the nagiosramdisk entries in various config files
Code: Select all
grep nagiosramdisk /usr/local/nagios/etc/nagios.cfg
grep nagiosramdisk /usr/local/nagiosmobile/include.inc.php
grep nagiosramdisk /usr/local/nrdp/server/config.inc.php
grep nagiosramdisk /usr/local/nagiosxi/html/config.inc.php
grep nagiosramdisk /usr/local/nagios/etc/pnp/npcd.cfg
grep nagiosramdisk /usr/local/nagios/etc/commands.cfg
# These commands will show the permissions on the checkresults directory, and how many perfdata files are in it
Code: Select all
ls -lad /var/nagiosramdisk/spool/checkresults
ls /var/nagiosramdisk/spool/checkresults/ | wc -l
# Let's see if the perdataproc cron job is running