Nagios Support Forum

Posted: **Tue Feb 26, 2019 11:53 pm**

Hi Support,

Need help here to explain the below queries,

Issue: No alarm for CPU , Server PG938

Current Setup:
Server: PG938
Component: CPU
CPU_Total_Linux( Individual CPU service)=90% threshold( -a '-c 90' )
GMS_Linux_Default_CPU ( Generic service rule for CPU)=95% (-a '-w 90 -c 95')

Scenario:
1)Server PG938 is being monitored through Nagios XI. Recently , There was incident happened where CPU for PG938 server was 100% full however no alarm triggered from Nagios XI
2) Upon investigation, Found that, CPU monitoring rule was available twice in Nagios XI, One where the CPU monitoring for PG938 was individually configured with threshold value 90% and another one in default CPU service with threshold value of 95%

Queries:
1) In such scenario, why Nagios XI was not able trigger alarm for either one for CPU as threshold values were set differently in Default Service and Individual configuration, instead it generated Warning event as "Duplicate definition found"
2) Is this Nagios XI default mechanism setup where it does not generate alarm when there is duplicate definition found? If no, Can this modified somewhere in Nagios XI to allow one alarm at least generation during such situation?
3) What are all other exceptions or rejections which currently exist in Nagios XI to avoid such cases in future?

Posted: **Wed Feb 27, 2019 2:20 pm**

Hi @Rohan77,

Some good questions. Nagios will let you apply configuration ( with a warning) if when there is a duplicate service or host definition. However, this does not mean that it will simply ignore checks for the service. It will use the last one in the configuration file for the host or service.

You can look at the definition for the host or service taking into account template inheritance by searching for the definition in the objects.cach file located in usr/local/nagios/var/objects.cache.

If configured correctly, Nagios should have generated alert and/or notification for the service in question. Please review the settings or send me your system profile and I can take a look.

Lastly, we did have some notification bugs in Nagios Core 4.4.2 so there maybe more the one issue here. What version are you running? If possible, I would recommend upgrading to 5.5.10. Before upgrading, we always recommend making a backup or taking a VM snapshot.

Backing Up and Restoring Nagios XI

Posted: **Thu Feb 28, 2019 5:36 am**

Hi Benjamin Smith,

Thanks for your updates, Have gone through objects.cach file from nagios server and found the configuration for PG938 seems correct however there was no any CRITICAL/WARNING alarm sent from Nagios on the day when the incident happened hence need further expert review on it.

As I am still not authorized to send PM hence managed to send objects.cache and nagios profile through my colleague farid rossle( who is in same team with me), Hope you have received it for further review and help to update me, Thanks.

Posted: **Thu Feb 28, 2019 12:40 pm**

Hi Rohan,

Looking for the objects.cache data and the CPU checks for PG938, one of those checks has notifications enabled and the other does not ( notifications_enabled 0). I've copied the definitions below. You'll want to go to Configure > CCM > Monitoring > Services > Alert Settings and enable notifications on that service, and then Apply Configuration.

Code: Select all

define service {
	host_name	PG938
	service_description	CPU_Performance_Stats
	check_period	24x7
	check_command	check_nrpe!check_cpu_stats!-a '-w 90,90,90 -c 100,100,100'!!!!!!
	contacts	linuxserver_host_services_contact_msend,linuxserver_host_services_contact
	notification_period	24x7
	initial_state	o
	importance	0
	check_interval	5.000000
	retry_interval	5.000000
	max_check_attempts	2
	is_volatile	0
	parallelize_check	1
	active_checks_enabled	1
	passive_checks_enabled	0
	obsess	1
	event_handler_enabled	1
	low_flap_threshold	0.000000
	high_flap_threshold	0.000000
	flap_detection_enabled	1
	flap_detection_options	a
	freshness_threshold	0
	check_freshness	0
	notification_options	a
	notifications_enabled	0
	notification_interval	4320.000000
	first_notification_delay	5.000000
	stalking_options	n
	process_perf_data	1
	retain_status_information	1
	retain_nonstatus_information	1
	}


	define service {
	host_name	PG938
	service_description	CPU_Total_Linux
	check_period	24x7
	check_command	scb_cpu_total_linux!-a '-c 90'!!!!!!!
	contacts	linuxserver_host_services_contact_msend,linuxserver_host_services_contact
	notification_period	24x7
	initial_state	o
	importance	0
	check_interval	5.000000
	retry_interval	5.000000
	max_check_attempts	3
	is_volatile	0
	parallelize_check	1
	active_checks_enabled	1
	passive_checks_enabled	0
	obsess	1
	event_handler_enabled	1
	low_flap_threshold	0.000000
	high_flap_threshold	0.000000
	flap_detection_enabled	1
	flap_detection_options	a
	freshness_threshold	0
	check_freshness	0
	notification_options	r,w,c
	notifications_enabled	1
	notification_interval	5256000.000000
	first_notification_delay	5.000000
	stalking_options	n
	process_perf_data	1
	retain_status_information	1
	retain_nonstatus_information	1
	}

The State History and Notification Reports are helpful in determining what notifications have been sent and when a host or service entered a hard down or critical state.

Posted: **Fri Mar 01, 2019 12:51 am**

Hi Benjamin Smith,

Service description "CPU_Performance_Stats" is to collect only performance/utilization report for PG938 hence the notifications for the same is disabled since this service does not required to send notifications instead just to collect performance statistics of PG938.

CPU monitoring threshold is configured in service " CPU_Total_Linux" with CRITICAL threshold set to 90% and notification for the same is already enabled so can you help me to understand why notification from service "CPU_Total_Linux" was not triggered? Do let me know if you need further information, Thanks.

Posted: **Fri Mar 01, 2019 12:45 pm**

Hi @Rohan77,

Looking at this further I'm not seeing an alert being generated in the nagios log for CPU_Total_Linux, but only an alert generated for system load:

[1551322700] SERVICE ALERT: PG938;Load;CRITICAL;HARD;2;CRITICAL - load average: 23.48, 12.56, 10.39

Can you post the state history report ( select type: services, state types: both ) and the notification report for the time period in question for this host?

Thank you.

Posted: **Mon Mar 04, 2019 6:44 am**

Hi Benjamin,

I have attached requested state history and notification files for Host PG938 in Private message for your review, Thanks.

Posted: **Mon Mar 04, 2019 1:03 pm**

Hi @Rohan77,

Thank you for sending the state history report. There are two related issues here.

1. CPU_Total_Linux did not send notification since it did not enter a hard warning or critical state. To test notifications you can either adjust the check command parameters to force it into a critical sate or send passive checks until the state type changes to hard critical. The send passive checks to to Home > Service Status > CPU_Total_Linux > Advanced > Submit Passive Check Result

2. The other the issue is sometimes your plugin is timing out. Please follow the guide below to increase the timeout settings. You can test the check from the command line to make sure it's working properly.

References
https://support.nagios.com/kb/article/n ... s-617.html
https://support.nagios.com/kb/article.php?id=167

Posted: **Tue Mar 05, 2019 7:15 am**

Hi Benjamin Smith,

Thanks for your review.

Have reviewed the CPU performance report for the host PG938 , Observed that CPU idle was below 10% throughout the the day on incident day which was 19th Feb 2019, By this, This should have triggered an alert in the day itself, Have sent you CPU performance report over PM for review. Regards to hard warning or critical, May i understand, Only hard warning or critical state related events will trigger an alerts?

Lastly, for passive checks and plugin timeout, I will further confirm with tooling team before making any global configuration changes in nagios, Thanks.

Posted: **Tue Mar 05, 2019 12:52 pm**

Hi @Rohan77,

It would be helpful to see the complete state history report for PG938 for this time period as the previous spreadsheet does not include all the logs for CPU_Total_Linux. Go to Report > State History, host = PG938, type = services, state type = both and state = any.

As far as the performance graphs, there may not be a direct comparison between these two checks. The check command for CPU_Performance_Stats uses the iostat command to gather system information. CpuUser is the usage by applications, while CpuSystem is usage by the Linux kernal. The idle statistic is the percent time the CPU did not have any outstanding i/o request and is determined by adding up the user, system, i/o wait, nice and steal.

Nagios Support Forum

Duplicate definition found for service

Duplicate definition found for service

Re: Duplicate definition found for service

Re: Duplicate definition found for service

Re: Duplicate definition found for service

Re: Duplicate definition found for service

Re: Duplicate definition found for service

Re: Duplicate definition found for service

Re: Duplicate definition found for service

Re: Duplicate definition found for service

Re: Duplicate definition found for service