Hello,
I am currently trying to set up external health checks for Nagios XI. Nagios XI has become an important component of our environments monitoring stack and there has become a need to ensure that it is up and running correctly.
For most of the products we use there is an API endpoint to query for the health status. I could not find a suitable endpoint for Nagios to do this. Is there any recommendations regarding ways which we could validate that Nagios is healthy. Preferably we would be able to query the for the health via HTTP, but I am open to other suggestions as well.
Best regards,
Mark Jackson
Programatically Checking Nagios Health
Re: Programatically Checking Nagios Health
You could spin up another XI server and use the XI FREE license, then use the Linux Server wizard and the Nagios XI wizards to monitor the host metrics/XI services.
Specifically through the API:
See these URLs on your XI server to see working examples of the API:
You would call this API endpoint first to make sure is_currently_running is set which means API is queryable:
Then check the daemons/etc through this endpoint:
Specifically through the API:
See these URLs on your XI server to see working examples of the API:
You would call this API endpoint first to make sure is_currently_running is set which means API is queryable:
Code: Select all
https://YOURXISERVER/nagiosxi/help/api-system-reference.php#system-statusCode: Select all
https://YOURXISERVER/nagiosxi/help/api-system-reference.php#system-status-detailRe: Programatically Checking Nagios Health
Hello ssax,
I did find those endpoints reading the API documentation before, maybe checking if 'is_currently_running' from system/status is not 1 is enough. I was hoping for an endpoint that would give me a JSON response with the information that you can normally get from the system component status table page (https://nagios.example/nagiosxi/admin/sysstat.php). With a response with something like 'status: healthy/unhealthy' for each component.
System Component Status:
- Monitoring Engine
- Performance Grapher
- Database Maintenance
- Command Subsystem
- Event Manager
- Feed Processor
- Report Engine
- Cleaner
- Nonstop Operations Manager
- System Statistics
GET system/status, returns an object that looks like this.
{
"instance_id": "1",
"instance_name": "localhost",
"status_update_time": "2015-09-21 01:48:14",
"program_start_time": "2015-09-20 12:21:20",
"program_run_time": "48419",
"program_end_time": "0000-00-00 00:00:00",
"is_currently_running": "1",
"process_id": "105075",
"daemon_mode": "1",
"last_command_check": "1969-12-31 18:00:00",
"last_log_rotation": "2015-09-21 00:00:00",
"notifications_enabled": "1",
"active_service_checks_enabled": "1",
"passive_service_checks_enabled": "1",
"active_host_checks_enabled": "1",
"passive_host_checks_enabled": "1",
"event_handlers_enabled": "1",
"flap_detection_enabled": "1",
"process_performance_data": "1",
"obsess_over_hosts": "0",
"obsess_over_services": "0",
"modified_host_attributes": "0",
"modified_service_attributes": "0",
"global_host_event_handler": "xi_host_event_handler",
"global_service_event_handler": "xi_service_event_handler"
}
GET system/statusdetail might be able to provide some of the information I have mentioned above, but most of the objects return "last_check": "<time stamp>". Do you know what these objects return if a component fails to check in? Or the time format that these objects are using. I could potentially check if the last check in was over x amount time.
Also as some background information, I was looking at doing this by querying the API via HTTP because we have had some experiences in the past monitoring assets where Apache is up, and the host is up, but the web page/app wasn't being served properly. We have also had some instances where an application is up, but specific components of the application are busted. If I have to I will probably just set up a basic HTTP check that goes to the page and checks to see if some expected content is there. We already use Nagios to self monitor things like system metrics for the Nagios host. So the only thing I really need to do is confirm it is available. As long as it available, Nagios its self will send us alerts about other issues like high load.
Thanks,
Mark Jackson
I did find those endpoints reading the API documentation before, maybe checking if 'is_currently_running' from system/status is not 1 is enough. I was hoping for an endpoint that would give me a JSON response with the information that you can normally get from the system component status table page (https://nagios.example/nagiosxi/admin/sysstat.php). With a response with something like 'status: healthy/unhealthy' for each component.
System Component Status:
- Monitoring Engine
- Performance Grapher
- Database Maintenance
- Command Subsystem
- Event Manager
- Feed Processor
- Report Engine
- Cleaner
- Nonstop Operations Manager
- System Statistics
GET system/status, returns an object that looks like this.
{
"instance_id": "1",
"instance_name": "localhost",
"status_update_time": "2015-09-21 01:48:14",
"program_start_time": "2015-09-20 12:21:20",
"program_run_time": "48419",
"program_end_time": "0000-00-00 00:00:00",
"is_currently_running": "1",
"process_id": "105075",
"daemon_mode": "1",
"last_command_check": "1969-12-31 18:00:00",
"last_log_rotation": "2015-09-21 00:00:00",
"notifications_enabled": "1",
"active_service_checks_enabled": "1",
"passive_service_checks_enabled": "1",
"active_host_checks_enabled": "1",
"passive_host_checks_enabled": "1",
"event_handlers_enabled": "1",
"flap_detection_enabled": "1",
"process_performance_data": "1",
"obsess_over_hosts": "0",
"obsess_over_services": "0",
"modified_host_attributes": "0",
"modified_service_attributes": "0",
"global_host_event_handler": "xi_host_event_handler",
"global_service_event_handler": "xi_service_event_handler"
}
GET system/statusdetail might be able to provide some of the information I have mentioned above, but most of the objects return "last_check": "<time stamp>". Do you know what these objects return if a component fails to check in? Or the time format that these objects are using. I could potentially check if the last check in was over x amount time.
Also as some background information, I was looking at doing this by querying the API via HTTP because we have had some experiences in the past monitoring assets where Apache is up, and the host is up, but the web page/app wasn't being served properly. We have also had some instances where an application is up, but specific components of the application are busted. If I have to I will probably just set up a basic HTTP check that goes to the page and checks to see if some expected content is there. We already use Nagios to self monitor things like system metrics for the Nagios host. So the only thing I really need to do is confirm it is available. As long as it available, Nagios its self will send us alerts about other issues like high load.
Thanks,
Mark Jackson
Re: Programatically Checking Nagios Health
Since sysstat data is run on a cron job the is_currently_running one would be the one you want to check since the sysstat information would be old.
There isn't really a single check that will check the total health of the system, you could create your own custom API endpoint but one doesn't currently exist that will check all of the health in a single check. The system/status or system/statusdetail are currently the only endpoints that would contain the data in the API.
You could check daemon/jobs like the XI wizard does but they come out as individual services from the wizard and use specific plugins but you should still be checking for is_currently_running first as that's the best option there is for it to make sure it's up and ready to be queried:
There isn't really a single check that will check the total health of the system, you could create your own custom API endpoint but one doesn't currently exist that will check all of the health in a single check. The system/status or system/statusdetail are currently the only endpoints that would contain the data in the API.
You could check daemon/jobs like the XI wizard does but they come out as individual services from the wizard and use specific plugins but you should still be checking for is_currently_running first as that's the best option there is for it to make sure it's up and ready to be queried:
Code: Select all
/usr/bin/php /usr/local/nagios/libexec/check_nagiosxiserver.php --address=X.X.X.X --url='https://X.X.X.X/nagiosxi/' --apikey='XXXXXXXXXXX' --mode=daemons
/usr/bin/php /usr/local/nagios/libexec/check_nagiosxiserver.php --address=X.X.X.X --url='https://X.X.X.X/nagiosxi/' --apikey='XXXXXXXXXXX' --mode=jobsRe: Programatically Checking Nagios Health
Ok, thank you for the feedback. I will give what you are suggesting a try.
Re: Programatically Checking Nagios Health
Great, let us know if you have further questions or when we're okay to lock this up and mark it as resolved.
Thank you!
Thank you!
Re: Programatically Checking Nagios Health
Hey ssax, you can mark this as resolved when you are ready. I don't think any more feedback is required. Thanks again.