Hi,
We've been running NCPA on Solaris for about a week now. It's been going quite well, but today at 9:48 a bunch of different servers all started failing the service checks. We run it in passive mode, but I enabled the listener to have a look at the API and try to figure out what was happening. When I look at the services endpoint, this is all I see:
{
"services": {
"9:49:24": "running",
"9:49:23": "running",
"9:51:37": "running",
"9:49:04": "running",
"9:49:16": "running",
"9:49:17": "running",
"9:49:14": "running",
"9:49:15": "running",
"9:49:12": "running",
"9:49:13": "running",
"9:49:10": "running",
"9:49:11": "running",
"9:52:41": "running",
"9:49:18": "running",
"9:49:19": "running",
"site|ncpa_passive:default": "running",
"9:49:31": "running",
"9:49:32": "running",
"site|ncpa_listener:default": "stopped",
"9:50:05": "running",
"network|ssh:default": "running",
"9:49:29": "running",
"9:49:01": "running",
"9:49:00": "running",
"9:49:03": "running",
"9:49:02": "running",
"9:49:05": "running",
"9:49:22": "running",
"9:49:21": "running",
"9:49:06": "running",
"9:49:59": "running",
"9:49:26": "running",
"9:48:59": "running",
"9:48:58": "running",
"9:49:25": "running"
}
}
It seems to be showing what it knew at the time it failure, but isn't really updating, and isn't showing process names.
Any tips to get troubleshooting?
Lyndon
NCPA on Solaris Service Issue
Re: NCPA on Solaris Service Issue
A reboot does solve the issue, but a service restart doesn't help. I have one node repaired, and another still in this error state.
Re: NCPA on Solaris Service Issue
Hi! Welcome to the forum!
Can you crank up the log level to debug, restart the ncpa_listener service, and then hit up Services in the API? I'd like to see if it logs anything.
Can you crank up the log level to debug, restart the ncpa_listener service, and then hit up Services in the API? I'd like to see if it logs anything.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: NCPA on Solaris Service Issue
Done!
Here's the output of services this morning, a few more have creeped in, but not all...
{
"services": {
"9:49:24": "running",
"9:49:23": "running",
"9:51:37": "running",
"9:49:04": "running",
"9:49:16": "running",
"9:49:17": "running",
"9:49:14": "running",
"9:49:15": "running",
"9:49:12": "running",
"9:49:13": "running",
"9:49:10": "running",
"9:49:11": "running",
"9:52:41": "running",
"9:49:18": "running",
"9:49:19": "running",
"site|ncpa_passive:default": "running",
"9:49:31": "running",
"9:49:32": "running",
"system|manifest-import:default": "running",
"system|console-login:default": "running",
"site|ncpa_listener:default": "running",
"system|vxpbx:default": "running",
"network|ldap|client:default": "running",
"9:50:05": "running",
"network|ssh:default": "running",
"9:49:29": "running",
"9:49:01": "running",
"9:49:00": "running",
"9:49:03": "running",
"9:49:02": "running",
"9:49:05": "running",
"9:49:22": "running",
"9:49:21": "running",
"9:49:06": "running",
"9:49:59": "running",
"9:49:26": "running",
"9:48:59": "running",
"9:48:58": "running",
"9:49:25": "running"
}
}
Here's the debug logs, doesn't look like any errors are presented. Running the standard svcs utility shows all of the services, I'm not sure what NCPA is hooking into though
--
2020-01-24 08:19:33,548 12260 DEBUG Initializing WebSocket
2020-01-24 08:19:33,548 12260 DEBUG Validating WebSocket request
2020-01-24 08:19:33,814 12260 INFO ::ffff:<redacted> - - [2020-01-24 08:19:33] "GET /api/services HTTP/1.1" 200 1397 0.266562
Here's the output of services this morning, a few more have creeped in, but not all...
{
"services": {
"9:49:24": "running",
"9:49:23": "running",
"9:51:37": "running",
"9:49:04": "running",
"9:49:16": "running",
"9:49:17": "running",
"9:49:14": "running",
"9:49:15": "running",
"9:49:12": "running",
"9:49:13": "running",
"9:49:10": "running",
"9:49:11": "running",
"9:52:41": "running",
"9:49:18": "running",
"9:49:19": "running",
"site|ncpa_passive:default": "running",
"9:49:31": "running",
"9:49:32": "running",
"system|manifest-import:default": "running",
"system|console-login:default": "running",
"site|ncpa_listener:default": "running",
"system|vxpbx:default": "running",
"network|ldap|client:default": "running",
"9:50:05": "running",
"network|ssh:default": "running",
"9:49:29": "running",
"9:49:01": "running",
"9:49:00": "running",
"9:49:03": "running",
"9:49:02": "running",
"9:49:05": "running",
"9:49:22": "running",
"9:49:21": "running",
"9:49:06": "running",
"9:49:59": "running",
"9:49:26": "running",
"9:48:59": "running",
"9:48:58": "running",
"9:49:25": "running"
}
}
Here's the debug logs, doesn't look like any errors are presented. Running the standard svcs utility shows all of the services, I'm not sure what NCPA is hooking into though
--
2020-01-24 08:19:33,548 12260 DEBUG Initializing WebSocket
2020-01-24 08:19:33,548 12260 DEBUG Validating WebSocket request
2020-01-24 08:19:33,814 12260 INFO ::ffff:<redacted> - - [2020-01-24 08:19:33] "GET /api/services HTTP/1.1" 200 1397 0.266562
Re: NCPA on Solaris Service Issue
Looks like NCPA also uses svcs. Specifically svcs -a There may be something in the way we filter the services, though? Could you run svcs -a on your system, copy it out to a text file and PM it to me. Also at roughly the same time, if you could get the output of NCPA's services list, and send that along, that would be helpful. They should be listing the services in the same order. Maybe I can find something between the code, the svcs -a output, and NCPA's output combined.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: NCPA on Solaris Service Issue
Hi - I PM'd last week, any updates?
Lyndon
Lyndon
Re: NCPA on Solaris Service Issue
Yes, my apologies that this took so long. I think I've found what's happening. Check out this chunk of output from your text file,
It's hard to see outside of an editor, but all of the services that show up have the date,time butt up against one another. The services that aren't showing up have date, time. That's date comma space time. Because they started earlier in the morning. I can see in the code where this is confusing NCPA.
Do you have any Solaris machines where NCPA is still working properly? If so, could you run svcs -a against one of those machines to see if there are any services that started in the morning? I'm wondering if there is some setting that tells Solaris to output svcs with leading zeroes in the time.
According to the OpenSolaris man page for svcs, STIME should only display the time if the service started within the last 24 hours, and it should have an underscore in place of blanks. So _9:50:05 instead of 09:50:05. But that's OpenSolaris. I'm not sure how well that translates to actual Solaris.
It's hard to see outside of an editor, but all of the services that show up have the date,time butt up against one another. The services that aren't showing up have date, time. That's date comma space time. Because they started earlier in the morning. I can see in the code where this is confusing NCPA.
Do you have any Solaris machines where NCPA is still working properly? If so, could you run svcs -a against one of those machines to see if there are any services that started in the morning? I'm wondering if there is some setting that tells Solaris to output svcs with leading zeroes in the time.
According to the OpenSolaris man page for svcs, STIME should only display the time if the service started within the last 24 hours, and it should have an underscore in place of blanks. So _9:50:05 instead of 09:50:05. But that's OpenSolaris. I'm not sure how well that translates to actual Solaris.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: NCPA on Solaris Service Issue
No worries on the delay, I appreciate the effort!
It looks like everything is showing fine at the moment, so tough to troubleshoot just now, but I believe you're on the money.
It looks like with Solaris 11.3 it only showed the date on services that have been running for longer than a day. Now with Solaris 11.4 dates are shown with the full day and time, and when a morning-start is the case it looks like an extra "space" char is thrown in.
I assume NCPA is filtering by field at the moment (grab everything in f3 cut by space)
Is it possible to change how we're pulling? If NCPA is only utilizing column 1 & 3, a command could look like:
It looks like everything is showing fine at the moment, so tough to troubleshoot just now, but I believe you're on the money.
It looks like with Solaris 11.3 it only showed the date on services that have been running for longer than a day. Now with Solaris 11.4 dates are shown with the full day and time, and when a morning-start is the case it looks like an extra "space" char is thrown in.
I assume NCPA is filtering by field at the moment (grab everything in f3 cut by space)
Is it possible to change how we're pulling? If NCPA is only utilizing column 1 & 3, a command could look like:
Code: Select all
~# svcs -a -o STATE,FMRI | grep ncpa
disabled svc:/site/ncpa_listener:default
online svc:/site/ncpa_passive:default
Re: NCPA on Solaris Service Issue
O - could you please also remove the services with our internal zone-names in the list from your earlier post?
Re: NCPA on Solaris Service Issue
Whoops! My apologies, I did not realize there was sensitive data there. It is removed.
I would be hesitant to just remove the STIME reporting entirely, as I'm sure someone probably uses it to monitor the uptime of a service. However, we can probably add logic to handle AM start times. I will talk with the devs and see if this qualifies as a bug, otherwise it will have to be a feature request.
If you're handy with Python, you can see what's happening here, starting at line 275.
https://github.com/NagiosEnterprises/nc ... ervices.py
I would be hesitant to just remove the STIME reporting entirely, as I'm sure someone probably uses it to monitor the uptime of a service. However, we can probably add logic to handle AM start times. I will talk with the devs and see if this qualifies as a bug, otherwise it will have to be a feature request.
If you're handy with Python, you can see what's happening here, starting at line 275.
https://github.com/NagiosEnterprises/nc ... ervices.py
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!