NCPA on Solaris Service Issue

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
vtlyndon
Posts: 7
Joined: Mon Nov 05, 2018 3:07 pm

NCPA on Solaris Service Issue

Post by vtlyndon »

Hi,

We've been running NCPA on Solaris for about a week now. It's been going quite well, but today at 9:48 a bunch of different servers all started failing the service checks. We run it in passive mode, but I enabled the listener to have a look at the API and try to figure out what was happening. When I look at the services endpoint, this is all I see:

{
"services": {
"9:49:24": "running",
"9:49:23": "running",
"9:51:37": "running",
"9:49:04": "running",
"9:49:16": "running",
"9:49:17": "running",
"9:49:14": "running",
"9:49:15": "running",
"9:49:12": "running",
"9:49:13": "running",
"9:49:10": "running",
"9:49:11": "running",
"9:52:41": "running",
"9:49:18": "running",
"9:49:19": "running",
"site|ncpa_passive:default": "running",
"9:49:31": "running",
"9:49:32": "running",
"site|ncpa_listener:default": "stopped",
"9:50:05": "running",
"network|ssh:default": "running",
"9:49:29": "running",
"9:49:01": "running",
"9:49:00": "running",
"9:49:03": "running",
"9:49:02": "running",
"9:49:05": "running",
"9:49:22": "running",
"9:49:21": "running",
"9:49:06": "running",
"9:49:59": "running",
"9:49:26": "running",
"9:48:59": "running",
"9:48:58": "running",
"9:49:25": "running"
}
}

It seems to be showing what it knew at the time it failure, but isn't really updating, and isn't showing process names.

Any tips to get troubleshooting?

Lyndon
vtlyndon
Posts: 7
Joined: Mon Nov 05, 2018 3:07 pm

Re: NCPA on Solaris Service Issue

Post by vtlyndon »

A reboot does solve the issue, but a service restart doesn't help. I have one node repaired, and another still in this error state.
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: NCPA on Solaris Service Issue

Post by mbellerue »

Hi! Welcome to the forum!

Can you crank up the log level to debug, restart the ncpa_listener service, and then hit up Services in the API? I'd like to see if it logs anything.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
vtlyndon
Posts: 7
Joined: Mon Nov 05, 2018 3:07 pm

Re: NCPA on Solaris Service Issue

Post by vtlyndon »

Done!

Here's the output of services this morning, a few more have creeped in, but not all...
{
"services": {
"9:49:24": "running",
"9:49:23": "running",
"9:51:37": "running",
"9:49:04": "running",
"9:49:16": "running",
"9:49:17": "running",
"9:49:14": "running",
"9:49:15": "running",
"9:49:12": "running",
"9:49:13": "running",
"9:49:10": "running",
"9:49:11": "running",
"9:52:41": "running",
"9:49:18": "running",
"9:49:19": "running",
"site|ncpa_passive:default": "running",
"9:49:31": "running",
"9:49:32": "running",
"system|manifest-import:default": "running",
"system|console-login:default": "running",
"site|ncpa_listener:default": "running",
"system|vxpbx:default": "running",
"network|ldap|client:default": "running",
"9:50:05": "running",
"network|ssh:default": "running",
"9:49:29": "running",
"9:49:01": "running",
"9:49:00": "running",
"9:49:03": "running",
"9:49:02": "running",
"9:49:05": "running",
"9:49:22": "running",
"9:49:21": "running",
"9:49:06": "running",
"9:49:59": "running",
"9:49:26": "running",
"9:48:59": "running",
"9:48:58": "running",
"9:49:25": "running"
}
}

Here's the debug logs, doesn't look like any errors are presented. Running the standard svcs utility shows all of the services, I'm not sure what NCPA is hooking into though
--
2020-01-24 08:19:33,548 12260 DEBUG Initializing WebSocket
2020-01-24 08:19:33,548 12260 DEBUG Validating WebSocket request
2020-01-24 08:19:33,814 12260 INFO ::ffff:<redacted> - - [2020-01-24 08:19:33] "GET /api/services HTTP/1.1" 200 1397 0.266562
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: NCPA on Solaris Service Issue

Post by mbellerue »

Looks like NCPA also uses svcs. Specifically svcs -a There may be something in the way we filter the services, though? Could you run svcs -a on your system, copy it out to a text file and PM it to me. Also at roughly the same time, if you could get the output of NCPA's services list, and send that along, that would be helpful. They should be listing the services in the same order. Maybe I can find something between the code, the svcs -a output, and NCPA's output combined.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
vtlyndon
Posts: 7
Joined: Mon Nov 05, 2018 3:07 pm

Re: NCPA on Solaris Service Issue

Post by vtlyndon »

Hi - I PM'd last week, any updates?

Lyndon
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: NCPA on Solaris Service Issue

Post by mbellerue »

Yes, my apologies that this took so long. I think I've found what's happening. Check out this chunk of output from your text file,


It's hard to see outside of an editor, but all of the services that show up have the date,time butt up against one another. The services that aren't showing up have date, time. That's date comma space time. Because they started earlier in the morning. I can see in the code where this is confusing NCPA.

Do you have any Solaris machines where NCPA is still working properly? If so, could you run svcs -a against one of those machines to see if there are any services that started in the morning? I'm wondering if there is some setting that tells Solaris to output svcs with leading zeroes in the time.

According to the OpenSolaris man page for svcs, STIME should only display the time if the service started within the last 24 hours, and it should have an underscore in place of blanks. So _9:50:05 instead of 09:50:05. But that's OpenSolaris. I'm not sure how well that translates to actual Solaris.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
vtlyndon
Posts: 7
Joined: Mon Nov 05, 2018 3:07 pm

Re: NCPA on Solaris Service Issue

Post by vtlyndon »

No worries on the delay, I appreciate the effort!

It looks like everything is showing fine at the moment, so tough to troubleshoot just now, but I believe you're on the money.

It looks like with Solaris 11.3 it only showed the date on services that have been running for longer than a day. Now with Solaris 11.4 dates are shown with the full day and time, and when a morning-start is the case it looks like an extra "space" char is thrown in.

I assume NCPA is filtering by field at the moment (grab everything in f3 cut by space)

Is it possible to change how we're pulling? If NCPA is only utilizing column 1 & 3, a command could look like:

Code: Select all

~# svcs -a -o STATE,FMRI | grep ncpa
disabled       svc:/site/ncpa_listener:default
online         svc:/site/ncpa_passive:default
vtlyndon
Posts: 7
Joined: Mon Nov 05, 2018 3:07 pm

Re: NCPA on Solaris Service Issue

Post by vtlyndon »

O - could you please also remove the services with our internal zone-names in the list from your earlier post?
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: NCPA on Solaris Service Issue

Post by mbellerue »

Whoops! My apologies, I did not realize there was sensitive data there. It is removed.

I would be hesitant to just remove the STIME reporting entirely, as I'm sure someone probably uses it to monitor the uptime of a service. However, we can probably add logic to handle AM start times. I will talk with the devs and see if this qualifies as a bug, otherwise it will have to be a feature request.

If you're handy with Python, you can see what's happening here, starting at line 275.
https://github.com/NagiosEnterprises/nc ... ervices.py
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked