what does this mean?

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: what does this mean?

Post by eloyd »

Am I missing something? Doesn't NLS have alerting built in?
alerts.png
This is a real question, not me being snarky: How hard would it be to trigger an alert if the logstash/elasticsearch checks produced negative results?

This topic is actually a core part of my 2015 Nagios World Conference presentation..... :-)
You do not have the required permissions to view the files attached to this post.
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: what does this mean?

Post by tmcdonald »

Well, NLS does have alerting built-in but it alerts on log messages, not check results. In a really perverse way you could have a plugin run by cron, and have it log a message on failure then alert on that.
Former Nagios employee
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: what does this mean?

Post by eloyd »

See, here's the thing...

Of course I can do it many other ways. Nagios, cron, all sorts of stuff. But I just figure, since NLS has built-in alerting capabilities, why not allow the two system checks to be able to generate alerts just like anything else within NLS can generate alerts? I mean, I'm assuming there's a piece of internal API that says "function sendAlert() {}" that's used by the threshold alerting system, can't the same function be called by the system daemon check system?
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: what does this mean?

Post by tmcdonald »

I think that would just be a matter of a feature request. The irony is that if you are using LS to check if ES is running (by checking logs and alerting if "ES is not running!" is found in them) then by definition if ES goes down you can't check for this like you can with other things :)

So this would need to be hard-coded and shouldn't be too terribly difficult. Shall I feature request it?
Former Nagios employee
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: what does this mean?

Post by eloyd »

Wait. I'm mobile right now but we may be talking about different things. I'll be in an office in a couple hours and will write more.
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: what does this mean?

Post by eloyd »

Okay, so. Here is what I thought we were talking about.

NLS does some sort of check using sudo to check the output of a "service logstash status" command. This is what it uses to make the red/green light on the dashboard. Why not make it so that if it fails, in addition to making the light green, it can also trigger an alert? Say, a built-in query (which you would have to program for us) called Logstash Failure or something like that. So if that failure condition arises, we could use the built-in alerting capabilities to alert us that it failed.

Does that make sense?

I mean, yes, I could use NRPE to make sure it's running and even restart it if it's not (which, is what we do) but that's not the point. :-)
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: what does this mean?

Post by jolson »

eloyd,

That definitely makes sense.

The API calls are run to check on the status of the processes:

Code: Select all

http://192.168.x.x/nagioslogserver/index.php/api/system/status?subsystem=elasticsearch
http://192.168.x.x/nagioslogserver/index.php/api/system/status?subsystem=logstash
{"status":"running","pid":"23957","message":"Search engine (elasticsearch) is running."}
{"status":"running","pid":"24026","message":"Log collector (logstash) is running."}
Those API calls will return good results if logstash/elasticsearch is down:

Code: Select all

http://192.168.x.x/nagioslogserver/index.php/api/system/status?subsystem=elasticsearch
http://192.168.x.x/nagioslogserver/index.php/api/system/status?subsystem=logstash
{"status":"stopped","message":"Search engine (elasticsearch) is stopped."}
{"status":"stopped","message":"Log collector (logstash) is stopped."}
The above is secured using an authorization token supplied by the user logged into the system.

Using the above as a reference, I understand what you mean - when a service is detected as down, why can't we send alerts based on that behavior?

The answer is "We likely can, but what would we do, exactly?" I suppose that's the discussion you're trying to open here. What do you think?
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: what does this mean?

Post by eloyd »

Thanks for that info. That makes life a lot easier, actually.

I guess the question is, if NLS is being sold as a standalone product (and I believe it is positioned as such, currently) then it should be able to do something when it detects that it has failed. I mean, if it has detected a failure state, it should be able to do something with that information (and it currently does - it changes the dashboard green/red light).

If NLS is being used by a Nagios suite customer, then it's a non-issue. So my question to Nagios Enterprises is, how much do you want to present NLS as a standalone product, capable of notifying people when it's broken, or even maybe proactively correcting itself when it is?

Matters not to me, but we'll need to know what to tell our potential customers when we sell them NLS in the future. ;-)
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: what does this mean?

Post by tmcdonald »

Wouldn't be hard at all, really. Should definitely be configurable whether it alerts/self-fixes though.
Former Nagios employee
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: what does this mean?

Post by eloyd »

Of course, if it self-fixes, then you need to teach it how many times to try before it gives up, or does it keep trying to self fix forever. Like, maybe it's out of disk space, and that's why it's failing. :-)

Sounds like a job for Nagios!!
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
Locked