NagiosXI+Remote-Workers-(Distributed Monitoring)

maartin.pii · Post by **maartin.pii** » Tue Jan 17, 2017 2:47 pm

Upload worker config

EDIT: removed gearman file as their may be sensitive information.

bheden · Post by **bheden** » Tue Jan 17, 2017 3:02 pm

After reviewing the configuration files you posted, a few things stick out:

You do not have all of your hostgroups listed in the NEB config in the worker configs. Some are missing.
Those will be orphaned.

You have some hostgroups specified in both the hostgroups directive and the localhostgroups. I've actually never seen someone do that, but I assume it could end up orphaned as well.

Remember that whatever hostgroup and servicegroup you define to split into a queue in the NEB config HAS TO HAVE A CORRESPONDING DEFINITION on one of the workers. (Or it will be orphaned).

maartin.pii · Post by **maartin.pii** » Wed Jan 18, 2017 10:45 pm

bheden wrote:After reviewing the configuration files you posted, a few things stick out:

You do not have all of your hostgroups listed in the NEB config in the worker configs. Some are missing.
Those will be orphaned.

You have some hostgroups specified in both the hostgroups directive and the localhostgroups. I've actually never seen someone do that, but I assume it could end up orphaned as well.

Remember that whatever hostgroup and servicegroup you define to split into a queue in the NEB config HAS TO HAVE A CORRESPONDING DEFINITION on one of the workers. (Or it will be orphaned).

-
Hi @bheden - First of all, thinks for answering.

- You do not have all of your hostgroups listed in the NEB config in the worker configs. Some are missing.
Those will be orphaned.

Yeap - You are right. However, all of the hostgroups that are listed in the NEB but not in the worker config, are hg that are not being used any more and I only put them there for the company legacy reasons. But I could clear them from the NEB config, no problem.

You have some hostgroups specified in both the hostgroups directive and the localhostgroups. I've actually never seen someone do that, but I assume it could end up orphaned as well.

Right, I have to correct that. However the servers that are defined on those HG are not the ones which are generating me problems. But I will correct it.

-----------------------------

I think that I know what is happening here - if you read the entire thread, I opened this post because I had a difficult scenario and I wanted to be sure that I was making all my configs under Nagios best practices. To summary it - this is my entire scenario:

1 - Nagios XI (1vm)
|
|- Remote Worker 1 // This works with Windows Hostgroups (check everything)
|- Remote Worker 2 // This works with Linux ServiceGroups (check Linux/Unix OS Services - Ex: Load, Disk, etc)
|- Remote Worker 3 // This works with Linux ServiceGroups (check Linux/Unix Applications Services - Ex: Tomcat, Httpd, etc )
| - Remote Worker 4 // This works with DB ServiceGroups (Check Oracle/MSSQL/MySQL Services)
| - Remote Worker 5 // This works with Networking Hostgroups (check everything)
| - Remote Worker 6 // This works with DMZ Hostgroups (check everything)

Note: On Remote Worker 5 I have specified 'services=yes' and 'host=yes' for those checks that could get orphan state.

What I am seeing is that everything works OK (I have NO orphan service) except for the 'check host alive' check of the Linux/Unix Hosts. As I created a service group for all the service checks but none for the host check, they get orphan state.

If force them to run, they return OK state. However, a few minutes later they start to flap again between ok and orpha.

Do I explain myself? Is there any way resole this?

I thought that having a worker with the host&check = yes would prevent myself from this problem. Otherwise could I force that host check to run local?

Regards

bheden · Post by **bheden** » Thu Jan 19, 2017 1:11 pm

Let's redefine the problem for clarity:

Some hostchecks are being sent to a gearman worker where they are failing.

This begs the question ..do all of the gearman workers have the proper routes to all of the hosts?

If not, maybe just having a generic catchall hosts=yes isn't going to work for you. You may have to specify all of your hosts into a hostgroup so that they can all end up on a gearman worker that has the necessary routes.

maartin.pii · Post by **maartin.pii** » Fri Jan 20, 2017 1:07 pm

bheden wrote:Let's redefine the problem for clarity:

Some hostchecks are being sent to a gearman worker where they are failing.

This begs the question ..do all of the gearman workers have the proper routes to all of the hosts?

If not, maybe just having a generic catchall hosts=yes isn't going to work for you. You may have to specify all of your hosts into a hostgroup so that they can all end up on a gearman worker that has the necessary routes.

Hi @bheden - Thanks for your answer.

All of the gearman workers that have the directive 'hosts=yes' have the necessary routes to the hosts.

The issue here is that for example I have a host that get orphan state, and a few moments later it gets ok. And all of my unix/linux hostchecks flaps between orphan and OK state.

Do you have any kind of ideas about it?

bheden · Post by **bheden** » Fri Jan 20, 2017 2:48 pm

All of the gearman workers that have the directive 'hosts=yes' have the necessary routes to the hosts.

Your first step is going to be logging in to each of the 6 gearman workers and manually running the host check command in its entirety. It will likely be something like this:

Code: Select all

/usr/local/nagios/libexec/check_icmp -H 127.0.0.1 -w 3000.0,80% -c 5000.0,100% -p 5

You'll have to login as the user the gearman_worker service is running as (Likely user 'nagios' but could be something else depending on where you installed the packages from).

maartin.pii · Post by **maartin.pii** » Tue Jan 24, 2017 9:35 am

bheden wrote:
All of the gearman workers that have the directive 'hosts=yes' have the necessary routes to the hosts.
Your first step is going to be logging in to each of the 6 gearman workers and manually running the host check command in its entirety. It will likely be something like this:
Code: Select all
/usr/local/nagios/libexec/check_icmp -H 127.0.0.1 -w 3000.0,80% -c 5000.0,100% -p 5
You'll have to login as the user the gearman_worker service is running as (Likely user 'nagios' but could be something else depending on where you installed the packages from).

Hi @bheden - I've alredy done what you asked to me and the execution of the command was successful on each worker as the 'nagios' user.

I don't think that this could be a permissions issue.

rkennedy · Post by **rkennedy** » Tue Jan 24, 2017 5:09 pm

maartin.pii wrote:
bheden wrote:
All of the gearman workers that have the directive 'hosts=yes' have the necessary routes to the hosts.
Your first step is going to be logging in to each of the 6 gearman workers and manually running the host check command in its entirety. It will likely be something like this:
Code: Select all
/usr/local/nagios/libexec/check_icmp -H 127.0.0.1 -w 3000.0,80% -c 5000.0,100% -p 5
You'll have to login as the user the gearman_worker service is running as (Likely user 'nagios' but could be something else depending on where you installed the packages from).
Hi @bheden - I've alredy done what you asked to me and the execution of the command was successful on each worker as the 'nagios' user.

I don't think that this could be a permissions issue.

I don't believe it's permission, but possibly routing. When you ran the check - did you run check_icmp against 127.0.0.1 or did you specify the IP of a check that is failing? If it's routing related, 10.0.0.0/8 may not have a route to 192.168.5.0/24. One of the workers may not be able to reach 192.168.5.0/24 which would explain the issue.

maartin.pii · Post by **maartin.pii** » Wed Jan 25, 2017 1:12 pm

bheden wrote:

I don't believe it's permission, but possibly routing. When you ran the check - did you run check_icmp against 127.0.0.1 or did you specify the IP of a check that is failing? If it's routing related, 10.0.0.0/8 may not have a route to 192.168.5.0/24. One of the workers may not be able to reach 192.168.5.0/24 which would explain the issue.

I've tried both. I have all my firewalls routes ok. And it has sense because none of my service checks has errors. And what is more, if it would be a routing issue the error should be something like 'host unreachable' or 'host down' but not 'orphan check (is host queue running)' - wouldn't it?

Regards,

rkennedy · Post by **rkennedy** » Wed Jan 25, 2017 6:02 pm

Can you show us the full output of gearman_top2 from the gearman machine?

Nagios Support Forum

NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)