NagiosXI+Remote-Workers-(Distributed Monitoring)
-
maartin.pii
- Posts: 84
- Joined: Wed May 18, 2016 1:39 pm
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Upload worker config
EDIT: removed gearman file as their may be sensitive information.
EDIT: removed gearman file as their may be sensitive information.
-
bheden
- Product Development Manager
- Posts: 179
- Joined: Thu Feb 13, 2014 9:50 am
- Location: Nagios Enterprises
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
After reviewing the configuration files you posted, a few things stick out:
You do not have all of your hostgroups listed in the NEB config in the worker configs. Some are missing.
Those will be orphaned.
You have some hostgroups specified in both the hostgroups directive and the localhostgroups. I've actually never seen someone do that, but I assume it could end up orphaned as well.
Remember that whatever hostgroup and servicegroup you define to split into a queue in the NEB config HAS TO HAVE A CORRESPONDING DEFINITION on one of the workers. (Or it will be orphaned).
You do not have all of your hostgroups listed in the NEB config in the worker configs. Some are missing.
Those will be orphaned.
You have some hostgroups specified in both the hostgroups directive and the localhostgroups. I've actually never seen someone do that, but I assume it could end up orphaned as well.
Remember that whatever hostgroup and servicegroup you define to split into a queue in the NEB config HAS TO HAVE A CORRESPONDING DEFINITION on one of the workers. (Or it will be orphaned).
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Nagios Enterprises
Senior Developer
Nagios Enterprises
Senior Developer
-
maartin.pii
- Posts: 84
- Joined: Wed May 18, 2016 1:39 pm
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
-bheden wrote:After reviewing the configuration files you posted, a few things stick out:
You do not have all of your hostgroups listed in the NEB config in the worker configs. Some are missing.
Those will be orphaned.
You have some hostgroups specified in both the hostgroups directive and the localhostgroups. I've actually never seen someone do that, but I assume it could end up orphaned as well.
Remember that whatever hostgroup and servicegroup you define to split into a queue in the NEB config HAS TO HAVE A CORRESPONDING DEFINITION on one of the workers. (Or it will be orphaned).
Hi @bheden - First of all, thinks for answering.
- You do not have all of your hostgroups listed in the NEB config in the worker configs. Some are missing.
Those will be orphaned.
Yeap - You are right. However, all of the hostgroups that are listed in the NEB but not in the worker config, are hg that are not being used any more and I only put them there for the company legacy reasons. But I could clear them from the NEB config, no problem.
You have some hostgroups specified in both the hostgroups directive and the localhostgroups. I've actually never seen someone do that, but I assume it could end up orphaned as well.
Right, I have to correct that. However the servers that are defined on those HG are not the ones which are generating me problems. But I will correct it.
-----------------------------
I think that I know what is happening here - if you read the entire thread, I opened this post because I had a difficult scenario and I wanted to be sure that I was making all my configs under Nagios best practices. To summary it - this is my entire scenario:
1 - Nagios XI (1vm)
|
|- Remote Worker 1 // This works with Windows Hostgroups (check everything)
|- Remote Worker 2 // This works with Linux ServiceGroups (check Linux/Unix OS Services - Ex: Load, Disk, etc)
|- Remote Worker 3 // This works with Linux ServiceGroups (check Linux/Unix Applications Services - Ex: Tomcat, Httpd, etc )
| - Remote Worker 4 // This works with DB ServiceGroups (Check Oracle/MSSQL/MySQL Services)
| - Remote Worker 5 // This works with Networking Hostgroups (check everything)
| - Remote Worker 6 // This works with DMZ Hostgroups (check everything)
Note: On Remote Worker 5 I have specified 'services=yes' and 'host=yes' for those checks that could get orphan state.
What I am seeing is that everything works OK (I have NO orphan service) except for the 'check host alive' check of the Linux/Unix Hosts. As I created a service group for all the service checks but none for the host check, they get orphan state.
If force them to run, they return OK state. However, a few minutes later they start to flap again between ok and orpha.
Do I explain myself? Is there any way resole this?
I thought that having a worker with the host&check = yes would prevent myself from this problem. Otherwise could I force that host check to run local?
Regards
-
bheden
- Product Development Manager
- Posts: 179
- Joined: Thu Feb 13, 2014 9:50 am
- Location: Nagios Enterprises
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Let's redefine the problem for clarity:
Some hostchecks are being sent to a gearman worker where they are failing.
This begs the question ..do all of the gearman workers have the proper routes to all of the hosts?
If not, maybe just having a generic catchall hosts=yes isn't going to work for you. You may have to specify all of your hosts into a hostgroup so that they can all end up on a gearman worker that has the necessary routes.
Some hostchecks are being sent to a gearman worker where they are failing.
This begs the question ..do all of the gearman workers have the proper routes to all of the hosts?
If not, maybe just having a generic catchall hosts=yes isn't going to work for you. You may have to specify all of your hosts into a hostgroup so that they can all end up on a gearman worker that has the necessary routes.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Nagios Enterprises
Senior Developer
Nagios Enterprises
Senior Developer
-
maartin.pii
- Posts: 84
- Joined: Wed May 18, 2016 1:39 pm
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Hi @bheden - Thanks for your answer.bheden wrote:Let's redefine the problem for clarity:
Some hostchecks are being sent to a gearman worker where they are failing.
This begs the question ..do all of the gearman workers have the proper routes to all of the hosts?
If not, maybe just having a generic catchall hosts=yes isn't going to work for you. You may have to specify all of your hosts into a hostgroup so that they can all end up on a gearman worker that has the necessary routes.
All of the gearman workers that have the directive 'hosts=yes' have the necessary routes to the hosts.
The issue here is that for example I have a host that get orphan state, and a few moments later it gets ok. And all of my unix/linux hostchecks flaps between orphan and OK state.
Do you have any kind of ideas about it?
-
bheden
- Product Development Manager
- Posts: 179
- Joined: Thu Feb 13, 2014 9:50 am
- Location: Nagios Enterprises
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Your first step is going to be logging in to each of the 6 gearman workers and manually running the host check command in its entirety. It will likely be something like this:All of the gearman workers that have the directive 'hosts=yes' have the necessary routes to the hosts.
Code: Select all
/usr/local/nagios/libexec/check_icmp -H 127.0.0.1 -w 3000.0,80% -c 5000.0,100% -p 5As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Nagios Enterprises
Senior Developer
Nagios Enterprises
Senior Developer
-
maartin.pii
- Posts: 84
- Joined: Wed May 18, 2016 1:39 pm
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
bheden wrote:Your first step is going to be logging in to each of the 6 gearman workers and manually running the host check command in its entirety. It will likely be something like this:All of the gearman workers that have the directive 'hosts=yes' have the necessary routes to the hosts.
You'll have to login as the user the gearman_worker service is running as (Likely user 'nagios' but could be something else depending on where you installed the packages from).Code: Select all
/usr/local/nagios/libexec/check_icmp -H 127.0.0.1 -w 3000.0,80% -c 5000.0,100% -p 5
Hi @bheden - I've alredy done what you asked to me and the execution of the command was successful on each worker as the 'nagios' user.
I don't think that this could be a permissions issue.
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
I don't believe it's permission, but possibly routing. When you ran the check - did you run check_icmp against 127.0.0.1 or did you specify the IP of a check that is failing? If it's routing related, 10.0.0.0/8 may not have a route to 192.168.5.0/24. One of the workers may not be able to reach 192.168.5.0/24 which would explain the issue.maartin.pii wrote:bheden wrote:Your first step is going to be logging in to each of the 6 gearman workers and manually running the host check command in its entirety. It will likely be something like this:All of the gearman workers that have the directive 'hosts=yes' have the necessary routes to the hosts.
You'll have to login as the user the gearman_worker service is running as (Likely user 'nagios' but could be something else depending on where you installed the packages from).Code: Select all
/usr/local/nagios/libexec/check_icmp -H 127.0.0.1 -w 3000.0,80% -c 5000.0,100% -p 5
Hi @bheden - I've alredy done what you asked to me and the execution of the command was successful on each worker as the 'nagios' user.
I don't think that this could be a permissions issue.
Former Nagios Employee
-
maartin.pii
- Posts: 84
- Joined: Wed May 18, 2016 1:39 pm
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
I've tried both. I have all my firewalls routes ok. And it has sense because none of my service checks has errors. And what is more, if it would be a routing issue the error should be something like 'host unreachable' or 'host down' but not 'orphan check (is host queue running)' - wouldn't it?bheden wrote:
I don't believe it's permission, but possibly routing. When you ran the check - did you run check_icmp against 127.0.0.1 or did you specify the IP of a check that is failing? If it's routing related, 10.0.0.0/8 may not have a route to 192.168.5.0/24. One of the workers may not be able to reach 192.168.5.0/24 which would explain the issue.
Regards,
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Can you show us the full output of gearman_top2 from the gearman machine?
Former Nagios Employee