Automatically Remove Services DEADPOOL - Part II
Posted: Fri Oct 08, 2021 9:12 am
Hello Nagios Support:
This is a follow-up to the previous post:
https://support.nagios.com/forum/viewto ... 16&t=61019
We have done some further testing on this and this is what we've found and would like your guidance as to a better way to do what we'd like to do.
We have enabled the Deadpool processor.
We're only interested in dealing with our services at the moment, which we have set stage 1 to 30 days and stage 2 to 300 days and have set the action to "Deactivate".
We are targeting a very limited list of services to deadpool that have a naming convention of "SBC-XXX-XXX-XXX".
We built an exclusion list of all service names that begin with "S". You can see "SBC" and "sbc" in the list. We will remove those from the exclusion filter.
[root@nagios-p services]# cat *.cfg | grep service_description | sed 's/service_description//g' | sed 's/ //g' | cut -c1-3 | grep ^[S,s] | sort -u
SAE
SAJ
sbc <------
SBC <------
SCH
SDF
SDH
SDI
SDM
SDR
Se0
Se1
sen
Sen
SEP
Ser
SFE
SFR
SGO
SGP
SHU
Sig
SIM
SIP
SIT
SKA
SKC
SKP
SKY
SLE
Smp
SNM
SON
Spa
SPV
SRT
SS-
SSB
SSH
Sta
STA
STG
sto
STS
Sub
Swa
SWG
Swi
SWZ
sys
Sys
We also want to add ALL OTHER SERVICES to this exclusion filter (A-R*, T-Z*, etc). After some initial testing, it appears that Deadpool *does* work retroactively, as we had several services deadpooled that we believe should not have been. Our exclusion filter looked like this:
A*
B*
C*
D*
E*
F*
G*
H*
I*
J*
K*
L*
M*
N*
O*
P*
Q*
R*
T*
U*
V*
W*
X*
Y*
Z*
/*
SAE*
SAJ*
SCH*
SDF*
SDH*
SDI*
SDM
SDR*
Se0*
Se1*
sen*
Sen*
SEP*
Ser*
SFE*
SFR*
SGO*
SGP*
SHU*
Sig*
SIM*
SIP*
SIT*
SKA*
SKC*
SKP*
SKY*
SLE*
Smp*
SNM*
SON*
Spa*
SPV*
SRT*
SS-*
SSB*
SSH*
Sta*
STA*
STG*
STS*
Sub*
Swa*
SWG*
Swi*
SWZ*
sys*
Sys*
When we enabled Deadpool, we immediately received ~20 emails saying dozens of services were Deadpooled. This is where we found that Deadpool does seem to work retroactively:
From https://assets.nagios.com/downloads/nag ... ios-XI.pdf
"Deadpool does not work retroactively. For example, if a service has already been down for 4 days and then Deadpool is activated with its default setting to delete after 3 days, the service will not be deleted."
One email showed the following services (trimmed down) having been added to the Deadpool service group:
...
NOAS-CSG-MBI-IO6 / HP Switch CPU Usage
NOAS-CSG-MBI-IO6 / HP Switch Memory Usage
NOAS-CSG-MBI-IO6 / Ping
NOAS-CSG-MBI-IO5 / HP Switch CPU Usage
NOAS-CSG-MBI-IO5 / HP Switch Memory Usage
NOAS-CSG-MBI-IO5 / Ping
NOAS-CSG-MBI-IO2 / HP Switch CPU Usage
NOAS-CSG-MBI-IO2 / HP Switch Memory Usage
NOAS-CSG-MBI-IO2 / Ping
NOAS-CSG-MBI-IO1 / HP Switch CPU Usage
NOAS-CSG-MBI-IO1 / HP Switch Memory Usage
NOAS-CSG-MBI-IO1 / Ping
NOAS-CSG-MBI-APP04-iLO / Overall Temperature Condition
NOAS-CSG-MBI-APP04-iLO / Overall Health Condition
NOAS-CSG-MBI-APP03-iLO / Overall Health Condition
NOAS-CSG-MBI-APP03-iLO / Overall Temperature Condition
NOAS-CSG-MBI-APP02-iLO / Overall Health Condition
NOAS-CSG-MBI-APP02-iLO / Overall Temperature Condition
...
Then we looked into the regular expression rules and found that we possibly needed to do the following to our exclusion filter:
/A*/
/B*/
/C*/
/D*/
/E*/
/F*/
/G*/
/H*/
/I*/
/J*/
/K*/
/L*/
/M*/
/N*/
/O*/
/P*/
/Q*/
/R*/
/T*/
/U*/
/V*/
/W*/
/X*/
/Y*/
/Z*/
/a*/
/b*/
/c*/
/d*/
/e*/
/f*/
/g*/
/h*/
/i*/
/j*/
/k*/
/l*/
/m*/
/n*/
/o*/
/p*/
/q*/
/r*/
/s*/
/t*/
/u*/
/v*/
/w*/
/x*/
/y*/
/z*/
/SAE*/
/SAJ*/
/SCH*/
/SDF*/
/SDH*/
/SDI*/
/SDM/
/SDR*/
/Se0*/
/Se1*/
/sen*/
/Sen*/
/SEP*/
/Ser*/
/SFE*/
/SFR*/
/SGO*/
/SGP*/
/SHU*/
/Sig*/
/SIM*/
/SIP*/
/SIT*/
/SKA*/
/SKC*/
/SKP*/
/SKY*/
/SLE*/
/Smp*/
/SNM*/
/SON*/
/Spa*/
/SPV*/
/SRT*/
/SS-*/
/SSB*/
/SSH*/
/Sta*/
/STA*/
/STG*/
/STS*/
/Sub*/
/Swa*/
/SWG*/
/Swi*/
/SWZ*/
/sys*/
/Sys*/
/\/*/
QUESTION 1: Is this the correct way to build an exclusion filter?
QUESTION 2: Would this be more efficient? Or does it matter?
/[A-R]\*/
/[T-Z]\*/
...
then the remaining S* services.
Once enabled, we found that our mysqld process's CPU load would be higher than normal.
When we look at the Deadpool log we see the following:
Matched service filter '/u*/' -> skipping service
Matched service filter '/v*/' -> skipping service
Matched service filter '/w*/' -> skipping service
Matched service filter '/x*/' -> skipping service
Matched service filter '/y*/' -> skipping service
Matched service filter '/z*/' -> skipping service
Matched service filter '/\/*/' -> skipping service
HOST/SVC: GMV-OSS-RC-17B-TPOSSINF01/LDAP Server = 1600208922 (33492739 seconds ago) [1]
Matched service filter '/A*/' -> skipping service
Matched service filter '/B*/' -> skipping service
Matched service filter '/C*/' -> skipping service
Matched service filter '/D*/' -> skipping service
Matched service filter '/E*/' -> skipping service
Here are some relevant lines:
HOST/SVC: NOAS-CS-G-GENERIC/CSg FRK Service Performance = 1631546426 (2155296 seconds ago) [0]
HOST/SVC: NOAS-CS-G-GENERIC/CSg HKG Service Performance = 1631546441 (2155281 seconds ago) [0]
HOST/SVC: GSIP-ZRH-SBC4/GSIP SBC Ping = 1629771474 (3930248 seconds ago) [1]
HOST/SVC: GSIP-ZRH-SBC3/GSIP SBC Ping = 1629773498 (3928224 seconds ago) [1]
HOST/SVC: GSIP-RSM-CHC/ASR for GSIP-CHC-SBC9:ZAYO = 1632781038 (920684 seconds ago) [0]
HOST/SVC: GSIP-RSM-AMH/ASR for GSIP-FFT-SBC5:PUREIP = 1632750318 (951404 seconds ago) [0]
HOST/SVC: GSIP-NUT-SBC1/SBC1-NUT-AM:EP-TK-K0000005256-0-TCP = 1633700199 (1523 seconds ago) [0]
HOST/SVC: GSIP-NUT-SBC1/SBC1-NUT-AM:EP-TK-K0000008512-0-TCP = 1633700199 (1523 seconds ago) [0]
HOST/SVC: GSIP-NUT-SBC1/SBC1-NUT-AM:EP-TK-IDAL-TLS00000804-0-TCP = 1633700199 (1523 seconds ago) [0]
HOST/SVC: GSIP-CE1-AMH-EU/Interfaces ERR DISC = 1610642464 (23059258 seconds ago) [1]
HOST/SVC: GSIP-CE1-AMH-EU/Ping = 1611896721 (21805001 seconds ago) [1]
HOST/SVC: GSIP-CE1-AMH-EU/CPU Usage = 1610642339 (23059383 seconds ago) [1]
HOST/SVC: GSIP-CE1-AMH-EU/Hardware Health - Card = 1610642499 (23059223 seconds ago) [1]
HOST/SVC: GSIP-CE1-AMH-EU/Hardware Health - Fan and PSU = 1610642499 (23059223 seconds ago) [1]
HOST/SVC: GSIP-CE1-AMH-EU/Memory Usage = 1610642499 (23059223 seconds ago) [1]
HOST/SVC: GMV_WIN_LDAP_TUF/Logon Errors = 1632812181 (889541 seconds ago) [0]
HOST/SVC: GMV_WIN_LDAP_LGS/Logon Errors = 1607973862 (25727860 seconds ago) [1]
HOST/SVC: GMV-OSS-RC-17B-TPOSSNAS02-ILO/Overall Temperature Condition = 1601108976 (32592746 seconds ago) [1]
HOST/SVC: GMV-OSS-RC-17B-TPOSSINF01/LDAP Server = 1600208922 (33492800 seconds ago) [1]
...
...
Matched service filter '/SIT*/' -> skipping service
Matched service filter '/\/*/' -> skipping service
HOST/SVC: GSIP-RSM-CHC/ASR for GSIP-CHC-SBC9:ZAYO = 1632781038 (920924 seconds ago) [0]
Passive only -> skipping service
HOST/SVC: GSIP-RSM-AMH/ASR for GSIP-FFT-SBC5:PUREIP = 1632750318 (951644 seconds ago) [0]
Passive only -> skipping service
HOST/SVC: GSIP-NUT-SBC1/SBC1-NUT-AM:EP-TK-K0000005256-0-TCP = 1633700199 (1763 seconds ago) [0]
Passive only -> skipping service
HOST/SVC: GSIP-NUT-SBC1/SBC1-NUT-AM:EP-TK-K0000008512-0-TCP = 1633700199 (1763 seconds ago) [0]
Passive only -> skipping service
HOST/SVC: GSIP-NUT-SBC1/SBC1-NUT-AM:EP-TK-IDAL-TLS00000804-0-TCP = 1633700199 (1763 seconds ago) [0]
Passive only -> skipping service
HOST/SVC: GSIP-CE1-AMH-EU/Interfaces ERR DISC = 1610642464 (23059498 seconds ago) [1]
Matched service filter '/A*/' -> skipping service
Matched service filter '/B*/' -> skipping service
QUESTION 3: What does "Passive only" mean? Will passive services be skipped regardless of their Deadpool status?
Our exclusion filter appears to be working, but we are concerned that it may be causing excessive CPU load on MySQL.
Thank you in advance for your support.
Regards,
JLu
This is a follow-up to the previous post:
https://support.nagios.com/forum/viewto ... 16&t=61019
We have done some further testing on this and this is what we've found and would like your guidance as to a better way to do what we'd like to do.
We have enabled the Deadpool processor.
We're only interested in dealing with our services at the moment, which we have set stage 1 to 30 days and stage 2 to 300 days and have set the action to "Deactivate".
We are targeting a very limited list of services to deadpool that have a naming convention of "SBC-XXX-XXX-XXX".
We built an exclusion list of all service names that begin with "S". You can see "SBC" and "sbc" in the list. We will remove those from the exclusion filter.
[root@nagios-p services]# cat *.cfg | grep service_description | sed 's/service_description//g' | sed 's/ //g' | cut -c1-3 | grep ^[S,s] | sort -u
SAE
SAJ
sbc <------
SBC <------
SCH
SDF
SDH
SDI
SDM
SDR
Se0
Se1
sen
Sen
SEP
Ser
SFE
SFR
SGO
SGP
SHU
Sig
SIM
SIP
SIT
SKA
SKC
SKP
SKY
SLE
Smp
SNM
SON
Spa
SPV
SRT
SS-
SSB
SSH
Sta
STA
STG
sto
STS
Sub
Swa
SWG
Swi
SWZ
sys
Sys
We also want to add ALL OTHER SERVICES to this exclusion filter (A-R*, T-Z*, etc). After some initial testing, it appears that Deadpool *does* work retroactively, as we had several services deadpooled that we believe should not have been. Our exclusion filter looked like this:
A*
B*
C*
D*
E*
F*
G*
H*
I*
J*
K*
L*
M*
N*
O*
P*
Q*
R*
T*
U*
V*
W*
X*
Y*
Z*
/*
SAE*
SAJ*
SCH*
SDF*
SDH*
SDI*
SDM
SDR*
Se0*
Se1*
sen*
Sen*
SEP*
Ser*
SFE*
SFR*
SGO*
SGP*
SHU*
Sig*
SIM*
SIP*
SIT*
SKA*
SKC*
SKP*
SKY*
SLE*
Smp*
SNM*
SON*
Spa*
SPV*
SRT*
SS-*
SSB*
SSH*
Sta*
STA*
STG*
STS*
Sub*
Swa*
SWG*
Swi*
SWZ*
sys*
Sys*
When we enabled Deadpool, we immediately received ~20 emails saying dozens of services were Deadpooled. This is where we found that Deadpool does seem to work retroactively:
From https://assets.nagios.com/downloads/nag ... ios-XI.pdf
"Deadpool does not work retroactively. For example, if a service has already been down for 4 days and then Deadpool is activated with its default setting to delete after 3 days, the service will not be deleted."
One email showed the following services (trimmed down) having been added to the Deadpool service group:
...
NOAS-CSG-MBI-IO6 / HP Switch CPU Usage
NOAS-CSG-MBI-IO6 / HP Switch Memory Usage
NOAS-CSG-MBI-IO6 / Ping
NOAS-CSG-MBI-IO5 / HP Switch CPU Usage
NOAS-CSG-MBI-IO5 / HP Switch Memory Usage
NOAS-CSG-MBI-IO5 / Ping
NOAS-CSG-MBI-IO2 / HP Switch CPU Usage
NOAS-CSG-MBI-IO2 / HP Switch Memory Usage
NOAS-CSG-MBI-IO2 / Ping
NOAS-CSG-MBI-IO1 / HP Switch CPU Usage
NOAS-CSG-MBI-IO1 / HP Switch Memory Usage
NOAS-CSG-MBI-IO1 / Ping
NOAS-CSG-MBI-APP04-iLO / Overall Temperature Condition
NOAS-CSG-MBI-APP04-iLO / Overall Health Condition
NOAS-CSG-MBI-APP03-iLO / Overall Health Condition
NOAS-CSG-MBI-APP03-iLO / Overall Temperature Condition
NOAS-CSG-MBI-APP02-iLO / Overall Health Condition
NOAS-CSG-MBI-APP02-iLO / Overall Temperature Condition
...
Then we looked into the regular expression rules and found that we possibly needed to do the following to our exclusion filter:
/A*/
/B*/
/C*/
/D*/
/E*/
/F*/
/G*/
/H*/
/I*/
/J*/
/K*/
/L*/
/M*/
/N*/
/O*/
/P*/
/Q*/
/R*/
/T*/
/U*/
/V*/
/W*/
/X*/
/Y*/
/Z*/
/a*/
/b*/
/c*/
/d*/
/e*/
/f*/
/g*/
/h*/
/i*/
/j*/
/k*/
/l*/
/m*/
/n*/
/o*/
/p*/
/q*/
/r*/
/s*/
/t*/
/u*/
/v*/
/w*/
/x*/
/y*/
/z*/
/SAE*/
/SAJ*/
/SCH*/
/SDF*/
/SDH*/
/SDI*/
/SDM/
/SDR*/
/Se0*/
/Se1*/
/sen*/
/Sen*/
/SEP*/
/Ser*/
/SFE*/
/SFR*/
/SGO*/
/SGP*/
/SHU*/
/Sig*/
/SIM*/
/SIP*/
/SIT*/
/SKA*/
/SKC*/
/SKP*/
/SKY*/
/SLE*/
/Smp*/
/SNM*/
/SON*/
/Spa*/
/SPV*/
/SRT*/
/SS-*/
/SSB*/
/SSH*/
/Sta*/
/STA*/
/STG*/
/STS*/
/Sub*/
/Swa*/
/SWG*/
/Swi*/
/SWZ*/
/sys*/
/Sys*/
/\/*/
QUESTION 1: Is this the correct way to build an exclusion filter?
QUESTION 2: Would this be more efficient? Or does it matter?
/[A-R]\*/
/[T-Z]\*/
...
then the remaining S* services.
Once enabled, we found that our mysqld process's CPU load would be higher than normal.
When we look at the Deadpool log we see the following:
Matched service filter '/u*/' -> skipping service
Matched service filter '/v*/' -> skipping service
Matched service filter '/w*/' -> skipping service
Matched service filter '/x*/' -> skipping service
Matched service filter '/y*/' -> skipping service
Matched service filter '/z*/' -> skipping service
Matched service filter '/\/*/' -> skipping service
HOST/SVC: GMV-OSS-RC-17B-TPOSSINF01/LDAP Server = 1600208922 (33492739 seconds ago) [1]
Matched service filter '/A*/' -> skipping service
Matched service filter '/B*/' -> skipping service
Matched service filter '/C*/' -> skipping service
Matched service filter '/D*/' -> skipping service
Matched service filter '/E*/' -> skipping service
Here are some relevant lines:
HOST/SVC: NOAS-CS-G-GENERIC/CSg FRK Service Performance = 1631546426 (2155296 seconds ago) [0]
HOST/SVC: NOAS-CS-G-GENERIC/CSg HKG Service Performance = 1631546441 (2155281 seconds ago) [0]
HOST/SVC: GSIP-ZRH-SBC4/GSIP SBC Ping = 1629771474 (3930248 seconds ago) [1]
HOST/SVC: GSIP-ZRH-SBC3/GSIP SBC Ping = 1629773498 (3928224 seconds ago) [1]
HOST/SVC: GSIP-RSM-CHC/ASR for GSIP-CHC-SBC9:ZAYO = 1632781038 (920684 seconds ago) [0]
HOST/SVC: GSIP-RSM-AMH/ASR for GSIP-FFT-SBC5:PUREIP = 1632750318 (951404 seconds ago) [0]
HOST/SVC: GSIP-NUT-SBC1/SBC1-NUT-AM:EP-TK-K0000005256-0-TCP = 1633700199 (1523 seconds ago) [0]
HOST/SVC: GSIP-NUT-SBC1/SBC1-NUT-AM:EP-TK-K0000008512-0-TCP = 1633700199 (1523 seconds ago) [0]
HOST/SVC: GSIP-NUT-SBC1/SBC1-NUT-AM:EP-TK-IDAL-TLS00000804-0-TCP = 1633700199 (1523 seconds ago) [0]
HOST/SVC: GSIP-CE1-AMH-EU/Interfaces ERR DISC = 1610642464 (23059258 seconds ago) [1]
HOST/SVC: GSIP-CE1-AMH-EU/Ping = 1611896721 (21805001 seconds ago) [1]
HOST/SVC: GSIP-CE1-AMH-EU/CPU Usage = 1610642339 (23059383 seconds ago) [1]
HOST/SVC: GSIP-CE1-AMH-EU/Hardware Health - Card = 1610642499 (23059223 seconds ago) [1]
HOST/SVC: GSIP-CE1-AMH-EU/Hardware Health - Fan and PSU = 1610642499 (23059223 seconds ago) [1]
HOST/SVC: GSIP-CE1-AMH-EU/Memory Usage = 1610642499 (23059223 seconds ago) [1]
HOST/SVC: GMV_WIN_LDAP_TUF/Logon Errors = 1632812181 (889541 seconds ago) [0]
HOST/SVC: GMV_WIN_LDAP_LGS/Logon Errors = 1607973862 (25727860 seconds ago) [1]
HOST/SVC: GMV-OSS-RC-17B-TPOSSNAS02-ILO/Overall Temperature Condition = 1601108976 (32592746 seconds ago) [1]
HOST/SVC: GMV-OSS-RC-17B-TPOSSINF01/LDAP Server = 1600208922 (33492800 seconds ago) [1]
...
...
Matched service filter '/SIT*/' -> skipping service
Matched service filter '/\/*/' -> skipping service
HOST/SVC: GSIP-RSM-CHC/ASR for GSIP-CHC-SBC9:ZAYO = 1632781038 (920924 seconds ago) [0]
Passive only -> skipping service
HOST/SVC: GSIP-RSM-AMH/ASR for GSIP-FFT-SBC5:PUREIP = 1632750318 (951644 seconds ago) [0]
Passive only -> skipping service
HOST/SVC: GSIP-NUT-SBC1/SBC1-NUT-AM:EP-TK-K0000005256-0-TCP = 1633700199 (1763 seconds ago) [0]
Passive only -> skipping service
HOST/SVC: GSIP-NUT-SBC1/SBC1-NUT-AM:EP-TK-K0000008512-0-TCP = 1633700199 (1763 seconds ago) [0]
Passive only -> skipping service
HOST/SVC: GSIP-NUT-SBC1/SBC1-NUT-AM:EP-TK-IDAL-TLS00000804-0-TCP = 1633700199 (1763 seconds ago) [0]
Passive only -> skipping service
HOST/SVC: GSIP-CE1-AMH-EU/Interfaces ERR DISC = 1610642464 (23059498 seconds ago) [1]
Matched service filter '/A*/' -> skipping service
Matched service filter '/B*/' -> skipping service
QUESTION 3: What does "Passive only" mean? Will passive services be skipped regardless of their Deadpool status?
Our exclusion filter appears to be working, but we are concerned that it may be causing excessive CPU load on MySQL.
Thank you in advance for your support.
Regards,
JLu