Didn't receive correction alerts

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
ITOMB_IMT
Posts: 181
Joined: Wed Oct 17, 2018 12:55 pm

Didn't receive correction alerts

Post by ITOMB_IMT »

Hi,

Currently we are hosting Nagios XI 5.6.3. Today at 9am we received alerts saying Host is DOWN, after the while a while the host is UP but we didn't receive correction email from XI. i checked state history

I checked state history Reports, i saw this
2019-10-07 09:23:23 UP HARD 1 of 2 OK - 10.61.46.208: rta 0.256ms, lost 40%
2019-10-07 09:03:19 DOWN HARD 2 of 2 CRITICAL - 10.61.46.208: rta nan, lost 100%

In Notifications Reports,
2019-10-07 09:13:26 - Host Problem No DOWN nagiosadmin Nagios XI CRITICAL - 10.61.46.208: rta nan, lost 100%
2019-10-07 09:03:19 - Host Problem No DOWN nagiosadmin Nagios XI CRITICAL - 10.61.46.208: rta nan, lost 100%


The host.cfg is as follows,


define host {
host_name host_name
use xiwizard_linuxserver_host
address ip_address
initial_state o
max_check_attempts 2
check_interval 5
retry_interval 1
active_checks_enabled 1
check_period xi_timeperiod_24x7
process_perf_data 1
contacts nagiosadmin
contact_groups Hadoop_admins
notification_interval 10
notification_period xi_timeperiod_24x7
notification_options d,u,r,f,
notifications_enabled 1
icon_image redhat.png
statusmap_image redhat.png
_xiwizard linux-server
register 1
}

We didn't receive any correction email for that host. Any support is much appreciated.
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: Didn't receive correction alerts

Post by benjaminsmith »

Hi @ISSB-MAOIST,

Thanks for posting the state history and notification reports, I'd like to take a look at the mail logs as well.

Can PM the system profile along with the exact name of this host?

To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and share in a private message and then reply to this post to bring it up in the queue.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
ITOMB_IMT
Posts: 181
Joined: Wed Oct 17, 2018 12:55 pm

Re: Didn't receive correction alerts

Post by ITOMB_IMT »

I am receiving error while downloading system profile

PROFILE BUILD FAILED
Array
(
)
CODE: 1
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: Didn't receive correction alerts

Post by benjaminsmith »

Hello,

That error message is typically caused by an incorrect sudoers file. Please follow the steps in the article below to resolve, and let know if you need any assistance.

Nagios XI - Profile Build Failed
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
ITOMB_IMT
Posts: 181
Joined: Wed Oct 17, 2018 12:55 pm

Re: Didn't receive correction alerts

Post by ITOMB_IMT »

I have upgraded NagiosXI to 5.6.7 and now too am receiving only recovery alerts. when i check reports i see both critical and recovery emails sent to user, but they weren't delivered to user.

when i check logs i see the following errors around that time.

Oct 27 04:02:01 APP_Server nagios: SERVICE NOTIFICATION: user2;client_machine;Ping;UNKNOWN;xi_service_notification_handler;(No output on stdout) stderr:
Oct 27 04:02:01 APP_Server nagios: wproc: 'Core Worker user1' seems to be choked. ret = -1; bufsize = 770: written = 0; errno = 32 (Broken pipe)
Oct 27 04:02:01 APP_Server nagios: SERVICE NOTIFICATION: 518592;client_machine;Ping;UNKNOWN;xi_service_notification_handler;(No output on stdout) stderr:
Oct 27 04:02:01 APP_Server nagios: Caught SIGTERM, shutting down...
Oct 27 04:02:01 APP_Server nagios: SERVICE NOTIFICATION: user2;client_machine;Ping;UNKNOWN;xi_service_notification_handler;(No output on stdout) stderr:
Oct 27 04:02:01 APP_Server nagios: wproc: 'Core Worker user1' seems to be choked. ret = -1; bufsize = 770: written = 0; errno = 32 (Broken pipe)
Oct 27 04:02:01 APP_Server nagios: SERVICE NOTIFICATION: 518592;client_machine;Ping;UNKNOWN;xi_service_notification_handler;(No output on stdout) stderr:
Oct 27 04:02:01 APP_Server nagios: wproc: 'Core Worker 31206' seems to be choked. ret = -1; bufsize = 773: written = 0; errno = 32 (Broken pipe)
Oct 27 04:02:01 APP_Server nagios: SERVICE NOTIFICATION: 557291;client_machine;Ping;UNKNOWN;xi_service_notification_handler;(No output on stdout) stderr:
Oct 27 04:02:01 APP_Server nagios: wproc: 'Core Worker 31207' seems to be choked. ret = -1; bufsize = 766: written = 0; errno = 32 (Broken pipe)
Oct 27 04:02:01 APP_Server nagios: SERVICE NOTIFICATION: DMA_UNIX_pager;client_machine;Ping;UNKNOWN;xi_service_notification_handler;(No output on stdout) stderr:
Oct 27 04:02:01 APP_Server nagios: wproc: 'Core Worker 31208' seems to be choked. ret = -1; bufsize = 776: written = 0; errno = 32 (Broken pipe)
Oct 27 04:02:01 APP_Server nagios: SERVICE ALERT: client_machine;Ping;UNKNOWN;HARD;1;(No output on stdout) stderr:
Oct 27 04:02:01 APP_Server nagios: wproc: 'Core Worker 31209' seems to be choked. ret = -1; bufsize = 566: written = 0; errno = 32 (Broken pipe)
Oct 27 04:02:01 APP_Server nagios: wproc: Socket to worker Core Worker user1 broken, removing
Oct 27 04:02:01 APP_Server nagios: wproc: Socket to worker Core Worker 31206 broken, removing
Oct 27 04:02:01 APP_Server nagios: wproc: Socket to worker Core Worker 31207 broken, removing
Oct 27 04:02:01 APP_Server nagios: wproc: 'Core Worker 31206' seems to be choked. ret = -1; bufsize = 773: written = 0; errno = 32 (Broken pipe)
Oct 27 04:02:01 APP_Server nagios: SERVICE NOTIFICATION: 557291;client_machine;Ping;UNKNOWN;xi_service_notification_handler;(No output on stdout) stderr:
Oct 27 04:02:01 APP_Server nagios: wproc: 'Core Worker 31207' seems to be choked. ret = -1; bufsize = 766: written = 0; errno = 32 (Broken pipe)
Oct 27 04:02:01 APP_Server nagios: SERVICE NOTIFICATION: DMA_UNIX_pager;client_machine;Ping;UNKNOWN;xi_service_notification_handler;(No output on stdout) stderr:
Oct 27 04:02:01 APP_Server nagios: wproc: 'Core Worker 31208' seems to be choked. ret = -1; bufsize = 776: written = 0; errno = 32 (Broken pipe)
Oct 27 04:02:01 APP_Server nagios: SERVICE ALERT: client_machine;Ping;UNKNOWN;HARD;1;(No output on stdout) stderr:
Oct 27 04:02:01 APP_Server nagios: wproc: 'Core Worker 31209' seems to be choked. ret = -1; bufsize = 566: written = 0; errno = 32 (Broken pipe)

Please help me in rectifying the issue.

Thanks,
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: Didn't receive correction alerts

Post by benjaminsmith »

Hello @ISSB_MAOST,

Thanks for posting the log file. The logs have errors with Nagios Core workers and likely related to system resources or high CPU usage. Are you able to download a profile now? If we can get a current system profile, we can review some of the key log files.
wproc: 'Core Worker 31208' seems to be choked. ret = -1; bufsize = 776: written = 0; errno = 32 (Broken pipe)
Make sure you have adequate disk space.

Code: Select all

df -h
Log in as nagios user and run the following command to show the current resource settings.

Code: Select all

ulimit -a
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and share this in a private message and then reply to this post to bring it up in the queue.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
ITOMB_IMT
Posts: 181
Joined: Wed Oct 17, 2018 12:55 pm

Re: Didn't receive correction alerts

Post by ITOMB_IMT »

I cannot download the system profile i followed the steps you provided but no luck.

# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/sysvg-lv_root 25G 121M 25G 1% /
devtmpfs 7.8G 0 7.8G 0% /dev
tmpfs 7.8G 0 7.8G 0% /dev/shm
tmpfs 7.8G 836K 7.8G 1% /run
tmpfs 7.8G 0 7.8G 0% /sys/fs/cgroup
/dev/mapper/sysvg-lv_usr 50G 7.6G 43G 16% /usr
/dev/vda1 509M 233M 276M 46% /boot
/dev/mapper/sysvg-lv_home 15G 141M 15G 1% /home
/dev/mapper/sysvg-lv_opt 25G 2.1G 23G 9% /opt
/dev/mapper/sysvg-lv_var 20G 2.5G 18G 13% /var
/dev/mapper/sysvg-lv_store 40G 2.5G 38G 7% /store
/dev/mapper/sysvg-lv_tmp 15G 387M 15G 3% /tmp
/dev/mapper/sysvg-lv_log 15G 593M 15G 4% /var/log
/dev/mapper/sysvg-lv_journal 10G 1.2G 8.9G 12% /var/log/journal
/dev/mapper/sysvg-lv_audit 8.0G 512M 7.5G 7% /var/log/audit

# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 63423
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 10240
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 63423
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: Didn't receive correction alerts

Post by benjaminsmith »

Hello,

There are two separate issues here. One is that you are not receiving notification emails. If you are getting other notifications from Nagios, this is most likely due to configuration settings.

The other part is system issues as you're not able to download a profile and possible resource issues.

If you cannot get a profile, can you post the following logs for us to review.

1. Sudors file

Code: Select all

cat /etc/sudoers
2. Permssions on the profile script

Code: Select all

ls -l /usr/local/nagiosxi/html/includes/components/profile/getprofile.sh
3. Post the output of the Apache error log.

Code: Select all

tail -n 50 /var/log/httpd/*error_log
4. The database log

Code: Select all

tail -n 50 /var/log/mariadb/mariadb.log
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
ITOMB_IMT
Posts: 181
Joined: Wed Oct 17, 2018 12:55 pm

Re: Didn't receive correction alerts

Post by ITOMB_IMT »

Please see the sent PM.

Thanks,
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Didn't receive correction alerts

Post by ssax »

If you're unable to generate the the profile through the web interface, please try generating it from the command line by running these commands (as root):

Code: Select all

rm -rf /usr/local/nagiosxi/var/components/profile*​​
/usr/local/nagiosxi/scripts/components/getprofile.sh SUPPORT
Then send me the resulting /usr/local/nagiosxi/var/components/profile.zip​ file.​

If the profile script fails, please include the ENTIRE output.

Additionally, please send the output of these commands (as root):
- NOTE: You may need to adjust the -h 127.0.0.1, the -uroot, and -pnagiosxi in the first command if your DB is offloaded to another server and/or you've changed the root mysql password

Code: Select all

echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --table
Then run this command:

Code: Select all

grep mysql /usr/local/nagiosxi/html/config.inc.php | wc -l
If it outputs the number 2, run the command below as well and include the output, if it outputs anything other than 2 - don't run the command. (some XI systems use both mysql and postgresql if they were install prior to XI 5.0 and then upgraded from there).

Code: Select all

echo "SELECT relname as Table, pg_size_pretty(pg_total_relation_size(relid)) As Size, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as ExternalSize FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;" | psql nagiosxi nagiosxi
Attach your /usr/local/nagiosxi/html/config.inc.php as well.

Include the output of these commands:

Code: Select all

ls -lh /usr/local/nagiosxi/var
ls -lh /usr/local/nagios/var
That should give us what we need.


Here's my email troubleshooting process:

Please go to Admin > Manage Email Settings:
- Make sure Logging is checked
- Click the Update Settings button
- Take a screenshot of the settings here and send it to me

Then run this tail command (and leave it running):

Code: Select all

tail -Fn0 /var/log/maillog /usr/local/nagiosxi/tmp/phpmailer.log /usr/local/nagiosxi/var/eventman.log
Then force a notification to be sent then send me the full output of the tail command above and any errors you see on the screen (please test via the host from the instructions below exactly).

After you've done the steps above attach a FRESH copy of your profile.

Additionally, please include the full output of these commands:

Code: Select all

alternatives --display mta

Please test like this when doing your testing so the notification takes the actual path it will in the backend:

How to submit passive results for testing:

For Hosts
------------

Go to Home > Service Status:
- Find the Service and click on it
- Click the + tab
- Click the "Submit passive check result" link
- Select the Check Result and type in some text for the Check Output
- Click the Submit button

NOTE: By default, passive_host_checks_are_soft=0 is set in your /usr/local/nagios/etc/nagios.cfg, this differs from services in that when you submit a passive host check result it immeditely goes into a HARD state (and should send a notification if configured to) instead of going into a SOFT state like services do. (notifications are only sent on HARD states).

For Services
----------------

Go to Home > Service Status:
- Find the Service and click on it
- Click the + tab
- Note these two rows:

State Type: Hard
Current Check: 1 of 4

Those columns tell the current State Type and the Current Check number. In order to generate a notification for a service you will need to submit MULTIPLE problem check results (the number that you need to submit is determined by the last number in the Current Check column, that is the max_check_attempts setting). For services, when you submit a passive check result, each result that you submit will be a SOFT state until you submit enough to hit the Max Check Attempts setting that you've defined on the service, only then will the service enter a HARD problem state which will generate the notification (just remember, notifications are only sent on HARD states).

- Click the "Submit passive check result" link
- Select the Check Result and type in some text for the Check Output
- Click the Submit button
- Submit as many as you need, right after another, until the service enters the HARD state so that a notification will be sent

NOTE: When coming from a HARD problem state (whether we are talking about hosts or services) if you submit an OK passive result it should fire off a recovery notification after a single passive result has been submitted.

Including these docs for your reference as well:

Code: Select all

https://assets.nagios.com/downloads/nagiosxi/docs/Configuring-Email-And-Text-Notifications-in-Nagios-XI.pdf

https://assets.nagios.com/downloads/nagiosxi/docs/Understanding-Email-Sending-In-Nagios-XI.pdf

https://assets.nagios.com/downloads/nagiosxi/docs/Understanding-Nagios-XI-Users-And-Contacts.pdf

https://assets.nagios.com/downloads/nagiosxi/docs/Configuring-Core-Contacts-to-Use-Xi's-PHP-Mailer.pdf
Locked