Scheduling very unstable - Part 2

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Scheduling very unstable - Part 2

Post by rajasegar »

Previous Thread
https://support.nagios.com/forum/viewto ... 9&start=20
Capture.JPG
Same problem again. Last time we solved a similar issue with other instance by not offloading the DB.
Cant do that in this instance as it is not offloaded

Attached is the system Profile
profile.zip

Code: Select all

[nagios@nagiosprodxi3 ~]$ /usr/local/nagios/bin/nagiostats

Nagios Stats 4.2.4
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 12-07-2016
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /var/nagiosramdisk/status.dat
Status File Age:                        0d 0h 0m 10s
Status File Version:                    4.2.4

Program Running Time:                   0d 2h 0m 2s
Nagios PID:                             15031

Total Services:                         16140
Services Checked:                       16140
Services Scheduled:                     16140
Services Actively Checked:              16140
Services Passively Checked:             0
Total Service State Change:             0.000 / 71.840 / 0.353 %
Active Service Latency:                 0.000 / 1.090 / 0.014 sec
Active Service Execution Time:          0.009 / 60.026 / 11.391 sec
Active Service State Change:            0.000 / 71.840 / 0.353 %
Active Services Last 1/5/15/60 min:     1456 / 9702 / 14507 / 16110
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              13269 / 7 / 2564 / 300
Services Flapping:                      99
Services In Downtime:                   0

Total Hosts:                            2998
Hosts Checked:                          2998
Hosts Scheduled:                        2998
Hosts Actively Checked:                 2998
Host Passively Checked:                 0
Total Host State Change:                0.000 / 13.750 / 0.082 %
Active Host Latency:                    0.000 / 1.113 / 0.013 sec
Active Host Execution Time:             4.032 / 28.313 / 6.700 sec
Active Host State Change:               0.000 / 13.750 / 0.082 %
Active Hosts Last 1/5/15/60 min:        487 / 2956 / 2998 / 2998
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  2508 / 490 / 0
Hosts Flapping:                         3
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     738 / 4045 / 12363
   Scheduled:                           735 / 4024 / 12292
   On-demand:                           3 / 21 / 71
   Parallel:                            735 / 4024 / 12292
   Serial:                              0 / 0 / 0
   Cached:                              3 / 21 / 71
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  2037 / 10459 / 33266
   Scheduled:                           2037 / 10459 / 33266
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0

You do not have the required permissions to view the files attached to this post.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Scheduling very unstable - Part 2

Post by scottwilkerson »

Lets run the following:

Code: Select all

echo "vacuum;vacuum analyse;vacuum full;"|psql nagiosxi postgres
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Scheduling very unstable - Part 2

Post by rajasegar »

scottwilkerson wrote:Lets run the following:

Code: Select all

echo "vacuum;vacuum analyse;vacuum full;"|psql nagiosxi postgres
FInished very fast, almost 2 seconds.

Code: Select all

[nagios@nagiosprodxi3 ~]$ echo "vacuum;vacuum analyse;vacuum full;"|psql nagiosxi postgres
VACUUM
VACUUM
VACUUM
The queue rate was very high and good after restarting the services but started going down gradually.

After 6 minutes
Capture1.JPG
After 36min
Capture2.JPG
After about 1 hour it is back to blank.
Capture3.JPG
Any other suggestions?
You do not have the required permissions to view the files attached to this post.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Scheduling very unstable - Part 2

Post by scottwilkerson »

Can you send a current profile?

Also lets run the following

Code: Select all

echo "select count(*) from xi_events where status_code !=0"|psql nagiosxi nagiosxi
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Scheduling very unstable - Part 2

Post by rajasegar »

scottwilkerson wrote:Can you send a current profile?
profile (1).zip
Also lets run the following

Code: Select all

echo "select count(*) from xi_events where status_code !=0"|psql nagiosxi nagiosxi

Code: Select all

[nagios@nagiosprodxi3 ~]$ echo "select count(*) from xi_events where status_code !=0"|psql nagiosxi nagiosxi
 count
-------
  6959
(1 row)
You do not have the required permissions to view the files attached to this post.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Scheduling very unstable - Part 2

Post by scottwilkerson »

Could you create a new system profile at the time this happens again, before restarting any services?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Scheduling very unstable - Part 2

Post by rajasegar »

scottwilkerson wrote:Could you create a new system profile at the time this happens again, before restarting any services?
The profile I posted earlier was before the restart.

Anyway this morning it looks back to normal. This is very frustrating as it keeps on happening.
Capture.JPG

Code: Select all

Last login: Thu Jul 26 11:23:46 2018 from 172.29.2.75
[nagios@nagiosprodxi3 ~]$  echo "select count(*) from xi_events where status_code !=0"|psql nagiosxi nagiosxi
 count
-------
  4346
(1 row)

You do not have the required permissions to view the files attached to this post.
Last edited by rajasegar on Sun Jul 29, 2018 7:06 pm, edited 1 time in total.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Scheduling very unstable - Part 2

Post by scottwilkerson »

Looking at the profile again, it looks like ndo2db wasn't able to post a critical piece of information because the database was disconnected before the write. I'm not sure if this was part of a system shutdown or what but could have been part of the cause.

Code: Select all

Jul 24 12:51:31 nagiosprodxi3 ndo2db: Error: mysql_query() failed for 'UPDATE nagios_conninfo SET disconnect_time=NOW(), last_checkin_time=NOW(), data_end_time=FROM_UNIXTIME(0), bytes_processed='0', lines_processed='0', entries_processed='0' WHERE conninfo_id='0''
Jul 24 12:51:31 nagiosprodxi3 ndo2db: mysql_error: 'MySQL server has gone away'
Jul 24 12:51:31 nagiosprodxi3 ndo2db: Error: Connection to MySQL database has been lost!
Jul 24 12:51:31 nagiosprodxi3 rrdcached[1289]: caught SIGTERM
Jul 24 12:51:31 nagiosprodxi3 rrdcached[1289]: starting shutdown
Jul 24 12:51:33 nagiosprodxi3 rrdcached[1289]: clean shutdown; all RRDs flushed
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Scheduling very unstable - Part 2

Post by rajasegar »

scottwilkerson wrote:Looking at the profile again, it looks like ndo2db wasn't able to post a critical piece of information because the database was disconnected before the write. I'm not sure if this was part of a system shutdown or what but could have been part of the cause.

Code: Select all

Jul 24 12:51:31 nagiosprodxi3 ndo2db: Error: mysql_query() failed for 'UPDATE nagios_conninfo SET disconnect_time=NOW(), last_checkin_time=NOW(), data_end_time=FROM_UNIXTIME(0), bytes_processed='0', lines_processed='0', entries_processed='0' WHERE conninfo_id='0''
Jul 24 12:51:31 nagiosprodxi3 ndo2db: mysql_error: 'MySQL server has gone away'
Jul 24 12:51:31 nagiosprodxi3 ndo2db: Error: Connection to MySQL database has been lost!
Jul 24 12:51:31 nagiosprodxi3 rrdcached[1289]: caught SIGTERM
Jul 24 12:51:31 nagiosprodxi3 rrdcached[1289]: starting shutdown
Jul 24 12:51:33 nagiosprodxi3 rrdcached[1289]: clean shutdown; all RRDs flushed

Looks like a normal shutdown as we included mysqld service shutdown in the script.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Scheduling very unstable - Part 2

Post by rajasegar »

It is dead again. Please see the profile before the restart.
profile.zip
You do not have the required permissions to view the files attached to this post.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
Locked