Scheduling very unstable - Part 2

rajasegar · Post by **rajasegar** » Tue Jul 24, 2018 1:48 am

Previous Thread
https://support.nagios.com/forum/viewto ... 9&start=20

Capture.JPG

Same problem again. Last time we solved a similar issue with other instance by not offloading the DB.
Cant do that in this instance as it is not offloaded

Attached is the system Profile

profile.zip

Code: Select all

[nagios@nagiosprodxi3 ~]$ /usr/local/nagios/bin/nagiostats

Nagios Stats 4.2.4
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 12-07-2016
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /var/nagiosramdisk/status.dat
Status File Age:                        0d 0h 0m 10s
Status File Version:                    4.2.4

Program Running Time:                   0d 2h 0m 2s
Nagios PID:                             15031

Total Services:                         16140
Services Checked:                       16140
Services Scheduled:                     16140
Services Actively Checked:              16140
Services Passively Checked:             0
Total Service State Change:             0.000 / 71.840 / 0.353 %
Active Service Latency:                 0.000 / 1.090 / 0.014 sec
Active Service Execution Time:          0.009 / 60.026 / 11.391 sec
Active Service State Change:            0.000 / 71.840 / 0.353 %
Active Services Last 1/5/15/60 min:     1456 / 9702 / 14507 / 16110
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              13269 / 7 / 2564 / 300
Services Flapping:                      99
Services In Downtime:                   0

Total Hosts:                            2998
Hosts Checked:                          2998
Hosts Scheduled:                        2998
Hosts Actively Checked:                 2998
Host Passively Checked:                 0
Total Host State Change:                0.000 / 13.750 / 0.082 %
Active Host Latency:                    0.000 / 1.113 / 0.013 sec
Active Host Execution Time:             4.032 / 28.313 / 6.700 sec
Active Host State Change:               0.000 / 13.750 / 0.082 %
Active Hosts Last 1/5/15/60 min:        487 / 2956 / 2998 / 2998
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  2508 / 490 / 0
Hosts Flapping:                         3
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     738 / 4045 / 12363
   Scheduled:                           735 / 4024 / 12292
   On-demand:                           3 / 21 / 71
   Parallel:                            735 / 4024 / 12292
   Serial:                              0 / 0 / 0
   Cached:                              3 / 21 / 71
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  2037 / 10459 / 33266
   Scheduled:                           2037 / 10459 / 33266
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0

scottwilkerson · Post by **scottwilkerson** » Tue Jul 24, 2018 7:28 am

Lets run the following:

Code: Select all

echo "vacuum;vacuum analyse;vacuum full;"|psql nagiosxi postgres

rajasegar · Post by **rajasegar** » Tue Jul 24, 2018 7:46 pm

scottwilkerson wrote:Lets run the following:
Code: Select all
echo "vacuum;vacuum analyse;vacuum full;"|psql nagiosxi postgres

FInished very fast, almost 2 seconds.

Code: Select all

[nagios@nagiosprodxi3 ~]$ echo "vacuum;vacuum analyse;vacuum full;"|psql nagiosxi postgres
VACUUM
VACUUM
VACUUM

The queue rate was very high and good after restarting the services but started going down gradually.

After 6 minutes

Capture1.JPG

After 36min

Capture2.JPG

After about 1 hour it is back to blank.

Capture3.JPG

Any other suggestions?

scottwilkerson · Post by **scottwilkerson** » Wed Jul 25, 2018 8:31 am

Can you send a current profile?

Also lets run the following

Code: Select all

echo "select count(*) from xi_events where status_code !=0"|psql nagiosxi nagiosxi

rajasegar · Post by **rajasegar** » Wed Jul 25, 2018 6:31 pm

scottwilkerson wrote:Can you send a current profile?

profile (1).zip
Also lets run the following
Code: Select all
echo "select count(*) from xi_events where status_code !=0"|psql nagiosxi nagiosxi

Code: Select all

[nagios@nagiosprodxi3 ~]$ echo "select count(*) from xi_events where status_code !=0"|psql nagiosxi nagiosxi
 count
-------
  6959
(1 row)

scottwilkerson · Post by **scottwilkerson** » Thu Jul 26, 2018 7:29 am

Could you create a new system profile at the time this happens again, before restarting any services?

rajasegar · Post by **rajasegar** » Thu Jul 26, 2018 6:35 pm

scottwilkerson wrote:Could you create a new system profile at the time this happens again, before restarting any services?

The profile I posted earlier was before the restart.

Anyway this morning it looks back to normal. This is very frustrating as it keeps on happening.

Capture.JPG

Code: Select all

Last login: Thu Jul 26 11:23:46 2018 from 172.29.2.75
[nagios@nagiosprodxi3 ~]$  echo "select count(*) from xi_events where status_code !=0"|psql nagiosxi nagiosxi
 count
-------
  4346
(1 row)

scottwilkerson · Post by **scottwilkerson** » Fri Jul 27, 2018 7:28 am

Looking at the profile again, it looks like ndo2db wasn't able to post a critical piece of information because the database was disconnected before the write. I'm not sure if this was part of a system shutdown or what but could have been part of the cause.

Code: Select all

Jul 24 12:51:31 nagiosprodxi3 ndo2db: Error: mysql_query() failed for 'UPDATE nagios_conninfo SET disconnect_time=NOW(), last_checkin_time=NOW(), data_end_time=FROM_UNIXTIME(0), bytes_processed='0', lines_processed='0', entries_processed='0' WHERE conninfo_id='0''
Jul 24 12:51:31 nagiosprodxi3 ndo2db: mysql_error: 'MySQL server has gone away'
Jul 24 12:51:31 nagiosprodxi3 ndo2db: Error: Connection to MySQL database has been lost!
Jul 24 12:51:31 nagiosprodxi3 rrdcached[1289]: caught SIGTERM
Jul 24 12:51:31 nagiosprodxi3 rrdcached[1289]: starting shutdown
Jul 24 12:51:33 nagiosprodxi3 rrdcached[1289]: clean shutdown; all RRDs flushed

rajasegar · Post by **rajasegar** » Sun Jul 29, 2018 7:07 pm

scottwilkerson wrote:Looking at the profile again, it looks like ndo2db wasn't able to post a critical piece of information because the database was disconnected before the write. I'm not sure if this was part of a system shutdown or what but could have been part of the cause.
Code: Select all
Jul 24 12:51:31 nagiosprodxi3 ndo2db: Error: mysql_query() failed for 'UPDATE nagios_conninfo SET disconnect_time=NOW(), last_checkin_time=NOW(), data_end_time=FROM_UNIXTIME(0), bytes_processed='0', lines_processed='0', entries_processed='0' WHERE conninfo_id='0''
Jul 24 12:51:31 nagiosprodxi3 ndo2db: mysql_error: 'MySQL server has gone away'
Jul 24 12:51:31 nagiosprodxi3 ndo2db: Error: Connection to MySQL database has been lost!
Jul 24 12:51:31 nagiosprodxi3 rrdcached[1289]: caught SIGTERM
Jul 24 12:51:31 nagiosprodxi3 rrdcached[1289]: starting shutdown
Jul 24 12:51:33 nagiosprodxi3 rrdcached[1289]: clean shutdown; all RRDs flushed

Looks like a normal shutdown as we included mysqld service shutdown in the script.

rajasegar · Post by **rajasegar** » Mon Jul 30, 2018 1:27 am

It is dead again. Please see the profile before the restart.

profile.zip

Nagios Support Forum

Scheduling very unstable - Part 2

Scheduling very unstable - Part 2

Re: Scheduling very unstable - Part 2

Re: Scheduling very unstable - Part 2

Re: Scheduling very unstable - Part 2

Re: Scheduling very unstable - Part 2

Re: Scheduling very unstable - Part 2

Re: Scheduling very unstable - Part 2

Re: Scheduling very unstable - Part 2

Re: Scheduling very unstable - Part 2

Re: Scheduling very unstable - Part 2