Page 4 of 5

Re: Multiple (40+) poller cron jobs running

Posted: Mon May 11, 2015 7:28 am
by GhostRider2110
Here is from httpd/access_log and httpd/error_log

Code: Select all

==> error_log <==
[Sun May 10 03:48:02 2015] [notice] Digest: generating secret for digest authentication ...
[Sun May 10 03:48:02 2015] [notice] Digest: done
[Sun May 10 03:48:02 2015] [notice] Apache/2.2.15 (Unix) DAV/2 PHP/5.3.3 configured -- resuming normal operations

==> access_log <==
10.100.52.154 - - [11/May/2015:07:36:28 -0400] "POST /nagioslogserver/index.php/api/system/status HTTP/1.1" 200 82 "http://iganagioslog.iga.local/nagioslogserver/index.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36"
10.100.52.154 - - [11/May/2015:07:37:28 -0400] "POST /nagioslogserver/index.php/api/system/status HTTP/1.1" 200 82 "http://iganagioslog.iga.local/nagioslogserver/index.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36"
10.100.52.154 - - [11/May/2015:07:44:44 -0400] "POST /nagioslogserver/index.php/api/system/status HTTP/1.1" 200 82 "http://iganagioslog.iga.local/nagioslogserver/index.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36"
10.100.30.8 - - [11/May/2015:07:38:31 -0400] "POST /nagioslogserver/index.php/api/check/query HTTP/1.1" 200 79 "-" "BinGet/1.00.A (http://www.bin-co.com/php/scripts/load/)"
10.100.30.8 - - [11/May/2015:07:44:28 -0400] "POST /nagioslogserver/index.php/api/check/query HTTP/1.1" 200 95 "-" "BinGet/1.00.A (http://www.bin-co.com/php/scripts/load/)"
10.100.52.154 - - [11/May/2015:07:43:11 -0400] "POST /nagioslogserver/index.php/api/system/status HTTP/1.1" 200 87 "http://iganagioslog.iga.local/nagioslogserver/index.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36"
10.100.52.154 - - [11/May/2015:07:53:03 -0400] "GET /nagioslogserver/index.php/dashboard HTTP/1.1" 302 - "http://iganagioslog.iga.local/nagioslogserver/index.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36"
10.100.52.154 - - [11/May/2015:07:42:21 -0400] "POST /nagioslogserver/index.php/api/system/status HTTP/1.1" 200 82 "http://iganagioslog.iga.local/nagioslogserver/index.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36"
10.100.30.8 - - [11/May/2015:08:14:19 -0400] "POST /nagioslogserver/index.php/api/check/query HTTP/1.1" 200 73 "-" "BinGet/1.00.A (http://www.bin-co.com/php/scripts/load/)"
10.100.52.154 - - [11/May/2015:08:04:45 -0400] "POST /nagioslogserver/index.php/api/system/status HTTP/1.1" 200 87 "http://iganagioslog.iga.local/nagioslogserver/index.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36"
10.100.30.8 - - [11/May/2015:07:39:29 -0400] "POST /nagioslogserver/index.php/api/check/query HTTP/1.1" 200 95 "-" "BinGet/1.00.A (http://www.bin-co.com/php/scripts/load/)"
::1 - - [11/May/2015:08:17:57 -0400] "OPTIONS * HTTP/1.0" 200 - "-" "Apache/2.2.15 (CentOS) (internal dummy connection)"
10.100.30.8 - - [11/May/2015:07:49:26 -0400] "POST /nagioslogserver/index.php/api/check/query HTTP/1.1" 200 95 "-" "BinGet/1.00.A (http://www.bin-co.com/php/scripts/load/)"
10.100.30.8 - - [11/May/2015:07:59:23 -0400] "POST /nagioslogserver/index.php/api/check/query HTTP/1.1" 200 95 "-" "BinGet/1.00.A (http://www.bin-co.com/php/scripts/load/)"
10.100.30.8 - - [11/May/2015:08:09:20 -0400] "POST /nagioslogserver/index.php/api/check/query HTTP/1.1" 200 95 "-" "BinGet/1.00.A (http://www.bin-co.com/php/scripts/load/)"
10.100.30.8 - - [11/May/2015:08:04:21 -0400] "POST /nagioslogserver/index.php/api/check/query HTTP/1.1" 200 95 "-" "BinGet/1.00.A (http://www.bin-co.com/php/scripts/load/)"
10.100.30.8 - - [11/May/2015:07:54:25 -0400] "POST /nagioslogserver/index.php/api/check/query HTTP/1.1" 200 95 "-" "BinGet/1.00.A (http://www.bin-co.com/php/scripts/load/)"
::1 - - [11/May/2015:08:18:02 -0400] "OPTIONS * HTTP/1.0" 200 - "-" "Apache/2.2.15 (CentOS) (internal dummy connection)"
::1 - - [11/May/2015:08:18:03 -0400] "OPTIONS * HTTP/1.0" 200 - "-" "Apache/2.2.15 (CentOS) (internal dummy connection)"
::1 - - [11/May/2015:08:18:04 -0400] "OPTIONS * HTTP/1.0" 200 - "-" "Apache/2.2.15 (CentOS) (internal dummy connection)"
::1 - - [11/May/2015:08:18:05 -0400] "OPTIONS * HTTP/1.0" 200 - "-" "Apache/2.2.15 (CentOS) (internal dummy connection)"
::1 - - [11/May/2015:08:18:06 -0400] "OPTIONS * HTTP/1.0" 200 - "-" "Apache/2.2.15 (CentOS) (internal dummy connection)"
10.100.30.8 - - [11/May/2015:08:19:15 -0400] "POST /nagioslogserver/index.php/api/check/query HTTP/1.1" 200 95 "-" "BinGet/1.00.A (http://www.bin-co.com/php/scripts/load/)"
One note we now have noticed that when ever we try to bring up the Dashboard installed from here:

http://exchange.nagios.org/directory/Ad ... ng/details

The hung poller/jobs process problem starts. I ran into it Friday, but was able to kill off the poller jobs and things seemed to catch up.

See-ya
Mitch

Re: Multiple (40+) poller cron jobs running

Posted: Mon May 11, 2015 8:26 am
by GhostRider2110
Was finally able to get response back from web interface, login and restart elasticsearch from the Systems Status page. That clearned up the old poller/job processes. --Mitch

Re: Multiple (40+) poller cron jobs running

Posted: Mon May 11, 2015 12:44 pm
by jolson
crond dead but subsys locked
This indicates a crash of cron - the question is why. Anything notable in the cron logs?

Code: Select all

cat /var/log/cron
grep -i cron /var/log/messages
One note we now have noticed that when ever we try to bring up the Dashboard installed from here:

http://exchange.nagios.org/directory/Ad ... ng/details

The hung poller/jobs process problem starts. I ran into it Friday, but was able to kill off the poller jobs and things seemed to catch up.
Is this reproducible? Fingers crossed that it is.

My guess at this point is that your crond process crashed when there are too many poller/jobs happening at once.

One last thing I haven't had you check is your sudoers file - I expect that it's fine, but just to be sure:

Code: Select all

cat /etc/sudoers

Re: Multiple (40+) poller cron jobs running

Posted: Tue May 12, 2015 7:22 am
by GhostRider2110
There is nothing in the cron or messages logs. You can see things running.. and where I re-started cron...

It seems to be related to searches we are doing. We have a situation where we are trying to find who was logged into the VPN at a certain time. But as we change the filters and or queries, thing slow down to a craw and we see the poller processes start delaying on exit. In addition we are seeing inconsistent results. We know a message from the ASN has a Username in it, one search finds that, but when we try to refine the search, that message now does not show the username. Much is probably our lack of understanding on the query process and language. Part of this problem may be just the huge size of entries.
ClusterStats-051215.png
It very hard to explain, would be easy to show you. :)

Here is the sudoers:

Code: Select all

[root@IGAnagioslog var]# cat /etc/sudoers
## Sudoers allows particular users to run various commands as
## the root user, without needing the root password.
##
## Examples are provided at the bottom of the file for collections
## of related commands, which can then be delegated out to particular
## users or groups.
## 
## This file must be edited with the 'visudo' command.

## Host Aliases
## Groups of machines. You may prefer to use hostnames (perhaps using 
## wildcards for entire domains) or IP addresses instead.
# Host_Alias     FILESERVERS = fs1, fs2
# Host_Alias     MAILSERVERS = smtp, smtp2

## User Aliases
## These aren't often necessary, as you can use regular groups
## (ie, from files, LDAP, NIS, etc) in this file - just use %groupname 
## rather than USERALIAS
# User_Alias ADMINS = jsmith, mikem


## Command Aliases
## These are groups of related commands...

## Networking
# Cmnd_Alias NETWORKING = /sbin/route, /sbin/ifconfig, /bin/ping, /sbin/dhclient, /usr/bin/net, /sbin/iptables, /usr/bin/rfcomm, /usr/bin/wvdial, /sbin/iwconfig, /sbin/mii-tool

## Installation and management of software
# Cmnd_Alias SOFTWARE = /bin/rpm, /usr/bin/up2date, /usr/bin/yum

## Services
# Cmnd_Alias SERVICES = /sbin/service, /sbin/chkconfig

## Updating the locate database
# Cmnd_Alias LOCATE = /usr/bin/updatedb

## Storage
# Cmnd_Alias STORAGE = /sbin/fdisk, /sbin/sfdisk, /sbin/parted, /sbin/partprobe, /bin/mount, /bin/umount

## Delegating permissions
# Cmnd_Alias DELEGATING = /usr/sbin/visudo, /bin/chown, /bin/chmod, /bin/chgrp 

## Processes
# Cmnd_Alias PROCESSES = /bin/nice, /bin/kill, /usr/bin/kill, /usr/bin/killall

## Drivers
# Cmnd_Alias DRIVERS = /sbin/modprobe

# Defaults specification

#
# Disable "ssh hostname sudo <cmd>", because it will show the password in clear. 
#         You have to run "ssh -t hostname sudo <cmd>".
#
####Defaults    requiretty

#
# Refuse to run if unable to disable echo on the tty. This setting should also be
# changed in order to be able to use sudo without a tty. See requiretty above.
#
Defaults   !visiblepw

#
# Preserving HOME has security implications since many programs
# use it when searching for configuration files. Note that HOME
# is already set when the the env_reset option is enabled, so
# this option is only effective for configurations where either
# env_reset is disabled or HOME is present in the env_keep list.
#
Defaults    always_set_home

Defaults    env_reset
Defaults    env_keep =  "COLORS DISPLAY HOSTNAME HISTSIZE INPUTRC KDEDIR LS_COLORS"
Defaults    env_keep += "MAIL PS1 PS2 QTDIR USERNAME LANG LC_ADDRESS LC_CTYPE"
Defaults    env_keep += "LC_COLLATE LC_IDENTIFICATION LC_MEASUREMENT LC_MESSAGES"
Defaults    env_keep += "LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE"
Defaults    env_keep += "LC_TIME LC_ALL LANGUAGE LINGUAS _XKB_CHARSET XAUTHORITY"

#
# Adding HOME to env_keep may enable a user to run unrestricted
# commands via sudo.
#
# Defaults   env_keep += "HOME"

Defaults    secure_path = /sbin:/bin:/usr/sbin:/usr/bin

## Next comes the main part: which users can run what software on 
## which machines (the sudoers file can be shared between multiple
## systems).
## Syntax:
##
## 	user	MACHINE=COMMANDS
##
## The COMMANDS section may have other options added to it.
##
## Allow root to run any commands anywhere 
root	ALL=(ALL) 	ALL

## Allows members of the 'sys' group to run networking, software, 
## service management apps and more.
# %sys ALL = NETWORKING, SOFTWARE, SERVICES, STORAGE, DELEGATING, PROCESSES, LOCATE, DRIVERS

## Allows people in group wheel to run all commands
# %wheel	ALL=(ALL)	ALL

## Same thing without a password
# %wheel	ALL=(ALL)	NOPASSWD: ALL

## Allows members of the users group to mount and unmount the 
## cdrom as root
# %users  ALL=/sbin/mount /mnt/cdrom, /sbin/umount /mnt/cdrom

## Allows members of the users group to shutdown this system
# %users  localhost=/sbin/shutdown -h now

## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

# NEEDED TO ALLOW NAGIOS TO CHECK SERVICE STATUS
Defaults:nagios !requiretty
nagios ALL=NOPASSWD: /usr/local/nagios/libexec/check_init_service

# ASTERISK-SPECIFIC CHECKS
# NOTE: You can uncomment the following line if you are monitoring Asterisk locally
#nagios ALL=NOPASSWD: /usr/local/nagios/libexec/check_asterisk_sip_peers.sh, /usr/local/nagios/libexec/nagisk.pl, /usr/sbin/asterisk

User_Alias      NAGIOSLOGSERVER=nagios
User_Alias      NAGIOSLOGSERVERWEB=apache
NAGIOSLOGSERVER ALL = NOPASSWD:/etc/init.d/logstash start
NAGIOSLOGSERVER ALL = NOPASSWD:/etc/init.d/logstash stop
NAGIOSLOGSERVER ALL = NOPASSWD:/etc/init.d/logstash restart
NAGIOSLOGSERVER ALL = NOPASSWD:/etc/init.d/logstash reload
NAGIOSLOGSERVER ALL = NOPASSWD:/etc/init.d/logstash status
NAGIOSLOGSERVER ALL = NOPASSWD:/etc/init.d/elasticsearch start
NAGIOSLOGSERVER ALL = NOPASSWD:/etc/init.d/elasticsearch stop
NAGIOSLOGSERVER ALL = NOPASSWD:/etc/init.d/elasticsearch restart
NAGIOSLOGSERVER ALL = NOPASSWD:/etc/init.d/elasticsearch reload
NAGIOSLOGSERVER ALL = NOPASSWD:/etc/init.d/elasticsearch status
NAGIOSLOGSERVER ALL = NOPASSWD:/usr/local/nagioslogserver/scripts/change_timezone.sh
NAGIOSLOGSERVERWEB ALL = NOPASSWD:/etc/init.d/logstash start
NAGIOSLOGSERVERWEB ALL = NOPASSWD:/etc/init.d/logstash stop
NAGIOSLOGSERVERWEB ALL = NOPASSWD:/etc/init.d/logstash restart
NAGIOSLOGSERVERWEB ALL = NOPASSWD:/etc/init.d/logstash reload
NAGIOSLOGSERVERWEB ALL = NOPASSWD:/etc/init.d/logstash status
NAGIOSLOGSERVERWEB ALL = NOPASSWD:/etc/init.d/elasticsearch start
NAGIOSLOGSERVERWEB ALL = NOPASSWD:/etc/init.d/elasticsearch stop
NAGIOSLOGSERVERWEB ALL = NOPASSWD:/etc/init.d/elasticsearch restart
NAGIOSLOGSERVERWEB ALL = NOPASSWD:/etc/init.d/elasticsearch reload
NAGIOSLOGSERVERWEB ALL = NOPASSWD:/etc/init.d/elasticsearch status
NAGIOSLOGSERVERWEB ALL = NOPASSWD:/usr/local/nagioslogserver/scripts/get_logstash_ports.sh
See-ya
Mitch

Re: Multiple (40+) poller cron jobs running

Posted: Tue May 12, 2015 7:40 am
by GhostRider2110
Currently we have about 5 poller jobs showing, but it seems like some finish and others start since the times span about a 6min interval. As long as we don't do a query it stays 6-7.
Screenshot1.png
Screenshot2.png

Re: Multiple (40+) poller cron jobs running

Posted: Tue May 12, 2015 9:39 am
by jolson
Mitch,

How about we move this to a ticket and get a remote session going? If I could have you email [email protected] with a reference to this thread, we could get a session setup for sometime this week. Thanks!

Re: Multiple (40+) poller cron jobs running

Posted: Tue May 12, 2015 9:44 am
by GhostRider2110
Here is an example if the strangeness we are seeing. I do believe some of this is related to not understanding how the filtering work in global configuration.

Here are some screen shots of searches we were doing. Searching for a user name during a short time period for the ASN logs. Seems if we just use really short time window for a search things will keep flowing.

Search for users name for VPN session
nls-gensearch1.png
Expanding just to make sure seeing complete message.
nls-gensearch1-message.png
Searching now for the IP address given.
nls-gensearch2.png
(Continued in next post, to get around 3 image limit)

Re: Multiple (40+) poller cron jobs running

Posted: Tue May 12, 2015 9:50 am
by GhostRider2110
Expanding to see full message text.
nls-gensearch2-message.png
Notice the message text is different. When we searched for the username, we get Group User returned from the ASA-4-722051 log message. Yet when I do the query using the IPv4 address given in the previous query, the Group User field is empty. The timestams are the same, 2015-05-12T09:30:20.074-04:00.

See-ya
Mitch

Re: Multiple (40+) poller cron jobs running

Posted: Tue May 12, 2015 10:44 am
by tmcdonald
Ticket received, so I will be locking this thread for the time-being.

Re: Multiple (40+) poller cron jobs running

Posted: Wed May 13, 2015 3:30 pm
by GhostRider2110
Had very productive online session to help fix issues noted.

First, the issue with fields not showing up in query. From the tech:
missing 'User' and 'Group' values were present in the 'raw' data, however Kibana was stripping them. This is likely because Kibana thought that they were part of the pages HTML - the developer I spoke with said that this is being worked on. For now, the workaround is to click 'raw'. Queries for specific names/groups seem to pull up the missing information.
Original issue, poller jobs getting stuck.

We added the processors: N config parameter to the /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml file, starting at 4 (we have 6 proc's allocated) ending with 6. Notice more ballanced use of CPUs. The ES_HEAP_SIZE variable in /etc/sysconfig/elasticsearch was set to 2GB, while the amount of installed RAM in the system was 8GB. Adjusted this value to the recommended setting (50% of available RAM - 4GB). After these two changes we saw dramatic improvement in query speed. The poller processes would also finish up faster and go away as they should. Will be monitoring system but initial tests that would bring system to a halt didn't slow it down much at all. Thanks for all the help, I would call this closed.

The tech also mentioned that with the amount of data we have, splitting into a 2-3 node cluster would be recommended. We currently do have licensing for 2 NLS system and do plan on getting a second in place.

Again, thanks for all the help!

--Mitch