AWS boto3 cannot see credentials or $USER macros
Posted: Thu Oct 15, 2015 3:46 am
I have written a python boto script to get some metric statistics from the AWS hosts in our production account
The script uses AWS API calls to see which hosts are up and then asks each one for it's "StatusCheckFailed" stats. These represent health checks that AWS carry out on the VMs and if one of them is failed there is usually a problem, sometimes with hardware
These stats are only accessible with the AWS API. In order to use the AWS API for monitoring we have established an IAM user with read-only access to most of the resources in our production account. Obviously the credentials for this account are sensitive because the permissions are quite strong
The script normally picks up the aws credentials to use from a ~/.aws directory. But the Nagios XI process seems unable to pick this up. To attempt to fix this I have tried
My specific questions are
Thanks
The script uses AWS API calls to see which hosts are up and then asks each one for it's "StatusCheckFailed" stats. These represent health checks that AWS carry out on the VMs and if one of them is failed there is usually a problem, sometimes with hardware
These stats are only accessible with the AWS API. In order to use the AWS API for monitoring we have established an IAM user with read-only access to most of the resources in our production account. Obviously the credentials for this account are sensitive because the permissions are quite strong
The script normally picks up the aws credentials to use from a ~/.aws directory. But the Nagios XI process seems unable to pick this up. To attempt to fix this I have tried
- Setting the HOME environment in the script with os.environ['HOME']='/nagioshome' This does work with python / boto3 as I can run in a shell from an account with no credentials and it finds the nagios home ones. It does not make the script work from the Nagios XI process however
- Following the advice in https://exchange.nagios.org/directory/D ... os/details for adding the AWS API keys as $USERn$. The $USERn$ variables did not appear in the script, I modified error messages to display them but they did not seem to be set. I used the /etc/init.d/nagios reload to sighup the nagios process
Here is the output from the script version that takes $USERn$ variables (this is not the version attached below)Code: Select all
COMMAND: /usr/local/nagios/libexec/check_ec2_statuscheck.py -u \$USER9\$ -a \$USER10\$ OUTPUT: system error type: value: An error occurred (AuthFailure) when calling the DescribeInstances operation: AWS was not able to validate the provided access credentials : -u $USER9$ -a $USER10$ Unknown
My specific questions are
- What configuration of the Nagios XI system should I check to ensure that the Nagios XI process does have access to the nagios user .aws config directory?
- What configuration of the Nagios XI system should I check to ensure that the $USERn$ macros are set correctly?
Thanks
Code: Select all
Nagios XI Installation Profile
Download Profile
System:
Nagios XI Version : 2012R2.7
nagios.ioppublishing.com 2.6.32-358.23.2.el6.x86_64 x86_64
Red Hat Enterprise Linux Server release 6.4 (Santiago)
Gnome Installed
Apache Information
PHP Version: 5.3.3
Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
Server Name: nagios.ioppublishing.com
Server Address: 10.20.1.83
Server Port: 80
Date/Time
PHP Timezone: Europe/London
PHP Time: Thu, 15 Oct 2015 08:47:55 +0100
System Time: Thu, 15 Oct 2015 08:47:55 +0100
Nagios XI Data
License ends in: OVQTPV
nagios (pid 28940) is running...
NPCD running (pid 2338).
ndo2db (pid 2973) is running...
CPU Load 15: 2.11
Total Hosts: 381
Total Services: 2689
Function 'get_base_uri' returns: http://nagios.ioppublishing.com/nagiosxi/
Function 'get_base_url' returns: http://nagios.ioppublishing.com/nagiosxi/
Function 'get_backend_url(internal_call=false)' returns: http://nagios.ioppublishing.com/nagiosxi/includes/components/profile/profile.php
Function 'get_backend_url(internal_call=true)' returns: http://localhost/nagiosxi/backend/
Ping Test localhost
Running:
/bin/ping -c 3 localhost 2>&1
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.028 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.026 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.025 ms
--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.025/0.026/0.028/0.004 ms
Test wget To locahost
WGET From URL: http://localhost/nagiosql/index.php
Running:
/usr/bin/wget http://localhost/nagiosql/index.php
--2015-10-15 08:47:57-- http://localhost/nagiosql/index.php
Resolving localhost... ::1, 127.0.0.1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5259 (5.1K) [text/html]
Saving to: "/tmp/nagiosql_index.tmp"
0K ..... 100% 537M=0s
2015-10-15 08:47:58 (537 MB/s) - "/tmp/nagiosql_index.tmp" saved [5259/5259]
Code: Select all
import boto3
import sys
from datetime import datetime, timedelta
# set the time period to get stats for
lookback = timedelta(hours=1)
finish = datetime.now()
start = finish - lookback
try:
# get details of instances in account
e = boto3.client('ec2').describe_instances()
myinstances = []
myfail = []
# filter the instances to only the running ones
for r in e['Reservations']:
if r['Instances'][0]['State']['Name'] == 'running':
myinstances.append({ 'Name': 'InstanceId', 'Value': r['Instances'][0]['InstanceId']})
# attach to cloudwatch
c = boto3.client('cloudwatch')
# get the status checks
for instance in myinstances :
x=c.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='StatusCheckFailed_System',
Dimensions= [instance],
StartTime=start,
EndTime=finish,
Period=300,
Statistics=['Sum'],
Unit='Count'
)
# go through all returned results looking for problems
for result in x['Datapoints']:
n = result['Sum']
if (n > 0):
myfail.append(instance)
except:
# something went wrong
print "system error %s %s: Unknown" % (sys.exc_info()[0],sys.exc_info()[1])
sys.exit(3)
if (len(myfail) == 0):
print "%d instances checked :OK " % len(myinstances)
sys.exit(0)
else:
print "%d instances checked %s :Critical" % (len(myinstances), repr(myfail))
sys.exit(2)