Many corporate networks are processing massive amounts of internet traffic, which poses a monitoring and security challenge:
- With so much activity, how do we know what should be investigated?
- Better yet... how can we proactively identify internet traffic that is worth investigation before there's a security incident?
- And most importantly... can we automate this?
We answer all of these questions for Cisco Umbrella users in this article.
Not an Umbrella customer? Check out our always-on sandbox.
Table of Contents
- What is Cisco Umbrella?
- Generating an Umbrella Admin API Key
- Planning our script
- Filtering and formatting files for comparison
- Finding uncommon domains
- Removing old files
- Running the script
- Cisco DevNet sample code
What is Cisco Umbrella?
If you're unfamiliar with Umbrella, we colloquially refer to it as "internet security" or DNS security. While all Umbrella packages allow you to create DNS policies and access a variety of reports on your network's internet activity, other packages have additional features ranging from web policy to data loss prevention (DLP) policy to the Investigate API (proactive threat research).
The image above shows a small snippet of the Umbrella interface, in which I navigated to the Activity Search. (This is because I had just configured Umbrella, so my dashboard of the past 24 hours was empty.)
Generating an Umbrella Admin API Key
In the leftside menu, navigate to Admin > API Keys.
In the top right corner of the screen, you should then see a circular Add button. Click that, then fill out the information for creating a new Admin API Key.
The photo below provides an example of how to fill this out. What's important is that you can choose the correct scope (Reports > Aggregations: Read-Only) and that you choose an expiration date that isn't today.
When you've filled out the form above, click the Create Key button.
You'll now see an API Key and Key Secret, as shown above. Copy both of these -- they will only be displayed once.
Securely using the API Key and Key Secret
We'll need the API Key and Key Secret we just generated in order to communicate with Umbrella; but if we're using a version-controlled repository, hardcoding those credentials into the script and pushing them to the repository will expose our sensitive credentials to others.
While you can secure credentials in multiple ways, we'll create a .env file to store them in.
API_KEY=apikeygoeshere
KEY_SECRET=keysecretgoeshere
Next, to ensure this .env file is not pushed to our repository, we'll create a .gitignore file using this command:
touch .gitignore
Once the .gitignore file is created, we'll add the name of any files we want to be ignored inside the file -- in this case, only .env.
Finally, our script will need to access these credentials despite the fact that they aren't hardcoded into the script.
In Python, we'll include the following import statements to accomplish this:
import os
from dotenv import load_dotenv
Then, we'll load the credentials into the script:
load_dotenv()
# Environmental variables should contain your org's values in .env file.
client_key = os.environ['API_KEY']
client_secret = os.environ['KEY_SECRET']
Planning our script
Okay, so now we have an API Key and Key Secret from which we can retrieve DNS traffic from Umbrella. Now, we have 3 questions to answer:
- What's required to authenticate with the Umbrella API?
- Which API call should we be making to Umbrella?
- How can we sift through the DNS traffic to determine what is "uncommon"?
Authenticating with Umbrella API
Authentication with any Umbrella API requires not just an API Key and Key Secret (which we generated earlier), but an access token, which expires after 1 hour.
In Python, we'll first need to import the requests library to make an API call:
import requests
Then, we write a function that generates an access token using the correct API endpoint. Because we'll run this script on a weekly basis, but the access token only lasts an hour, we'll make sure we call this function first each time the script runs.
# Relevant v2 Umbrella API endpoints
base_url = "https://api.umbrella.com"
access_token_endpoint = f"{base_url}/auth/v2/token"
# Generate new access token as these expire after 1 hour. Requires a valid and unexpired Umbrella API Key and Key Secret.
def generate_access_token():
response = requests.post(url=access_token_endpoint,auth=(client_key,client_secret))
access_token = response.json()['access_token']
return access_token
Umbrella Reports API
By reviewing the Umbrella API documentation, we see that using the Reports API will retrieve information about the traffic coming through the Umbrella network; specifically, the getTopDestinations endpoint.
We'll first create a variable for the Top Destinations endpoint:
# Relevant v2 Umbrella API endpoints
base_url = "https://api.umbrella.com"
access_token_endpoint = f"{base_url}/auth/v2/token"
top_destinations_endpoint = f"{base_url}/reports/v2/top-destinations"
Next, we define the headers and parameters (as defined in the API documentation) before making the Top Destinations API call:
# Get the Top Destinations visited from 7 days ago until now. Top 1000 domains are returned.
def get_top_destinations(access_token):
headers = {
"Authorization": "Bearer " + access_token,
"Content-Type": "application/json",
"Accept": "application/json"
}
params = {
"from": "-7days",
"to": "now",
"offset": "0",
"limit": 1000
}
top_destinations_request = requests.get(top_destinations_endpoint, headers=headers,params=params)
top_destinations = top_destinations_request.json()
return top_destinations
You'll notice that we've set parameters to pull Top Destinations from 7 days ago until now. (Specifying now is a supported option based on documentation.)
Leveraging Umbrella's global usage data
The best way to determine what is abnormal or uncommon? Find a way to establish a baseline or "normal."
Fortunately, Umbrella posts a Popularity List daily. According to Cisco Umbrella, this list "contains our most queried domains based on passive DNS usage across our Umbrella global network of more than 100 Billion requests per day with 65 million unique active users, in more than 165 countries."
1-million of the most commonly queried domains should be a sufficient baseline of "normal."
We'll make a GET API call to retrieve the Umbrella Top 1-Million. (Yes, I'm realizing now I should have stored the URL in a variable to be consistent.)
# Download the Umbrella top 1 million destinations, unzip file, format file.
def get_top_million():
# API call to get Umbrella Top 1 Million as a zip file
get_top_1million_zip = requests.get("http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip")
This function isn't yet complete. Let's discuss what else needs to be considered before we finish this function.
Filtering and formatting files for comparison
We've now made two API calls: one to retrieve the top 1000 destinations seen by our Umbrella network over the past week, and one to retrieve the Top 1-Million domains seen by the Umbrella network globally.
We'll want to clean up these files so that they're easier to compare and the resulting file is meaningful.
Cleaning up our top_destinations.csv file
Our CSV file successfully returns our network's top destinations, but not all of those destinations are domains -- we'll also see IP addresses.
While those IP addresses may be worth investigating, they cannot be compared to Umbrella's Top 1-Million, which is a list of domains only. For this reason, we'll want to filter out IP addresses.
First we'll import Python libraries that will help us check for IP addresses and handle CSV files:
from IPy import IP
import csv
Next, we'll add logic that checks if something is an IP address.
def isIP(str):
try:
IP(str)
except ValueError:
return False
return True
We'll then incorporate our logic in a new function that takes our csv file as a parameter. For each line of the csv, if the content isn't an IP address, we'll add it to a list called destinations_list.
After that, we'll write that "domains only" list to a new csv.
# If destination in Top Destinations is a domain, write it as a new line in a CSV called top_destinations.csv.
def top_destinations_to_csv(top_destinations):
destinations_list = []
for destination in top_destinations['data']:
if not isIP(destination['domain']):
destinations_list.append(destination['domain'])
top_destinations_csvfile = open('top_destinations.csv', 'w')
with open('top_destinations.csv', 'w', newline='') as top_destinations_csvfile:
filewriter = csv.writer(top_destinations_csvfile, delimiter=',',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
for destination in destinations_list:
filewriter.writerow([destination])
top_destinations_csvfile.close()
return top_destinations_csvfile
Cleaning up the Top 1-Million CSV file
Our top destinations file is now just rows upon rows of domains, but how about our Top 1-Million file?
When we made our GET API call, we received a zipfile in return. We'll first import a library to work with that zipfile:
import zipfile
We then need to write that zip file to disk, create a fresh csv file to save the cleaned up version to, and unzip the file.
# Download the Umbrella top 1 million destinations, unzip file, format file.
def get_top_million():
# API call to get Umbrella Top 1 Million as a zip file
get_top_1million_zip = requests.get("http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip")
# Write the zip file to disk
open('top-1m.csv.zip', 'wb').write(get_top_1million_zip.content)
# Create a new CSV file to write the cleaned up Top 1 Million to
top_1million_csv = 'top-1m.csv'
# Unzip the file
with zipfile.ZipFile('top-1m.csv.zip', 'r') as zip_ref:
zip_ref.extractall('.')
The unzipped Top 1-Million file looks something like this:
We need to remove that rank order so that we can just compare domains. To do this, we import a Python library to help us format the csv:
import pandas
Finally, we drop that rank order column:
# Removing rank order in first column so that we can compare domains to Top Destinations.
top_1million_csv = pandas.read_csv('top-1m.csv')
first_column = top_1million_csv.columns[0]
top_1million_csv = top_1million_csv.drop([first_column], axis=1)
top_1million_csv.to_csv('top_1million_csv', index=False)
return top_1million_csv
Finding uncommon domains
We're finally ready to find those uncommon domains! To do this, we compare both CSVs by opening and reading each line of the files. For each domain in our network's top destinations that does not appear in the Umbrella Top-1 Million, we write that domain as a line in our final CSV named uncommon_domains.csv.
# Compares each domain in top_destinations.csv to top-1m.csv (Umbrella's Top 1 Million) and returns any domains that are not in the Top 1 Million.
def find_uncommon_domains():
top_destinations_file_path = "./top_destinations.csv"
top_1million_file_path = "./top_1million_csv"
uncommon_domains_file_path = "./uncommon_domains.csv"
with open(uncommon_domains_file_path, 'w') as uncommon_domains_csv:
top_destinations = open(top_destinations_file_path).readlines()
top_1million = open(top_1million_file_path).readlines()
for domain in top_destinations:
if domain not in top_1million:
uncommon_domains_csv.write(domain)
print(f"Uncommon domains have been written to uncommon_domains.csv in your current directory.")
return uncommon_domains_csv
Removing old files
This part is a nicety, but let's remove all of the CSV files besides the resulting uncommon_domains.py to avoid confusion:
# Clean up files used to determine uncommon domains.
def clean_up_files():
os.remove('top-1m.csv.zip')
os.remove('top-1m.csv')
os.remove('top_destinations.csv')
os.remove('top_1million_csv')
Running the script
We'll write a main function that runs when the script runs, calling the relevant functions in order:
# Main function
def main():
access_token = generate_access_token()
top_destinations = get_top_destinations(access_token)
top_destinations_csvfile = top_destinations_to_csv(top_destinations)
cleaned_top_1million_csv = get_top_million()
uncommon_domains_csv = find_uncommon_domains()
clean_up_files()
return uncommon_domains_csv
if __name__ == "__main__":
main()
You can find an example of the output provided in the resulting uncommon_domains.csv below.
Cisco DevNet sample code
This is an official submission in Cisco Code Exchange, including a suggested use case, available here. You can also access the sample code directly on GitHub.
Top comments (2)
Nice write up Erika! I can't believe how convenient the compare functions are in Python / pandas compared to what you'd have to do in some other languages.
Agreed! Thanks for giving this a read!