Security intelligence datalake

Table of Contents

I’ve had a load of data gathering scripts that I’ve run manually or via cron jobs for many years. Most of the scripts are aimed at creating a snapshot of something at a moment in time for comparisons against other snapshots. These differential views can be really useful for determining areas of interest from a security point of view.

The primary scripts are:

DNS_Mon for recording DNS responses for certain domains
HTTP_Mon for recording the HTML, CURL verbose logs of certain websites (and screenshots although they don’t work so well…)

Until now I’ve been storing the output of these scripts on my laptop which means that I don’t get data when the laptop is offline, and there’s an awful lot of files being stored.

A new file is created for each scan of each assets - so every time a scan is performed against an indiviudal website, a new file is created. This was done on purpose, the filename holds the scan number, the timestamp, and the asset that was scanned. For example deliveroo.co.uk/66-1568646678-deliveroo.co.uk-A means that the DNS_Mon script has run a 66th scan at 1568646678 unix time (Mon Sep 16 15:11:18 2019 UTC) for the A record of deliveroo.co.uk. Using this naming convention makes it easy to find what’s being looked for. Additional the directory that the scan is stored in is also the domain name so all the scans relating to that are kept together.

DNS_Mon on its own has created more than 84,000 files coming in at 50MB. The current directy strucutre cannot be easily navigated with GUI tools, they become very slow so using CLI tools is much better.

HTTP_Mon has 36,000+ files for a total of 3.5GB, mainly due to the screenshots it takes.

Collecting data #

Collecting DNS and HTTP information might not sound very interesting but it can show changes over time that relate to infrastructure, configuration, or applications.

Taking DNS information as an example:

Looking at uber.com/66-1568646677-uber.com-TXT


; <<>> DiG 9.11.10-RedHat-9.11.10-1.fc30 <<>> TXT uber.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11388
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;uber.com.			IN	TXT

;; ANSWER SECTION:
uber.com.		300	IN	TXT	"v=spf1 include:uber.com._nspf.vali.email include:%{i}._ip.%{h}._ehlo.%{d}._spf.vali.email ~all"
uber.com.		300	IN	TXT	"facebook-domain-verification=fgnbsxqefhg2pzugzl4vcw82ylgagg"
uber.com.		300	IN	TXT	"MS=607A6B094E5395250B2F88D76D42FFB6DC2C18A4"
uber.com.		300	IN	TXT	"google-site-verification=yHvJ7x6qUkjrzRfaPzSO5Iu42eP70uSS0Q88xPFBbSU"
uber.com.		300	IN	TXT	"docusign=635f0402-4f58-42de-8e07-e1da6d8a971a"

;; Query time: 21 msec
;; SERVER: 192.168.2.253#53(192.168.2.253)
;; WHEN: Mon Sep 16 16:11:18 BST 2019
;; MSG SIZE  rcvd: 411

Looking at uber.com/83-1588531054-uber.com-TXT


; <<>> DiG 9.11.18-RedHat-9.11.18-1.fc32 <<>> TXT uber.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4298
;; flags: qr rd ra; QUERY: 1, ANSWER: 7, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1452
;; QUESTION SECTION:
;uber.com.			IN	TXT

;; ANSWER SECTION:
uber.com.		37	IN	TXT	"v=spf1 include:uber.com._nspf.vali.email include:%{i}._ip.%{h}._ehlo.%{d}._spf.vali.email ~all"
uber.com.		37	IN	TXT	"facebook-domain-verification=fgnbsxqefhg2pzugzl4vcw82ylgagg"
uber.com.		37	IN	TXT	"MS=607A6B094E5395250B2F88D76D42FFB6DC2C18A4"
uber.com.		37	IN	TXT	"google-site-verification=yHvJ7x6qUkjrzRfaPzSO5Iu42eP70uSS0Q88xPFBbSU"
uber.com.		37	IN	TXT	"mixpanel-domain-verify=a35ee3f7-3848-4a0b-822e-d429b507c0c6"
uber.com.		37	IN	TXT	"AD5-G1R-7NJ"
uber.com.		37	IN	TXT	"docusign=635f0402-4f58-42de-8e07-e1da6d8a971a"

;; Query time: 20 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Sun May 03 19:37:34 BST 2020
;; MSG SIZE  rcvd: 515

From these 2 scans it’s possible to see that there are two new records:

uber.com.		37	IN	TXT	"mixpanel-domain-verify=a35ee3f7-3848-4a0b-822e-d429b507c0c6"
uber.com.		37	IN	TXT	"AD5-G1R-7NJ"

These records must mean something. The first looks fairly obvious - some kind of domain name based verification for Mixpanel has been created.

The second record isn’t so obvious.

Looking at just-eat.co.uk/66-1568646677-just-eat.co.uk-TXT


; <<>> DiG 9.11.10-RedHat-9.11.10-1.fc30 <<>> TXT just-eat.co.uk
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8589
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;just-eat.co.uk.			IN	TXT

;; ANSWER SECTION:
just-eat.co.uk.		60	IN	TXT	"MS=ms41881746"
just-eat.co.uk.		60	IN	TXT	"MS=ms77361615"
just-eat.co.uk.		60	IN	TXT	"globalsign-domain-verification=EZqi8N2yc05TuywBj4bOn9omA3LX0hzCuWnPfrKB0V"
just-eat.co.uk.		60	IN	TXT	"google-site-verification=Ls66zi52pub37Vbv1XBzh4imv4ZgVBGeKd8S1ceP6jY"
just-eat.co.uk.		60	IN	TXT	"v=spf1 include:_spf.google.com include:spf1.just-eat.co.uk include:spf2.just-eat.co.uk include:_spf.salesforce.com include:mail.zendesk.com ~all"
just-eat.co.uk.		60	IN	TXT	"workplace-domain-verification=NBnS3v9heko06faaUWmZa5uy6aMDwX"

;; Query time: 19 msec
;; SERVER: 192.168.2.253#53(192.168.2.253)
;; WHEN: Mon Sep 16 16:11:18 BST 2019
;; MSG SIZE  rcvd: 492

Looking at just-eat.co.uk/83-1588531054-just-eat.co.uk-TXT


; <<>> DiG 9.11.18-RedHat-9.11.18-1.fc32 <<>> TXT just-eat.co.uk
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62109
;; flags: qr rd ra; QUERY: 1, ANSWER: 7, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1452
;; QUESTION SECTION:
;just-eat.co.uk.			IN	TXT

;; ANSWER SECTION:
just-eat.co.uk.		7	IN	TXT	"MS=ms41881746"
just-eat.co.uk.		7	IN	TXT	"MS=ms77361615"
just-eat.co.uk.		7	IN	TXT	"atlassian-domain-verification=xC9uu3r2iqKv0OgDnfNyJ5lNw4p33+LKzekrEn5RJ+1hXvIUc+1Ajd2Ee2XLMsqp"
just-eat.co.uk.		7	IN	TXT	"globalsign-domain-verification=EZqi8N2yc05TuywBj4bOn9omA3LX0hzCuWnPfrKB0V"
just-eat.co.uk.		7	IN	TXT	"google-site-verification=Ls66zi52pub37Vbv1XBzh4imv4ZgVBGeKd8S1ceP6jY"
just-eat.co.uk.		7	IN	TXT	"v=spf1 include:_spf.google.com include:spf1.just-eat.co.uk include:spf2.just-eat.co.uk include:_spf.salesforce.com include:mail.zendesk.com ~all"
just-eat.co.uk.		7	IN	TXT	"workplace-domain-verification=NBnS3v9heko06faaUWmZa5uy6aMDwX"

;; Query time: 20 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Sun May 03 19:37:34 BST 2020
;; MSG SIZE  rcvd: 613

Comparing the two scans above shows that the following record is new:

just-eat.co.uk.		7	IN	TXT	"atlassian-domain-verification=xC9uu3r2iqKv0OgDnfNyJ5lNw4p33+LKzekrEn5RJ+1hXvIUc+1Ajd2Ee2XLMsqp"

So, it looks like Just Eat started to use Atlassian products at some time between scan 66 (Mon Sep 16 15:11:17 2019 UTC) and scan 83 (Sun May 03 18:37:34 2020 UTC). Looking at that occured between scan 66 and 83 shows that the last scan before the new record was detected was scan 81 (Sun Nov 03 15:07:48 2019 UTC) and the first scan that detected it was 82 (Sun May 03 18:36:40 2020 UTC). Unfortunately, because my scans weren’t very frequent, I cannot be more accurate than that.

The value in this data is that creating snapshots over time allows for detecting roughly when a change has happened. This might be useful for a number of reason, especially when it comes to using HTTP_Mon which tracks website source code.

Website source code change tracking might allow for the identification of a time period when malicious code was added to a website, or to see changes in comments in the source code, and more.

Saving the data #

Saving the data in the rawest format means that I don’t risk cutting any data out or incorrectly processing data and losing it. Instead of relying on my laptop I will be using AWS.

AWS allows for scheduled jobs that kick off Lambda functions. I have created a scheduled job for each of the scans that I want to run, and a Lambda function that runs my original code and then save the scan results to S3. The main change is that I will have each scheduled job run for just one domain rather than having a job try to scan multiple domains. This change meant that the scan number is not really useful any more and has been removed leaving the filename format looking like just-eat.co.uk/1557522600-just-eat.co.uk-TXT.

After a couple of days of running, this is still running smoothly and I’m capturing data at regular intervals reliably.

Processing the data #

Currently I am manually reviewing data and comparing the data out of interest. At some point the idea is to automate comparisons between each snapshot and make the comparisons available for easy viewing.

After making comparisons easy then the step would then be alerting on certain conditions such as malicious looking JavaScript being added to a website that’s being monitored.

The data in the datalake will only ever be source data. The data will not be ammended in the datalake so that it can be reprocessed later with different tools, ensuring no data is lost.