Simple data sanitisation with Python

Table of Contents

Following on from the simple data generator that I created a few weeks ago, I decided to create a script that takes CSV data and performs various different techniques to transform the data in order to sanitise the contents.

There are already a lot of tools for sanitising data but most are difficult to setup and use. I wanted to create something that could be easily used within a few minutes - with the main focus being on determining what should be done with the data rather than how.

For a useful level of functionality I determine that, for each field, there would need to be the option to:

Remove the field entirely
Replace the contents of the field with custom input
Blank the field
Replace the contents with the hashed version

Basic steps:

Define the action to take on each field
Read in each line of the file
For each field in record, apply the corresponding rule

Mock data #

For testing my script I used the simple data generator to create the following data:

999991,Rosana,Venezia,1938-02-12,82,Rosana.Venezia@ghmail.co.uk
999992,Reta,Leanna,1925-05-02,95,Reta.Leanna@hatmail.co
999993,Gusti,Moran,1972-03-23,48,Gusti.Moran@madeupdomain.us
999994,Ulrike,Brenden,1955-07-28,65,Ulrike.Brenden@fam-domain.uk
999995,Angel,Crotty,1934-08-05,86,Angel.Crotty@reallylonghardtorememberdomain.com
999996,Carine,Grassi,1961-11-25,58,Carine.Grassi@ghmail.co.uk
999997,Dulcea,Ranit,1921-03-01,99,Dulcea.Ranit@ghmail.com
999998,Aurelea,Mas,1952-07-20,68,Aurelea.Mas@totallymadeupdomain.ie
999999,Yvette,Eada,1965-05-12,55,Yvette.Eada@fam-domain.uk
1000000,Elbertine,Forlini,1973-09-24,46,Elbertine.Forlini@orangesandlemons.cc

Creating options #

Providing the control for how to treat each field that needs to be sanitisied could be done in a number of ways. Requiring a configuration file would work but not needing an external file is preferable.

After trialing a few different ways of passing settings into the script it was clear that lots of individual flags and arguments would over complicate things. For setting the sanitisation options the -o argument with a space separated string containing key value pairs is best. Defining the field and the action to perform on the field using the format f<fieldnumber>:<action> where action can be:

MASK=REMOVE to entirely remove the field, so by the end of the document the entire column is gone
MASK=REPLACE(<stringtoreplacewith>) to replace the current field value with a custom value
MASK=BLANK to replace the current field value with an empty value
MASK=HASH(<hashingalgo>[,seedvalue]) to take the field contents and hash it, with the option to provide a salt

Options examples #

f1:MASK=REPLACE(helloworld) replaces the first field with “helloworld”
f2:MASK=REPLACE(hellokitty) replaces the second field with “hellokitty”
f1:MASK=BLANK f2:MASK=REMOVE replaces the first field with "" and completely removes the second field
f3:MASK=HASH(MD5) replace the third field value with its own MD5 hash
f4:MASK=HASH(MD5,evergreen) replace the fourth field value with its own MD5 hash that’s using “evergreen” as a salt

Testing sanitisation #

To test the new sanitisation script I ran it with the test data from earlier in the “dirty_data.txt” file:

./data-sanitiser dirty_data.txt -o "f1:MASK=HASH(SHA256,hi) f2:MASK=HASH(SHA256,hi) f5:MASK=HASH(SHA256)" > clean_data.txt

This options passed do the following:

hash the first field with SHA256 and a salt of “hi”
hash the second field with SHA256 and a salt of “hi”
hash the fifth field with SHA256 but without a salt.

The output from the command:

6d67a797eed3d9bb791b8baba6644f443199050579add051738163cf93d1e780,59640c396b7bbafb21383a5361a3f0bbdd6d87ddf1aaaf9be3bbf5cc9793ec17,Venezia,1938-02-12,a46e37632fa6ca51a13fe39a567b3c23b28c2f47d8af6be9bd63e030e214ba38,Rosana.Venezia@ghmail.co.uk
daddeb454e81f638507140e89f005cec5f27eab27a44cc6cb90c219af6a06bc0,ccb16a2e7a0941b7781d8508fa2d70de0caa60791a897cf7aa74b8038a60f8aa,Leanna,1925-05-02,ad48ff99415b2f007dc35b7eb553fd1eb35ebfa2f2f308acd9488eeb86f71fa8,Reta.Leanna@hatmail.co
e368e40c9550299ff2da99a066d37e5180b2338b55bcc5012ca6504382d5a4dd,87444d01f154bd2fc8268b9a37bd545b393a2b994f1fb8fc4c6ca0d6baefbf85,Moran,1972-03-23,98010bd9270f9b100b6214a21754fd33bdc8d41b2bc9f9dd16ff54d3c34ffd71,Gusti.Moran@madeupdomain.us
a62f740a5656700c79ca680d7899367927b164f37ab49a883399f4a3e749fae4,880ad36e6a70283fa018e9fc01970d5b8474c18e29a5b1f3082ad86df4fc8e8d,Brenden,1955-07-28,108c995b953c8a35561103e2014cf828eb654a99e310f87fab94c2f4b7d2a04f,Ulrike.Brenden@fam-domain.uk
5185d1b0f922f1b294db34de37a59a8b7f4b4d13b3d20cf4655ef0e180fbdcb8,19b014b8f51ba9725d5d65f9fe2e2bf040ac0b9fa1d1766d070f23d68062cd79,Crotty,1934-08-05,434c9b5ae514646bbd91b50032ca579efec8f22bf0b4aac12e65997c418e0dd6,Angel.Crotty@reallylonghardtorememberdomain.com
1fa0bbeb8dfcbe444957522406c7269c809b7b1b3ecb2f1cea490349047dc55d,9ac9805fa063a50e15cd568d56459e4ed4f73b5197e8fad9a3f726430b5dff5c,Grassi,1961-11-25,6208ef0f7750c111548cf90b6ea1d0d0a66f6bff40dbef07cb45ec436263c7d6,Carine.Grassi@ghmail.co.uk
77d6f9762c5dd9a96368d41e523dfb19d85075701c6b5919573272dbd2c59ad3,925474e595bf4ed71e644123054d4251b45740619ea8397ab862bf8863dfca25,Ranit,1921-03-01,8c1f1046219ddd216a023f792356ddf127fce372a72ec9b4cdac989ee5b0b455,Dulcea.Ranit@ghmail.com
7ba02c7f85ea036bf33abdfa874bf70ff756e346fd96512c3bf6be54ee6e53eb,2f41abe47677bf990783a2cae0ce8611e9d97130befb9b612adbcf8315192d14,Mas,1952-07-20,a21855da08cb102d1d217c53dc5824a3a795c1c1a44e971bf01ab9da3a2acbbf,Aurelea.Mas@totallymadeupdomain.ie
7b803a75c9f24dae7c0d61cfa3f7cb2c78ed8659fca4d7d08cb72c66339638b3,bac5025393d5bfd199ffd0eb0c38bc2454828e1d58989610d7e90f859e9fb8f0,Eada,1965-05-12,02d20bbd7e394ad5999a4cebabac9619732c343a4cac99470c03e23ba2bdc2bc,Yvette.Eada@fam-domain.uk
68c5aaa497ab572e8b83d3f40e9f4f3ae587cc8c201a77648bcb94c59728e4d8,f3979281eb08a76ba702463f75d758926c77b06cca07ac4d09559a5bcc47df18,Forlini,1973-09-24,25fc0e7096fc653718202dc30b0c580b8ab87eac11a700cba03a7c021bc35b0c,Elbertine.Forlini@orangesandlemons.cc

To do #

There are still a number of items on the to do list including:

Proper error handling
Make the options rules more efficient
Writing out to files rather than stdout
Writing errors to stderr
Containerise

Code #

The current implementation only allows for a single rule per field. If multiple rules are defined the first is the one that will be used. If there are errors in the options syntax then the output may be compromised.

The code is under the MIT license.

At some point I may make the GitHub repository public, but until then the code is below:

#!/bin/python3

import sys
import random
import argparse
import hashlib


# data-sanitiser dirty_data.txt -o "f1:MASK=HASH(MD5,hi) f2:MASK=HASH(MD5,hi) f5:MASK=HASH(MD5,hi)" > clean_data.txt

def sanitise_file(input_file, DELIMITER, OPTIONS): #output_file
    with open(input_file) as in_file:
        for line in in_file:
            in_file_fields = line.strip().split(DELIMITER)

            cleaned_record = ""

            for field_num,field in enumerate(in_file_fields):
                
                action_performed = False
                opts = OPTIONS.split(" ")
                for opt in opts:
                    if action_performed == True:
                        break
                    if ("f" + str(field_num + 1) + ":") in opt:
                        
                        # The first rule that matches for the filed is used, the rest are ignored - so only create 1 rule per field

                        if "MASK=REMOVE" in opt:
                            #print("Removed")
                            action_performed = True
                            break
                        elif "MASK=REPLACE" in opt:
                            replacement_value = opt[opt.find("(")+1:-1]
                            cleaned_record += replacement_value + DELIMITER
                            action_performed = True
                            break
                        elif "MASK=BLANK" in opt:
                            cleaned_record += "" + DELIMITER
                            action_performed = True
                            break
                        elif "MASK=HASH" in opt:
                            replacement_value = opt[opt.find("(")+1:-1]
                            hashalgo = ""
                            hash = ""
                            seed = ""

                            if "," in replacement_value:
                                hashalgo_seed = replacement_value.split(",")
                                hashalgo = hashalgo_seed[0]
                                seed = hashalgo_seed[1]
                            else:
                                hashalgo = replacement_value

                            if hashalgo == "SHA256":
                                h = hashlib.sha256()
                                h.update((str(seed) + str(field)).strip().encode())
                                hash = h.hexdigest()
                            elif hashalgo == "MD5":
                                h = hashlib.md5()
                                h.update((str(seed) + str(field)).strip().encode())
                                hash = h.hexdigest()
                            else:
                                print("Error.  Hash algo is not supported.")
                                sys.exit()

                            cleaned_record += hash + DELIMITER
                            action_performed = True
                            break

                if action_performed == False:
                    cleaned_record += str(field) + DELIMITER


            print(cleaned_record[:-1]) #Remove the last delimiter
       

if __name__ == "__main__":
    
    parser = argparse.ArgumentParser()
    parser.add_argument('input_file_loc', nargs='?')
    parser.add_argument("-d","--delimiter",type=str, default=",", help="Set the delimiter character.")
    parser.add_argument("-f","--fields",type=str, default="", help="Set the fields to sanitise.")
    parser.add_argument("-o","--options",type=str,help="Other options")
    parser.add_argument("-v","--verbose", action='store_true', help="Set the verbosity level to help with debugging.  0 is off, 1 is on.")
    args = parser.parse_args()

    sanitise_file(args.input_file_loc, args.delimiter, args.options) #args.output_file_loc