Skip to main content

Simple data sanitisation with Python

·854 words·5 mins

Following on from the simple data generator that I created a few weeks ago, I decided to create a script that takes CSV data and performs various different techniques to transform the data in order to sanitise the contents.

There are already a lot of tools for sanitising data but most are difficult to setup and use. I wanted to create something that could be easily used within a few minutes - with the main focus being on determining what should be done with the data rather than how.

For a useful level of functionality I determine that, for each field, there would need to be the option to:

  • Remove the field entirely
  • Replace the contents of the field with custom input
  • Blank the field
  • Replace the contents with the hashed version

Basic steps:

  1. Define the action to take on each field
  2. Read in each line of the file
  3. For each field in record, apply the corresponding rule

Mock data #

For testing my script I used the simple data generator to create the following data:


Creating options #

Providing the control for how to treat each field that needs to be sanitisied could be done in a number of ways. Requiring a configuration file would work but not needing an external file is preferable.

After trialing a few different ways of passing settings into the script it was clear that lots of individual flags and arguments would over complicate things. For setting the sanitisation options the -o argument with a space separated string containing key value pairs is best. Defining the field and the action to perform on the field using the format f<fieldnumber>:<action> where action can be:

  • MASK=REMOVE to entirely remove the field, so by the end of the document the entire column is gone
  • MASK=REPLACE(<stringtoreplacewith>) to replace the current field value with a custom value
  • MASK=BLANK to replace the current field value with an empty value
  • MASK=HASH(<hashingalgo>[,seedvalue]) to take the field contents and hash it, with the option to provide a salt

Options examples #

  • f1:MASK=REPLACE(helloworld) replaces the first field with “helloworld”
  • f2:MASK=REPLACE(hellokitty) replaces the second field with “hellokitty”
  • f1:MASK=BLANK f2:MASK=REMOVE replaces the first field with "" and completely removes the second field
  • f3:MASK=HASH(MD5) replace the third field value with its own MD5 hash
  • f4:MASK=HASH(MD5,evergreen) replace the fourth field value with its own MD5 hash that’s using “evergreen” as a salt

Testing sanitisation #

To test the new sanitisation script I ran it with the test data from earlier in the “dirty_data.txt” file:

./data-sanitiser dirty_data.txt -o "f1:MASK=HASH(SHA256,hi) f2:MASK=HASH(SHA256,hi) f5:MASK=HASH(SHA256)" > clean_data.txt

This options passed do the following:

  • hash the first field with SHA256 and a salt of “hi”
  • hash the second field with SHA256 and a salt of “hi”
  • hash the fifth field with SHA256 but without a salt.

The output from the command:


To do #

There are still a number of items on the to do list including:

  • Proper error handling
  • Make the options rules more efficient
  • Writing out to files rather than stdout
  • Writing errors to stderr
  • Containerise

Code #

The current implementation only allows for a single rule per field. If multiple rules are defined the first is the one that will be used. If there are errors in the options syntax then the output may be compromised.

The code is under the MIT license.

At some point I may make the GitHub repository public, but until then the code is below:


import sys
import random
import argparse
import hashlib

# data-sanitiser dirty_data.txt -o "f1:MASK=HASH(MD5,hi) f2:MASK=HASH(MD5,hi) f5:MASK=HASH(MD5,hi)" > clean_data.txt

def sanitise_file(input_file, DELIMITER, OPTIONS): #output_file
    with open(input_file) as in_file:
        for line in in_file:
            in_file_fields = line.strip().split(DELIMITER)

            cleaned_record = ""

            for field_num,field in enumerate(in_file_fields):
                action_performed = False
                opts = OPTIONS.split(" ")
                for opt in opts:
                    if action_performed == True:
                    if ("f" + str(field_num + 1) + ":") in opt:
                        # The first rule that matches for the filed is used, the rest are ignored - so only create 1 rule per field

                        if "MASK=REMOVE" in opt:
                            action_performed = True
                        elif "MASK=REPLACE" in opt:
                            replacement_value = opt[opt.find("(")+1:-1]
                            cleaned_record += replacement_value + DELIMITER
                            action_performed = True
                        elif "MASK=BLANK" in opt:
                            cleaned_record += "" + DELIMITER
                            action_performed = True
                        elif "MASK=HASH" in opt:
                            replacement_value = opt[opt.find("(")+1:-1]
                            hashalgo = ""
                            hash = ""
                            seed = ""

                            if "," in replacement_value:
                                hashalgo_seed = replacement_value.split(",")
                                hashalgo = hashalgo_seed[0]
                                seed = hashalgo_seed[1]
                                hashalgo = replacement_value

                            if hashalgo == "SHA256":
                                h = hashlib.sha256()
                                h.update((str(seed) + str(field)).strip().encode())
                                hash = h.hexdigest()
                            elif hashalgo == "MD5":
                                h = hashlib.md5()
                                h.update((str(seed) + str(field)).strip().encode())
                                hash = h.hexdigest()
                                print("Error.  Hash algo is not supported.")

                            cleaned_record += hash + DELIMITER
                            action_performed = True

                if action_performed == False:
                    cleaned_record += str(field) + DELIMITER

            print(cleaned_record[:-1]) #Remove the last delimiter

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('input_file_loc', nargs='?')
    parser.add_argument("-d","--delimiter",type=str, default=",", help="Set the delimiter character.")
    parser.add_argument("-f","--fields",type=str, default="", help="Set the fields to sanitise.")
    parser.add_argument("-o","--options",type=str,help="Other options")
    parser.add_argument("-v","--verbose", action='store_true', help="Set the verbosity level to help with debugging.  0 is off, 1 is on.")
    args = parser.parse_args()

    sanitise_file(args.input_file_loc, args.delimiter, args.options) #args.output_file_loc