Skip to main content

Simple data generation with Python

·895 words·5 mins
Table of Contents

It’s always useful to have a way of generating data to populate databases, create real looking datasets for honey pots, and for testing data processing scripts.

I’ve had a few different scripts that generate names and email addresses but now those have been retired in favour of a new, very basic, script that is intended to generate mock personal data.

The script can output mock personal data files containing:

  • a unique ID (that could count as a customer ID)
  • firstname (from a names list)
  • surname (from a names list)
  • date of birth (generated in a defined range)
  • age (as determined by the date of birth and current date)
  • email address (derived from name and made up email domains).

For the data to look realistic I’ve made use of some wordlists from https://github.com/dominictarr/random-name.

Using wget I pulled down the first names list and the names list which is the last names list.

To download a file to use for first names I ran wget https://raw.githubusercontent.com/dominictarr/random-name/master/first-names.txt and wget https://raw.githubusercontent.com/dominictarr/random-name/master/names.txt for last names.

I defined a few fake domains in a file called fake-domains.txt to serve as an input too which looks like:

ghmail.com
ghmail.co.uk
hatmail.co
fam-domain.uk
reallylonghardtorememberdomain.com
orangesandlemons.cc
totallymadeupdomain.ie
pretenditaliandomain.it

After playing about with some code for a bit I had a working script that was too complicated to use. I’d defined all of the different options for outputs as their own arguments or flags that were hard to remember. I reworked the code to have an options string as an input that defines what the script should do.

I settled for the following code:

#!/bin/python3

import sys
import random
import datetime
from dateutil.relativedelta import *
import argparse
import os

# data-generator first-names.txt names.txt fake-domains.txt -c 1000000 -u -o "DOB_GEN=True DOB_AGE=True DOB_LOWER=1 DOB_UPPER=100"

class file_read(object):
    """ Reads in an input file, validates there are no errors """  

    def __init__(self,input_file):
        self.file_data = []

        try:
            with open(input_file) as f:
                for line in f:
                    #print(line)
                    self.file_data.append(line.strip())
            self.file_data_total = len(self.file_data)
        except:
            print("Error: Issue reading input file:",input_file)
            sys.exit()

    def print(self):
        for line in self.file_data:
            print(line)

    def rand(self):
        return self.file_data[random.randrange(self.file_data_total)]

    def total(self):
        return self.file_data_total

def generate_data(firstname_file_loc,surname_file_loc,domain_file_loc,COUNT,UUID_GEN,DELIMITER,OPTIONS="",VERBOSE=False):
    # DOB_LOWER = 18,DOB_UPPER = 100,DOB_AGE = False
    #when set to true each output record will have a unique ID added (incremental)
    #when set to true each output record will have a generated DOB
    #lower limit of age
    #upper limit of ages to generate
    #when set to true, the age based on DOB (at the time of data generation is set)

    #Options flags and their defaults
    DOB_GEN = False
    DOB_LOWER = 18
    DOB_UPPER = 100
    DOB_AGE = False

    #Options read from user input.  Options is space delimited and has = to asign e.g. DOB_GEN=True
    if OPTIONS != "":
        options_data = OPTIONS.split(" ")
        for option in options_data:
            if "DOB_GEN=" in option:
                if option == "DOB_GEN=True":
                    DOB_GEN = True
                    continue
            if "DOB_LOWER=" in option:
                DOB_LOWER = int(option[10:])
                continue
            if "DOB_UPPER=" in option:
                DOB_UPPER = int(option[10:])
                continue
            if "DOB_AGE=" in option:
                if option == "DOB_AGE=True":
                    DOB_AGE = True
                    continue

    # Try to read the files
    if VERBOSE == True:
        print("Attempting to read all input files")

    num_to_generate = int(COUNT)

    firstname_file = file_read(firstname_file_loc)
    surname_file = file_read(surname_file_loc)
    domain_file = file_read(domain_file_loc)


    ts_now = datetime.datetime.now() # Used for age comparisons
    #print(int(ts_now.strftime('%Y%m%d'))) #int for current time
    min_years_ago = int((ts_now - datetime.timedelta(days=DOB_LOWER*365)).timestamp()) #int for min years ago
    max_years_ago = int((ts_now - datetime.timedelta(days=DOB_UPPER*365)).timestamp()) #int for max years ago

    #print("Starting the main loop")
    for enum, i in enumerate(range(num_to_generate)):
        uuid = ""
        if UUID_GEN == True:
            uuid = str(enum+1) + DELIMITER
        dob = ""
        age = ""
        if DOB_GEN == True:
            
            gend = random.randrange(max_years_ago,min_years_ago)
            gend_str = (datetime.datetime.utcfromtimestamp(gend).strftime('%Y-%m-%d'))
            dob = str(gend_str) + DELIMITER

            if DOB_AGE == True:
                compare_dob_to_now = relativedelta(ts_now, datetime.datetime.utcfromtimestamp(gend)).years
                age = str(compare_dob_to_now) + DELIMITER #str(compare_dob_to_now) + DELIMITER

        firstname = firstname_file.rand()
        lastname = surname_file.rand()
        domain = domain_file.rand()
        email = firstname + "." +  lastname + "@" + domain
        gen_data = uuid + firstname + DELIMITER + lastname + DELIMITER +  dob + age + email

        print(gen_data)


if __name__ == "__main__":
    
    parser = argparse.ArgumentParser()
    parser.add_argument('firstname_file_loc', nargs='?')
    parser.add_argument('lastname_file_loc', nargs='?')
    parser.add_argument('domain_file_loc', nargs='?')
    parser.add_argument("-d","--delimiter",type=str, default=",", help="Set the delimiter character.")
    parser.add_argument("-u","--uniqueid", action='store_true', help="Generate a UUID for each record")
    parser.add_argument("-o","--options",type=str,default="",help="Other options")
    parser.add_argument("-v","--verbose", action='store_true', help="Set the verbosity level to help with debugging.  0 is off, 1 is on.")
    parser.add_argument("-c","--count",type=int, default=1,help="Number of records to create.")
    args = parser.parse_args()

    if args.firstname_file_loc == None:
        print("You need to provide a firstname file")
        sys.exit()
    if args.lastname_file_loc == None:
        print("You need to provide a lastname file")
        sys.exit()
    if args.domain_file_loc == None:
        print("You need to provide a domainname file")
        sys.exit()

    generate_data(args.firstname_file_loc,args.lastname_file_loc,args.domain_file_loc,args.count,args.uniqueid,args.delimiter,args.options,args.verbose)

Running the script with the downloaded files of names, and my made up list of domain names, the argument for outputting 10 records, the flag for unique IDs to be generated, and the options string that requests a date of birth with a range of 16 to 75 years of age, and the age to be output too, looks like:

data-generator first-names.txt names.txt fake-domains.txt -c 10 -u -o "DOB_GEN=True DOB_AGE=True DOB_LOWER=16 DOB_UPPER=75"

Running the command provided the following output:

1,Giacinta,Randene,1972-01-13,48,Giacinta.Randene@hatmail.co
2,Shawnee,Sharona,1993-11-07,26,Shawnee.Sharona@ghmail.co.uk
3,Bernetta,Vrablik,1953-03-03,67,Bernetta.Vrablik@pretenditaliandomain.it
4,Loleta,Elaine,1996-06-09,24,Loleta.Elaine@ghmail.co.uk
5,Dehlia,Warp,1977-09-24,42,Dehlia.Warp@ghmail.co.uk
6,Lora,Betthel,2004-02-05,16,Lora.Betthel@pretenditaliandomain.it
7,Elisha,Bradlee,1989-01-15,31,Elisha.Bradlee@pretenditaliandomain.it
8,Hedy,Sky,1948-05-19,72,Hedy.Sky@ghmail.co.uk
9,Babara,Krenek,1969-03-27,51,Babara.Krenek@pretenditaliandomain.it
10,Doralin,Hilario,1968-06-28,52,Doralin.Hilario@pretenditaliandomain.it

To do #

There’s a lot that can still be done. It would be good to:

  • Allow the position of fields to be defined
  • Allow the user to define the start unique ID value
  • Generated other files with “related” data that should relate to the original data set (uniqueID as a key).
  • Support MYSQLdump style format
  • Increase the speed
  • Add postcode and address generation