One step at a time

on

Data is a messy business.

Although access to genealogical data is now easier to get than ever, none the less, it seems to depend an awful lot on fallible humans who index the data. So whatever data I collate, I have to data wrangle/munge what I collect.

In previous years, when I first contemplated attempting a one-name study, I would have used Microsoft Excel in all likelihood.  As that was the one tool I was familiar with from office work, that was easy to use and had automation (Visual Basic for Applications).

Since then, I have continued to study my own family tree and moved jobs and careers to one of Data Transformation. This has increased my toolset somewhat considerably. I now have SQL skills (for relational databases such as ORACLE) and NoSQL skills (for graph databases such as NEO4J) and finally, some (admittedly basic) Python skills involving Pandas. This last item is for a laymen better described as Excel for Python. But on steroids.

Despite the increase on my side in the toolset, the data remains just as bad. I’m going to start with the England & Wales Census Records. And begin with 1841 working my way through to 1911.

Storing the information and then working out what I can from the data and posting some results here, will hopefully show how messy things can get.

Leave a Reply

Your email address will not be published. Required fields are marked *