Category: Data Science

  • Data Swamp

    My dad has spent some of his retirement doing hobbyist machine learning projects. He heard the term “data lake” a while back and has taken to calling his datasets a “data swamp.” Feels like a terminology improvement the whole field could get behind. This is brilliant, I’ve not come across this term before but I…

    Read More

  • Medieval Buzzfeed – Debugging Dodgy Datetimes in Pandas and Parquet

    Medieval Buzzfeed – Debugging Dodgy Datetimes in Pandas and Parquet

    I was recently attempting to cache the results of a long-running SQL query to a local parquet file using SQL via a workflow like this: This ended up yielding the following slightly cryptic error message: So obviously there is an issue with my published_at timestamp column. Googling didn’t help me very much, lots of people…

    Read More

  • Parsing Ingredient Strings with SpaCy PhraseMatcher

    Parsing Ingredient Strings with SpaCy PhraseMatcher

    As part of my work on Gastronaut, I’m building a form that allows users to create recipes and which will attempt to parse ingredients lists and find a suitable stock photo for each item the user adds to their recipe. As well as being cute and decorative, this step is important for later when we…

    Read More

  • Prod-Ready Airbyte Sync

    Prod-Ready Airbyte Sync

    Airbyte is a tool that allows you to periodically extract data from one database and then load and transform it into another. It provides a performant way to clone data between databases and gives us the flexibility to dictate what gets shared at field level (for example we can copy the users table but we…

    Read More