Category: Data Science
-
Data Swamp
My dad has spent some of his retirement doing hobbyist machine learning projects. He heard the term “data lake” a while back and has taken to calling his datasets a “data swamp.” Feels like a terminology improvement the whole field could get behind. This is brilliant, I’ve not come across this term before but I…
-
Medieval Buzzfeed – Debugging Dodgy Datetimes in Pandas and Parquet
I was recently attempting to cache the results of a long-running SQL query to a local parquet file using SQL via a workflow like this: This ended up yielding the following slightly cryptic error message: So obviously there is an issue with my published_at timestamp column. Googling didn’t help me very much, lots of people…
-
Parsing Ingredient Strings with SpaCy PhraseMatcher
As part of my work on Gastronaut, I’m building a form that allows users to create recipes and which will attempt to parse ingredients lists and find a suitable stock photo for each item the user adds to their recipe. As well as being cute and decorative, this step is important for later when we…
-
Prod-Ready Airbyte Sync
Airbyte is a tool that allows you to periodically extract data from one database and then load and transform it into another. It provides a performant way to clone data between databases and gives us the flexibility to dictate what gets shared at field level (for example we can copy the users table but we…