Tech hack example: using regex to clean and format data

A creative tech hack - using regular expressions and spreadsheets to clean and format data for input

I created a data cleaning and formatting method using regular expressions to accelerate web content setup for customers at a wholesale food marketplace startup.

At the suggestion of the engineering team, I learned how to form regular expressions for this task and utilize the Regexextract function in Google Sheets.

  • View a screenshot of the implementation in Google Sheets

Company context: As the data team and onboarding manager at Buyer’s Best Friend, a B2B wholesale food online marketplace startup, fast onboarding and web page creation for customers was necessary to sustain growth and momentum. The task was to process the customer’s often messy product and pricing data so it could be uploaded to the web and displayed on the customer’s page.

Process for this document: Customer product and pricing content was pulled from PDF’s and websites and pasted raw into Google Sheets. The typical first pass using regular expressions was to clean out unwanted spacing and special characters. Near the end of the process, the cleaned data types needed additional separation and sometimes reformatting. The spreadsheet shown performs this for one customer.

Audience and Problems to solve: These spreadsheets were used internally by the data team, which I managed, and saved significant time in cleaning data vs. manual cleaning, which was usually accomplished by find-and-replace or line-by-line deletion. For some larger data sets consisting of hundreds of SKU’s, the time to process the data and complete uploading was reduced from 10-20 hours to 2-3 hours.

I maintain working copies of this and other spreadsheets using this technique. If you’d like access to try it out, let me know.