Automating data cleaning and formatting

Use AI to handle missing values, remove duplicates, standardise formats, and more.

Tags Streamline Icon: https://streamlinehq.com
Wrench Streamline Icon: https://streamlinehq.com
Uses
People Man Graduate Streamline Icon: https://streamlinehq.com
Beginner

By automating data cleaning, you can save hours of manual time and effort to get accurate and complete data sets for future analysis and reporting.

In this tutorial, we'll show you how to leverage ChatGPT to automate your data cleaning.

Many of these actions can be performed within spreadsheets, but working with ChatGPT becomes powerful when you start asking for non-traditional data transformations.

Steps:

  1. Upload your data file and ask ChatGPT to remove duplicates
  2. Request blank rows be removed
  3. Insert placeholder values in blank cells
  4. Ask for values to be standardized with specific formatting
  5. Label data based on defined rules

Once you complete this tutorial, you should check out our tutorial on how to analyse sales data with ChatGPT as a next step.

Products Give Gift 1 Streamline Icon: https://streamlinehq.com

Try before you bite?
This one's on us.

Sign up for a free account to view our free courses

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Upload your data file

Start by uploading your data file (e.g., CSV) to ChatGPT. Then, ask ChatGPT to remove any duplicate data. Provide a short description of what constitutes a duplicate in your data. In each step, ask ChatGPT to provide you a preview of the data transformation prior to having a new data file generated. This will save you time.

💡 Tip: Occasionally, ChatGPT will have an issue loading the data from the file you provided. In these cases, it will continue to re-run. If the issue persists, you can open a new chat window and re-upload your file.

Suggested prompt:

Here's my data file. Can you remove any duplicate data? Duplicates will have the same name. Please provide a preview before generating an updated data file.

If the preview looks good, instruct ChatGPT to proceed with generating the updated data file.

Request blanks rows be removed

To remove blank rows, instruct ChatGPT to remove them. Make sure to provide detail on what you consider a blank row and ask for a preview for confirmation.

💡 Tip: You do not need to re-upload your data file if you continue to work with ChatGPT in the same chat window. However, the context window can expire or timeout. If it does, you can re-upload the newest version of the data file or start a chat in a new window.

Suggested prompt:

Can you remove any blank rows? I consider any row with more than two columns blank to be a blank row. Please provide a preview before generating an updated data file.

If the preview looks good, instruct ChatGPT to proceed with generating the updated data file.

Insert placeholder values in blank cells

To insert placeholder values in blank cells, instruct ChatGPT to insert specific values for these cells. Make sure to provide detail on what you consider a blank cell, what value(s) you want inserted, and ask for a preview for confirmation.

Suggested prompt:

Can you insert the text “N/A” into any blank cells? I consider a blank cell any cell with NULL or NaN as a value. Please provide a preview before generating an updated data file.

Ask for values to be standardized with specific formatting

To standardize values with specific formatting, tell ChatGPT what columns you want formatted and how you want them formatted.

💡 Tip: If you have a specific example(s) within the dataset that is a good representation of what you want, provide this example to ChatGPT to ensure an accurate transformation.

Example prompt:

Two columns in my data file don’t have consistent or standardized formatting: Age and Income. For the Age column, can you update all cells to just be an integer? For instance, one Age value in the data file is “31.0 years”, I want this to be “31”. For the income column, can you reformat them all into USD?

Label data based on defined rules

To label data, provide clearly defined rules and column names for this new data. In this example, I’m going to have ChatGPT categorize anyone in my data set that is between the ages of 25-40 as a “Millennial”. I’m also going to instruct ChatGPT to categorize anyone with an income >$80K USD as a “Target Customer”.

Example prompt:

I want to create two new columns in my data set: Age Category and Customer Category. Can you create these two new columns and for any row with an Age between 25 - 40, set their Age Category to “Millennial”. For any row with an Income >$80K USD, set their Customer Category to “Target Customer”.

This tutorial was created by Garrett.

Get full access

✔️ All 100+ courses & tutorials in our catalog
✔️ New content added weekly
✔️ Private community access
✔️ No subscription, $150 paid once
✔️ Expense it using this template. Or get a team account.
✔️ 30-day refund policy. No questions asked
Join 3,107 learners from companies like Microsoft, Coca Cola, NBA, Adobe & Google

More tutorials like this

View all

If you scrolled this far, you must be a little interested...

Start learning ->

Join 3,107 professionals already learning