adriens

Posted on Mar 30, 2023

🪄 Enhance/fix data quality w. openai's API 🦾

#openai #ai #datascience #python

❔ About

🤔 Sometimes you face lack of data or data quality issues that prevent you from producing insights.

💡 Whatif you could call AI to the rescue to fix/enhance some data

I first started some Prompt engineering on chatGPT:

☝️ Notice

Notice that guessing gender on firstnames can seem useless or a bit dumb (or nerdy). Yes,but...

🗺️ This work relies on openAI... which acts as a universal language firstname parser
💡 This work is just an illustration of how prompt engineering and OpenAPI'API can help review/fix any kind of data quality issues... and makes a concrete illustration on how you may enrich your enterprise data pipeline

🎯 Target

The purpose of this article is to see how openai's API can help on a very specific testable dataset.

📝 `Kaggle` Notebook

This short notebook I will:

📥 Download data
🐼 Load data in pandas
🦾 Call openai's API to guess firstname's gender
⚖️ Compare guessed vs. real data

🍿 Demo

🗃️ Input Dataset

I have used the top-10-prenoms-a-noumea-depuis-1860 open dataset from data.gouv.nc:

Top 10 des Prénoms à Nouméa depuis 1860 — Open Data NC

Ce jeu de données présente la liste des dix prénoms les plus donnés à Nouméa, depuis 1860, d'après le registre de l'état civil. Fréquence de mise à jour : Annuelle

data.gouv.nc

🤖 The `text-davinci-003` model

I have used text-davinci-003 from GPT-3.5 models as they can:

"understand and generate natural language or code."

📊 Results 👏

☝️ Notice

Notice that I have put the guessed value in a dedicated structure... so we can easily flag it as AI generated when reporting its metadatas:

💰 Gains

📈 Data quality
💡 Better decisions & opportunities
💸 Puts the cost of the lack of data quality in evidence (API calls are not free)
🧠 Create more intelligence

👨‍🔬 Further optimizations

Benchmark models to spend as less money as possible while getting the best results as possible

🔭 News & perpsectives

Top comments (9)

adriens • Jun 12 '23

Adrien SALES

@rastadidi

💡 Using #OpenAI #APIs to enhance #data quality
youtu.be/vDXkRrkfqRc

#ArtificialIntelligence #DataScience #DataScientist

22:39 PM - 12 Jun 2023

adriens • Apr 21 '23

Programmer Humor

@pr0grammerhum0r

Thanks to u/Calslock for the original idea reddit.com/r/programmerhu…

01:00 AM - 21 Apr 2023

adriens • Apr 17 '23

Making the most of AI: The latest lessons from MIT Sloan Management Review | MIT Sloan

Knowing how to evaluate AI tools, manage data effectively, and share data strategically will help leaders see the results from their AI investments.

mitsloan.mit.edu

adriens • Apr 10 '23

VentureBeat

@venturebeat

To make the most of #AI infrastructure, organizations must evaluate the value of deploying centrally, regionally or locally. Here's how businesses can get started: #ArtificialIntelligence #autonomousvehicles #privacy #compliance @Equinix venturebeat.com/ai/how-busines…

18:07 PM - 10 Apr 2023

adriens • Mar 30 '23

Adrien SALES

@rastadidi

Hi @OpenAI @kaggle @datagouvnc from #newcaledonia #OpenData @Opendatasoft twitter.com/rastadidi/stat…

21:17 PM - 30 Mar 2023

Adrien SALES @rastadidi
🥲 Lacking #data ❔ 🥹 Having #dataquality issues ❓ 😭 ... then stuck on building #intelligence and achieve serious #digitaltransformation ⁉️ 💡 See how putting OpenAI in your ingestion #pipeline can help 👇 https://t.co/Otjfj73Vdl #openai #openai #datascientist #Python

adriens • Mar 30 '23

Adrien SALES

@rastadidi

🎁 @datagouvnc
#OpenData #newcaledonia #nouvellecaledonie #noumea

21:15 PM - 30 Mar 2023

adriens • Mar 30 '23 • Edited

Adrien SALES

@rastadidi

🥲 Lacking #data ❔
🥹 Having #dataquality issues ❓
😭 ... then stuck on building #intelligence and achieve serious #digitaltransformation ⁉️
💡 See how putting OpenAI in your ingestion #pipeline can help 👇
dev.to/adriens/enhanc…
#openai #openai #datascientist #Python

20:40 PM - 30 Mar 2023

DEV Community

🪄 Enhance/fix data quality w. openai's API 🦾

❔ About

☝️ Notice

🎯 Target

📝 `Kaggle` Notebook

🍿 Demo

🗃️ Input Dataset

Top 10 des Prénoms à Nouméa depuis 1860 — Open Data NC

🤖 The `text-davinci-003` model

📊 Results 👏

☝️ Notice

💰 Gains

👨‍🔬 Further optimizations

🔭 News & perpsectives

Top comments (9)

Making the most of AI: The latest lessons from MIT Sloan Management Review | MIT Sloan

Read next

Top 10 Most Used AI Tools Revolutionizing Industries in 2024

Batch, Mini-Batch & Stochastic Gradient Descent

Top 7 Artificial Intelligence Concepts Every Beginner Should Learn

Beginner's Guide to Python: A Quick Tutorial - 2

❔ About

☝️ Notice

🎯 Target

📝 Kaggle Notebook

🍿 Demo

🗃️ Input Dataset

Top 10 des Prénoms à Nouméa depuis 1860 — Open Data NC

🤖 The text-davinci-003 model

📊 Results 👏

☝️ Notice

💰 Gains

👨‍🔬 Further optimizations

🔭 News & perpsectives

Making the most of AI: The latest lessons from MIT Sloan Management Review | MIT Sloan

Read next

Top 10 Most Used AI Tools Revolutionizing Industries in 2024

Batch, Mini-Batch & Stochastic Gradient Descent

Top 7 Artificial Intelligence Concepts Every Beginner Should Learn

Beginner's Guide to Python: A Quick Tutorial - 2

📝 `Kaggle` Notebook

🤖 The `text-davinci-003` model