3 Resources To JumpStart Your DataScience Journey:
- Data Science For Beginners Curriculum - learn the fundamentals!
- Data Science In Visual Studio Code - learn to use core tools!
- PyDataNYC 2023 Workshop - fork it, and bring your own dataset!
Welcome to the fourth post in my This Week in AI News series! Today, I want to talk about Data Science - but first, some context.
1 | 2024 Resolution: Weekly Blog Series
Like many of you, I've had an aspirational goal to blog more. But finding time to plan content and publish it regularly, was hard. Then I joined the AI Advocacy team and found myself learning, building, and collaborating on, a number of interesting tools, technologies and projects in AI, ML and Data Science. Content problem solved. π
Now all I needed was accountability. Say hello to This Week in AI News. In 2024, my resolution is to write one a week - with a deep dive into 1 topic at a time. Want more ways to learn?
- Join AI Developers & Entrepreneurs ππ½ our in-person meetup in NYC
- Follow My LinkedIn feed ππ½ get AI news & updates as they happen
2 | This Week: Skilling Up On Data Science
As AI enters every aspect of our daily lives, it becomes useful (if not necessary) to build skills in both data science and app development. And you don't need to be a Python or Data Science expert to become productive! Instead, take advantage of core developer tools and AI assistance, and build your knowledge and expertise by playing with real data. That was core message in my PyData NYC 2023 talk.
I had two core objectives:
- Focused Learning. Prioritize learning just want you need to learn, to get you closer to completing the targeted task. Don't boil the ocean. Optimize your time.
- Transferable Learning. Use the right tools and techniques to allow your learning to be shared with others, making it easy for you to collaboratively debug issues with experts (now) and reuse/extend as templates (later).
And my philosophy was simple. Don't try to learn everything. You cannot match the knowledge of those with decades of experience. Instead, focus on solving a problem so your knowledge is tied to practical usage.
3 | Resource: PyData NYC Workshop
Want to explore this further? First, here are 3 links to bookmark and revisit, from my PyDataNYC talk:
3.1 Video Recording
Watch the talk replay on YouTube here:
3.2 Presentation Slides
Flip through the slides as you watch, to get better context.
3.3 Workshop Repo
Fork this repo and follow along if you meet the pre-requisites. Note: The repo will be getting actively updated over the next few weeks with more exercises. The PyData NYC version will remain archived at the pydata-nyc-2023 branch.
4 | Overview: What You'll Learn
By the end of that workshop / talk, you should be able to:
- Use Jupyter Notebooks - and explain how they help with transferable learning.
- Use GitHub Codespaces and explain how they simplify setup with prebuilt dev environments.
- Use GitHub Copilot - and explain how it supports focused learning with context-driven suggestions.
- Analyze and Visualize Data - using an open-source dataset and traditional tools & techniques.
- Grow your Data Intuition - using Project LIDA (with OpenAI) to define goals and get suggestions.
The last two steps are particularly important to differentiate because data science is still more of an art than a science. Given a dataset, extracting insights can have two paths:
- Known questions. The first is where you want to use the data to answer a specific question - and you just need help with the right tools & process to get there.
- Unknown questions. The second is where you don't know what questions to ask in the first place - and you need to start building up your intuition on how to define "what insights are possible" with a given dataset.
5 | Exploratory Data Analysis: IPL 2022
The first step is to do exploratory data analysis (EDA) with the given dataset, using traditional tools and techniques to get preliminary insights. For my talk I used a Kaggle IPL 2022 dataset for 3 reasons:
- Kaggle datasets provide a Data Card that gives us details on the data structure, origins and usage requirements.
- The Kaggle community publishes sample EDA notebooks like this IPL 2022 example that can provide learning inspiration.
- You can now build on your learning journey by trying new datasets and exploring more EDA notebooks - with the same development environment, as a template.
How do you get started with your exploratory data analysis (EDA) task? Look at EDA examples and see how you can replicate those steps on your own, using the GitHub Codespaces (pre-built environment) and GitHub Copilot (code explainers coaching) to streamline learning.
What if you didn't have example EDAs to learn from? How can you build your own intuition for what you can do? Let's talk about Project LIDA from Microsoft.
6 | LIDA: Generate Data Visualizations with AI
Project LIDA is a Microsoft Research open-source project that can automatically generate data visualizations from your dataset, using the power of Large Language Models. The screenshot below shows the approach LIDA takes - convering a data set into a compact natural language representation (context) that can be used with a variety of Large Language Model providers (OpenAI, Azure OpenAI, HuggingFace) and data visualization libraries (matplotlib, d3, seaborn etc.)
Most importantly, it allows me (as a developer) to focus on defining my task using natural language prompts, with appropriate system contexts, to not just automate the visualization - but also get suggestions that build my intuition.
For example: I can specify a persona showing interest in a specific team (Mumbai Indians) resulting in the data visualization goals being auto-generated for me to derive insights relevant to that team.
And I can tailor my prompt using natural language to tailor the underlying code libraries in a more intuitive way without having to know the deeper details of the Python APIs or functionality.
If you find this interesting, check out the LIDA Project organization on GitHub for some side-projects using LIDA - include a Codespaces template that you can instantiate and use for your own experiments.
7 | Visual Studio Code: For Data Science
We covered a lot, but the main goal here was to help you realize how much developer tools and AI can help you with jumpstarting your developer journey into data science, even as a non-Python developer. Every learning journey starts with setting up a development environment that empowers you - and I love Visual Studio Code. So here are two resources I recommend, to help you get started:
7.1 | Data Science Tutorial With VS Code
This Visual Studio Code series will teach you everything you need to know, to get started using VS Code for your data science development - from basic Notebooks, to training and deploying machine learning solutions on Azure.
7.2 | Data Wrangler Extension for VS Code
Data Wrangler is a code-centric data cleaning tool that is integrated into VS Code and VS Code Jupyter Notebooks - and was released in early 2023 as a tool for data scientists to speed up the data preparation & analysis process. It does this by also providing a rich user interface that automatically generates Pandas code to help show insightful data visualizations.
For example: In our IPL 2022 analysis, we could have run this tool early, to identify how representative the data was in terms of coverage across all features and contexts for our analysis. For a step-by-step tutorial on using Data Wrangler, check out their docs.
Summary
That was a lot!! But hopefully it inspires you to go out and grab a dataset and start your own data analysis journey. Watch this space for an updated workshop and tutorials that will showcase a few more data science tools and tips later this year.
3 Resources To JumpStart Your DataScience Journey:
- Data Science For Beginners Curriculum - learn the fundamentals!
- Data Science In Visual Studio Code - learn to use core tools!
- PyDataNYC 2023 Workshop - fork it, and bring your own dataset!
And follow the Azure org for more articles from the Developer Relations team.
Top comments (0)