Level 2: Basic Python for Data Science

Saba Shahrukh May 13, 2025 0

✨ Ever felt like you’re drowning in data, wishing you had a magic wand 🪄 to transform it into compelling stories and future insights? Prepare to embark on an exhilarating journey 🚀 where you’ll wield the power of Python’s mightiest tools!

🐍 Discover how NumPy can make numbers dance 🕺 to your tune, how Pandas 🐼 can tame unruly datasets into elegant structures, and how Matplotlib 📊 and Seaborn can paint breathtaking visualizations that reveal hidden patterns.

🤖 But that’s not all! Get ready to unlock the secrets of Machine Learning with scikit-learn, where you’ll build intelligent systems that can classify 🤔 the world around you.

💻 The best part? You can dive right in and experiment with live Python code using our built-in editor – no setup required!

🔮 Are you ready to transform from a data novice to a data wizard? Let’s begin! 🎉

NumPy – Your Numerical Powerhouse! 🚀

Ready to turbocharge your data manipulation? Enter NumPy! This library is the go-to for efficient numerical operations in Python, especially when dealing with arrays of numbers.

import numpy as np

# Creating NumPy arrays
my_list = [1, 2, 3, 4, 5]
numpy_array = np.array(my_list)
print(f"NumPy Array: {numpy_array}")
print(f"Data type: {numpy_array.dtype}")

zeros_array = np.zeros((2, 3))
print(f"Zeros Array:\n{zeros_array}")

random_array = np.random.rand(3, 2)
print(f"Random Array:\n{random_array}")


# Indexing and slicing
print(f"First element: {numpy_array[0]}")
print(f"Slice: {numpy_array[1:4]}")

Code Explanation

The code demonstrates basic operations with the NumPy library for numerical computing in Python:

import numpy as np: Imports the NumPy library and assigns it the alias “np” for easier use.
my_list = [1, 2, 3, 4, 5]: A Python list is created.
numpy_array = np.array(my_list): Converts the Python list into a NumPy array. NumPy arrays are more efficient for numerical operations.
print(f"NumPy Array: {numpy_array}"): Prints the NumPy array.
print(f"Data type: {numpy_array.dtype}"): Prints the data type of the elements in the array (e.g., int64). NumPy arrays have a single data type for all elements.
zeros_array = np.zeros((2, 3)): Creates a 2×3 array filled with zeros.
print(f"Zeros Array:\n{zeros_array}"): Prints the array of zeros.
random_array = np.random.rand(3, 2): Creates a 3×2 array with random values between 0 and 1.
print(f"Random Array:\n{random_array}"): Prints the array of random numbers.
print(f"First element: {numpy_array[0]}"): Accesses and prints the first element (index 0) of the NumPy array.
print(f"Slice: {numpy_array[1:4]}"): Prints a slice of the NumPy array, containing the elements from index 1 up to (but not including) index 4.

The output of this code will be:

NumPy Array: [1 2 3 4 5]
Data type: int64
Zeros Array:
[[0. 0. 0.]
 [0. 0. 0.]]
Random Array:
[[0.372284    0.81945511]
 [0.72274141  0.71872124]
 [0.771274    0.21747818]]
First element: 1
Slice: [2 3 4]

🐍 Practice Python Here! 🚀

Write, Run & Debug Python Code Directly in Your Browser! 💻✨

Here, we’re creating NumPy arrays from Python lists, generating arrays filled with zeros, and even creating arrays with random numbers! NumPy arrays are incredibly efficient for storing and manipulating numerical data.

import numpy as np

# Array operations
array1 = np.array([10, 20, 30])
array2 = np.array([1, 2, 3])

addition = array1 + array2
subtraction = array1 - array2
multiplication = array1 * array2
division = array1 / array2

print(f"Addition: {addition}")
print(f"Subtraction: {subtraction}")
print(f"Multiplication: {multiplication}")
print(f"Division: {division}")

Code Explanation

The code demonstrates basic arithmetic operations on NumPy arrays:

import numpy as np: Imports the NumPy library.
array1 = np.array([10, 20, 30]): Creates a NumPy array named array1 with the values [10, 20, 30].
array2 = np.array([1, 2, 3]): Creates another NumPy array named array2 with the values [1, 2, 3].
addition = array1 + array2: Performs element-wise addition of array1 and array2. The result is `[10+1, 20+2, 30+3] = [11, 22, 33]`.
subtraction = array1 - array2: Performs element-wise subtraction of array2 from array1. The result is `[10-1, 20-2, 30-3] = [9, 18, 27]`.
multiplication = array1 * array2: Performs element-wise multiplication of array1 and array2. The result is `[10*1, 20*2, 30*3] = [10, 40, 90]`.
division = array1 / array2: Performs element-wise division of array1 by array2. The result is `[10/1, 20/2, 30/3] = [10.0, 10.0, 10.0]`.
The code then prints the results of each operation.

The output of this code will be:

Addition: [11 22 33]
Subtraction: [ 9 18 27]
Multiplication: [ 10  40  90]
Division: [10. 10. 10.]

🐍 Practice Python Here! 🚀

Write, Run & Debug Python Code Directly in Your Browser! 💻✨

NumPy makes it a breeze to perform element-wise operations on arrays (addition, subtraction, etc.) and to access specific elements or subsets of arrays using indexing and slicing. It’s like having a super-powered calculator for your data!

Advanced NumPy Array Manipulation

Now, let’s delve deeper into the power of NumPy for numerical operations.

import numpy as np

# Reshaping arrays
arr = np.arange(12)
reshaped_arr = arr.reshape(3, 4)
print(f"Original array:\n{arr}")
print(f"Reshaped array (3x4):\n{reshaped_arr}")

# Stacking arrays
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
stacked_horizontally = np.hstack((arr1, arr2))
stacked_vertically = np.vstack((arr1, arr2))
print(f"Horizontally stacked:\n{stacked_horizontally}")
print(f"Vertically stacked:\n{stacked_vertically}")

# Boolean indexing
data = np.array([10, 25, 5, 30, 15])
mask = data > 20
filtered_data = data[mask]
print(f"Data: {data}")
print(f"Mask (data > 20): {mask}")
print(f"Filtered data (where mask is True): {filtered_data}")

# Basic linear algebra (dot product)
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
dot_product = np.dot(vector1, vector2)
print(f"Dot product of {vector1} and {vector2}: {dot_product}")

Code Explanation

The code demonstrates several fundamental operations using the NumPy library:

import numpy as np: Imports the NumPy library.
arr = np.arange(12): Creates a 1-dimensional NumPy array named `arr` containing numbers from 0 to 11.
reshaped_arr = arr.reshape(3, 4): Reshapes the array `arr` into a 3×4 (3 rows, 4 columns) 2-dimensional array.
print(f"Original array:\n{arr}"): Prints the original 1D array.
print(f"Reshaped array (3x4):\n{reshaped_arr}"): Prints the reshaped 2D array.
arr1 = np.array([[1, 2], [3, 4]]) and arr2 = np.array([[5, 6], [7, 8]]): Creates two 2×2 NumPy arrays.
stacked_horizontally = np.hstack((arr1, arr2)): Stacks `arr1` and `arr2` horizontally (side-by-side).
stacked_vertically = np.vstack((arr1, arr2)): Stacks `arr1` and `arr2` vertically (one on top of the other).
print(f"Horizontally stacked:\n{stacked_horizontally}"): Prints the horizontally stacked array.
print(f"Vertically stacked:\n{stacked_vertically}"): Prints the vertically stacked array.
data = np.array([10, 25, 5, 30, 15]): Creates a NumPy array named `data`.
mask = data > 20: Creates a boolean array called `mask`. Each element in `mask` is `True` if the corresponding element in `data` is greater than 20, and `False` otherwise.
filtered_data = data[mask]: Uses boolean indexing to select elements from `data` where the corresponding value in `mask` is `True`.
print(f"Data: {data}"): Prints the original data array.
print(f"Mask (data > 20): {mask}"): Prints the boolean mask.
print(f"Filtered data (where mask is True): {filtered_data}"): Prints the filtered data.
vector1 = np.array([1, 2, 3]) and vector2 = np.array([4, 5, 6]): Creates two 1D NumPy arrays representing vectors.
dot_product = np.dot(vector1, vector2): Calculates the dot product of the two vectors.
print(f"Dot product of {vector1} and {vector2}: {dot_product}"): Prints the result of the dot product calculation.

The output of this code will be:

Original array:
[ 0  1  2  3  4  5  6  7  8  9 10 11]
Reshaped array (3x4):
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
Horizontally stacked:
[[1 2 5 6]
 [3 4 7 8]]
Vertically stacked:
[[1 2]
 [3 4]
 [5 6]
 [7 8]]
Data: [10 25  5 30 15]
Mask (data > 20): [False  True False  True False]
Filtered data (where mask is True): [25 30]
Dot product of [1 2 3] and [4 5 6]: 32

🐍 Practice Python Here! 🚀

Write, Run & Debug Python Code Directly in Your Browser! 💻✨

Here, we’re exploring how to change the shape of NumPy arrays (reshape), combine arrays both horizontally and vertically (hstack, vstack), select elements based on boolean conditions (boolean indexing), and perform fundamental linear algebra operations like the dot product. These are essential skills for more advanced data analysis and machine learning tasks.

Pandas – Data Wrangling Wizardry! 🐼

Now, let’s talk about Pandas, the library that makes data analysis in Python a joy! Pandas provides powerful tools for working with structured data, like tables.

import pandas as pd

# Creating Pandas Series
my_series = pd.Series([10, 20, 30, 40, 50])
print(f"Pandas Series:\n{my_series}")

# Creating Pandas DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 22, 35],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)
print(f"Pandas DataFrame:\n{df}")

Code Explanation

The code demonstrates the creation of Pandas Series and DataFrames:

import pandas as pd: Imports the Pandas library, aliasing it as `pd`.
my_series = pd.Series([10, 20, 30, 40, 50]): Creates a Pandas Series from a Python list. A Series is a one-dimensional labeled array.
print(f"Pandas Series:\n{my_series}"): Prints the Series. The output shows the values and their corresponding index (which is automatically generated).
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 22, 35], 'City': ['New York', 'London', 'Paris', 'Tokyo']}: Creates a Python dictionary where keys are column names (‘Name’, ‘Age’, ‘City’) and values are lists of column data.
df = pd.DataFrame(data): Creates a Pandas DataFrame from the dictionary. A DataFrame is a 2-dimensional labeled data structure, similar to a table.
print(f"Pandas DataFrame:\n{df}"): Prints the DataFrame. The output displays the data in a tabular format with labeled columns and row indices.

The output of this code will be:

Pandas Series:
0    10
1    20
2    30
3    40
4    50
dtype: int64
Pandas DataFrame:
       Name  Age      City
0     Alice   25  New York
1       Bob   30    London
2  Charlie   22     Paris
3    David   35     Tokyo

🐍 Practice Python Here! 🚀

Write, Run & Debug Python Code Directly in Your Browser! 💻✨

With Pandas, you can create Series (one-dimensional labeled arrays) and DataFrames (two-dimensional tables). DataFrames are incredibly useful for organizing and analyzing data with rows and columns.

import pandas as pd

# Creating Pandas DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 22, 35],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)

# Accessing data
print(f"Names column:\n{df['Name']}")
print(f"First row:\n{df.iloc[0]}")

# Filtering data
young_people = df[df['Age'] < 30]
print(f"Young People:\n{young_people}")

# Basic operations
print(f"Mean age: {df['Age'].mean()}")
print(f"Number of rows: {len(df)}")

Code Explanation

The code demonstrates basic operations on a Pandas DataFrame:

import pandas as pd: Imports the Pandas library.
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 22, 35], 'City': ['New York', 'London', 'Paris', 'Tokyo']}: Creates a dictionary where keys are column names and values are lists of column data.
df = pd.DataFrame(data): Creates a Pandas DataFrame from the dictionary.
print(f"Names column:\n{df['Name']}"): Accesses and prints the 'Name' column as a Pandas Series.
print(f"First row:\n{df.iloc[0]}"): Accesses and prints the first row of the DataFrame using integer-based indexing (`iloc`).
young_people = df[df['Age'] < 30]: Filters the DataFrame to create a new DataFrame containing only rows where the 'Age' is less than 30.
print(f"Young People:\n{young_people}"): Prints the filtered DataFrame.
print(f"Mean age: {df['Age'].mean()}"): Calculates and prints the mean of the 'Age' column.
print(f"Number of rows: {len(df)}"): Prints the number of rows in the DataFrame.

The output of this code will be:

Names column:
0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object
First row:
Name       Alice
Age           25
City    New York
Name: 0, dtype: object
Young People:
       Name  Age      City
0     Alice   25  New York
2  Charlie   22     Paris
Mean age: 28.0
Number of rows: 4

🐍 Practice Python Here! 🚀

Write, Run & Debug Python Code Directly in Your Browser! 💻✨

Pandas allows you to easily access specific columns or rows, filter data based on conditions, and perform calculations like finding the mean. It's your ultimate tool for cleaning, transforming, and analyzing data!

Powerful Pandas Operations

Pandas is your best friend when it comes to working with structured data. Let's explore some more advanced capabilities.

import pandas as pd
import numpy as np

# Grouping data
data = {'Category': ['A', 'B', 'A', 'B', 'A'],
        'Value': [10, 20, 15, 25, 12]}
df = pd.DataFrame(data)
grouped = df.groupby('Category')
print("Grouped DataFrame:\n", grouped)
print("\nMean value per category:\n", grouped['Value'].mean())

# Aggregation
aggregated = grouped.agg(['sum', 'mean', 'count'])
print("\nAggregation (sum, mean, count):\n", aggregated)

# Applying custom functions
def multiply_by_two(x):
    return x * 2

multiplied_value = df['Value'].apply(multiply_by_two)
print("\nValue multiplied by two:\n", multiplied_value)

# Handling missing data (a glimpse)
data_with_na = {'Col1': [1, 2, np.nan, 4],
                'Col2': [np.nan, 5, 6, 7]}
df_na = pd.DataFrame(data_with_na)
print("\nDataFrame with missing values:\n", df_na)
print("\nMissing values count per column:\n", df_na.isnull().sum())
df_filled = df_na.fillna(0)
print("\nDataFrame with missing values filled with 0:\n", df_filled)

Code Explanation

The code demonstrates advanced operations on Pandas DataFrames, including grouping, aggregation, custom function application, and handling missing data:

import pandas as pd: Imports the Pandas library.
import numpy as np: Imports the NumPy library, which is often used with Pandas.
data = {'Category': ['A', 'B', 'A', 'B', 'A'], 'Value': [10, 20, 15, 25, 12]}: Creates a dictionary with data on categories and values.
df = pd.DataFrame(data): Creates a Pandas DataFrame from the dictionary.
grouped = df.groupby('Category'): Groups the DataFrame by the 'Category' column. This creates a GroupBy object, which allows you to perform operations on each group.
print("Grouped DataFrame:\n", grouped): Prints the GroupBy object (which doesn't display the actual grouped data, but its structure).
print("\nMean value per category:\n", grouped['Value'].mean()): Calculates and prints the mean of the 'Value' column for each group.
aggregated = grouped.agg(['sum', 'mean', 'count']): Performs multiple aggregations on the grouped data: calculates the sum, mean, and count of the 'Value' column for each category.
print("\nAggregation (sum, mean, count):\n", aggregated): Prints the aggregation results.
def multiply_by_two(x): return x * 2: Defines a custom function that multiplies a value by 2.
multiplied_value = df['Value'].apply(multiply_by_two): Applies the `multiply_by_two` function to each element in the 'Value' column using the `apply` method.
print("\nValue multiplied by two:\n", multiplied_value): Prints the result of applying the custom function.
data_with_na = {'Col1': [1, 2, np.nan, 4], 'Col2': [np.nan, 5, 6, 7]}: Creates a dictionary with missing values (represented by `np.nan`).
df_na = pd.DataFrame(data_with_na): Creates a DataFrame from the dictionary with missing data.
print("\nDataFrame with missing values:\n", df_na): Prints the DataFrame with missing values.
print("\nMissing values count per column:\n", df_na.isnull().sum()): Calculates and prints the number of missing values in each column using `isnull()` and `sum()`.
df_filled = df_na.fillna(0): Fills the missing values with 0 using `fillna()`.
print("\nDataFrame with missing values filled with 0:\n", df_filled): Prints the DataFrame with the missing values filled.

The output of this code will be:

Grouped DataFrame:
 <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f93541b6790>

Mean value per category:
 Category
A    12.333333
B    22.500000
Name: Value, dtype: float64

Aggregation (sum, mean, count):
         sum  mean count
Category               
A     37 12.333333   3
B     45 22.500000   2

Value multiplied by two:
 0    20
1    40
2    30
3    50
4    24
Name: Value, dtype: int64

DataFrame with missing values:
     Col1  Col2
0     1.0    NaN
1     2.0     5.0
2     NaN     6.0
3     4.0     7.0

Missing values count per column:
 Col1    1
Col2    1
dtype: int64

DataFrame with missing values filled with 0:
     Col1  Col2
0     1.0     0.0
1     2.0     5.0
2     0.0     6.0
3     4.0     7.0

🐍 Practice Python Here! 🚀

Write, Run & Debug Python Code Directly in Your Browser! 💻✨

We're now using Pandas to group data based on a column (groupby), perform aggregate calculations on these groups (agg), apply custom functions to columns (apply), and get a basic understanding of how to identify and handle missing data (isnull(), fillna()). These are crucial steps in any real-world data analysis workflow.

Visualizing Your Insights with Matplotlib and Seaborn! 📊🎨

What's data analysis without visualizing your findings? Matplotlib and Seaborn are your go-to libraries for creating stunning charts and graphs in Python.

import matplotlib.pyplot as plt
import seaborn as sns

# Matplotlib - Line plot
years = [2018, 2019, 2020, 2021, 2022]
sales = [100, 120, 90, 150, 130]
plt.plot(years, sales)
plt.xlabel("Year")
plt.ylabel("Sales (in units)")
plt.title("Annual Sales Trend")
plt.show()

# Matplotlib - Scatter plot
ages = [20, 25, 30, 35, 40, 45]
incomes = [30000, 45000, 60000, 75000, 90000, 105000]
plt.scatter(ages, incomes)
plt.xlabel("Age")
plt.ylabel("Income")
plt.title("Age vs. Income")
plt.show()

Code Explanation

The code generates two basic plots using Matplotlib, a popular Python library for creating visualizations:

import matplotlib.pyplot as plt: Imports the Matplotlib library's pyplot module, which provides a collection of functions for creating plots. It's aliased as plt for brevity.
import seaborn as sns: Imports the Seaborn library, which is built on top of Matplotlib and provides a higher-level interface for creating more visually appealing plots. While imported, it's not used in this specific code.
Line Plot:
- years = [2018, 2019, 2020, 2021, 2022]: Defines a list of years.
- sales = [100, 120, 90, 150, 130]: Defines a list of sales figures corresponding to the years.
- plt.plot(years, sales): Creates a line plot with `years` on the x-axis and `sales` on the y-axis.
- plt.xlabel("Year"): Sets the label for the x-axis.
- plt.ylabel("Sales (in units)"): Sets the label for the y-axis.
- plt.title("Annual Sales Trend"): Sets the title of the plot.
- plt.show(): Displays the plot.
Scatter Plot:
- ages = [20, 25, 30, 35, 40, 45]: Defines a list of ages.
- incomes = [30000, 45000, 60000, 75000, 90000, 105000]: Defines a list of incomes corresponding to the ages.
- plt.scatter(ages, incomes): Creates a scatter plot with `ages` on the x-axis and `incomes` on the y-axis.
- plt.xlabel("Age"): Sets the label for the x-axis.
- plt.ylabel("Income"): Sets the label for the y-axis.
- plt.title("Age vs. Income"): Sets the title of the plot.
- plt.show(): Displays the plot.

The code will generate two separate plots: a line plot showing the trend of sales over the years, and a scatter plot showing the relationship between age and income.

🐍 Practice Python Here! 🚀

Write, Run & Debug Python Code Directly in Your Browser! 💻✨

Code Explanation

The code generates and displays two types of plots, a box plot and a histogram, using the Seaborn and Matplotlib libraries. These plots are used to visualize the distribution of a set of randomly generated data.

import seaborn as sns: Imports the Seaborn library, which provides a high-level interface for creating informative statistical graphics.
import matplotlib.pyplot as plt: Imports the Matplotlib library's pyplot module, used for creating plots.
import numpy as np: Imports the NumPy library for numerical operations.
import pandas as pd: Imports the Pandas library for data manipulation.
np.random.seed(42): Sets the random seed to 42. This ensures that the random data generated is the same every time the code is run, making the results reproducible.
data = pd.DataFrame({'Values': np.random.normal(loc=5, scale=2, size=100)}):
- np.random.normal(loc=5, scale=2, size=100): Generates 100 random numbers from a normal distribution with a mean (loc) of 5 and a standard deviation (scale) of 2.
- pd.DataFrame(...): Creates a Pandas DataFrame with a single column named 'Values' containing the generated random numbers.
fig, axes = plt.subplots(1, 2, figsize=(12, 5)):
- plt.subplots(1, 2, figsize=(12, 5)): Creates a figure and a set of subplots.
  - 1, 2: Arranges the subplots in 1 row and 2 columns.
  - figsize=(12, 5): Sets the size of the entire figure to 12 inches wide and 5 inches tall.
- fig: Represents the entire figure.
- axes: Is a NumPy array containing the two subplot axes objects.
Box Plot:
- sns.boxplot(y='Values', data=data, ax=axes[0]): Creates a box plot using Seaborn.
  - y='Values': Specifies that the 'Values' column from the DataFrame should be used for the y-axis of the box plot.
  - data=data: Specifies the DataFrame containing the data.
  - ax=axes[0]: Specifies that this box plot should be drawn in the first subplot (axes[0]).
- axes[0].set_title('Box Plot of Values'): Sets the title of the first subplot.
- axes[0].set_ylabel('Values'): Sets the label for the y-axis of the first subplot.
Histogram:
- sns.histplot(data['Values'], kde=True, ax=axes[1]): Creates a histogram using Seaborn.
  - data['Values']: Specifies the data to be plotted (the 'Values' column).
  - kde=True: Adds a Kernel Density Estimate (KDE) curve to the histogram, showing the estimated probability density function of the data.
  - ax=axes[1]: Specifies that this histogram should be drawn in the second subplot (axes[1]).
- axes[1].set_title('Histogram of Values'): Sets the title of the second subplot.
- axes[1].set_xlabel('Values'): Sets the x-axis label for the second subplot.
- axes[1].set_ylabel('Frequency'): Sets the y-axis label for the second subplot.
plt.tight_layout(): Adjusts the spacing between subplots to prevent overlapping titles and labels.
plt.show(): Displays the entire figure with both subplots.

This code will generate a figure containing two plots side-by-side. The left plot will be a box plot, visualizing the distribution of the 'Values' data by showing quartiles, median, and potential outliers. The right plot will be a histogram, visualizing the frequency of different value ranges in the 'Values' data, with a smooth curve (KDE) overlaid to estimate the data's probability density.

🐍 Practice Python Here! 🚀

Write, Run & Debug Python Code Directly in Your Browser! 💻✨

Seaborn builds on top of Matplotlib and provides more sophisticated statistical visualizations like histograms (to see data distributions) and box plots (to compare groups). These tools help you tell compelling stories with your data!

Informative Visualizations

Let's enhance our visualization skills with Matplotlib and Seaborn.

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Generate some more diverse random data
np.random.seed(42)
n_samples = 100
x = np.random.rand(n_samples) * 10
noise = np.random.normal(0, 2, n_samples)
y = 2 * x + 1 + noise
category = np.random.choice(['A', 'B', 'C'], size=n_samples)
data = pd.DataFrame({'X': x, 'Y': y, 'Category': category})

# Create a figure with four subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()  # Flatten the 2x2 array of axes for easier indexing

# 1. Regression Plot
sns.regplot(x='X', y='Y', data=data, ax=axes[0])
axes[0].set_title('Regression Plot of Y vs X')
axes[0].set_xlabel('X (Independent Variable)')
axes[0].set_ylabel('Y (Dependent Variable)')

# 2. Distribution Plot
sns.displot(data['Y'], kde=True, ax=axes[1])
axes[1].set_title('Distribution Plot of Y')
axes[1].set_xlabel('Y')
axes[1].set_ylabel('Density')

# 3. Subplots (already done with fig, axes) - adding another example of a simple scatter plot
sns.scatterplot(x='X', y='Y', hue='Category', data=data, ax=axes[2])
axes[2].set_title('Scatter Plot of Y vs X by Category')
axes[2].set_xlabel('X')
axes[2].set_ylabel('Y')
axes[2].legend(title='Category')

# 4. Boxplot with Categorical Comparisons
sns.boxplot(x='Category', y='Y', data=data, ax=axes[3])
axes[3].set_title('Box Plot of Y by Category')
axes[3].set_xlabel('Category')
axes[3].set_ylabel('Y')

# Adjust layout to prevent overlapping titles
plt.tight_layout()

# Show the plot
plt.show()

Code Explanation

The code generates a dataset with a mix of numerical and categorical data and creates four different types of plots using Seaborn and Matplotlib to visualize relationships and distributions within the data.

import seaborn as sns: Imports the Seaborn library for enhanced data visualization.
import matplotlib.pyplot as plt: Imports Matplotlib's pyplot for basic plotting functions.
import numpy as np: Imports NumPy for numerical operations, especially for generating random data.
import pandas as pd: Imports Pandas for creating and manipulating DataFrames.
Data Generation:
- np.random.seed(42): Sets the random seed for reproducibility.
- n_samples = 100: Defines the number of data points to generate.
- x = np.random.rand(n_samples) * 10: Generates 100 random numbers between 0 and 10 from a uniform distribution and assigns them to 'x'.
- noise = np.random.normal(0, 2, n_samples): Generates 100 random numbers from a normal distribution with mean 0 and standard deviation 2 (representing noise).
- y = 2 * x + 1 + noise: Calculates 'y' based on a linear relationship with 'x' plus the added noise.
- category = np.random.choice(['A', 'B', 'C'], size=n_samples): Randomly assigns one of the categories 'A', 'B', or 'C' to each of the 100 data points.
- data = pd.DataFrame({'X': x, 'Y': y, 'Category': category}): Creates a Pandas DataFrame containing the generated 'X', 'Y', and 'Category' data.
Figure and Subplots:
- fig, axes = plt.subplots(2, 2, figsize=(14, 10)): Creates a figure and a 2x2 grid of subplots with a figure size of 14x10 inches.
- axes = axes.flatten(): Flattens the 2x2 array of axes into a 1D array, making it easier to access each subplot.
Plot 1: Regression Plot:
- sns.regplot(x='X', y='Y', data=data, ax=axes[0]): Creates a regression plot showing the relationship between 'X' and 'Y', including a regression line and confidence intervals.
- axes[0].set_title(...), axes[0].set_xlabel(...), axes[0].set_ylabel(...): Sets the title and labels for the x and y axes of the first subplot.
Plot 2: Distribution Plot:
- sns.displot(data['Y'], kde=True, ax=axes[1]): Creates a distribution plot (histogram with KDE) of the 'Y' variable.
- axes[1].set_title(...), axes[1].set_xlabel(...), axes[1].set_ylabel(...): Sets the title and labels for the axes of the second subplot.
Plot 3: Scatter Plot:
- sns.scatterplot(x='X', y='Y', hue='Category', data=data, ax=axes[2]): Creates a scatter plot of 'Y' vs. 'X', with different colors for each 'Category'.
- axes[2].set_title(...), axes[2].set_xlabel(...), axes[2].set_ylabel(...): Sets the title and labels for the axes of the third subplot.
- axes[2].legend(title='Category'): Adds a legend to the third subplot, with the title 'Category'.
Plot 4: Box Plot:
- sns.boxplot(x='Category', y='Y', data=data, ax=axes[3]): Creates a box plot showing the distribution of 'Y' for each 'Category'.
- axes[3].set_title(...), axes[3].set_xlabel(...), axes[3].set_ylabel(...): Sets the title and labels for the axes of the fourth subplot.
plt.tight_layout(): Adjusts subplot parameters to prevent overlapping elements.
plt.show(): Displays the entire figure with the four subplots.

This code generates a 2x2 grid of plots. The plots show the following: 1. The relationship between X and Y with a regression line. 2. The distribution of Y. 3. A scatterplot of X and Y, with data points colored according to their category. 4. Boxplots showing how Y varies across different categories.

🐍 Practice Python Here! 🚀

Write, Run & Debug Python Code Directly in Your Browser! 💻✨

In this example, we've expanded on the previous one to include the requested plot types:

Boxplot with Categorical Comparisons: sns.boxplot() is used again, but this time we pass a categorical variable ('Category') to the x argument. This creates separate box plots for each category, allowing for a visual comparison of the distribution of 'Y' across different categories.

Regression Plot: sns.regplot() visualizes the linear relationship between two variables ('X' and 'Y') along with a regression line and a confidence interval.

Distribution Plot: sns.displot() (note the change from histplot to displot which is more versatile) shows the distribution of a single variable ('Y'). The kde=True argument adds the kernel density estimate.

Subplots: We've already structured our figure using plt.subplots(2, 2, figsize=(14, 10)) to create a grid of 2x2 subplots. The axes.flatten() line makes it easier to access individual subplots using a single index. We've added a simple scatter plot as another example of utilizing subplots, coloring points based on the 'Category'.

A Glimpse into Machine Learning with Scikit-learn! 🤖

Ready for some AI magic? Scikit-learn is a powerful Python library that provides a wide range of Machine Learning algorithms. Let's take a peek at a simple linear regression example.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt

# Sample data (replace with your actual data)
X = np.array([[1], [2], [3], [4], [5]])  # Features
y = np.array([2, 4, 5, 4, 5])  # Target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# Visualize the results
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Linear Regression Prediction")
plt.show()

Code Explanation

The code demonstrates a simple linear regression analysis using scikit-learn. Here's a breakdown:

from sklearn.model_selection import train_test_split: Imports the `train_test_split` function to split the dataset into training and testing sets.
from sklearn.linear_model import LinearRegression: Imports the `LinearRegression` class, which is used to create a linear regression model.
from sklearn.metrics import mean_squared_error: Imports the `mean_squared_error` function to evaluate the model's performance.
import numpy as np: Imports the NumPy library for numerical operations.
import matplotlib.pyplot as plt: Imports the Matplotlib library for plotting.
Sample Data:
- X = np.array([[1], [2], [3], [4], [5]]): Creates a NumPy array `X` representing the independent variable (feature). Each inner list represents a data point.
- y = np.array([2, 4, 5, 4, 5]): Creates a NumPy array `y` representing the dependent variable (target).
Data Splitting:
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42): Splits the data into training and testing sets.
  - test_size=0.2: 20% of the data is used for testing, and 80% for training.
  - random_state=42: Sets the random seed to ensure the same split is obtained each time the code is run.
Model Training:
- model = LinearRegression(): Creates an instance of the `LinearRegression` model.
- model.fit(X_train, y_train): Trains the model using the training data. The model learns the relationship between `X_train` and `y_train`.
Prediction:
- y_pred = model.predict(X_test): Uses the trained model to predict the values of `y` for the test set (`X_test`).
Evaluation:
- mse = mean_squared_error(y_test, y_pred): Calculates the Mean Squared Error (MSE) to evaluate the model's performance. MSE measures the average squared difference between the predicted and actual values.
- print(f"Mean Squared Error: {mse}"): Prints the calculated MSE.
Visualization:
- plt.scatter(X_test, y_test, color='black'): Creates a scatter plot of the actual test data points.
- plt.plot(X_test, y_pred, color='blue', linewidth=3): Plots the predicted values (the regression line) in blue.
- plt.xlabel("X"), plt.ylabel("y"), plt.title("Linear Regression Prediction"): Adds labels to the x-axis, y-axis, and sets the title of the plot.
- plt.show(): Displays the plot.

This code performs a linear regression, predicts 'y' based on 'X', evaluates the prediction accuracy, and visualizes the results.

🐍 Practice Python Here! 🚀

Write, Run & Debug Python Code Directly in Your Browser! 💻✨

This code demonstrates how to train a linear regression model to predict a numerical value based on input features. We also evaluate the model's performance and visualize the results. Scikit-learn makes it surprisingly easy to get started with Machine Learning!

Introduction to Classification

Let's get a taste of machine learning by tackling a multi-class classification problem using Logistic Regression.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample multi-class classification data
data_clf = {'Feature1': [1, 2, 3, 1.5, 2.5, 3.5, 1, 2, 3],
            'Feature2': [2, 4, 1, 2.5, 3.5, 1.5, 3, 1, 4],
            'Target': [0, 0, 1, 0, 1, 1, 2, 2, 2]}
df_multi_clf = pd.DataFrame(data_clf)

X_multi = df_multi_clf[['Feature1', 'Feature2']]
y_multi = df_multi_clf['Target']

X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X_multi, y_multi, test_size=0.3, random_state=42)

# Train a Logistic Regression model
model_multi = LogisticRegression(multi_class='ovr') # One-vs-rest strategy for multi-class
model_multi.fit(X_train_multi, y_train_multi)

# Make predictions
y_pred_multi = model_multi.predict(X_test_multi)

# Evaluate performance
accuracy_multi = accuracy_score(y_test_multi, y_pred_multi)
print(f"\nAccuracy of Multi-class Logistic Regression: {accuracy_multi}")

# Confusion matrix
cm_multi = confusion_matrix(y_test_multi, y_pred_multi)
sns.heatmap(cm_multi, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

Code Explanation

The code demonstrates multi-class classification using logistic regression with scikit-learn. Here's a breakdown:

from sklearn.model_selection import train_test_split: Imports the `train_test_split` function for splitting the dataset.
from sklearn.linear_model import LogisticRegression: Imports the `LogisticRegression` class for creating the model.
from sklearn.metrics import accuracy_score, confusion_matrix: Imports functions for evaluating model performance: `accuracy_score` for overall accuracy and `confusion_matrix` to visualize classification results.
import seaborn as sns: Imports the Seaborn library for enhanced visualization (used for the confusion matrix).
import matplotlib.pyplot as plt: Imports Matplotlib's pyplot for plotting.
import pandas as pd: Imports the Pandas library for data manipulation.
Sample Data Creation:
- data_clf = {'Feature1': [1, 2, 3, 1.5, 2.5, 3.5, 1, 2, 3], 'Feature2': [2, 4, 1, 2.5, 3.5, 1.5, 3, 1, 4], 'Target': [0, 0, 1, 0, 1, 1, 2, 2, 2]}: Creates a dictionary containing sample data for a multi-class classification problem. 'Feature1' and 'Feature2' are the independent variables, and 'Target' is the dependent variable with three classes (0, 1, and 2).
- df_multi_clf = pd.DataFrame(data_clf): Creates a Pandas DataFrame from the dictionary.
- X_multi = df_multi_clf[['Feature1', 'Feature2']]: Creates a DataFrame `X_multi` containing the features (independent variables).
- y_multi = df_multi_clf['Target']: Creates a Series `y_multi` containing the target variable (dependent variable).
Data Splitting:
- X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X_multi, y_multi, test_size=0.3, random_state=42): Splits the data into training and testing sets, with 30% used for testing. `random_state` ensures consistent splitting.
Model Training:
- model_multi = LogisticRegression(multi_class='ovr'): Creates a Logistic Regression model. The `multi_class='ovr'` argument specifies the "one-vs-rest" strategy, which is suitable for multi-class classification. In this strategy, a separate logistic regression model is trained for each class, predicting the probability of an instance belonging to that class versus all other classes.
- model_multi.fit(X_train_multi, y_train_multi): Trains the logistic regression model using the training data.
Prediction:
- y_pred_multi = model_multi.predict(X_test_multi): Uses the trained model to predict the target variable for the test set.
Evaluation:
- accuracy_multi = accuracy_score(y_test_multi, y_pred_multi): Calculates the accuracy of the model's predictions. Accuracy is the proportion of correctly classified instances.
- print(f"\nAccuracy of Multi-class Logistic Regression: {accuracy_multi}"): Prints the accuracy.
- cm_multi = confusion_matrix(y_test_multi, y_pred_multi): Computes the confusion matrix, which shows the distribution of predicted and actual classes. It's useful for understanding the types of errors the model is making.
- sns.heatmap(cm_multi, annot=True, fmt='d', cmap='Blues'): Plots the confusion matrix as a heatmap.
  - annot=True: Displays the numerical values in each cell of the heatmap.
  - fmt='d': Specifies that the values should be displayed as integers.
  - cmap='Blues': Uses the 'Blues' color map.
- plt.xlabel('Predicted Label'), plt.ylabel('True Label'), plt.title('Confusion Matrix'): Sets the labels for the x-axis, y-axis, and the title of the plot.
- plt.show(): Displays the confusion matrix plot.

This code trains a logistic regression model for a multi-class classification problem, evaluates its performance using accuracy and a confusion matrix, and visualizes the confusion matrix.

🐍 Practice Python Here! 🚀

Write, Run & Debug Python Code Directly in Your Browser! 💻✨

We're now taking our first steps into classification by training a Logistic Regression model to predict one of three classes. We evaluate its performance using accuracy and a confusion matrix, which provides a detailed breakdown of the model's predictions versus the actual labels for each class.

Keep Leveling Up! 🚀

You've now armed yourself with a more powerful set of Python and data science skills! You're writing more concise and robust code, manipulating data with greater ease using NumPy and Pandas, creating insightful visualizations, and even dipping your toes into the world of multi-class classification.

Remember, the journey of a data scientist is one of continuous learning and exploration. Keep practicing these intermediate techniques, experiment with different datasets, and don't be afraid to explore the vast capabilities of these powerful libraries. The more you practice, the sharper your skills will become!

What intermediate data science concept are you excited to master next? Share your thoughts in the comments below! 👇

#datascience #machinelearning #python #intermediatepython #numpy #pandas #datavisualization #matplotlib #seaborn #logisticregression #classification #coding #skilldevelopment

The Adventure Continues! 🚀

This is just a taste of the incredible journey that awaits you in the world of data science. There's so much more to explore:

Data Cleaning and Preprocessing: Getting your data ready for analysis.
More Machine Learning Algorithms: Expanding your toolkit with classification, clustering, and dimensionality reduction techniques.
Natural Language Processing (NLP): Working with text data.
Time Series Analysis: Analyzing data that changes over time.
Databases and SQL: Managing and querying data in databases.
Web Scraping: Extracting data from websites.

Remember, consistent practice and hands-on projects are key to mastering these skills. So, dive in, experiment, and most importantly, have fun!

This "Three steps to Machine Learning challenge" is not just a challenge; it's an invitation to transform yourself. Take that first step, run these codes, tweak them, and watch your confidence—and competence—grow.

Tags: artificial intelligence, engineering, get hired, learn ai, learn ml, machine learning

Category: Uncategorized