Chapter 7: Data Science Libraries

Chapter 7: Data Science Libraries - NumPy and Pandas

Python's data science ecosystem is built on powerful libraries that handle numerical computation and data manipulation. NumPy and Pandas are the foundation of data science in Python - master these, and you can work with any data analysis task.

NumPy: Numerical Python

NumPy is a low-level multi-dimensional array library that serves as the foundation for many Python data science tools including Pandas, scikit-learn, and TensorFlow.

What is NumPy?

NumPy provides:

  • Fast multi-dimensional arrays
  • Mathematical operations on arrays
  • Linear algebra functions
  • Random number generation
  • The building blocks for data science

Hello World NumPy Workflow

Creating NumPy Arrays

One-Dimensional Arrays

Two-Dimensional Arrays

Creating Sequences with arange

Creating Arrays of Zeros

Creating Arrays of Ones

Creating Identity Matrices

Array Operations

Element-wise Arithmetic

Scalar Operations

Matrix Operations

Array Indexing and Slicing

Basic Indexing

Two-Dimensional Indexing

Boolean Indexing

Aggregation Functions

Statistical Operations

Axis-based Operations

Array Reshaping

Stacking Arrays

Pandas: Data Analysis Library

Pandas is built on NumPy and provides high-level data structures for working with structured data. It's the primary tool for data manipulation in Python.

Pandas Series

A Series is a one-dimensional labeled array.

Series with Custom Index

Series from Dictionary

Pandas DataFrame

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types.

Creating a DataFrame

DataFrame from Lists

Exploring DataFrames

Basic Information

Head and Tail

Describe Statistics

Accessing DataFrame Data

Column Selection

Row Selection with loc and iloc

Boolean Filtering

Adding and Modifying Data

Adding Columns

Modifying Values

Handling Missing Data

Detecting Missing Data

Dropping Missing Data

Filling Missing Data

Sorting Data

Grouping and Aggregation

Merging DataFrames

Applying Functions

Working with CSV Data (In-Memory)

Key Takeaways

  1. NumPy Arrays: Fast, efficient multi-dimensional arrays for numerical computation
  2. Array Operations: Element-wise operations, broadcasting, and linear algebra
  3. Pandas Series: One-dimensional labeled arrays
  4. Pandas DataFrames: Two-dimensional labeled data structures, the workhorse of data analysis
  5. Data Selection: Use loc for labels, iloc for positions, boolean indexing for filtering
  6. Missing Data: Detect with isnull(), handle with dropna() or fillna()
  7. Aggregation: Use groupby() for split-apply-combine operations
  8. Data Merging: Combine datasets with merge() or join()

NumPy and Pandas form the foundation of data science in Python. With these tools, you can load, clean, transform, and analyze data efficiently.

Quiz

25?", "options": { "A": "df.filter(age > 25)", "B": "df[df['age'] > 25]", "C": "df.select(age > 25)", "D": "df.where('age' > 25)" }, "correct": "B", "feedback": { "A": "Incorrect. Pandas does not have a filter() method for row selection in this way.", "B": "Correct! Boolean indexing with df[condition] is the standard way to filter rows in Pandas.", "C": "Incorrect. There is no select() method in Pandas for this purpose.", "D": "Incorrect. While where() exists, the syntax is incorrect and it works differently." } } } }'>

Further Reading

📝 Test Your Knowledge: Chapter 7: Data Science Libraries

Take this quiz to reinforce what you've learned in this chapter.