Chapter 10: Pattern Matching

Pattern matching is essential for data science workflows - from cleaning text data to extracting information from logs to validating input formats. Python provides powerful tools for pattern matching, from simple string methods to sophisticated regular expressions.

Simple String Pattern Matching

Python's built-in string methods provide simple pattern matching capabilities:

The in operator checks for substring presence, returning a boolean value.

String Position Methods

Strings have methods to check patterns at specific positions:

These methods are efficient for checking prefixes and suffixes without regular expressions.

Finding Substring Positions

The find() method returns the index of the first occurrence:

find() returns -1 when the substring isn't found, making it safe for conditional logic.

List Membership Matching

Pattern matching works with lists using the in operator:

List membership checks are exact matches, not substring searches.

Dictionary Membership Matching

Dictionaries support pattern matching in both keys and values:

Use in for keys, loop through values() or items() to search values.

Introduction to Regular Expressions

Regular expressions (regex) provide powerful pattern matching beyond simple string methods:

re.match() only matches at the beginning of the string.

Regular Expression Search

Use re.search() to find patterns anywhere in the string:

search() finds the first occurrence anywhere in the string, while match() only checks the start.

Character Sets in Regex

Character sets match any character from a specified range:

Character sets use brackets [], quantifiers like + (one or more) and {n} (exactly n).

Email Pattern Matching

Regular expressions excel at matching structured formats like emails:

This pattern matches: letters + @ + letters + . + letters.

Character Classes

Character classes provide shortcuts for common patterns:

Common character classes: \w (word), \d (digit), \s (space), \W (non-word), \D (non-digit).

Capture Groups

Groups capture parts of the pattern for extraction:

Group 0 is always the full match. Groups 1, 2, 3... are captured subpatterns.

Named Capture Groups

Named groups make patterns more readable:

Named groups make code self-documenting and easier to maintain.

Finding All Matches

findall() returns all non-overlapping matches:

findall() returns a list of strings (or tuples if groups are used).

Finding All with Groups

When using groups, findall() returns tuples:

Each match becomes a tuple of captured groups.

Iterating Over Matches

finditer() returns an iterator of match objects:

finditer() is memory-efficient for large texts, processing matches lazily.

Named Groups with Iterators

Combine iterators with named groups for powerful extraction:

groupdict() returns a dictionary of named groups, perfect for creating structured data.

Pattern Substitution

re.sub() replaces matches with new text:

sub() can use a string or function as replacement.

Substitution with Named Groups

Named groups can be referenced in replacements:

Use \g<name> to reference named groups in replacement strings.

Compiling Regular Expressions

Compile patterns for better performance when reusing:

Compiled patterns are faster for repeated matching and can store flags.

Regex Flags

Flags modify regex behavior:

Common flags: re.IGNORECASE, re.MULTILINE, re.DOTALL, re.VERBOSE.

Multiline Pattern Matching

Handle multiline text with ^ and $ anchors:

re.MULTILINE makes ^ and $ match line boundaries, not just string boundaries.

Practical Example: Log Parsing

Extract structured data from log files:

Regular expressions excel at extracting structured information from semi-structured text.

Practical Example: Data Validation

Validate data formats like phone numbers:

Regular expressions provide robust input validation for data pipelines.

Practical Example: Data Extraction

Extract specific data from mixed text:

Combine regex with data processing for powerful text analytics.

Quiz: Test Your Knowledge

...)?", "options": { "A": "They make patterns match faster", "B": "They enable case-insensitive matching", "C": "They make code more readable by naming captured parts", "D": "They are required for using re.sub()" }, "correct": "C", "feedback": { "A": "Named groups don\\'t affect performance. Their benefit is code readability and maintainability, not speed.", "B": "Case-insensitive matching is controlled by the re.IGNORECASE flag, not by named groups. Named groups are about labeling captured text.", "C": "Correct! Named groups use (?Ppattern) syntax to give meaningful names to captured parts. Instead of accessing groups by number (match.group(1), match.group(2)), you can use names (match.group(\\'username\\'), match.group(\\'domain\\')). This makes code self-documenting and easier to maintain.", "D": "Named groups are optional for re.sub(). They\\'re helpful for referencing groups in the replacement string (\\\\g), but you can also use numbered groups (\\\\1, \\\\2)." } } } }'>

Summary

Pattern matching is essential for data processing in Python. Simple string methods handle basic cases, while regular expressions provide powerful, flexible pattern matching for complex scenarios. Use in, startswith(), and endswith() for simple checks. Use the re module for structured patterns like emails, phone numbers, or log entries.

Key takeaways:

String methods: fast and simple for basic patterns
re.match(): anchored to string start
re.search(): finds first match anywhere
re.findall(): returns all matches as list
re.finditer(): lazy iteration over matches
Compile patterns for reuse
Named groups improve code readability
Use re.sub() for pattern-based replacement

Regular expressions are a superpower for data cleaning, validation, and extraction in data science workflows.

Related Courses

Master text processing and data extraction with these courses from Pragmatic AI Labs:

Text Analytics with Python

Learn comprehensive text processing:

Regular expressions mastery
Natural language processing basics
Text cleaning and normalization
Entity extraction and recognition
Sentiment analysis techniques

Explore Text Analytics →

Web Scraping with Python

Extract data from the web:

BeautifulSoup and lxml parsing
Selenium for dynamic content
Regular expressions for extraction
API integration techniques
Ethical scraping practices

Explore Web Scraping →

Data Cleaning and Preprocessing

Prepare data for analysis:

Pattern-based data validation
Handling missing and malformed data
Text normalization techniques
Data type conversion
Quality assurance workflows

Explore Data Cleaning →

Advanced Python for Data Science

Level up your Python skills:

Regular expression optimization
String manipulation best practices
Performance profiling
Memory-efficient text processing
Production-ready data pipelines

Explore Advanced Python →

Ready for a structured learning path? Check out our Data Science Professional Track for a comprehensive journey through Python, data analysis, and machine learning.

📝 Test Your Knowledge: Chapter 10: Pattern Matching

Take this quiz to reinforce what you've learned in this chapter.