【Introduction to Python Standard Library Part 5】The Partner for Tabular Data Processing! Manipulate CSV Files Freely with the csv Module 📊 #15

Welcome to Part 5 of our "Introduction to Python Standard Library" series! So far, we've explored modules for handling datetime, randomness, JSON data, and file system operations. Today, we're focusing on a format you'll encounter constantly when dealing with tabular data: CSV. We'll learn how to master it with Python's built-in csv module.

CSV (Comma-Separated Values) is a universal format for storing spreadsheet-like data. It's how you export data from applications like Excel or Google Sheets, how databases often dump report data, and a common format for datasets in data science. While it might seem simple enough to just split lines by commas, real-world CSV files can have complexities like commas within data fields or different line endings. The csv module is designed to handle these edge cases robustly, making it your reliable partner for all things CSV.

In this guide, we'll cover how to read from and write to CSV files using two different approaches: one based on lists and a more powerful, recommended approach based on dictionaries.


What is a CSV File?

CSV stands for Comma-Separated Values. It's a plain-text file format used to store tabular data. The structure is simple:

  • Each line in the file represents a row of data.
  • Within each row, values (or fields) are separated by a comma.
  • The first line is often a "header row," which contains the names of the columns.

A Simple CSV Example:

If you opened a file named employees.csv in a text editor, it might look like this:

name,department,employee_id,start_date
"Alice Smith",Engineering,101,2022-03-15
"Bob Johnson",Marketing,102,2021-08-20
"Charlie Brown, Jr.",HR,103,2023-01-10

Notice that "Charlie Brown, Jr." is enclosed in double quotes. This is how CSV format handles values that contain a comma. The `csv` module understands these quoting rules automatically, which is why it's so much better than manually splitting strings!


Reading CSV Files with the `csv` Module

Let's dive into reading data from a CSV file. First, let's make sure we have the module imported:

import csv

Crucial Note: open(..., newline='')
When you open a CSV file for reading or writing with Python's `csv` module, you should always specify the `newline=''` argument in the `open()` function. This prevents the `csv` module and your system's text handling from creating extra blank rows in your output. It's a small but vital detail!

1. The List-based Approach: csv.reader()

The csv.reader() function returns a reader object that iterates over lines in the given CSV file. Each row is returned as a list of strings.

# Let's assume we have the 'employees.csv' file from above
try:
    with open('employees.csv', mode='r', newline='', encoding='utf-8') as f:
        csv_reader = csv.reader(f)
        
        # The reader is an iterator. We can get the header row first.
        header = next(csv_reader)
        print(f"Header: {header}")

        # Now, iterate over the rest of the rows
        print("Data rows as lists:")
        for row in csv_reader:
            # Each 'row' is a list of strings
            print(row)
            # If you need numbers, you must convert them manually
            employee_id = int(row[2])
            print(f"  -> Employee {row[0]} has ID: {employee_id}")

except FileNotFoundError:
    print("employees.csv not found. Please create it first.")

This approach works, but you have to remember the index of each column (e.g., `row[0]` for name, `row[2]` for ID), which can be error-prone if the column order ever changes.

2. The Dictionary-based Approach: csv.DictReader() (Recommended)

A much more robust and readable approach is to use csv.DictReader(). It reads each row into a dictionary, using the values from the first (header) row as the keys. (Technically, it's an `OrderedDict` in older Python versions, but it behaves just like a regular `dict` for our purposes).

try:
    with open('employees.csv', mode='r', newline='', encoding='utf-8') as f:
        csv_dict_reader = csv.DictReader(f)

        print("\nData rows as dictionaries:")
        for row in csv_dict_reader:
            # Each 'row' is a dictionary
            print(row)
            # Access data by column name - much more readable!
            print(f"  -> Employee {row['name']} works in {row['department']}.")
except FileNotFoundError:
    print("\nemployees.csv not found. Please create it first.")

Using DictReader is highly recommended as your code becomes self-documenting and resilient to changes in column order.


Writing CSV Files with the `csv` Module

Now let's look at creating our own CSV files. Remember to use mode='w' and newline='' when opening the file.

1. The List-based Approach: csv.writer()

The csv.writer() returns a writer object responsible for converting the user’s data into delimited strings.

  • writer.writerow(row): Writes a single row from a list.
  • writer.writerows(rows): Writes multiple rows from a list of lists.
data_to_write_list = [
    ['product_id', 'product_name', 'price'],
    ['P101', 'Laptop', 1200],
    ['P102', 'Mouse', 25],
    ['P103', 'Keyboard', 75]
]

with open('products_list.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    # Write all rows at once
    writer.writerows(data_to_write_list) 
    # Or write one by one:
    # writer.writerow(data_to_write_list[0]) # Header
    # writer.writerow(data_to_write_list[1]) # First data row...
print("\nproducts_list.csv has been created.")

2. The Dictionary-based Approach: csv.DictWriter() (Recommended)

To write data from dictionaries, you use csv.DictWriter(). This requires you to specify the column headers (fieldnames) when you create the writer.

  • writer.writeheader(): Writes the header row to the file based on the `fieldnames`.
  • writer.writerow(row_dict): Writes a single row from a dictionary.
  • writer.writerows(rows_list_of_dicts): Writes multiple rows from a list of dictionaries.
data_to_write_dict = [
    {'student_id': 'S001', 'name': 'Alice', 'grade': 88},
    {'student_id': 'S002', 'name': 'Bob', 'grade': 92},
    {'student_id': 'S003', 'name': 'Charlie', 'grade': 76}
]

# Define the field names (column headers)
field_names = ['student_id', 'name', 'grade']

with open('grades.csv', 'w', newline='', encoding='utf-8') as f:
    dict_writer = csv.DictWriter(f, fieldnames=field_names)
    
    # Write the header row
    dict_writer.writeheader()
    
    # Write all the data rows
    dict_writer.writerows(data_to_write_dict)

print("grades.csv has been created.")

The `DictWriter` ensures that the values are written to the correct columns, even if the dictionaries in your list have keys in a different order.


Handling Different CSV "Dialects"

While "CSV" implies commas, some files use other characters as delimiters, like tabs (\t) or semicolons (;). The `csv` module handles this easily with the delimiter parameter.

# Example with a semicolon-delimited string
semicolon_data = 'fruit;color;quantity\napple;red;5\nbanana;yellow;10'

# Use io.StringIO to treat the string like a file for this example
import io
with io.StringIO(semicolon_data) as f:
    reader = csv.reader(f, delimiter=';')
    for row in reader:
        print(row)

The `csv` module also has parameters like quotechar (to specify the character used for quoting, e.g., ') and `quoting` (to control when quotes are used), but for most standard CSV files, the defaults work perfectly.


For Complex Data Analysis: A Note on `pandas`

The csv module is lightweight and perfect for straightforward reading and writing of CSV data. However, if your task involves more complex operations like:

  • Data cleaning and manipulation (handling missing values, filtering rows, transforming columns)
  • Statistical analysis
  • Working with very large datasets that may not fit into memory easily

...then the third-party pandas library is the industry standard. It provides a powerful DataFrame object and can read a CSV into it with a single, highly optimized command:

# This requires the pandas library to be installed: pip install pandas
import pandas as pd

# df = pd.read_csv('employees.csv')
# print(df)

For now, think of `csv` as your go-to for simple, script-based CSV I/O, and keep `pandas` in mind for when your data analysis needs grow.


Practical Example: Processing a Product Inventory

Let's combine what we've learned. We'll write an inventory to a CSV file, read it back, and then perform a simple calculation.

import csv

# 1. Data to be written
inventory_data = [
    {'item': 'Laptop', 'quantity': 15, 'unit_price': 1200},
    {'item': 'Mouse', 'quantity': 150, 'unit_price': 25},
    {'item': 'Monitor', 'quantity': 40, 'unit_price': 300},
]
inventory_filename = 'inventory.csv'
field_names = ['item', 'quantity', 'unit_price']

# 2. Write the data to a CSV file
with open(inventory_filename, 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=field_names)
    writer.writeheader()
    writer.writerows(inventory_data)
print(f"\nInventory data saved to {inventory_filename}")

# 3. Read the data back and process it
total_inventory_value = 0
try:
    with open(inventory_filename, 'r', newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        print("\n--- Calculating Inventory Value ---")
        for row in reader:
            item_name = row['item']
            # Remember to convert string values to numbers!
            quantity = int(row['quantity'])
            price = float(row['unit_price'])
            
            item_total_value = quantity * price
            total_inventory_value += item_total_value
            print(f"Value of {item_name}s: ${item_total_value:,.2f}")
    
    print(f"\nTotal value of all inventory: ${total_inventory_value:,.2f}")
except FileNotFoundError:
    print(f"Error: {inventory_filename} not found.")

This practical example demonstrates the full cycle: writing with DictWriter, reading with DictReader, and processing the results, including the crucial step of converting string data to numeric types.


Conclusion: Your Go-To for Tabular Data

The csv module is a simple, robust, and essential tool in the Python Standard Library for handling one of the most common data formats you'll ever encounter. You've now learned:

  • How to read CSV data into lists with csv.reader() and into dictionaries with csv.DictReader().
  • How to write data from lists with csv.writer() and from dictionaries with csv.DictWriter().
  • The critical importance of using newline='' when opening CSV files.
  • The benefits of the dictionary-based approach for creating readable and maintainable code.

The next time you need to export data from a spreadsheet or process a data dump, you'll be well-equipped to handle it with Python!

In our next installment, we'll continue our exploration of the standard library. Perhaps we'll dive into the world of regular expressions with re or explore system-specific parameters with sys. See you then!

Post Index


コメント

このブログの人気の投稿

Post Index

【Introduction to Python Standard Library Part 3】The Standard for Data Exchange! Handle JSON Freely with the json Module #13

Your First Step into Python: A Beginner-Friendly Installation Guide for Windows #0