Tutorial Use Case 01: Developing code for Loading/Processing CSV Files

Dusan_Perkin · February 13, 2024, 3:35pm

Tutorial Use Case 01: Developing code for Loading/Processing CSV Files

Today, we’ll introduce you to Emily, a data analyst at DataTech Solutions. Emily’s role involves processing large datasets to extract insights for business decision-making.

Emily’s Dilemma: Emily receives numerous CSV files from various departments within the company, each containing valuable data for analysis. However, manually processing these files one by one is time-consuming and prone to errors. Emily needs a more efficient way to automate the file processing tasks.

Your Mission: Your mission is to assist Emily in automating the file processing tasks using Agents. By creating Agents to monitor designated folders for new CSV files, extract, transform, and load (ETL) the data into new files, you’ll help Emily streamline her data analysis workflow and improve productivity.

Getting Started: We’ll guide you through the process step by step, using OpenFlow’s intuitive tools and resources. No prior programming knowledge is required, and you’ll have everything you need to succeed, including Visual Studio Code with the OpenFlow extension and access to the OpenFlow Control Room.

Course Overview:

Getting Started with Agents and OpenFlow

In this introductory module, you will learn the basics of preparing environments for creating Packages for your Agents using Visual Studio Code, NodeJS, and Python. Guided setup of OpenIAP assistant and initialization of a new project in Visual Studio Code.

Initializing Your Project

In this tutorial, we’ll begin by creating a project folder for our packages for OpenFlow Agents project and setting up its initial structure.

Understanding Emily’s Workflow

Before diving into the technical aspects of automation, it’s essential to understand Emily’s workflow at DataTech Solutions. You’ll gain insights into the challenges she faces with manual file processing tasks and the impact it has on her productivity. By understanding Emily’s workflow, we can identify opportunities for automation and tailor our solution to meet her specific needs.

Running and Debugging Your Code

In this module, we’ll set up folder monitoring Agents to automate the process of monitoring designated folders for new CSV files. You’ll learn how to install and configure the necessary tools, including Visual Studio Code and the OpenFlow extension. We’ll then walk through the process of configuring Agents to monitor specific folders and define monitoring parameters such as file types, directories, and file naming conventions.

Deploying Your Code as a Package

In this tutorial, you’ve learned how to package your code and publish it to an OpenIAP instance for deployment as an agent.

Support and Feedback: If you encounter any challenges or have feedback to share, our community is here to help. Join us on the OpenFlow channel and #help-developer-training for assistance and collaboration.

Let’s Get Started: Join Emily in revolutionizing data analysis at DataTech Solutions. Together, we’ll automate file processing tasks and unlock new insights from valuable datasets. Are you ready? Let’s dive in!
________________________________________

Developing code for Loading/Processing CSV Files:

!! Before we starting writing code in the main.py file we need to update the conda.yaml file with the following code:

channels:
  # Define conda channels here.
  - conda-forge

dependencies:
  # Define conda-forge packages here -> https://anaconda.org/search
  # When available, prefer the conda-forge packages over pip as installations are more efficient.
  - python=3.9.13               # https://pyreadiness.org/3.9/ 
  - pip=22.3                    # https://pip.pypa.io/en/stable/news/
  #- truststore=0.7.0              # https://github.com/sethmlarson/truststore/blob/main/CHANGELOG.md
  - pip:
      # Define pip packages here -> https://pypi.org/
      - openiap
      - pandas
      - matplotlib

This tutorial will guide you through a Python script that reads sales data from a CSV file, analyses it, and generates a summary report along with visualizations for further analysis. Let’s break down the code step by step.

To write the summary of sales data, including total revenue, top-selling products, and customer demographics, into a new file for reporting with dashboards, you can use Python’s built-in CSV module to create a new CSV file. Additionally, you may want to use libraries like pandas and matplotlib to create visualizations for the dashboard.

This code performs data analysis on sales data stored in a CSV file (‘SalesData.csv’). Here’s how you can do it:

Importing Required Libraries :

import openiap, asyncio, csv, os
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
from openiap import Client

Explanation:

The csv module is imported to handle reading and writing CSV files.
The Counter class from the collections module is imported to count occurrences of elements.
pandas is imported with the alias pd for data manipulation and analysis.
matplotlib.pyplot is imported with the alias plt for creating visualizations.

Reading Sales Data from CSV File :

sales_data = []
with open('your_abs_path/Example_code_tutorial_02/Input files/SalesData.csv', 'r',encoding='utf-8') as csvfile:
        reader = csv.DictReader(csvfile,delimiter=';')
        for row in reader:
            sales_data.append(row)

Explanation:

A list named sales_data is initialized to store the data read from the CSV file.
The ‘SalesData.csv’ file is opened in read mode using a context manager.
The csv.DictReader is used to read the CSV file and iterate over its rows as dictionaries.
Each row (dictionary) is appended to the sales_data list.

Calculating Total Revenue :

total_revenue = sum(float(row['Amount']) for row in sales_data)

Explanation:

The sum function is used to calculate the total revenue by summing the ‘Amount’ column from each row in sales_data.
A generator expression is used to extract the ‘Amount’ value from each row as a floating-point number.

Identifying Top-Selling Products :

product_sales = Counter(row['Product'] for row in sales_data)
top_selling_products = product_sales.most_common(5)  # Get top 5 products

Explanation:

The Counter class is used to count occurrences of each product in the ‘Product’ column of sales_data.
The most_common method is called to retrieve the top 5 selling products along with their counts.

Analyzing Customer Demographics :

customer_demographics = {
    'age': Counter(),
    'gender': Counter(),
    'location': Counter()
}
for row in sales_data:
    customer_demographics['age'][row['Age']] += 1
    customer_demographics['gender'][row['Gender']] += 1
    customer_demographics['location'][row['Location']] += 1

Explanation:

A dictionary named customer_demographics is created with keys for ‘age’, ‘gender’, and ‘location’, each initialized with an empty Counter object.
A loop iterates over each row in sales_data and updates the corresponding Counter objects in customer_demographics based on age, gender, and location information.

Writing Summary Data to a New CSV File :

output_folder = 'your_abs_path\Example_code_tutorial_02\Output'
os.makedirs(output_folder, exist_ok=True)
# Write summary data to a new CSV file
with open(f'{output_folder}/SalesSummary.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=';')
        
    # Write total revenue
    writer.writerow(['Total Revenue', total_revenue])
    writer.writerow([])  # Blank row
        
    # Write top-selling products
    writer.writerow(['Top Selling Products'])
    writer.writerow(['Product', 'Quantity'])  # Header row
    for product, quantity in top_selling_products:
        writer.writerow([product, quantity])
    writer.writerow([])  # Blank row
        
    # Write customer demographics
    writer.writerow(['Customer Demographics'])
        
    # Write Age demographics
    writer.writerow(['Age', 'Count'])  # Header row
    for age, count in sorted(customer_demographics['age'].items()):
        writer.writerow([age, count])
        
    # Write Gender demographics
    writer.writerow(['Gender', 'Count'])  # Header row
    for gender, count in customer_demographics['gender'].items():
        writer.writerow([gender, count])
        
    # Write Location demographics
    writer.writerow(['Location', 'Count'])  # Header row
    for location, count in customer_demographics['location'].items():
        writer.writerow([location, count])

Explanation:

A new CSV file named ‘SalesSummary.csv’ is opened in write mode using a context manager.
Data such as total revenue, top-selling products, and customer demographics are written to the CSV file using the csv.writer object.
Separate sections are delineated with blank rows for better readability.

Creating Visualizations for the Dashboard:

# Create visualizations for the dashboard
# Age distribution
age_df = pd.DataFrame.from_dict(customer_demographics['age'], orient='index', columns=['Count'])
age_df.plot(kind='bar', xlabel='Age', ylabel='Count', title='Age Distribution')
plt.savefig(f'{output_folder}/age_distribution.png')
plt.clf()  # Clear the plot

# Gender distribution
gender_df = pd.DataFrame.from_dict(customer_demographics['gender'], orient='index', columns=['Count'])
gender_df.plot(kind='bar', xlabel='Gender', ylabel='Count', title='Gender Distribution')
plt.savefig(f'{output_folder}/gender_distribution.png')
plt.clf()  # Clear the plot

# Location distribution
location_df = pd.DataFrame.from_dict(customer_demographics['location'], orient='index', columns=['Count'])
location_df.plot(kind='bar', xlabel='Location', ylabel='Count', title='Location Distribution')
plt.savefig(f'{output_folder}/location_distribution.png')

Explanation:

Data from the customer_demographics dictionary is converted into pandas DataFrame objects for creating visualizations.
Matplotlib is used to create bar charts for age, gender, and location distributions.
Each visualization is saved as an image file using plt.savefig, and the plot is cleared using plt.clf() to prepare for the next visualization.

__________________________________

Whole code:

import openiap, asyncio, csv, os
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
from openiap import Client

    # cli.Close()
async def main():

    # Read sales data from CSV file
    sales_data = []
    with open('your_abs_path/Example_code_tutorial_02/Input files/SalesData.csv', 'r',encoding='utf-8') as csvfile:
        reader = csv.DictReader(csvfile,delimiter=';')
        for row in reader:
            print(row.keys())  # Print keys of each row
            print(row.values())  # Print keys of each row
            sales_data.append(row)

    # Calculate total revenue
    total_revenue = sum(float(row['Amount']) for row in sales_data)

    # Identify top-selling products
    product_sales = Counter(row['Product'] for row in sales_data)
    top_selling_products = product_sales.most_common(5)  # Get top 5 products

    # Analyze customer demographics
    customer_demographics = {
        'age': Counter(),
        'gender': Counter(),
        'location': Counter()
    }
    for row in sales_data:
        customer_demographics['age'][row['Age']] += 1
        customer_demographics['gender'][row['Gender']] += 1
        customer_demographics['location'][row['Location']] += 1

    output_folder = 'your_abs_path\Example_code_tutorial_02\Output'
    os.makedirs(output_folder, exist_ok=True)
    # Write summary data to a new CSV file
    with open(f'{output_folder}/SalesSummary.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=';')
        
        # Write total revenue
        writer.writerow(['Total Revenue', total_revenue])
        writer.writerow([])  # Blank row
        
        # Write top-selling products
        writer.writerow(['Top Selling Products'])
        writer.writerow(['Product', 'Quantity'])  # Header row
        for product, quantity in top_selling_products:
            writer.writerow([product, quantity])
        writer.writerow([])  # Blank row
        
        # Write customer demographics
        writer.writerow(['Customer Demographics'])
        
        # Write Age demographics
        writer.writerow(['Age', 'Count'])  # Header row
        for age, count in sorted(customer_demographics['age'].items()):
            writer.writerow([age, count])
        
        # Write Gender demographics
        writer.writerow(['Gender', 'Count'])  # Header row
        for gender, count in customer_demographics['gender'].items():
            writer.writerow([gender, count])
        
        # Write Location demographics
        writer.writerow(['Location', 'Count'])  # Header row
        for location, count in customer_demographics['location'].items():
            writer.writerow([location, count])


    # Create visualizations for the dashboard
    # Age distribution
    age_df = pd.DataFrame.from_dict(customer_demographics['age'], orient='index', columns=['Count'])
    age_df.plot(kind='bar', xlabel='Age', ylabel='Count', title='Age Distribution')
    plt.savefig(f'{output_folder}/age_distribution.png')
    plt.clf()  # Clear the plot

    # Gender distribution
    gender_df = pd.DataFrame.from_dict(customer_demographics['gender'], orient='index', columns=['Count'])
    gender_df.plot(kind='bar', xlabel='Gender', ylabel='Count', title='Gender Distribution')
    plt.savefig(f'{output_folder}/gender_distribution.png')
    plt.clf()  # Clear the plot

    # Location distribution
    location_df = pd.DataFrame.from_dict(customer_demographics['location'], orient='index', columns=['Count'])
    location_df.plot(kind='bar', xlabel='Location', ylabel='Count', title='Location Distribution')
    plt.savefig(f'{output_folder}/location_distribution.png')

asyncio.run(main())

Allan_Zimmermann · February 13, 2024, 5:22pm

you should recommend people to use conda.yaml files and not requirements.txt

requirements.txt is a legacy from when every package was sharing the same python environment.
It’s much better to use dedicated environments using conda.yaml ( and for that reason using requirements.txt will be removed in the near future )

See Agents Page | OpenIAP Documentation for more details

Dusan_Perkin · February 13, 2024, 5:43pm

Thanks for the comment! I misunderstood the documentation, so I used only requirements.
It’s now updated!

How do you like tutorial?
And is there way to upload zip file of the project so anyone can download it, or maybe I should send when someone ask me through email? I have tried but it does not allowed all kind of extensions.