TP 1 : Formatting and Sanitizing -

Note: you do not have writing rights over this repo. One of the teammates should create a fork (a remote copy under your username, in the github world) for your group to work on. Check this tutorial to learn about forks.

Instructions

Context

For this first practical assignment, we'll be working with data from Montpellier métropole. You will be downloading the raw "Defibrillators of Montpellier" dataset (the one you've played with in CodinGame).

Requirements

In groups you will write a loader script that creates the clean (formatted+sanitized+framed) dataframe. From the original raw data, you want to extract a data frame containing the information provided in the "Defibrillators" CodinGame challenge, + some extra info:

Name
Address (including postal code and city name)
Contact phone number (one number)
Latest maintenance date
Maintenance frequency
Longitude (in degrees)
Latitude (in degrees)

Deliverables

A fork of this git repo with

specification of each test sample in loader_test.py.
a working version of loader.py made by your group, with comments on your cleaning choices.

Technical requisites

Throughout this assignent you'll use multiple pandas functionalities such as:

drop useless columns
convert string types to numeric or datetime types
manipulating strings with .str (aka the string accessor)
checking strings for patterns using regular expressions

To learn how to do this in Pandas, you may refer to the following documents:

Some useful examples applied to the data at hand are given in the examples directory.

Methodology

This assignment will also be an opportunity to practice some test-oriented development. To build the loading script, you will proceed with the following steps:

Test-oriented problem specification
Implementation of the cleaning functions
Automatic validation with tests

Step 1. Test-oriented problem specification

Prepare some tests cases that ensure that your cleaning functions work as expected:

Select a handful of examples (between 10 and 15) from your data that cover the problems you have identified. Based on these examples, compose a sample dirty data file (sample_dirty.csv) that will serve as a testing reference. This is done by the teacher and you can see the results in the data directory.
Manually compose a pd.DataFrame that corresponds to the sample_dirty data loaded with the correct types/formats (sample_formatted). Only the pertinent columns should be loaded by this function (any useless columns must not be read). This is already specified in loader_test.py.
Manually compose a pd.DataFrame where all the sanity problems got fixed (sample_sanitized). This is already specified in loader_test.py.
Manually compose a pd.DataFrame framed as requested (sample_framed). Column renaming or merging should be performed at this step. You must complete this case in loader_test.py.

These test cases should be specified in the file loader_test.py where indicated.

Step 2. Implementation of the cleaning functions

Your loading function should be in a script loader.py. It should be as modular as possible. The current loader.py gives an example modularization template:

...

def load_formatted_data(data_fname:str) -> pd.DataFrame:
    """ One function to read csv into a dataframe with appropriate types/formats.
        Note: read only pertinent columns, ignore the others.
    """
    ...
    return df

def sanitize_data(df:pd.DataFrame) -> pd.DataFrame:
    """ One function to do all sanitizing"""
    ...
    return df

def frame_data(df:pd.DataFrame) -> pd.DataFrame:
    """ One function all framing (column renaming, column merge)"""
    ...
    return df


def load_clean_data(data_path:str=DATA_PATH)-> pd.DataFrame:
    """one function to run it all and return a clean dataframe"""
    df = (load_formatted_data(data_path)
          .pipe(sanitize_data)
          .pipe(frame_data)
    )
    return df

...

Step 3. Automatic validation with tests

Since the problem is defined in terms of test cases, verifying the task is completed consists in passing each of the test cases. The loader_test.py file contains testing functions that check each of them (assuming you implemented all the test cases). You can run them using pytest.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
data		data
examples		examples
.gitignore		.gitignore
README.md		README.md
loader.py		loader.py
loader_test.py		loader_test.py
requirements.txt		requirements.txt
sandbox_notebook.ipynb		sandbox_notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TP 1 : Formatting and Sanitizing -

Instructions

Context

Requirements

Deliverables

Technical requisites

Methodology

Step 1. Test-oriented problem specification

Step 2. Implementation of the cleaning functions

Step 3. Automatic validation with tests

About

Uh oh!

Releases

Packages

Uh oh!

Languages

EPF-MDE/data-clean-TP1

Folders and files

Latest commit

History

Repository files navigation

TP 1 : Formatting and Sanitizing -

Instructions

Context

Requirements

Deliverables

Technical requisites

Methodology

Step 1. Test-oriented problem specification

Step 2. Implementation of the cleaning functions

Step 3. Automatic validation with tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages