Skip to content

twhay/data-wrangling-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 

Repository files navigation

The Data Wrangling Toolkit

Preamble

Data Strategists (the venn diagram of data analysts, data engineers, and data scientists) constantly grapple with the challenge of preparing and cleaning their data before they are able to use it in analyses.

While the days of spending up to 80% of time uploading and cleaning data are largely gone, Anaconda's 2022 State of Data Science Survey revealed that data strategists self-report spending 22% of their time preparing data and 16% of their time cleaning data - meaning that a plurality of a data professionals time is still spent on work upstream of the analysis that their organizations rely on. Fortunately, advancements in data wrangling tools and techniques mean that data strategists have more options than ever to load, profile, and clean their data - often referred to as data wrangling or data munging.

Having an understanding of the full scope of these tools provides a data strategist with options to decrease the time spent on data prep and cleaning.

This repository will be used as both a reference document and tutorial for to data strategists of how to use the tool / tools they decide to use for data wrangling. This respository will cover tools as varied as spreadsheet applications (Excel and Google Sheets), R, Python, Stata, common BI tools (Tableau, Looker, and PowerBI), the computer’s command line/terminal (Unix-based and PowerShell, and other proprietary software that aims to solve data wrangling challenges.

Table of Contents

  1. Comma Separated Values
  2. Spreadsheets
  3. Portable Document Format (PDF)
  4. Relational Databases
  5. Websites

Short meta-notes on building a resource like this

I took inspiration from work done on the ConsenSys Academy Basic Training and Blockchain Developer Bootcamp courses, which in turn took a lot of inspiration from many other sources. As part of that work, I personally referenced Adam Pritchard's Markdown Cheatsheet and the Writing on GitHub documentation. I also had to dive into the GitHub Flavored Markdown Spec in order to get some questions answered about specific formatting.

Due to the fact that I seem to be incapable of remebering basic GitHub commands, I constantly referenced Pushing commits to a remote repository, and I would recommend anyone working with GitHub working through the Getting started with GitHub documentation. Also, Stack Overflow (embarrassing examples of actual searches here, here, here, and here) was helpful as always.

About

Reference and Code for general data wrangling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published