Data Strategists (the venn diagram of data analysts, data engineers, and data scientists) constantly grapple with the challenge of preparing and cleaning their data before they are able to use it in analyses.
While the days of spending up to 80% of time uploading and cleaning data are largely gone, Anaconda's 2022 State of Data Science Survey revealed that data strategists self-report spending 22% of their time preparing data and 16% of their time cleaning data - meaning that a plurality of a data professionals time is still spent on work upstream of the analysis that their organizations rely on. Fortunately, advancements in data wrangling tools and techniques mean that data strategists have more options than ever to load, profile, and clean their data - often referred to as data wrangling or data munging.
Having an understanding of the full scope of these tools provides a data strategist with options to decrease the time spent on data prep and cleaning.
This repository will be used as both a reference document and tutorial for to data strategists of how to use the tool / tools they decide to use for data wrangling. This respository will cover tools as varied as spreadsheet applications (Excel and Google Sheets), R, Python, Stata, common BI tools (Tableau, Looker, and PowerBI), the computer’s command line/terminal (Unix-based and PowerShell, and other proprietary software that aims to solve data wrangling challenges.
- Comma Separated Values
- Spreadsheets
- Portable Document Format (PDF)
- Relational Databases
- Websites
I took inspiration from work done on the ConsenSys Academy Basic Training and Blockchain Developer Bootcamp courses, which in turn took a lot of inspiration from many other sources. As part of that work, I personally referenced Adam Pritchard's Markdown Cheatsheet and the Writing on GitHub documentation. I also had to dive into the GitHub Flavored Markdown Spec in order to get some questions answered about specific formatting.
Due to the fact that I seem to be incapable of remebering basic GitHub commands, I constantly referenced Pushing commits to a remote repository, and I would recommend anyone working with GitHub working through the Getting started with GitHub documentation. Also, Stack Overflow (embarrassing examples of actual searches here, here, here, and here) was helpful as always.