From b054676caae3a385aa362cc58c560f4254ed1fc6 Mon Sep 17 00:00:00 2001 From: Josh Levy-Kramer Date: Mon, 28 Dec 2020 12:49:06 +0000 Subject: [PATCH 1/2] Changes inspired by Dr-Irv --- README.md | 30 ++++++++++++++++++++++++++---- 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 68ff68e..d5bca52 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ Pandas is a package for data manipulation and analytics in Python. It is highly This opinionated guide, about Pandas and using DataFrames, presents best practices to write code that is more consistent, reliable, maintainable and readable by practitioners. This is mainly aimed at code which is used in production systems and not ad-hoc exploratory work. -*Why do we need this?* Users of Pandas have a divers background (e.g. Data Scientist, Data Engineer, Researcher, Software Engineer) and language experience (e.g. SQL, MATLAB, Java) +*Why do we need this?* Users of Pandas have a diverse background (e.g. Data Scientist, Data Engineer, Researcher, Software Engineer) and language experience (e.g. SQL, MATLAB, Java) which can lead to various coding style. # Column selection @@ -22,6 +22,28 @@ Why: * It makes it more explict to the user that you are accessing a column and not a standard property or method * Not all column names can be represented as a property - it must be a valid Python variable name. The column name also must not clash with an existing method or property +# Avoid chained indexing + +```python +# Good +df.loc[df['A'] > 1, 'B'] = 1 + +# Bad (chained indexing) +df[df['A'] > 1]['B'] = 1 +``` + +Indexing or slicing a dataframe twice can lead to unintended behaviour. In the second "chained indexing" example the column `B` **does not** get set with value `1`. A warning stating that "A value is trying to be set on a copy of a slice from a DataFrame." is also shown by Pandas. + +In these circumstances use the `loc` and `iloc` methods to select both rows and columns. + +Sometimes chained indexing is unavoidable. In these circumstances explicitly copy the dataframe. For example: + +```python +df2 = df[df['A'] > 1].copy() +df2['B'] = 1 +df2['C'] = 2 +``` + # Copy vs re-assignment ```python @@ -81,7 +103,7 @@ Also, don't deduplicate rows after a merge to remove merge duplication. Remove d ```python # Good -df['new_col_float'] = np.nan +df['new_col_float'] = pd.NA df['new_col_int'] = pd.Series(dtype='int') df['new_col_str'] = pd.Series(dtype='object') @@ -90,7 +112,7 @@ df['new_col_int'] = 0 df['new_col_str'] = '' ``` -If a new empty column is needed always use NaN values. Never use "filler" values such as zeros or empty strings. This preserves the ability to use methods such as `isnull` or `notnull`. +If a new empty column is needed always use `pd.NA` values. Never use "filler" values such as zeros or empty strings. This preserves the ability to use methods such as `isnull` or `notnull`. # Querying data @@ -102,7 +124,7 @@ df[df['A'] > df['B']] df.query('A > B') ``` -Use idiomatic Pandas querying instead of SQL like string querying. +Use idiomatic Pandas querying instead of SQL like string querying. This provides a more consistent use of the Pandas API. # Mutability of DataFrames From 0bb0380c81c6eb8aa788480e35bf98804c31d9b3 Mon Sep 17 00:00:00 2001 From: Josh Levy-Kramer Date: Mon, 28 Dec 2020 12:50:08 +0000 Subject: [PATCH 2/2] Changes inspired by Dr-Irv --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index d5bca52..ed643b7 100644 --- a/README.md +++ b/README.md @@ -124,7 +124,7 @@ df[df['A'] > df['B']] df.query('A > B') ``` -Use idiomatic Pandas querying instead of SQL like string querying. This provides a more consistent use of the Pandas API. +Use idiomatic Pandas querying instead of SQL like string querying. # Mutability of DataFrames