A Python package providing enhanced pandas functionality with async support and optimized operations.
- Asynchronous DataFrame Loading: Load large CSV and Parquet files efficiently using async I/O
- Optimized DataFrame Merging: Merge large DataFrames using chunking or Dask
- Data Type Conversions: Convert between string representations and Python objects
- DataFrame Profiling: Generate detailed profiling reports and comparisons
- Data Transformation: Various utilities for DataFrame transformations
pip install mb_pandas
- Python >= 3.8
- numpy
- pandas
- colorama
Functions for DataFrame transformations and merging operations.
from mb_pandas.transform import merge_chunk, merge_dask, check_null, remove_unnamed,rename_columns
# Merge large DataFrames in chunks
result = merge_chunk(df1, df2, chunksize=10000)
# Merge using Dask for distributed computing
result = merge_dask(df1, df2)
# Check and handle null values
df = check_null('data.csv', fillna=True)
# Remove unnamed columns
df = remove_unnamed(df)
# Rename column
df = rename_columns(data,'labels2','labels')
Asynchronous DataFrame loading utilities.
from mb_pandas.dfload import load_any_df
# Load any supported file format
df = load_any_df('data.csv')
df = load_any_df('data.parquet')
# Convert string columns to Python objects
df = load_any_df('data.csv', literal_ast_columns=['json_col'])
Asynchronous I/O utilities.
from mb_pandas.aio import read_text, srun
# Read file asynchronously
content = await read_text('file.txt', context_vars={'async': True})
# Run async function synchronously
result = srun(async_function, *args)
Data type conversion utilities.
from mb_pandas.convert_data import convert_string_to_list, convert_string_to_dict, convert_string_to_type
# Convert string representations to lists
df = convert_string_to_list(df, 'list_column')
# Convert string representations to dictionaries
df = convert_string_to_dict(df, 'dict_column')
# Convert strings to specific types
df = convert_string_to_type(df, 'number_column', int)
DataFrame profiling and comparison utilities.
from mb_pandas.profiler import create_profile, profile_compare
# Generate profiling report
create_profile(df, 'report.html', target=['target_column'])
# Compare two DataFrames
profile_compare(df1, df2, 'comparison.html')
Merge two DataFrames in chunks to handle large datasets efficiently.
Merge two DataFrames using Dask for improved performance with large datasets.
Load DataFrames from various file formats with progress tracking.
Convert string representations of lists in a DataFrame column to actual lists.
Generate a detailed profiling report for a DataFrame.
All functions include comprehensive error handling with descriptive messages:
try:
df = load_any_df('data.csv')
except ValueError as e:
print(f"Error loading file: {e}")
Most functions accept an optional logger parameter for operation tracking:
import logging
logger = logging.getLogger()
df = load_any_df('data.csv', logger=logger)
- Use
merge_chunk
for large DataFrame merges that fit in memory - Use
merge_dask
for very large datasets that benefit from distributed computing - Enable
show_progress=True
to monitor long-running operations - Use
minimal=True
in profiling for large datasets - Consider sampling large datasets before profiling