A comprehensive guide to NumPy functions and operations focused on data analysis and scientific computing workflows.
To get started with NumPy locally, you can follow these steps to set up your environment and clone the repository.
First, ensure you have Python installed on your system. You can download Python from the official website.
Fork the repository to your own GitHub account by visiting complete-numpy-guide and clicking the "Fork" button in the top-right corner.
Clone your forked repository to your local machine. Open a terminal or command prompt and run:
git clone https://github.com/Mrnidhi/Complete-NumPy-Guide.git
cd Complete-NumPy-GuideCreating a virtual environment is a good practice to manage dependencies for your projects. Run the following command:
python -m venv numpy_envActivate the virtual environment:
- On Windows:
numpy_env\Scripts\activate
- On macOS/Linux:
source numpy_env/bin/activate
To deactivate the virtual environment, run:
- On Windows:
numpy_env\Scripts\deactivate.bat
- On macOS/Linux:
deactivate
With the virtual environment activated, install NumPy and Jupyter:
pip install numpy jupyterYou can use your favorite code editor like Visual Studio Code or PyCharm. Open the cloned repository folder in your code editor.
Create a new Jupyter Notebook file in your code editor:
- In Visual Studio Code, click on the "New File" icon or press
Ctrl+N, then save the file with a.ipynbextension. - In PyCharm, right-click on the project folder, select "New", and then "Jupyter Notebook".
- Else, if these options don't work or you are using an editor that doesn't support Jupyter Notebooks, run the following command in your terminal:
This will open Jupyter Notebook in your web browser.
jupyter notebook
If you prefer not to set up things locally, you can use Google Colab, which allows you to run Python code in your browser without any setup.
Go to Google Colab and start a new notebook. You can start using NumPy immediately by importing it:
import numpy as npNumPy forms the foundation of the Python data science ecosystem:
- Performance: Vectorized operations are 10-100x faster than Python loops
- Memory efficiency: Contiguous memory storage with precise data typing
- Scientific capabilities: Linear algebra, Fourier transforms, random sampling
- Ecosystem integration: Core library for pandas, scikit-learn, TensorFlow, etc.
import numpy as np# From Python lists
arr = np.array([1, 2, 3, 4, 5]) # 1D array
arr_2d = np.array([[1, 2, 3], [4, 5, 6]]) # 2D array
# Specialized arrays
zeros = np.zeros((3, 4)) # Array of zeros
ones = np.ones((2, 3)) # Array of ones
identity = np.eye(3) # 3x3 identity matrix
range_arr = np.arange(0, 10, 2) # Values from 0 to 10, step 2
linear_space = np.linspace(0, 1, 5) # 5 evenly spaced values
random_arr = np.random.rand(3, 3) # Random values from [0,1)# Reshaping
arr = np.arange(12)
arr_reshaped = arr.reshape(3, 4) # Reshape to 3x4 array
arr_transposed = arr_reshaped.T # Transpose
arr_flattened = arr_reshaped.flatten() # Flatten to 1D (returns copy)
arr_raveled = arr_reshaped.ravel() # Flatten to 1D (returns view when possible)
# Adding/removing dimensions
expanded = np.expand_dims(arr, axis=0) # Add new axis at position 0
squeezed = np.squeeze(expanded) # Remove single-dimensional entries
# Stacking arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
vertical = np.vstack([a, b]) # Stack vertically [[1,2,3], [4,5,6]]
horizontal = np.hstack([a, b]) # Stack horizontally [1,2,3,4,5,6]arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
# Basic indexing
arr[0, 0] # First element (1)
arr[1, -1] # Last element of second row (8)
# Slicing [start:stop:step]
arr[:2, 1:3] # First two rows, second and third columns
arr[::2, ::2] # Every 2nd row and every 2nd column
# Boolean indexing
mask = arr > 5
filtered = arr[mask] # Elements where condition is True, returns 1D array
# Fancy indexing
arr[[0, 2], [1, 3]] # Elements at positions (0,1) and (2,3)Broadcasting allows operations between arrays of different shapes:
# Add scalar to array
arr = np.array([1, 2, 3])
arr + 5 # [6, 7, 8]
# Operations between arrays of different shapes
a = np.array([[1, 2, 3]]) # Shape (1, 3)
b = np.array([[1], [2], [3]]) # Shape (3, 1)
a + b # Shape (3, 3) through broadcastingdata = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Descriptive statistics
np.min(data) # Minimum value
np.max(data) # Maximum value
np.mean(data) # Mean (average)
np.median(data) # Median
np.std(data, ddof=1) # Standard deviation (default ddof=0)
np.var(data, ddof=1) # Variance
np.percentile(data, 75) # 75th percentile
# Multi-dimensional statistics
np.mean(matrix, axis=0) # Mean of each column
np.sum(matrix, axis=1) # Sum of each row
# Finding extrema
np.argmin(data) # Index of minimum value
np.argmax(data) # Index of maximum value# Handling missing values
data = np.array([1, 2, np.nan, 4, 5])
np.isnan(data) # Boolean mask for NaN values
np.nanmean(data) # Mean ignoring NaN values
np.nanstd(data) # Standard deviation ignoring NaN values
# Cumulative operations
np.cumsum(data) # Cumulative sum
np.cumprod(np.array([1, 2, 3])) # Cumulative product
# Sorting
sorted_data = np.sort(data) # Sort array
sort_indices = np.argsort(data) # Get indices that would sort array
# Unique values
unique_values = np.unique(np.array([1, 2, 2, 3, 3, 3]))a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
v = np.array([1, 2])
# Matrix operations
np.dot(a, b) # Matrix multiplication
a @ b # Matrix multiplication (Python 3.5+)
np.dot(a, v) # Matrix-vector multiplication
# Matrix decompositions
eigenvalues, eigenvectors = np.linalg.eig(a)
u, s, vh = np.linalg.svd(a) # Singular Value Decomposition
# Solving linear systems
x = np.linalg.solve(a, v) # Solve for x in a·x = v
# Matrix properties
np.linalg.det(a) # Determinant
np.trace(a) # Trace (sum of diagonal elements)
np.linalg.matrix_rank(a) # Rank
np.linalg.inv(a) # Inverse
np.linalg.norm(a, ord='fro') # Frobenius norm# Set seed for reproducibility
np.random.seed(42)
# NumPy 1.17+ style (preferred)
rng = np.random.default_rng(42)# Basic distributions
np.random.rand(3, 3) # Uniform distribution [0,1)
np.random.randn(3, 3) # Standard normal distribution
np.random.randint(0, 10, size=(3, 3)) # Random integers from 0 to 9
# Common statistical distributions
np.random.normal(loc=0, scale=1, size=1000) # Normal/Gaussian
np.random.binomial(n=10, p=0.5, size=1000) # Binomial
np.random.poisson(lam=5, size=1000) # Poisson
# Sampling from data
data = np.array([10, 20, 30, 40, 50])
np.random.choice(data, size=10, replace=True) # Sample with replacement
np.random.choice(data, size=3, replace=False, p=[0.1,0.2,0.4,0.2,0.1]) # Sample without replacement# Generate sample data with outliers
data = np.random.normal(0, 1, 1000)
data[0] = 10 # Insert an outlier
# Z-score method for outlier detection
z_scores = (data - np.mean(data)) / np.std(data)
outliers = np.abs(z_scores) > 3
# Remove outliers
clean_data = data[~outliers]
# IQR method (alternative approach)
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
iqr_outliers = (data < lower_bound) | (data > upper_bound)# Sample multi-feature dataset
X = np.random.randn(100, 4) * np.array([1, 10, 100, 1000])
# Min-Max scaling [0, 1]
X_min = X.min(axis=0)
X_max = X.max(axis=0)
X_minmax = (X - X_min) / (X_max - X_min)
# Standardization (Z-score normalization)
X_mean = X.mean(axis=0)
X_std = X.std(axis=0, ddof=1)
X_standardized = (X - X_mean) / X_std# Generate time series data
time_series = np.cumsum(np.random.normal(0, 1, 1000))
# Rolling window functions
def rolling_window(a, window):
"""Create rolling window views into array a."""
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
# Calculate moving average (3 different methods)
window_size = 5
# Method 1: Convolution
weights = np.ones(window_size) / window_size
moving_avg1 = np.convolve(time_series, weights, mode='valid')
# Method 2: Rolling window
windows = rolling_window(time_series, window_size)
moving_avg2 = np.mean(windows, axis=1)
# Method 3: Using pandas (if available)
# import pandas as pd
# moving_avg3 = pd.Series(time_series).rolling(window_size).mean().values[window_size-1:]# Generate correlated data
n_samples = 100
t = np.linspace(0, 2 * np.pi, n_samples)
x = np.sin(t) + 0.1 * np.random.randn(n_samples)
y = np.cos(t) + 0.1 * np.random.randn(n_samples)
data = np.column_stack((x, y))
# Step 1: Center the data
data_centered = data - np.mean(data, axis=0)
# Step 2: Compute covariance matrix
cov_matrix = np.cov(data_centered, rowvar=False)
# Step 3: Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
eigenvalues = eigenvalues[::-1] # Sort in descending order
eigenvectors = eigenvectors[:, ::-1] # Reorder eigenvectors accordingly
# Step 4: Project data onto principal components
pca_data = data_centered @ eigenvectors
# Step 5: Reconstruction from first principal component only
reconstructed = pca_data[:, 0:1] @ eigenvectors[:, 0:1].T + np.mean(data, axis=0)NumPy arrays are stored in contiguous memory blocks. Understanding memory layout can greatly improve performance:
# C-contiguous vs Fortran-contiguous
c_array = np.zeros((1000, 1000), order='C') # Row-major (C-style)
f_array = np.zeros((1000, 1000), order='F') # Column-major (Fortran-style)
# Check memory layout
c_array.flags['C_CONTIGUOUS'] # True
f_array.flags['F_CONTIGUOUS'] # True
# Choose appropriate layout based on access patterns
# - Row-wise operations: use C order
# - Column-wise operations: use F orderAlways prefer vectorized operations over loops:
# Slow: using loops
def slow_euclidean(x, y):
result = 0
for i in range(len(x)):
result += (x[i] - y[i]) ** 2
return np.sqrt(result)
# Fast: vectorized
def fast_euclidean(x, y):
return np.sqrt(np.sum((x - y) ** 2))
# Even faster: using built-in norm
def fastest_euclidean(x, y):
return np.linalg.norm(x - y)# Memory-mapped arrays for large datasets
mmap_array = np.memmap('large_data.npy', dtype='float32', mode='w+', shape=(10000, 10000))
# Work with small sections at a time
for i in range(0, 10000, 100):
mmap_array[i:i+100] = np.random.random((100, 10000))
# Flush changes to disk and close
mmap_array.flush()
del mmap_arrayUnderstanding when NumPy returns views vs copies can prevent unexpected behavior:
# Views (changes to one affect the other)
a = np.arange(10)
b = a[2:7] # b is a view of a
b[0] = 99 # This changes a[2] as well
# Copies (independent)
a = np.arange(10)
b = a[2:7].copy() # Explicit copy
b[0] = 99 # This doesn't affect a
# Note: Advanced indexing always returns copies
x = np.arange(10)
y = x[[0, 1, 2]] # y is a copy, not a view- Official NumPy Documentation
- From Python to NumPy
- 100 NumPy Exercises
- NumPy Illustrated: The Visual Guide to NumPy
This project is licensed under the MIT License - see the LICENSE file for details.