Numpy: (Numerical Python) a library for scientific computing in Python
Chapter# 1: Introduction to numpy
1. Introduction
1.1 What is NumPy?
- NumPy (Numerical Python) is a fundamental library for scientific computing in Python.
- It provides support for multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
- It is the foundation for many other libraries in the Python data science ecosystem, such as Pandas, SciPy, Scikit-learn, and TensorFlow.
1.2 Installing and Importing NumPy
Installation: If you don’t have NumPy installed, you can install it using pip:
pip install numpy
2. NumPy Arrays
2.1 Creating Arrays
NumPy arrays are the core data structure in NumPy. Here’s how you can create them:
- From a Python list:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr) # [1 2 3 4 5]
- np.zeros(): Creates an array filled with zeros.
zeros_arr = np.zeros(5) # 1D array with 5 zeros
print(zeros_arr) # [0. 0. 0. 0. 0.]
np.ones()
: Creates an array filled with ones.
ones_arr = np.ones((3, 3)) # 2D array (3x3) with ones
print(ones_arr)
# Output
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
np.arange()
: Creates an array with evenly spaced values within a range.
range_arr = np.arange(0, 10, 2) # Start, Stop, Step
print(range_arr) # [0 2 4 6 8]
np.linspace()
: Creates an array with a specified number of evenly spaced values.
linspace_arr = np.linspace(0, 1, 5) # Start, Stop, Number of points
print(linspace_arr) # [0. 0.25 0.5 0.75 1. ]
2.2 Array Attributes
NumPy arrays have several attributes that provide useful information:
shape
: Returns the dimensions of the array.dtype
: Returns the data type of the array elements.size
: Returns the total number of elements in the array.ndim
: Returns the number of dimensions (axes) of the array.
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("Shape:", arr.shape) # (2, 3)
print("Data type:", arr.dtype) # int64
print("Size:", arr.size) # 6
print("Number of dimensions:", arr.ndim) # 2
3. Array Indexing and Slicing
Indexing: Accessing individual elements of an array.
arr = np.array([1, 2, 3, 4, 5])
print(arr[0]) # First element: 1
print(arr[-1]) # Last element: 5
Slicing: Accessing a subset of an array.
print(arr[1:4]) # Elements from index 1 to 3: [2, 3, 4]
print(arr[:3]) # Elements from start to index 2: [1, 2, 3]
print(arr[::2]) # Every second element: [1, 3, 5]
Boolean Indexing: Filtering elements using a boolean condition.
arr = np.array([1, 2, 3, 4, 5])
print(arr[arr > 3]) # Elements greater than 3: [4, 5]
4. Basic Operations
Arithmetic Operations: Element-wise addition, subtraction, multiplication, and division.
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a + b) # [5, 7, 9]
print(a * b) # [4, 10, 18]
Aggregation Functions: Functions like sum
, mean
, min
, max
, etc.
arr = np.array([1, 2, 3, 4, 5])
print(np.sum(arr)) # 15
print(np.mean(arr), axis=0) # 3.0 (0=column, 1=row)
Chapter# 2. Intermediate NumPy
2.1 Broadcasting
What is Broadcasting?
- Broadcasting allows NumPy to perform arithmetic operations on arrays of different shapes.
- It automatically “stretches” smaller arrays to match the shape of larger arrays, without actually copying data.
Rules of Broadcasting:
- If arrays have different dimensions, the smaller array is padded with ones on its left side.
- If the shapes of the arrays do not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
- If any dimension does not match and neither is equal to 1, an error is raised.
Example:
a = np.array([1, 2, 3]) # Shape: (3,)
b = np.array([[10], [20]]) # Shape: (2, 1)
print(a + b)
# Output
[[11 12 13]
[21 22 23]]
Here, a
is stretched to shape (2, 3)
and b
is stretched to shape (2, 3)
before the addition.
2.2 Advanced Indexing
Integer Array Indexing:
Use integer arrays to index into another array.
arr = np.array([10, 20, 30, 40, 50])
indices = np.array([0, 2, 4])
print(arr[indices]) # [10, 30, 50]
Fancy Indexing:
Use arrays of indices to access multiple elements at once.
arr = np.array([[1, 2], [3, 4], [5, 6]])
rows = np.array([0, 2])
cols = np.array([1, 0])
print(arr[rows, cols]) # [2, 5]
2.3 Universal Functions (ufuncs)
What are ufuncs?
- Universal functions are functions that operate element-wise on arrays.
- They are highly optimized and written in C, making them very fast.
Common ufuncs:
- Mathematical functions:
np.sin
,np.cos
,np.exp
,np.log
, etc.
arr = np.array([0, np.pi/2, np.pi])
print(np.sin(arr)) # [0., 1., 0.]
- Comparison functions:
np.greater
,np.less
,np.equal
, etc.
a = np.array([1, 2, 3])
b = np.array([2, 2, 2])
print(np.greater(a, b)) # [False, False, True]
- Custom ufuncs:
You can create your own ufuncs using np.frompyfunc
or np.vectorize
.
2.4 Matrix Operations
Matrix Multiplication:
Use np.dot
or the @
operator for matrix multiplication.
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
print(np.dot(a, b)) # or a @ b
# Output
[[19 22]
[43 50]]
Matrix Inversion:
Use np.linalg.inv
to compute the inverse of a matrix.
a = np.array([[1, 2], [3, 4]])
inv_a = np.linalg.inv(a)
print(inv_a)
# Output
[[-2. 1. ]
[ 1.5 -0.5]]
Determinant:
Use np.linalg.det
to compute the determinant of a matrix.
det_a = np.linalg.det(a)
print(det_a) # -2.0
Eigenvalues and Eigenvectors:
Use np.linalg.eig
to compute eigenvalues and eigenvectors.
eigenvalues, eigenvectors = np.linalg.eig(a)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:", eigenvectors)
2.5 Random Module
Generating Random Numbers:
np.random.rand
: Uniform distribution over [0, 1).
print(np.random.rand(3)) # [0.123, 0.456, 0.789]
np.random.randn
: Standard normal distribution (mean=0, variance=1).
print(np.random.randn(3)) # [-0.123, 0.456, -0.789]
np.random.randint
: Random integers within a range.
print(np.random.randint(0, 10, size=5)) # [3, 7, 2, 8, 1]
Random Sampling:
np.random.choice
: Randomly sample from a given array.
arr = np.array([1, 2, 3, 4, 5])
print(np.random.choice(arr, size=3)) # [2, 5, 1]
np.random.shuffle
: Shuffle an array in place.
np.random.shuffle(arr)
print(arr) # [3, 1, 5, 2, 4]
Setting Random Seeds:
Use np.random.seed
to ensure reproducibility.
np.random.seed(42)
print(np.random.rand(3)) # [0.3745, 0.9507, 0.7320]
3. Advanced NumPy
3.1 Structured Arrays
What are Structured Arrays?
- Structured arrays allow you to store heterogeneous data (e.g., integers, floats, strings) in a single NumPy array.
- Each element of the array is a structure (similar to a row in a table).
Creating Structured Arrays:
- Define a custom data type using
dtype
.
data = np.array([(1, 2.5, 'Hello'), (2, 3.7, 'World')],
dtype=[('id', 'i4'), ('value', 'f4'), ('label', 'U10')])
print(data)
# Output
[(1, 2.5, 'Hello') (2, 3.7, 'World')]
Accessing Fields:
- Use field names to access specific columns.
print(data['id']) # [1, 2]
print(data['value']) # [2.5, 3.7]
print(data['label']) # ['Hello', 'World']
3.2 Memory Management
Views vs Copies:
- A view is a new array object that references the same data as the original array.
- A copy is a new array object with its own copy of the data.
arr = np.array([1, 2, 3, 4])
view = arr[1:3] # View (references original data)
copy = arr[1:3].copy() # Copy (new data)
Memory Layout:
- NumPy arrays can be stored in C-order (row-major) or F-order (column-major).
- Use
np.ascontiguousarray
ornp.asfortranarray
to control memory layout.
arr = np.array([[1, 2], [3, 4]], order='C') # C-order (default)
print(arr.flags['C_CONTIGUOUS']) # True
3.3 Performance Optimization
Vectorization
- Replace explicit loops with vectorized operations for better performance.
# Non-vectorized (slow)
result = []
for i in range(1000):
result.append(i * 2)
# Vectorized (fast)
result = np.arange(1000) * 2
Using np.vectorize
:
- Convert a Python function into a vectorized function.
def my_func(x):
return x ** 2 + 1
vectorized_func = np.vectorize(my_func)
print(vectorized_func(np.array([1, 2, 3]))) # [2, 5, 10]
Profiling NumPy Code:
- Use tools like
timeit
orcProfile
to measure performance.
import timeit
setup = "import numpy as np; arr = np.random.rand(1000)"
print(timeit.timeit("arr * 2", setup=setup, number=1000))
3.4 Linear Algebra
Solving Linear Equations:
- Use
np.linalg.solve
to solve systems of linear equations.
A = np.array([[3, 2], [1, 4]])
b = np.array([8, 9])
x = np.linalg.solve(A, b)
print(x) # [2., 1.]
Singular Value Decomposition (SVD):
- Decompose a matrix into three matrices:
U
,Σ
, andV
.
A = np.array([[1, 2], [3, 4], [5, 6]])
U, S, V = np.linalg.svd(A)
print("U:", U)
print("S:", S)
print("V:", V)
QR Decomposition:
- Decompose a matrix into an orthogonal matrix
Q
and an upper triangular matrixR
.
Q, R = np.linalg.qr(A)
print("Q:", Q)
print("R:", R)
3.5 Masked Arrays
What are Masked Arrays?
- Masked arrays are arrays that have a mask to indicate missing or invalid data.
- Useful for handling incomplete datasets.
Creating Masked Arrays:
- Use
np.ma.masked_array
to create a masked array.
data = np.array([1, 2, 3, -999, 5])
masked_data = np.ma.masked_array(data, mask=[0, 0, 0, 1, 0])
print(masked_data) # [1, 2, 3, --, 5]
Operations on Masked Arrays:
- Masked arrays support most NumPy operations while ignoring masked values.
print(masked_data.mean()) # 2.75 (ignores masked value)
3.6 File I/O
Saving and Loading Arrays:
- Use
np.save
andnp.load
to save and load arrays in.npy
format.
arr = np.array([1, 2, 3])
np.save('my_array.npy', arr)
loaded_arr = np.load('my_array.npy')
print(loaded_arr)
Use np.savetxt
and np.loadtxt
for text files.
np.savetxt('my_array.txt', arr)
loaded_arr = np.loadtxt('my_array.txt')
print(loaded_arr)
4. NumPy for Machine Learning
4.1 Data Preprocessing
Normalization and Standardization:
- Normalization scales data to a range of [0, 1].
data = np.array([1, 2, 3, 4, 5])
normalized_data = (data - np.min(data)) / (np.max(data) - np.min(data))
print(normalized_data) # [0., 0.25, 0.5, 0.75, 1.]
- Standardization scales data to have a mean of 0 and a standard deviation of 1.
standardized_data = (data - np.mean(data)) / np.std(data)
print(standardized_data)
Handling Missing Values:
Replace missing values (e.g., NaN
) with a specific value or an aggregate (e.g., mean).
data = np.array([1, 2, np.nan, 4, 5])
data[np.isnan(data)] = np.nanmean(data) # Replace NaNs with mean
print(data) # [1., 2., 3., 4., 5.]
4.2 Feature Engineering
Creating Polynomial Features:
Generate polynomial features for regression tasks.
from numpy.polynomial.polynomial import polyvander
data = np.array([1, 2, 3])
poly_features = polyvander(data, degree=2) # Degree 2 polynomial
print(poly_features)
# Output
[[1. 1. 1.]
[1. 2. 4.]
[1. 3. 9.]]
One-Hot Encoding:
Convert categorical data into binary vectors.
categories = np.array(['red', 'blue', 'green'])
one_hot = np.eye(len(np.unique(categories)))[categories.astype('int')]
print(one_hot)
# Output
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
4.3 Distance Metrics
Euclidean Distance:
- Compute the Euclidean distance between two vectors.
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
distance = np.linalg.norm(a - b)
print(distance) # 5.196
Manhattan Distance:
- Compute the Manhattan distance (sum of absolute differences).
distance = np.sum(np.abs(a - b))
print(distance) # 9
Cosine Similarity:
- Compute the cosine of the angle between two vectors.
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
cosine_sim = dot_product / (norm_a * norm_b)
print(cosine_sim) # 0.974
4.4 Matrix Factorization
Principal Component Analysis (PCA):
Use SVD to perform PCA for dimensionality reduction.
data = np.array([[1, 2], [3, 4], [5, 6]])
mean = np.mean(data, axis=0)
centered_data = data - mean
U, S, V = np.linalg.svd(centered_data)
print("Principal Components:", V)
4.5 Gradient Calculations
Computing Gradients:
- Use NumPy to compute gradients for optimization (e.g., in gradient descent).
def loss_function(x):
return x ** 2 + 3 * x + 2
def gradient(x):
return 2 * x + 3
x = 2.0
print("Loss:", loss_function(x))
print("Gradient:", gradient(x))
4.6 Simulating Data
Generating Synthetic Datasets:
- Create synthetic datasets for testing machine learning models.
# Linear dataset with noise
X = np.linspace(0, 10, 100)
y = 2 * X + 3 + np.random.normal(0, 1, 100)
5. Integration with Machine Learning Libraries
NumPy and Pandas:
- Convert between NumPy arrays and Pandas DataFrames.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
numpy_array = df.to_numpy()
print(numpy_array)
NumPy and Scikit-Learn:
Use NumPy arrays as input to Scikit-Learn models.
from sklearn.linear_model import LinearRegression
X = np.array([[1], [2], [3]])
y = np.array([2, 4, 6])
model = LinearRegression()
model.fit(X, y)
NumPy and TensorFlow/PyTorch:
Convert between NumPy arrays and TensorFlow/PyTorch tensors.
import tensorflow as tf
numpy_array = np.array([1, 2, 3])
tensor = tf.convert_to_tensor(numpy_array)
print(tensor)
This post is based on interaction with https://chat.deepseek.com.
Happy learning :-)