Tutorial: Working with Batches¶
This tutorial will guide you through the basics of working with batches of data using batcharray.
Introduction¶
In machine learning and data processing, we often work with batches of data - collections of samples
processed together.
batcharray provides convenient utilities to manipulate these batches, whether they're single
arrays or complex nested structures.
Basic Batch Operations¶
Creating a Batch¶
Let's start by creating a simple batch of data:
import numpy as np
from batcharray import array
# Create a batch of 5 samples, each with 3 features
batch = np.array(
[
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0],
[10.0, 11.0, 12.0],
[13.0, 14.0, 15.0],
]
)
Slicing Batches¶
You can extract a subset of samples from a batch:
```python continuation
Create a batch of 5 samples, each with 3 features¶
batch = np.array( [ [1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0], [10.0, 11.0, 12.0], [13.0, 14.0, 15.0], ] )
Get first 3 samples¶
first_three = array.slice_along_batch(batch, stop=3)
[[1. 2. 3.]¶
[4. 5. 6.]¶
[7. 8. 9.]]¶
Get samples 2-4 (indices 1, 2, 3)¶
middle_samples = array.slice_along_batch(batch, start=1, stop=4)
[[ 4. 5. 6.]¶
[ 7. 8. 9.]¶
[10. 11. 12.]]¶
Get last 2 samples¶
last_two = array.slice_along_batch(batch, start=3)
[[10. 11. 12.]¶
[13. 14. 15.]]¶
### Selecting Specific Samples
Use `index_select_along_batch` to select specific samples by index:
```python continuation
# Select samples at indices 0, 2, and 4
indices = np.array([0, 2, 4])
selected = array.index_select_along_batch(batch, indices=indices)
# [[ 1. 2. 3.]
# [ 7. 8. 9.]
# [13. 14. 15.]]
Splitting Batches¶
Split a batch into multiple smaller batches:
```python continuation
Split into batches of size 2¶
chunks = array.chunk_along_batch(batch, chunks=3)
[array([[1., 2., 3.], [4., 5., 6.]]),¶
array([[ 7., 8., 9.], [10., 11., 12.]]),¶
array([[13., 14., 15.]])]¶
Split at specific sizes¶
splits = array.split_along_batch(batch, split_size_or_sections=[2, 2, 1])
[array([[1., 2., 3.], [4., 5., 6.]]),¶
array([[ 7., 8., 9.], [10., 11., 12.]]),¶
array([[13., 14., 15.]])]¶
## Working with Nested Batches
Real-world data often comes in nested structures - dictionaries with multiple arrays, lists of
arrays, etc.
### Dictionary Batches
```python
import numpy as np
from batcharray import nested
# Create a batch as a dictionary
batch = {
"features": np.array(
[
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0],
]
),
"labels": np.array([0, 1, 0]),
"weights": np.array([1.0, 0.8, 1.2]),
}
# Slice all arrays together
train_batch = nested.slice_along_batch(batch, stop=2)
# {
# 'features': array([[1., 2., 3.],
# [4., 5., 6.]]),
# 'labels': array([0, 1]),
# 'weights': array([1. , 0.8])
# }
# Split into train/validation
splits = nested.split_along_batch(batch, split_size_or_sections=[2, 1])
train, val = splits[0], splits[1]
Maintaining Consistency¶
The key advantage of nested operations is that they maintain consistency across all arrays:
import numpy as np
from batcharray import nested
# Shuffle while keeping features and labels aligned
batch = {"features": np.array([[1, 2], [3, 4], [5, 6]]), "labels": np.array([0, 1, 0])}
shuffled = nested.shuffle_along_batch(batch)
# Features and labels are shuffled with the same permutation
Computing Statistics¶
Batch-level Statistics¶
Compute statistics across samples in a batch:
import numpy as np
from batcharray import array
data = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]])
# Mean across samples (for each feature)
mean_features = array.mean_along_batch(data)
# [4. 5. 6.]
# Maximum value for each feature
max_features = array.amax_along_batch(data)
# [7. 8. 9.]
# Sum across samples
sum_features = array.sum_along_batch(data)
# [12. 15. 18.]
Finding Extremes¶
import numpy as np
from batcharray import array
scores = np.array([[0.2, 0.5, 0.3], [0.1, 0.8, 0.1], [0.6, 0.2, 0.2]])
# Index of maximum value for each feature
max_indices = array.argmax_along_batch(scores)
# [2, 1, 0]
# Actual maximum values
max_values = array.amax_along_batch(scores)
# [0.6, 0.8, 0.3]
Sorting and Ordering¶
Sorting Batches¶
import numpy as np
from batcharray import array
# Unsorted batch
data = np.array([[5, 2], [1, 4], [3, 6]])
# Sort along batch dimension
sorted_data = array.sort_along_batch(data)
# [[1 2]
# [3 4]
# [5 6]]
# Get sorting indices
sort_indices = array.argsort_along_batch(data)
# [[1 0]
# [2 1]
# [0 2]]
Random Shuffling¶
import numpy as np
from batcharray import array
data = np.array([[1, 2], [3, 4], [5, 6]])
# Random shuffle
shuffled = array.shuffle_along_batch(data)
# Order is randomized, e.g.:
# [[5 6]
# [1 2]
# [3 4]]
Combining Batches¶
Concatenation¶
import numpy as np
from batcharray import array
batch1 = np.array([[1, 2], [3, 4]])
batch2 = np.array([[5, 6], [7, 8]])
# Combine batches
combined = array.concatenate_along_batch([batch1, batch2])
# [[1 2]
# [3 4]
# [5 6]
# [7 8]]
Nested Concatenation¶
import numpy as np
from batcharray import nested
batch1 = {"features": np.array([[1, 2], [3, 4]]), "labels": np.array([0, 1])}
batch2 = {"features": np.array([[5, 6]]), "labels": np.array([0])}
combined = nested.concatenate_along_batch([batch1, batch2])
# {
# 'features': array([[1, 2],
# [3, 4],
# [5, 6]]),
# 'labels': array([0, 1, 0])
# }
Working with Missing Data¶
NumPy masked arrays allow you to handle missing or invalid data:
import numpy as np
import numpy.ma as ma
from batcharray import array
# Create data with missing values (marked as masked)
data = ma.array(
[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]],
mask=[
[False, True, False], # 2nd value missing
[False, False, True], # 3rd value missing
[True, False, False],
], # 1st value missing
)
# Compute mean (ignoring masked values)
mean_vals = array.mean_along_batch(data)
# [2.5, 5.0, 4.5]
# Sort (masked values handled appropriately)
sorted_data = array.sort_along_batch(data)
# [[1.0 5.0 3.0]
# [4.0 8.0 9.0]
# [-- -- --]]
Next Steps¶
- Learn about sequence operations for time-series data
- Explore advanced nested operations
- See computation models for low-level operations
Common Patterns¶
Train/Test Split¶
import numpy as np
from batcharray import nested
# Full dataset
dataset = {
"X": np.random.randn(1000, 784), # MNIST-like
"y": np.random.randint(0, 10, 1000),
}
# 80/20 split
train_size = int(0.8 * 1000)
train_data = nested.slice_along_batch(dataset, stop=train_size)
test_data = nested.slice_along_batch(dataset, start=train_size)
Mini-batch Processing¶
import numpy as np
from batcharray import nested
# Large dataset
dataset = {"X": np.random.randn(1000, 10), "y": np.random.randint(0, 2, 1000)}
# Process in mini-batches
batch_size = 32
num_batches = (1000 + batch_size - 1) // batch_size
for i in range(num_batches):
start = i * batch_size
stop = min((i + 1) * batch_size, 1000)
mini_batch = nested.slice_along_batch(dataset, start=start, stop=stop)
Data Augmentation¶
import numpy as np
from batcharray import nested
batch = {"images": np.random.randn(32, 28, 28), "labels": np.random.randint(0, 10, 32)}
# Shuffle for augmentation
augmented = nested.shuffle_along_batch(batch)
# Select random subset
indices = np.random.choice(32, size=16, replace=False)
subset = nested.index_select_along_batch(augmented, indices=indices)