Skip to content

utils

grizz.utils

Contain utility functions.

grizz.utils.column

Contain DataFrame columns utility functions.

grizz.utils.column.find_common_columns

find_common_columns(
    frame_or_cols: DataFrame | Sequence,
    columns: Sequence[str],
) -> tuple[str, ...]

Find the common columns that are both in the DataFrame and the given columns.

Parameters:

Name Type Description Default
frame_or_cols DataFrame | Sequence

The DataFrame or its columns.

required
columns Sequence[str]

The columns to check.

required

Returns:

Type Description
tuple[str, ...]

The columns i.e. the columns that are both in columns and frame_or_cols.

Example usage:

>>> import polars as pl
>>> from grizz.utils.column import find_common_columns
>>> frame = pl.DataFrame(
...     {
...         "col1": [1, 2, 3, 4, 5],
...         "col2": ["1", "2", "3", "4", "5"],
...         "col3": ["a ", " b", "  c  ", "d", "e"],
...     }
... )
>>> cols = find_common_columns(frame, columns=["col1", "col2", "col3", "col4"])
>>> cols
('col1', 'col2', 'col3')

grizz.utils.column.find_missing_columns

find_missing_columns(
    frame_or_cols: DataFrame | Sequence,
    columns: Sequence[str],
) -> tuple[str, ...]

Find the columns that are in the given columns but not in the DataFrame.

Parameters:

Name Type Description Default
frame_or_cols DataFrame | Sequence

The DataFrame or its columns.

required
columns Sequence[str]

The columns to check.

required

Returns:

Type Description
tuple[str, ...]

The list of missing columns i.e. the columns that are in columns but not in frame_or_cols.

Example usage:

>>> import polars as pl
>>> from grizz.utils.column import find_missing_columns
>>> frame = pl.DataFrame(
...     {
...         "col1": [1, 2, 3, 4, 5],
...         "col2": ["1", "2", "3", "4", "5"],
...         "col3": ["a ", " b", "  c  ", "d", "e"],
...     }
... )
>>> cols = find_missing_columns(frame, columns=["col1", "col2", "col3", "col4"])
>>> cols
('col4',)

grizz.utils.datetime

Contain utility functions for datetime and date objects.

grizz.utils.datetime.find_end_datetime

find_end_datetime(
    start: datetime | date,
    interval: str | timedelta,
    periods: int,
) -> datetime

Find the upper bound of the datetime range from the lower bound of the datetime range, the interval, and the number of periods.

Parameters:

Name Type Description Default
start datetime | date

The lower bound of the datetime range.

required
interval str | timedelta

The interval of the range periods, specified as a Python timedelta object or using the Polars duration string language.

required
periods int

The number of periods after the start.

required

Returns:

Type Description
datetime

The upper bound of the datetime range.

Notes

interval is created according to the following string language:

- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 calendar day)
- 1w (1 calendar week)

Example usage:

>>> from datetime import timedelta, datetime, timezone
>>> from grizz.utils.datetime import find_end_datetime
>>> find_end_datetime(
...     start=datetime(year=2020, month=5, day=12, hour=4, tzinfo=timezone.utc),
...     interval=timedelta(hours=1),
...     periods=42,
... )
datetime.datetime(2020, 5, 13, 22, 0, tzinfo=datetime.timezone.utc)

grizz.utils.datetime.to_datetime

to_datetime(dt: datetime | date) -> datetime

Convert a date object to a datetime object.

Parameters:

Name Type Description Default
dt datetime | date

The date object to convert.

required

Returns:

Type Description
datetime

The datetime object.

Example usage:

>>> from datetime import datetime, date, timezone
>>> from grizz.utils.datetime import to_datetime
>>> to_datetime(datetime(year=2020, month=5, day=12, hour=4, tzinfo=timezone.utc))
datetime.datetime(2020, 5, 12, 4, 0, tzinfo=datetime.timezone.utc)
>>> to_datetime(date(year=2020, month=5, day=12))
datetime.datetime(2020, 5, 12, 0, 0, tzinfo=datetime.timezone.utc)

grizz.utils.factory

Contain a function to instantiate an object from its configuration.

grizz.utils.factory.setup_object

setup_object(obj_or_config: T | dict) -> T

Set up an object from its configuration.

Parameters:

Name Type Description Default
obj_or_config T | dict

The object or its configuration.

required

Returns:

Type Description
T

The instantiated object.

Example usage:

>>> from grizz.utils.factory import setup_object
>>> obj = setup_object({"_target_": "collections.deque", "iterable": [1, 2, 1, 3]})
>>> obj
deque([1, 2, 1, 3])
>>> setup_object(obj)  # Do nothing because the object is already instantiated
deque([1, 2, 1, 3])

grizz.utils.format

Contain utility functions to format strings.

grizz.utils.format.human_byte

human_byte(size: float, decimal: int = 2) -> str

Return a human-readable string representation of byte sizes.

Parameters:

Name Type Description Default
size float

The number of bytes.

required
decimal int

The number of decimal digits.

2

Returns:

Type Description
str

The human-readable string representation of byte sizes.

Example usage:

>>> from grizz.utils.format import human_byte
>>> human_byte(2)
'2.00 B'
>>> human_byte(2048)
'2.00 KB'
>>> human_byte(2097152)
'2.00 MB'

grizz.utils.format.str_col_diff

str_col_diff(orig: int, final: int) -> str

Return a string that indicates the difference of columns.

Parameters:

Name Type Description Default
orig int

The original number of columns.

required
final int

The final number of columns.

required

Returns:

Type Description
str

The generated string with the difference of columns.

Example usage:

>>> from grizz.utils.format import str_col_diff
>>> str_col_diff(100, 10)
90/100 (90.0000 %) columns have been removed
>>> str_col_diff(100, 99)
1/100 (1.0000 %) column has been removed

grizz.utils.format.str_kwargs

str_kwargs(mapping: Mapping) -> str

Return a string of the input mapping.

This function is designed to be used in __repr__ and __str__ methods.

Parameters:

Name Type Description Default
mapping Mapping

The mapping.

required

Returns:

Type Description
str

The generated string.

Example usage:

>>> from grizz.utils.format import str_kwargs
>>> str_kwargs({"key1": 1})
', key1=1'
>>> str_kwargs({"key1": 1, "key2": 2})
', key1=1, key2=2'

grizz.utils.format.str_row_diff

str_row_diff(orig: int, final: int) -> str

Return a string that indicates the difference of rows.

Parameters:

Name Type Description Default
orig int

The original number of rows.

required
final int

The final number of rows.

required

Returns:

Type Description
str

The generated string with the difference of rows.

Example usage:

>>> from grizz.utils.format import str_row_diff
>>> str_row_diff(100, 10)
90/100 (90.0000 %) rows have been removed
>>> str_row_diff(100, 99)
1/100 (1.0000 %) row has been removed

grizz.utils.imports

Implement some utility functions to manage optional dependencies.

grizz.utils.imports.check_clickhouse_connect

check_clickhouse_connect() -> None

Check if the clickhouse_connect package is installed.

Raises:

Type Description
RuntimeError

if the clickhouse_connect package is not installed.

Example usage:

>>> from grizz.utils.imports import check_clickhouse_connect
>>> check_clickhouse_connect()

grizz.utils.imports.check_pyarrow

check_pyarrow() -> None

Check if the pyarrow package is installed.

Raises:

Type Description
RuntimeError

if the pyarrow package is not installed.

Example usage:

>>> from grizz.utils.imports import check_pyarrow
>>> check_pyarrow()

grizz.utils.imports.check_tqdm

check_tqdm() -> None

Check if the tqdm package is installed.

Raises:

Type Description
RuntimeError

if the tqdm package is not installed.

Example usage:

>>> from grizz.utils.imports import check_tqdm
>>> check_tqdm()

grizz.utils.imports.clickhouse_connect_available

clickhouse_connect_available(
    fn: Callable[..., Any]
) -> Callable[..., Any]

Implement a decorator to execute a function only if clickhouse_connect package is installed.

Parameters:

Name Type Description Default
fn Callable[..., Any]

The function to execute.

required

Returns:

Type Description
Callable[..., Any]

A wrapper around fn if clickhouse_connect package is installed, otherwise None.

Example usage:

>>> from grizz.utils.imports import clickhouse_connect_available
>>> @clickhouse_connect_available
... def my_function(n: int = 0) -> int:
...     return 42 + n
...
>>> my_function()

grizz.utils.imports.is_clickhouse_connect_available cached

is_clickhouse_connect_available() -> bool

Indicate if the clickhouse_connect package is installed or not.

Returns:

Type Description
bool

True if clickhouse_connect is available otherwise False.

Example usage:

>>> from grizz.utils.imports import is_clickhouse_connect_available
>>> is_clickhouse_connect_available()

grizz.utils.imports.is_pyarrow_available cached

is_pyarrow_available() -> bool

Indicate if the pyarrow package is installed or not.

Returns:

Type Description
bool

True if pyarrow is available otherwise False.

Example usage:

>>> from grizz.utils.imports import is_pyarrow_available
>>> is_pyarrow_available()

grizz.utils.imports.is_tqdm_available cached

is_tqdm_available() -> bool

Indicate if the tqdm package is installed or not.

Returns:

Type Description
bool

True if tqdm is available otherwise False.

Example usage:

>>> from grizz.utils.imports import is_tqdm_available
>>> is_tqdm_available()

grizz.utils.imports.pyarrow_available

pyarrow_available(
    fn: Callable[..., Any]
) -> Callable[..., Any]

Implement a decorator to execute a function only if pyarrow package is installed.

Parameters:

Name Type Description Default
fn Callable[..., Any]

Specifies the function to execute.

required

Returns:

Type Description
Callable[..., Any]

A wrapper around fn if pyarrow package is installed, otherwise None.

Example usage:

>>> from grizz.utils.imports import pyarrow_available
>>> @pyarrow_available
... def my_function(n: int = 0) -> int:
...     return 42 + n
...
>>> my_function()

grizz.utils.imports.tqdm_available

tqdm_available(
    fn: Callable[..., Any]
) -> Callable[..., Any]

Implement a decorator to execute a function only if tqdm package is installed.

Parameters:

Name Type Description Default
fn Callable[..., Any]

Specifies the function to execute.

required

Returns:

Type Description
Callable[..., Any]

A wrapper around fn if tqdm package is installed, otherwise None.

Example usage:

>>> from grizz.utils.imports import tqdm_available
>>> @tqdm_available
... def my_function(n: int = 0) -> int:
...     return 42 + n
...
>>> my_function()

grizz.utils.interval

Contain interval utility functions.

grizz.utils.interval.find_time_unit

find_time_unit(interval: str) -> str

Find the time unit associated to a polars interval.

Parameters:

Name Type Description Default
interval str

The polars interval to analyze.

required

Returns:

Type Description
str

The found time unit.

Raises:

Type Description
RuntimeError

if no valid time unit can be found.

Example usage:

>>> from grizz.utils.interval import find_time_unit
>>> find_time_unit("3d12h4m")
m
>>> find_time_unit("3y5mo")
mo

grizz.utils.interval.interval_to_strftime_format

interval_to_strftime_format(interval: str) -> str

Return the default strftime format for a given interval.

Parameters:

Name Type Description Default
interval str

The polars interval to analyze.

required

Returns:

Type Description
str

The default strftime format.

Example usage:

>>> from grizz.utils.interval import interval_to_strftime_format
>>> interval_to_strftime_format("1h")
%Y-%m-%d %H:%M
>>> interval_to_strftime_format("3y1mo")
%Y-%m

grizz.utils.interval.interval_to_timedelta

interval_to_timedelta(interval: str) -> timedelta

Convert a interval to a timedelta object.

Parameters:

Name Type Description Default
interval str

The input interval.

required

Returns:

Type Description
timedelta

The timedelta object generated from the interval.

Example usage:

>>> from grizz.utils.interval import interval_to_timedelta
>>> interval_to_timedelta("5d1h42m")
datetime.timedelta(days=5, seconds=6120)

grizz.utils.interval.time_unit_to_strftime_format

time_unit_to_strftime_format(time_unit: str) -> str

Return the default strftime format for a given time unit.

Parameters:

Name Type Description Default
time_unit str

The time unit.

required

Returns:

Type Description
str

The default strftime format.

Example usage:

>>> from grizz.utils.interval import time_unit_to_strftime_format
>>> time_unit_to_strftime_format("h")
%Y-%m-%d %H:%M
>>> time_unit_to_strftime_format("mo")
%Y-%m

grizz.utils.noop

Contain no-op functions.

grizz.utils.noop.tqdm

tqdm(
    iterable: Iterable, *args: Any, **kwargs: Any
) -> Iterable

Implement a no-op tqdm progressbar that is used when tqdm is not installed.

Parameters:

Name Type Description Default
iterable Iterable

Iterable to decorate with a progressbar.

required
*args Any

Positional arbitrary arguments.

()
**kwargs Any

Keyword arbitrary arguments.

{}

Returns:

Type Description
Iterable

The input iterable.

grizz.utils.path

Contain utility functions to manage paths.

grizz.utils.path.find_files

find_files(
    path: Path | str,
    filter_fn: Callable[[Path], bool],
    recursive: bool = True,
) -> list[Path]

Find the path of all the tar files in a given path.

This function does not check if a path is a symbolic link so be careful if you are using a path with symbolic links.

Parameters:

Name Type Description Default
path Path | str

The path where to look for the parquet files.

required
filter_fn Callable[[Path], bool]

The path filtering function. The function should return True for the path to find, and False otherwise.

required
recursive bool

Indicate if it should also check the sub-folders.

True

Returns:

Type Description
list[Path]

The tuple of path of parquet files.

Example usage:

>>> from pathlib import Path
>>> from grizz.utils.path import find_files
>>> find_files(Path("something"), filter_fn=lambda path: path.name.endswith(".txt"))
[...]

grizz.utils.path.find_parquet_files

find_parquet_files(
    path: Path | str, recursive: bool = True
) -> list[Path]

Find the path of all the parquet files in a given path.

This function does not check if a path is a symbolic link so be careful if you are using a path with symbolic links.

Parameters:

Name Type Description Default
path Path | str

The path where to look for the parquet files.

required
recursive bool

Specifies if it should also check the sub-folders.

True

Returns:

Type Description
list[Path]

The list of parquet files.

Example usage:

>>> from pathlib import Path
>>> from grizz.utils.path import find_parquet_files
>>> find_parquet_files(Path("something"))
[...]

grizz.utils.path.human_file_size

human_file_size(path: Path | str, decimal: int = 2) -> str

Get a human-readable representation of a file size.

Parameters:

Name Type Description Default
path Path | str

The path to the file.

required
decimal int

The number of decimal digits.

2

Returns:

Type Description
str

The file size in a human-readable format.

Example usage:

>>> from grizz.utils.path import human_file_size
>>> human_file_size("README.md")
'...B'

grizz.utils.path.sanitize_path

sanitize_path(path: Path | str) -> Path

Sanitize a given path.

Parameters:

Name Type Description Default
path Path | str

The path to sanitize.

required

Returns:

Type Description
Path

The sanitized path.

Example usage:

>>> from pathlib import Path
>>> from grizz.utils.path import sanitize_path
>>> sanitize_path("something")
PosixPath('.../something')
>>> sanitize_path("")
PosixPath('...')
>>> sanitize_path(Path("something"))
PosixPath('.../something')
>>> sanitize_path(Path("something/./../"))
PosixPath('...')