utils

grizz.utils ¶

Contain utility functions.

grizz.utils.column ¶

Contain DataFrame columns utility functions.

grizz.utils.column.find_common_columns ¶

find_common_columns(
    frame_or_cols: DataFrame | Sequence,
    columns: Sequence[str],
) -> tuple[str, ...]

Find the common columns that are both in the DataFrame and the given columns.

Parameters:

Name	Type	Description	Default
`frame_or_cols`	`DataFrame \| Sequence`	The DataFrame or its columns.	required
`columns`	`Sequence[str]`	The columns to check.	required

Returns:

Type	Description
`tuple[str, ...]`	The columns i.e. the columns that are both in `columns` and `frame_or_cols`.

Example usage:

>>> import polars as pl
>>> from grizz.utils.column import find_common_columns
>>> frame = pl.DataFrame(
...     {
...         "col1": [1, 2, 3, 4, 5],
...         "col2": ["1", "2", "3", "4", "5"],
...         "col3": ["a ", " b", "  c  ", "d", "e"],
...     }
... )
>>> cols = find_common_columns(frame, columns=["col1", "col2", "col3", "col4"])
>>> cols
('col1', 'col2', 'col3')

grizz.utils.column.find_missing_columns ¶

find_missing_columns(
    frame_or_cols: DataFrame | Sequence,
    columns: Sequence[str],
) -> tuple[str, ...]

Find the columns that are in the given columns but not in the DataFrame.

Parameters:

Name	Type	Description	Default
`frame_or_cols`	`DataFrame \| Sequence`	The DataFrame or its columns.	required
`columns`	`Sequence[str]`	The columns to check.	required

Returns:

Type	Description
`tuple[str, ...]`	The list of missing columns i.e. the columns that are in `columns` but not in `frame_or_cols`.

Example usage:

>>> import polars as pl
>>> from grizz.utils.column import find_missing_columns
>>> frame = pl.DataFrame(
...     {
...         "col1": [1, 2, 3, 4, 5],
...         "col2": ["1", "2", "3", "4", "5"],
...         "col3": ["a ", " b", "  c  ", "d", "e"],
...     }
... )
>>> cols = find_missing_columns(frame, columns=["col1", "col2", "col3", "col4"])
>>> cols
('col4',)

grizz.utils.datetime ¶

Contain utility functions for datetime and date objects.

grizz.utils.datetime.find_end_datetime ¶

find_end_datetime(
    start: datetime | date,
    interval: str | timedelta,
    periods: int,
) -> datetime

Find the upper bound of the datetime range from the lower bound of the datetime range, the interval, and the number of periods.

Parameters:

Name	Type	Description	Default
`start`	`datetime \| date`	The lower bound of the datetime range.	required
`interval`	`str \| timedelta`	The interval of the range periods, specified as a Python timedelta object or using the Polars duration string language.	required
`periods`	`int`	The number of periods after the start.	required

Returns:

Type	Description
`datetime`	The upper bound of the datetime range.

Notes

interval is created according to the following string language:

- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 calendar day)
- 1w (1 calendar week)

Example usage:

>>> from datetime import timedelta, datetime, timezone
>>> from grizz.utils.datetime import find_end_datetime
>>> find_end_datetime(
...     start=datetime(year=2020, month=5, day=12, hour=4, tzinfo=timezone.utc),
...     interval=timedelta(hours=1),
...     periods=42,
... )
datetime.datetime(2020, 5, 13, 22, 0, tzinfo=datetime.timezone.utc)

grizz.utils.datetime.to_datetime ¶

to_datetime(dt: datetime | date) -> datetime

Convert a date object to a datetime object.

Parameters:

Name	Type	Description	Default
`dt`	`datetime \| date`	The `date` object to convert.	required

Returns:

Type	Description
`datetime`	The `datetime` object.

Example usage:

>>> from datetime import datetime, date, timezone
>>> from grizz.utils.datetime import to_datetime
>>> to_datetime(datetime(year=2020, month=5, day=12, hour=4, tzinfo=timezone.utc))
datetime.datetime(2020, 5, 12, 4, 0, tzinfo=datetime.timezone.utc)
>>> to_datetime(date(year=2020, month=5, day=12))
datetime.datetime(2020, 5, 12, 0, 0, tzinfo=datetime.timezone.utc)

grizz.utils.factory ¶

Contain a function to instantiate an object from its configuration.

grizz.utils.factory.setup_object ¶

setup_object(obj_or_config: T | dict) -> T

Set up an object from its configuration.

Parameters:

Name	Type	Description	Default
`obj_or_config`	`T \| dict`	The object or its configuration.	required

Returns:

Type	Description
`T`	The instantiated object.

Example usage:

>>> from grizz.utils.factory import setup_object
>>> obj = setup_object({"_target_": "collections.deque", "iterable": [1, 2, 1, 3]})
>>> obj
deque([1, 2, 1, 3])
>>> setup_object(obj)  # Do nothing because the object is already instantiated
deque([1, 2, 1, 3])

grizz.utils.format ¶

Contain utility functions to format strings.

grizz.utils.format.human_byte ¶

human_byte(size: float, decimal: int = 2) -> str

Return a human-readable string representation of byte sizes.

Parameters:

Name	Type	Description	Default
`size`	`float`	The number of bytes.	required
`decimal`	`int`	The number of decimal digits.	`2`

Returns:

Type	Description
`str`	The human-readable string representation of byte sizes.

Example usage:

>>> from grizz.utils.format import human_byte
>>> human_byte(2)
'2.00 B'
>>> human_byte(2048)
'2.00 KB'
>>> human_byte(2097152)
'2.00 MB'

grizz.utils.format.str_col_diff ¶

str_col_diff(orig: int, final: int) -> str

Return a string that indicates the difference of columns.

Parameters:

Name	Type	Description	Default
`orig`	`int`	The original number of columns.	required
`final`	`int`	The final number of columns.	required

Returns:

Type	Description
`str`	The generated string with the difference of columns.

Example usage:

>>> from grizz.utils.format import str_col_diff
>>> str_col_diff(100, 10)
90/100 (90.0000 %) columns have been removed
>>> str_col_diff(100, 99)
1/100 (1.0000 %) column has been removed

grizz.utils.format.str_kwargs ¶

str_kwargs(mapping: Mapping) -> str

Return a string of the input mapping.

This function is designed to be used in __repr__ and __str__ methods.

Parameters:

Name	Type	Description	Default
`mapping`	`Mapping`	The mapping.	required

Returns:

Type	Description
`str`	The generated string.

Example usage:

>>> from grizz.utils.format import str_kwargs
>>> str_kwargs({"key1": 1})
', key1=1'
>>> str_kwargs({"key1": 1, "key2": 2})
', key1=1, key2=2'

grizz.utils.format.str_row_diff ¶

str_row_diff(orig: int, final: int) -> str

Return a string that indicates the difference of rows.

Parameters:

Name	Type	Description	Default
`orig`	`int`	The original number of rows.	required
`final`	`int`	The final number of rows.	required

Returns:

Type	Description
`str`	The generated string with the difference of rows.

Example usage:

>>> from grizz.utils.format import str_row_diff
>>> str_row_diff(100, 10)
90/100 (90.0000 %) rows have been removed
>>> str_row_diff(100, 99)
1/100 (1.0000 %) row has been removed

grizz.utils.imports ¶

Implement some utility functions to manage optional dependencies.

grizz.utils.imports.check_clickhouse_connect ¶

check_clickhouse_connect() -> None

Check if the clickhouse_connect package is installed.

Raises:

Type	Description
`RuntimeError`	if the `clickhouse_connect` package is not installed.

Example usage:

>>> from grizz.utils.imports import check_clickhouse_connect
>>> check_clickhouse_connect()

grizz.utils.imports.check_pyarrow ¶

check_pyarrow() -> None

Check if the pyarrow package is installed.

Raises:

Type	Description
`RuntimeError`	if the `pyarrow` package is not installed.

Example usage:

>>> from grizz.utils.imports import check_pyarrow
>>> check_pyarrow()

grizz.utils.imports.check_tqdm ¶

check_tqdm() -> None

Check if the tqdm package is installed.

Raises:

Type	Description
`RuntimeError`	if the `tqdm` package is not installed.

Example usage:

>>> from grizz.utils.imports import check_tqdm
>>> check_tqdm()

grizz.utils.imports.clickhouse_connect_available ¶

clickhouse_connect_available(
    fn: Callable[..., Any]
) -> Callable[..., Any]

Implement a decorator to execute a function only if clickhouse_connect package is installed.

Parameters:

Name	Type	Description	Default
`fn`	`Callable[..., Any]`	The function to execute.	required

Returns:

Type	Description
`Callable[..., Any]`	A wrapper around `fn` if `clickhouse_connect` package is installed, otherwise `None`.

Example usage:

>>> from grizz.utils.imports import clickhouse_connect_available
>>> @clickhouse_connect_available
... def my_function(n: int = 0) -> int:
...     return 42 + n
...
>>> my_function()

grizz.utils.imports.is_clickhouse_connect_available `cached` ¶

is_clickhouse_connect_available() -> bool

Indicate if the clickhouse_connect package is installed or not.

Returns:

Type	Description
`bool`	`True` if `clickhouse_connect` is available otherwise `False`.

Example usage:

>>> from grizz.utils.imports import is_clickhouse_connect_available
>>> is_clickhouse_connect_available()

grizz.utils.imports.is_pyarrow_available `cached` ¶

is_pyarrow_available() -> bool

Indicate if the pyarrow package is installed or not.

Returns:

Type	Description
`bool`	`True` if `pyarrow` is available otherwise `False`.

Example usage:

>>> from grizz.utils.imports import is_pyarrow_available
>>> is_pyarrow_available()

grizz.utils.imports.is_tqdm_available `cached` ¶

is_tqdm_available() -> bool

Indicate if the tqdm package is installed or not.

Returns:

Type	Description
`bool`	`True` if `tqdm` is available otherwise `False`.

Example usage:

>>> from grizz.utils.imports import is_tqdm_available
>>> is_tqdm_available()

grizz.utils.imports.pyarrow_available ¶

pyarrow_available(
    fn: Callable[..., Any]
) -> Callable[..., Any]

Implement a decorator to execute a function only if pyarrow package is installed.

Parameters:

Name	Type	Description	Default
`fn`	`Callable[..., Any]`	Specifies the function to execute.	required

Returns:

Type	Description
`Callable[..., Any]`	A wrapper around `fn` if `pyarrow` package is installed, otherwise `None`.

Example usage:

>>> from grizz.utils.imports import pyarrow_available
>>> @pyarrow_available
... def my_function(n: int = 0) -> int:
...     return 42 + n
...
>>> my_function()

grizz.utils.imports.tqdm_available ¶

tqdm_available(
    fn: Callable[..., Any]
) -> Callable[..., Any]

Implement a decorator to execute a function only if tqdm package is installed.

Parameters:

Name	Type	Description	Default
`fn`	`Callable[..., Any]`	Specifies the function to execute.	required

Returns:

Type	Description
`Callable[..., Any]`	A wrapper around `fn` if `tqdm` package is installed, otherwise `None`.

Example usage:

>>> from grizz.utils.imports import tqdm_available
>>> @tqdm_available
... def my_function(n: int = 0) -> int:
...     return 42 + n
...
>>> my_function()

grizz.utils.interval ¶

Contain interval utility functions.

grizz.utils.interval.find_time_unit ¶

find_time_unit(interval: str) -> str

Find the time unit associated to a polars interval.

Parameters:

Name	Type	Description	Default
`interval`	`str`	The `polars` interval to analyze.	required

Returns:

Type	Description
`str`	The found time unit.

Raises:

Type	Description
`RuntimeError`	if no valid time unit can be found.

Example usage:

>>> from grizz.utils.interval import find_time_unit
>>> find_time_unit("3d12h4m")
m
>>> find_time_unit("3y5mo")
mo

grizz.utils.interval.interval_to_strftime_format ¶

interval_to_strftime_format(interval: str) -> str

Return the default strftime format for a given interval.

Parameters:

Name	Type	Description	Default
`interval`	`str`	The `polars` interval to analyze.	required

Returns:

Type	Description
`str`	The default strftime format.

Example usage:

>>> from grizz.utils.interval import interval_to_strftime_format
>>> interval_to_strftime_format("1h")
%Y-%m-%d %H:%M
>>> interval_to_strftime_format("3y1mo")
%Y-%m

grizz.utils.interval.interval_to_timedelta ¶

interval_to_timedelta(interval: str) -> timedelta

Convert a interval to a timedelta object.

Parameters:

Name	Type	Description	Default
`interval`	`str`	The input interval.	required

Returns:

Type	Description
`timedelta`	The timedelta object generated from the interval.

Example usage:

>>> from grizz.utils.interval import interval_to_timedelta
>>> interval_to_timedelta("5d1h42m")
datetime.timedelta(days=5, seconds=6120)

grizz.utils.interval.time_unit_to_strftime_format ¶

time_unit_to_strftime_format(time_unit: str) -> str

Return the default strftime format for a given time unit.

Parameters:

Name	Type	Description	Default
`time_unit`	`str`	The time unit.	required

Returns:

Type	Description
`str`	The default strftime format.

Example usage:

>>> from grizz.utils.interval import time_unit_to_strftime_format
>>> time_unit_to_strftime_format("h")
%Y-%m-%d %H:%M
>>> time_unit_to_strftime_format("mo")
%Y-%m

grizz.utils.noop ¶

Contain no-op functions.

grizz.utils.noop.tqdm ¶

tqdm(
    iterable: Iterable, *args: Any, **kwargs: Any
) -> Iterable

Implement a no-op tqdm progressbar that is used when tqdm is not installed.

Parameters:

Name	Type	Description	Default
`iterable`	`Iterable`	Iterable to decorate with a progressbar.	required
`*args`	`Any`	Positional arbitrary arguments.	`()`
`**kwargs`	`Any`	Keyword arbitrary arguments.	`{}`

Returns:

Type	Description
`Iterable`	The input iterable.

grizz.utils.path ¶

Contain utility functions to manage paths.

grizz.utils.path.find_files ¶

find_files(
    path: Path | str,
    filter_fn: Callable[[Path], bool],
    recursive: bool = True,
) -> list[Path]

Find the path of all the tar files in a given path.

This function does not check if a path is a symbolic link so be careful if you are using a path with symbolic links.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	The path where to look for the parquet files.	required
`filter_fn`	`Callable[[Path], bool]`	The path filtering function. The function should return `True` for the path to find, and `False` otherwise.	required
`recursive`	`bool`	Indicate if it should also check the sub-folders.	`True`

Returns:

Type	Description
`list[Path]`	The tuple of path of parquet files.

Example usage:

>>> from pathlib import Path
>>> from grizz.utils.path import find_files
>>> find_files(Path("something"), filter_fn=lambda path: path.name.endswith(".txt"))
[...]

grizz.utils.path.find_parquet_files ¶

find_parquet_files(
    path: Path | str, recursive: bool = True
) -> list[Path]

Find the path of all the parquet files in a given path.

This function does not check if a path is a symbolic link so be careful if you are using a path with symbolic links.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	The path where to look for the parquet files.	required
`recursive`	`bool`	Specifies if it should also check the sub-folders.	`True`

Returns:

Type	Description
`list[Path]`	The list of parquet files.

Example usage:

>>> from pathlib import Path
>>> from grizz.utils.path import find_parquet_files
>>> find_parquet_files(Path("something"))
[...]

grizz.utils.path.human_file_size ¶

human_file_size(path: Path | str, decimal: int = 2) -> str

Get a human-readable representation of a file size.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	The path to the file.	required
`decimal`	`int`	The number of decimal digits.	`2`

Returns:

Type	Description
`str`	The file size in a human-readable format.

Example usage:

>>> from grizz.utils.path import human_file_size
>>> human_file_size("README.md")
'...B'

grizz.utils.path.sanitize_path ¶

sanitize_path(path: Path | str) -> Path

Sanitize a given path.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	The path to sanitize.	required

Returns:

Type	Description
`Path`	The sanitized path.

Example usage:

>>> from pathlib import Path
>>> from grizz.utils.path import sanitize_path
>>> sanitize_path("something")
PosixPath('.../something')
>>> sanitize_path("")
PosixPath('...')
>>> sanitize_path(Path("something"))
PosixPath('.../something')
>>> sanitize_path(Path("something/./../"))
PosixPath('...')

utils

grizz.utils ¶

grizz.utils.column ¶

grizz.utils.column.find_common_columns ¶

grizz.utils.column.find_missing_columns ¶

grizz.utils.datetime ¶

grizz.utils.datetime.find_end_datetime ¶

grizz.utils.datetime.to_datetime ¶

grizz.utils.factory ¶

grizz.utils.factory.setup_object ¶

grizz.utils.format ¶

grizz.utils.format.human_byte ¶

grizz.utils.format.str_col_diff ¶

grizz.utils.format.str_kwargs ¶

grizz.utils.format.str_row_diff ¶

grizz.utils.imports ¶

grizz.utils.imports.check_clickhouse_connect ¶

grizz.utils.imports.check_pyarrow ¶

grizz.utils.imports.check_tqdm ¶

grizz.utils.imports.clickhouse_connect_available ¶

grizz.utils.imports.is_clickhouse_connect_available cached ¶

grizz.utils.imports.is_pyarrow_available cached ¶

grizz.utils.imports.is_tqdm_available cached ¶

grizz.utils.imports.pyarrow_available ¶

grizz.utils.imports.tqdm_available ¶

grizz.utils.interval ¶

grizz.utils.interval.find_time_unit ¶

grizz.utils.interval.interval_to_strftime_format ¶

grizz.utils.interval.interval_to_timedelta ¶

grizz.utils.interval.time_unit_to_strftime_format ¶

grizz.utils.noop ¶

grizz.utils.noop.tqdm ¶

grizz.utils.path ¶

grizz.utils.path.find_files ¶

grizz.utils.path.find_parquet_files ¶

grizz.utils.path.human_file_size ¶

grizz.utils.path.sanitize_path ¶

grizz.utils.imports.is_clickhouse_connect_available `cached` ¶

grizz.utils.imports.is_pyarrow_available `cached` ¶

grizz.utils.imports.is_tqdm_available `cached` ¶