utils
grizz.utils ¶
Contain utility functions.
grizz.utils.column ¶
Contain DataFrame columns utility functions.
grizz.utils.column.find_common_columns ¶
find_common_columns(
frame_or_cols: DataFrame | Sequence,
columns: Sequence[str],
) -> tuple[str, ...]
Find the common columns that are both in the DataFrame and the given columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
frame_or_cols
|
DataFrame | Sequence
|
The DataFrame or its columns. |
required |
columns
|
Sequence[str]
|
The columns to check. |
required |
Returns:
Type | Description |
---|---|
tuple[str, ...]
|
The columns i.e. the columns that are both in
|
Example usage:
>>> import polars as pl
>>> from grizz.utils.column import find_common_columns
>>> frame = pl.DataFrame(
... {
... "col1": [1, 2, 3, 4, 5],
... "col2": ["1", "2", "3", "4", "5"],
... "col3": ["a ", " b", " c ", "d", "e"],
... }
... )
>>> cols = find_common_columns(frame, columns=["col1", "col2", "col3", "col4"])
>>> cols
('col1', 'col2', 'col3')
grizz.utils.column.find_missing_columns ¶
find_missing_columns(
frame_or_cols: DataFrame | Sequence,
columns: Sequence[str],
) -> tuple[str, ...]
Find the columns that are in the given columns but not in the DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
frame_or_cols
|
DataFrame | Sequence
|
The DataFrame or its columns. |
required |
columns
|
Sequence[str]
|
The columns to check. |
required |
Returns:
Type | Description |
---|---|
tuple[str, ...]
|
The list of missing columns i.e. the columns that are in
|
Example usage:
>>> import polars as pl
>>> from grizz.utils.column import find_missing_columns
>>> frame = pl.DataFrame(
... {
... "col1": [1, 2, 3, 4, 5],
... "col2": ["1", "2", "3", "4", "5"],
... "col3": ["a ", " b", " c ", "d", "e"],
... }
... )
>>> cols = find_missing_columns(frame, columns=["col1", "col2", "col3", "col4"])
>>> cols
('col4',)
grizz.utils.datetime ¶
Contain utility functions for datetime and date objects.
grizz.utils.datetime.find_end_datetime ¶
find_end_datetime(
start: datetime | date,
interval: str | timedelta,
periods: int,
) -> datetime
Find the upper bound of the datetime range from the lower bound of the datetime range, the interval, and the number of periods.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start
|
datetime | date
|
The lower bound of the datetime range. |
required |
interval
|
str | timedelta
|
The interval of the range periods, specified as a Python timedelta object or using the Polars duration string language. |
required |
periods
|
int
|
The number of periods after the start. |
required |
Returns:
Type | Description |
---|---|
datetime
|
The upper bound of the datetime range. |
Notes
interval
is created according to the following string
language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 calendar day)
- 1w (1 calendar week)
Example usage:
>>> from datetime import timedelta, datetime, timezone
>>> from grizz.utils.datetime import find_end_datetime
>>> find_end_datetime(
... start=datetime(year=2020, month=5, day=12, hour=4, tzinfo=timezone.utc),
... interval=timedelta(hours=1),
... periods=42,
... )
datetime.datetime(2020, 5, 13, 22, 0, tzinfo=datetime.timezone.utc)
grizz.utils.datetime.to_datetime ¶
to_datetime(dt: datetime | date) -> datetime
Convert a date
object to a datetime
object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dt
|
datetime | date
|
The |
required |
Returns:
Type | Description |
---|---|
datetime
|
The |
Example usage:
>>> from datetime import datetime, date, timezone
>>> from grizz.utils.datetime import to_datetime
>>> to_datetime(datetime(year=2020, month=5, day=12, hour=4, tzinfo=timezone.utc))
datetime.datetime(2020, 5, 12, 4, 0, tzinfo=datetime.timezone.utc)
>>> to_datetime(date(year=2020, month=5, day=12))
datetime.datetime(2020, 5, 12, 0, 0, tzinfo=datetime.timezone.utc)
grizz.utils.factory ¶
Contain a function to instantiate an object from its configuration.
grizz.utils.factory.setup_object ¶
setup_object(obj_or_config: T | dict) -> T
Set up an object from its configuration.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
obj_or_config
|
T | dict
|
The object or its configuration. |
required |
Returns:
Type | Description |
---|---|
T
|
The instantiated object. |
Example usage:
>>> from grizz.utils.factory import setup_object
>>> obj = setup_object({"_target_": "collections.deque", "iterable": [1, 2, 1, 3]})
>>> obj
deque([1, 2, 1, 3])
>>> setup_object(obj) # Do nothing because the object is already instantiated
deque([1, 2, 1, 3])
grizz.utils.format ¶
Contain utility functions to format strings.
grizz.utils.format.human_byte ¶
human_byte(size: float, decimal: int = 2) -> str
Return a human-readable string representation of byte sizes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
size
|
float
|
The number of bytes. |
required |
decimal
|
int
|
The number of decimal digits. |
2
|
Returns:
Type | Description |
---|---|
str
|
The human-readable string representation of byte sizes. |
Example usage:
>>> from grizz.utils.format import human_byte
>>> human_byte(2)
'2.00 B'
>>> human_byte(2048)
'2.00 KB'
>>> human_byte(2097152)
'2.00 MB'
grizz.utils.format.str_col_diff ¶
str_col_diff(orig: int, final: int) -> str
Return a string that indicates the difference of columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
orig
|
int
|
The original number of columns. |
required |
final
|
int
|
The final number of columns. |
required |
Returns:
Type | Description |
---|---|
str
|
The generated string with the difference of columns. |
Example usage:
>>> from grizz.utils.format import str_col_diff
>>> str_col_diff(100, 10)
90/100 (90.0000 %) columns have been removed
>>> str_col_diff(100, 99)
1/100 (1.0000 %) column has been removed
grizz.utils.format.str_kwargs ¶
str_kwargs(mapping: Mapping) -> str
Return a string of the input mapping.
This function is designed to be used in __repr__
and
__str__
methods.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mapping
|
Mapping
|
The mapping. |
required |
Returns:
Type | Description |
---|---|
str
|
The generated string. |
Example usage:
>>> from grizz.utils.format import str_kwargs
>>> str_kwargs({"key1": 1})
', key1=1'
>>> str_kwargs({"key1": 1, "key2": 2})
', key1=1, key2=2'
grizz.utils.format.str_row_diff ¶
str_row_diff(orig: int, final: int) -> str
Return a string that indicates the difference of rows.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
orig
|
int
|
The original number of rows. |
required |
final
|
int
|
The final number of rows. |
required |
Returns:
Type | Description |
---|---|
str
|
The generated string with the difference of rows. |
Example usage:
>>> from grizz.utils.format import str_row_diff
>>> str_row_diff(100, 10)
90/100 (90.0000 %) rows have been removed
>>> str_row_diff(100, 99)
1/100 (1.0000 %) row has been removed
grizz.utils.imports ¶
Implement some utility functions to manage optional dependencies.
grizz.utils.imports.check_clickhouse_connect ¶
check_clickhouse_connect() -> None
Check if the clickhouse_connect
package is installed.
Raises:
Type | Description |
---|---|
RuntimeError
|
if the |
Example usage:
>>> from grizz.utils.imports import check_clickhouse_connect
>>> check_clickhouse_connect()
grizz.utils.imports.check_pyarrow ¶
check_pyarrow() -> None
Check if the pyarrow
package is installed.
Raises:
Type | Description |
---|---|
RuntimeError
|
if the |
Example usage:
>>> from grizz.utils.imports import check_pyarrow
>>> check_pyarrow()
grizz.utils.imports.check_tqdm ¶
check_tqdm() -> None
Check if the tqdm
package is installed.
Raises:
Type | Description |
---|---|
RuntimeError
|
if the |
Example usage:
>>> from grizz.utils.imports import check_tqdm
>>> check_tqdm()
grizz.utils.imports.clickhouse_connect_available ¶
clickhouse_connect_available(
fn: Callable[..., Any]
) -> Callable[..., Any]
Implement a decorator to execute a function only if
clickhouse_connect
package is installed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable[..., Any]
|
The function to execute. |
required |
Returns:
Type | Description |
---|---|
Callable[..., Any]
|
A wrapper around |
Example usage:
>>> from grizz.utils.imports import clickhouse_connect_available
>>> @clickhouse_connect_available
... def my_function(n: int = 0) -> int:
... return 42 + n
...
>>> my_function()
grizz.utils.imports.is_clickhouse_connect_available
cached
¶
is_clickhouse_connect_available() -> bool
Indicate if the clickhouse_connect
package is installed or
not.
Returns:
Type | Description |
---|---|
bool
|
|
Example usage:
>>> from grizz.utils.imports import is_clickhouse_connect_available
>>> is_clickhouse_connect_available()
grizz.utils.imports.is_pyarrow_available
cached
¶
is_pyarrow_available() -> bool
Indicate if the pyarrow
package is installed or not.
Returns:
Type | Description |
---|---|
bool
|
|
Example usage:
>>> from grizz.utils.imports import is_pyarrow_available
>>> is_pyarrow_available()
grizz.utils.imports.is_tqdm_available
cached
¶
is_tqdm_available() -> bool
Indicate if the tqdm
package is installed or not.
Returns:
Type | Description |
---|---|
bool
|
|
Example usage:
>>> from grizz.utils.imports import is_tqdm_available
>>> is_tqdm_available()
grizz.utils.imports.pyarrow_available ¶
pyarrow_available(
fn: Callable[..., Any]
) -> Callable[..., Any]
Implement a decorator to execute a function only if pyarrow
package is installed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable[..., Any]
|
Specifies the function to execute. |
required |
Returns:
Type | Description |
---|---|
Callable[..., Any]
|
A wrapper around |
Example usage:
>>> from grizz.utils.imports import pyarrow_available
>>> @pyarrow_available
... def my_function(n: int = 0) -> int:
... return 42 + n
...
>>> my_function()
grizz.utils.imports.tqdm_available ¶
tqdm_available(
fn: Callable[..., Any]
) -> Callable[..., Any]
Implement a decorator to execute a function only if tqdm
package is installed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable[..., Any]
|
Specifies the function to execute. |
required |
Returns:
Type | Description |
---|---|
Callable[..., Any]
|
A wrapper around |
Example usage:
>>> from grizz.utils.imports import tqdm_available
>>> @tqdm_available
... def my_function(n: int = 0) -> int:
... return 42 + n
...
>>> my_function()
grizz.utils.interval ¶
Contain interval utility functions.
grizz.utils.interval.find_time_unit ¶
find_time_unit(interval: str) -> str
Find the time unit associated to a polars
interval.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
interval
|
str
|
The |
required |
Returns:
Type | Description |
---|---|
str
|
The found time unit. |
Raises:
Type | Description |
---|---|
RuntimeError
|
if no valid time unit can be found. |
Example usage:
>>> from grizz.utils.interval import find_time_unit
>>> find_time_unit("3d12h4m")
m
>>> find_time_unit("3y5mo")
mo
grizz.utils.interval.interval_to_strftime_format ¶
interval_to_strftime_format(interval: str) -> str
Return the default strftime format for a given interval.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
interval
|
str
|
The |
required |
Returns:
Type | Description |
---|---|
str
|
The default strftime format. |
Example usage:
>>> from grizz.utils.interval import interval_to_strftime_format
>>> interval_to_strftime_format("1h")
%Y-%m-%d %H:%M
>>> interval_to_strftime_format("3y1mo")
%Y-%m
grizz.utils.interval.interval_to_timedelta ¶
interval_to_timedelta(interval: str) -> timedelta
Convert a interval to a timedelta object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
interval
|
str
|
The input interval. |
required |
Returns:
Type | Description |
---|---|
timedelta
|
The timedelta object generated from the interval. |
Example usage:
>>> from grizz.utils.interval import interval_to_timedelta
>>> interval_to_timedelta("5d1h42m")
datetime.timedelta(days=5, seconds=6120)
grizz.utils.interval.time_unit_to_strftime_format ¶
time_unit_to_strftime_format(time_unit: str) -> str
Return the default strftime format for a given time unit.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
time_unit
|
str
|
The time unit. |
required |
Returns:
Type | Description |
---|---|
str
|
The default strftime format. |
Example usage:
>>> from grizz.utils.interval import time_unit_to_strftime_format
>>> time_unit_to_strftime_format("h")
%Y-%m-%d %H:%M
>>> time_unit_to_strftime_format("mo")
%Y-%m
grizz.utils.noop ¶
Contain no-op functions.
grizz.utils.noop.tqdm ¶
tqdm(
iterable: Iterable, *args: Any, **kwargs: Any
) -> Iterable
Implement a no-op tqdm progressbar that is used when tqdm is not installed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
iterable
|
Iterable
|
Iterable to decorate with a progressbar. |
required |
*args
|
Any
|
Positional arbitrary arguments. |
()
|
**kwargs
|
Any
|
Keyword arbitrary arguments. |
{}
|
Returns:
Type | Description |
---|---|
Iterable
|
The input iterable. |
grizz.utils.path ¶
Contain utility functions to manage paths.
grizz.utils.path.find_files ¶
find_files(
path: Path | str,
filter_fn: Callable[[Path], bool],
recursive: bool = True,
) -> list[Path]
Find the path of all the tar files in a given path.
This function does not check if a path is a symbolic link so be careful if you are using a path with symbolic links.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path | str
|
The path where to look for the parquet files. |
required |
filter_fn
|
Callable[[Path], bool]
|
The path filtering function. The function
should return |
required |
recursive
|
bool
|
Indicate if it should also check the sub-folders. |
True
|
Returns:
Type | Description |
---|---|
list[Path]
|
The tuple of path of parquet files. |
Example usage:
>>> from pathlib import Path
>>> from grizz.utils.path import find_files
>>> find_files(Path("something"), filter_fn=lambda path: path.name.endswith(".txt"))
[...]
grizz.utils.path.find_parquet_files ¶
find_parquet_files(
path: Path | str, recursive: bool = True
) -> list[Path]
Find the path of all the parquet files in a given path.
This function does not check if a path is a symbolic link so be careful if you are using a path with symbolic links.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path | str
|
The path where to look for the parquet files. |
required |
recursive
|
bool
|
Specifies if it should also check the sub-folders. |
True
|
Returns:
Type | Description |
---|---|
list[Path]
|
The list of parquet files. |
Example usage:
>>> from pathlib import Path
>>> from grizz.utils.path import find_parquet_files
>>> find_parquet_files(Path("something"))
[...]
grizz.utils.path.human_file_size ¶
human_file_size(path: Path | str, decimal: int = 2) -> str
Get a human-readable representation of a file size.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path | str
|
The path to the file. |
required |
decimal
|
int
|
The number of decimal digits. |
2
|
Returns:
Type | Description |
---|---|
str
|
The file size in a human-readable format. |
Example usage:
>>> from grizz.utils.path import human_file_size
>>> human_file_size("README.md")
'...B'
grizz.utils.path.sanitize_path ¶
sanitize_path(path: Path | str) -> Path
Sanitize a given path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path | str
|
The path to sanitize. |
required |
Returns:
Type | Description |
---|---|
Path
|
The sanitized path. |
Example usage:
>>> from pathlib import Path
>>> from grizz.utils.path import sanitize_path
>>> sanitize_path("something")
PosixPath('.../something')
>>> sanitize_path("")
PosixPath('...')
>>> sanitize_path(Path("something"))
PosixPath('.../something')
>>> sanitize_path(Path("something/./../"))
PosixPath('...')