Skip to content

ingestor

grizz.ingestor

Contain data ingestors.

grizz.ingestor.BaseIngestor

Bases: ABC

Define the base class to implement a DataFrame ingestor.

Example usage:

>>> from grizz.ingestor import ParquetIngestor
>>> ingestor = ParquetIngestor(path="/path/to/frame.parquet")
>>> ingestor
ParquetIngestor(path=/path/to/frame.parquet)
>>> frame = ingestor.ingest()  # doctest: +SKIP

grizz.ingestor.BaseIngestor.ingest

ingest() -> DataFrame

Ingest a DataFrame.

Returns:

Type Description
DataFrame

The ingested DataFrame.

Example usage:

>>> from grizz.ingestor import ParquetIngestor
>>> ingestor = ParquetIngestor(path="/path/to/frame.parquet")
>>> frame = ingestor.ingest()  # doctest: +SKIP

grizz.ingestor.ClickHouseIngestor

Bases: BaseIngestor

Implement a clickhouse DataFrame ingestor.

This ingestor requires clickhouse_connect and pyarrow.

Parameters:

Name Type Description Default
query str

The query to get the data.

required
client Client | dict

The clickhouse client or its configuration. Please check the documentation of clickhouse_connect.get_client to get more information.

required

Example usage:

>>> from grizz.ingestor import ClickHouseIngestor
>>> import clickhouse_connect
>>> client = clickhouse_connect.get_client()  # doctest: +SKIP
>>> ingestor = ClickHouseIngestor(query="", client=client)  # doctest: +SKIP
>>> frame = ingestor.ingest()  # doctest: +SKIP

grizz.ingestor.CsvIngestor

Bases: BaseIngestor

Implement a CSV DataFrame ingestor.

Parameters:

Name Type Description Default
path Path | str

The path to the CSV file to ingest.

required
**kwargs Any

Additional keyword arguments for polars.read_csv.

{}

Example usage:

>>> from grizz.ingestor import CsvIngestor
>>> ingestor = CsvIngestor(path="/path/to/frame.csv")
>>> ingestor
CsvIngestor(path=/path/to/frame.csv)
>>> frame = ingestor.ingest()  # doctest: +SKIP

grizz.ingestor.Ingestor

Bases: BaseIngestor

Implement a simple DataFrame ingestor.

Parameters:

Name Type Description Default
frame DataFrame

The DataFrame to ingest.

required

Example usage:

>>> import polars as pl
>>> from grizz.ingestor import Ingestor
>>> ingestor = Ingestor(
...     frame=pl.DataFrame(
...         {
...             "col1": [1, 2, 3, 4, 5],
...             "col2": ["1", "2", "3", "4", "5"],
...             "col3": ["1", "2", "3", "4", "5"],
...             "col4": ["a", "b", "c", "d", "e"],
...         }
...     )
... )
>>> ingestor
Ingestor(shape=(5, 4))
>>> frame = ingestor.ingest()

grizz.ingestor.ParquetIngestor

Bases: BaseIngestor

Implement a parquet DataFrame ingestor.

Parameters:

Name Type Description Default
path Path | str

The path to the parquet file to ingest.

required
**kwargs Any

Additional keyword arguments for polars.read_parquet.

{}

Example usage:

>>> from grizz.ingestor import ParquetIngestor
>>> ingestor = ParquetIngestor(path="/path/to/frame.parquet")
>>> ingestor
ParquetIngestor(path=/path/to/frame.parquet)
>>> frame = ingestor.ingest()  # doctest: +SKIP

grizz.ingestor.TransformIngestor

Bases: BaseIngestor

Implement an ingestor that also transforms the DataFrame.

Parameters:

Name Type Description Default
ingestor BaseIngestor | dict

The base ingestor.

required
transformer BaseTransformer | dict

The polars.DataFrame transformer or its configuration.

required

Example usage:

>>> import polars as pl
>>> from grizz.ingestor import TransformIngestor, Ingestor
>>> from grizz.transformer import Cast
>>> ingestor = TransformIngestor(
...     ingestor=Ingestor(
...         pl.DataFrame(
...             {
...                 "col1": ["1", "2", "3", "4", "5"],
...                 "col2": ["a", "b", "c", "d", "e"],
...                 "col3": [1.2, 2.2, 3.2, 4.2, 5.2],
...             }
...         )
...     ),
...     transformer=Cast(columns=["col1", "col3"], dtype=pl.Float32),
... )
>>> ingestor
TransformIngestor(
  (ingestor): Ingestor(shape=(5, 3))
  (transformer): CastTransformer(columns=('col1', 'col3'), dtype=Float32, ignore_missing=False)
)
>>> frame = ingestor.ingest()
>>> frame
shape: (5, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ f32  ┆ str  ┆ f32  │
╞══════╪══════╪══════╡
│ 1.0  ┆ a    ┆ 1.2  │
│ 2.0  ┆ b    ┆ 2.2  │
│ 3.0  ┆ c    ┆ 3.2  │
│ 4.0  ┆ d    ┆ 4.2  │
│ 5.0  ┆ e    ┆ 5.2  │
└──────┴──────┴──────┘

grizz.ingestor.is_ingestor_config

is_ingestor_config(config: dict) -> bool

Indicate if the input configuration is a configuration for a BaseIngestor.

This function only checks if the value of the key _target_ is valid. It does not check the other values. If _target_ indicates a function, the returned type hint is used to check the class.

Parameters:

Name Type Description Default
config dict

The configuration to check.

required

Returns:

Type Description
bool

True if the input configuration is a configuration for a BaseIngestor object.

Example usage:

>>> from grizz.ingestor import is_ingestor_config
>>> is_ingestor_config(
...     {"_target_": "grizz.ingestor.CsvIngestor", "path": "/path/to/data.csv"}
... )
True

grizz.ingestor.setup_ingestor

setup_ingestor(
    ingestor: BaseIngestor | dict,
) -> BaseIngestor

Set up an ingestor.

The ingestor is instantiated from its configuration by using the BaseIngestor factory function.

Parameters:

Name Type Description Default
ingestor BaseIngestor | dict

Specifies an ingestor or its configuration.

required

Returns:

Type Description
BaseIngestor

An instantiated ingestor.

Example usage:

>>> from grizz.ingestor import setup_ingestor
>>> ingestor = setup_ingestor(
...     {"_target_": "grizz.ingestor.CsvIngestor", "path": "/path/to/data.csv"}
... )
>>> ingestor
CsvIngestor(path=/path/to/data.csv)