ingestor
grizz.ingestor ¶
Contain data ingestors.
grizz.ingestor.BaseIngestor ¶
Bases: ABC
Define the base class to implement a DataFrame ingestor.
Example usage:
>>> from grizz.ingestor import ParquetIngestor
>>> ingestor = ParquetIngestor(path="/path/to/frame.parquet")
>>> ingestor
ParquetIngestor(path=/path/to/frame.parquet)
>>> frame = ingestor.ingest() # doctest: +SKIP
grizz.ingestor.BaseIngestor.ingest ¶
ingest() -> DataFrame
Ingest a DataFrame.
Returns:
Type | Description |
---|---|
DataFrame
|
The ingested DataFrame. |
Example usage:
>>> from grizz.ingestor import ParquetIngestor
>>> ingestor = ParquetIngestor(path="/path/to/frame.parquet")
>>> frame = ingestor.ingest() # doctest: +SKIP
grizz.ingestor.ClickHouseIngestor ¶
Bases: BaseIngestor
Implement a clickhouse DataFrame ingestor.
This ingestor requires clickhouse_connect
and pyarrow
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
The query to get the data. |
required |
client
|
Client | dict
|
The clickhouse client or its configuration.
Please check the documentation of
|
required |
Example usage:
>>> from grizz.ingestor import ClickHouseIngestor
>>> import clickhouse_connect
>>> client = clickhouse_connect.get_client() # doctest: +SKIP
>>> ingestor = ClickHouseIngestor(query="", client=client) # doctest: +SKIP
>>> frame = ingestor.ingest() # doctest: +SKIP
grizz.ingestor.CsvIngestor ¶
Bases: BaseIngestor
Implement a CSV DataFrame ingestor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path | str
|
The path to the CSV file to ingest. |
required |
**kwargs
|
Any
|
Additional keyword arguments for
|
{}
|
Example usage:
>>> from grizz.ingestor import CsvIngestor
>>> ingestor = CsvIngestor(path="/path/to/frame.csv")
>>> ingestor
CsvIngestor(path=/path/to/frame.csv)
>>> frame = ingestor.ingest() # doctest: +SKIP
grizz.ingestor.Ingestor ¶
Bases: BaseIngestor
Implement a simple DataFrame ingestor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
frame
|
DataFrame
|
The DataFrame to ingest. |
required |
Example usage:
>>> import polars as pl
>>> from grizz.ingestor import Ingestor
>>> ingestor = Ingestor(
... frame=pl.DataFrame(
... {
... "col1": [1, 2, 3, 4, 5],
... "col2": ["1", "2", "3", "4", "5"],
... "col3": ["1", "2", "3", "4", "5"],
... "col4": ["a", "b", "c", "d", "e"],
... }
... )
... )
>>> ingestor
Ingestor(shape=(5, 4))
>>> frame = ingestor.ingest()
grizz.ingestor.ParquetIngestor ¶
Bases: BaseIngestor
Implement a parquet DataFrame ingestor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path | str
|
The path to the parquet file to ingest. |
required |
**kwargs
|
Any
|
Additional keyword arguments for
|
{}
|
Example usage:
>>> from grizz.ingestor import ParquetIngestor
>>> ingestor = ParquetIngestor(path="/path/to/frame.parquet")
>>> ingestor
ParquetIngestor(path=/path/to/frame.parquet)
>>> frame = ingestor.ingest() # doctest: +SKIP
grizz.ingestor.TransformIngestor ¶
Bases: BaseIngestor
Implement an ingestor that also transforms the DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ingestor
|
BaseIngestor | dict
|
The base ingestor. |
required |
transformer
|
BaseTransformer | dict
|
The |
required |
Example usage:
>>> import polars as pl
>>> from grizz.ingestor import TransformIngestor, Ingestor
>>> from grizz.transformer import Cast
>>> ingestor = TransformIngestor(
... ingestor=Ingestor(
... pl.DataFrame(
... {
... "col1": ["1", "2", "3", "4", "5"],
... "col2": ["a", "b", "c", "d", "e"],
... "col3": [1.2, 2.2, 3.2, 4.2, 5.2],
... }
... )
... ),
... transformer=Cast(columns=["col1", "col3"], dtype=pl.Float32),
... )
>>> ingestor
TransformIngestor(
(ingestor): Ingestor(shape=(5, 3))
(transformer): CastTransformer(columns=('col1', 'col3'), dtype=Float32, ignore_missing=False)
)
>>> frame = ingestor.ingest()
>>> frame
shape: (5, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ --- ┆ --- ┆ --- │
│ f32 ┆ str ┆ f32 │
╞══════╪══════╪══════╡
│ 1.0 ┆ a ┆ 1.2 │
│ 2.0 ┆ b ┆ 2.2 │
│ 3.0 ┆ c ┆ 3.2 │
│ 4.0 ┆ d ┆ 4.2 │
│ 5.0 ┆ e ┆ 5.2 │
└──────┴──────┴──────┘
grizz.ingestor.is_ingestor_config ¶
is_ingestor_config(config: dict) -> bool
Indicate if the input configuration is a configuration for a
BaseIngestor
.
This function only checks if the value of the key _target_
is valid. It does not check the other values. If _target_
indicates a function, the returned type hint is used to check
the class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config
|
dict
|
The configuration to check. |
required |
Returns:
Type | Description |
---|---|
bool
|
|
Example usage:
>>> from grizz.ingestor import is_ingestor_config
>>> is_ingestor_config(
... {"_target_": "grizz.ingestor.CsvIngestor", "path": "/path/to/data.csv"}
... )
True
grizz.ingestor.setup_ingestor ¶
setup_ingestor(
ingestor: BaseIngestor | dict,
) -> BaseIngestor
Set up an ingestor.
The ingestor is instantiated from its configuration
by using the BaseIngestor
factory function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ingestor
|
BaseIngestor | dict
|
Specifies an ingestor or its configuration. |
required |
Returns:
Type | Description |
---|---|
BaseIngestor
|
An instantiated ingestor. |
Example usage:
>>> from grizz.ingestor import setup_ingestor
>>> ingestor = setup_ingestor(
... {"_target_": "grizz.ingestor.CsvIngestor", "path": "/path/to/data.csv"}
... )
>>> ingestor
CsvIngestor(path=/path/to/data.csv)