Skip to content

iden.dataset

iden.dataset

Contain dataset implementations.

iden.dataset.BaseDataset

Bases: ABC, Generic[T]

Define the base class to implement a dataset.

Note this dataset class is very different from the PyTorch dataset class because it has a different goal. One of the goals is to help to organize and manage shards.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import VanillaDataset
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 [1, 2, 3], uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     dataset = VanillaDataset(
...         uri=Path(tmpdir).joinpath("uri").as_uri(), shards=shards, assets=assets
...     )
...     dataset
...
VanillaDataset(
  (uri): file:///.../uri
  (shards): ShardDict(
      (uri): file:///.../uri_shards
      (shards):
        (train): ShardTuple(
            (uri): file:///.../uri_train
            (shards):
              (0): JsonShard(uri=file:///.../shard/uri1)
              (1): JsonShard(uri=file:///.../shard/uri2)
          )
        (val): ShardTuple(
            (uri): file:///.../uri_val
            (shards):
          )
    )
  (assets): ShardDict(
      (uri): file:///.../uri_assets
      (shards):
        (stats): JsonShard(uri=file:///.../uri_stats)
    )
)

iden.dataset.BaseDataset.equal abstractmethod

equal(other: Any, equal_nan: bool = False) -> bool

Indicate if two datasets are equal or not.

Parameters:

Name Type Description Default
other Any

The object to compare with.

required
equal_nan bool

If True, then two NaNs will be considered equal.

False

Returns:

Type Description
bool

True if the two datasets are equal, otherwise False.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import VanillaDataset
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 [1, 2, 3], uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     dataset1 = VanillaDataset(
...         uri=Path(tmpdir).joinpath("uri").as_uri(), shards=shards, assets=assets
...     )
...     dataset2 = VanillaDataset(
...         uri=Path(tmpdir).joinpath("uri2").as_uri(), shards=shards, assets=assets
...     )
...     dataset1.equal(dataset2)
...
False

iden.dataset.BaseDataset.get_asset abstractmethod

get_asset(asset_id: str) -> Any

Get a data asset from this sharded dataset.

This method is useful to access some data variables/parameters that are not available before to load/preprocess the data.

Parameters:

Name Type Description Default
asset_id str

The asset ID used to find the asset.

required

Returns:

Type Description
Any

The asset.

Raises:

Type Description
AssetNotFoundError

if the asset does not exist.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import VanillaDataset
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 {"mean": 42}, uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     dataset = VanillaDataset(
...         uri=Path(tmpdir).joinpath("uri").as_uri(), shards=shards, assets=assets
...     )
...     dataset.get_asset("stats").get_data()
...
{'mean': 42}

iden.dataset.BaseDataset.get_num_shards abstractmethod

get_num_shards(split: str) -> int

Get the number of shards for a given split.

Returns:

Type Description
int

The number of shards in the dataset for a given split.

Raises:

Type Description
SplitNotFoundError

if the split does not exist.

Returns:

Type Description
int

The dataset splits.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import VanillaDataset
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 {"mean": 42}, uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     dataset = VanillaDataset(
...         uri=Path(tmpdir).joinpath("uri").as_uri(), shards=shards, assets=assets
...     )
...     dataset.get_num_shards("train")
...     dataset.get_num_shards("val")
...
2
0

iden.dataset.BaseDataset.get_shards abstractmethod

get_shards(split: str) -> tuple[BaseShard[T], ...]

Get the shards for a given split.

Returns:

Type Description
tuple[BaseShard[T], ...]

The shards for a given split. The shards are sorted by ascending order of URI.

Raises:

Type Description
SplitNotFoundError

if the split does not exist.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import VanillaDataset
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 {"mean": 42}, uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     dataset = VanillaDataset(
...         uri=Path(tmpdir).joinpath("uri").as_uri(), shards=shards, assets=assets
...     )
...     dataset.get_shards("train")
...     dataset.get_shards("val")
...
(JsonShard(uri=file:///.../uri1), JsonShard(uri=file:///.../uri2))
()

iden.dataset.BaseDataset.get_splits abstractmethod

get_splits() -> set[str]

Get the available dataset splits.

Returns:

Type Description
set[str]

The dataset splits.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import VanillaDataset
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 {"mean": 42}, uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     dataset = VanillaDataset(
...         uri=Path(tmpdir).joinpath("uri").as_uri(), shards=shards, assets=assets
...     )
...     sorted(dataset.get_splits())
...
['train', 'val']

iden.dataset.BaseDataset.get_uri abstractmethod

get_uri() -> str

Get the Uniform Resource Identifier (URI) of the dataset.

Returns:

Type Description
str

The dataset's URI.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import VanillaDataset
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 {"mean": 42}, uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     dataset = VanillaDataset(
...         uri=Path(tmpdir).joinpath("uri").as_uri(), shards=shards, assets=assets
...     )
...     dataset.get_uri()
...
file:///.../uri

iden.dataset.BaseDataset.has_asset abstractmethod

has_asset(asset_id: str) -> bool

Indicate if the asset exists or not.

Parameters:

Name Type Description Default
asset_id str

The asset ID used to find the asset.

required

Returns:

Type Description
bool

True if the asset exists, otherwise False.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import VanillaDataset
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 {"mean": 42}, uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     dataset = VanillaDataset(
...         uri=Path(tmpdir).joinpath("uri").as_uri(), shards=shards, assets=assets
...     )
...     dataset.has_asset("stats")
...     dataset.has_asset("missing")
...
True
False

iden.dataset.BaseDataset.has_split abstractmethod

has_split(split: str) -> bool

Indicate if a dataset split exists or not.

Returns:

Type Description
bool

True of the split exists, otherwise False

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import VanillaDataset
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 {"mean": 42}, uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     dataset = VanillaDataset(
...         uri=Path(tmpdir).joinpath("uri").as_uri(), shards=shards, assets=assets
...     )
...     dataset.has_split("train")
...     dataset.has_split("missing")
...
True
False

iden.dataset.VanillaDataset

Bases: BaseDataset[T]

Implement a simple dataset for managing shards and assets.

This dataset provides a straightforward implementation for organizing data into shards (training, validation, test splits) and assets (metadata, statistics, etc.).

Parameters:

Name Type Description Default
uri str

The Uniform Resource Identifier (URI) associated with the dataset, used for identification and persistence.

required
shards ShardDict[ShardTuple[T]]

The dataset's shards. Each item in the mapping represent a dataset split, where the key is the dataset split and the value is the shards.

required
assets ShardDict[Any]

The dataset's assets.

required
Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import VanillaDataset
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 [1, 2, 3], uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     dataset = VanillaDataset(
...         uri=Path(tmpdir).joinpath("uri").as_uri(), shards=shards, assets=assets
...     )
...     dataset
...
VanillaDataset(
  (uri): file:///.../uri
  (shards): ShardDict(
      (uri): file:///.../uri_shards
      (shards):
        (train): ShardTuple(
            (uri): file:///.../uri_train
            (shards):
              (0): JsonShard(uri=file:///.../shard/uri1)
              (1): JsonShard(uri=file:///.../shard/uri2)
          )
        (val): ShardTuple(
            (uri): file:///.../uri_val
            (shards):
          )
    )
  (assets): ShardDict(
      (uri): file:///.../uri_assets
      (shards):
        (stats): JsonShard(uri=file:///.../uri_stats)
    )
)

iden.dataset.VanillaDataset.from_uri classmethod

from_uri(uri: str) -> VanillaDataset[T]

Instantiate a shard from its URI.

Parameters:

Name Type Description Default
uri str

The Uniform Resource Identifier (URI) of the dataset to load.

required

Returns:

Type Description
VanillaDataset[T]

The instantiated shard.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import create_vanilla_dataset
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 [1, 2, 3], uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     uri = Path(tmpdir).joinpath("uri").as_uri()
...     create_vanilla_dataset(uri=uri, shards=shards, assets=assets)
...     dataset = VanillaDataset.from_uri(uri)
...     dataset
...
VanillaDataset(
  (uri): file:///.../uri
  (shards): ShardDict(
      (uri): file:///.../uri_shards
      (shards):
        (train): ShardTuple(
            (uri): file:///.../uri_train
            (shards):
              (0): JsonShard(uri=file:///.../shard/uri1)
              (1): JsonShard(uri=file:///.../shard/uri2)
          )
        (val): ShardTuple(
            (uri): file:///.../uri_val
            (shards):
          )
    )
  (assets): ShardDict(
      (uri): file:///.../uri_assets
      (shards):
        (stats): JsonShard(uri=file:///.../uri_stats)
    )
)

iden.dataset.VanillaDataset.generate_uri_config classmethod

generate_uri_config(
    shards: ShardDict[ShardTuple[BaseShard[T]]],
    assets: ShardDict[Any],
) -> dict[str, Any]

Generate the minimal config that is used to load the dataset from its URI.

The config must be compatible with the JSON format.

Parameters:

Name Type Description Default
shards ShardDict[ShardTuple[BaseShard[T]]]

The shards in the dataset. Each item in the mapping represent a dataset split, where the key is the dataset split and the value is the shards.

required
assets ShardDict[Any]

The dataset's assets.

required

Returns:

Type Description
dict[str, Any]

The minimal config to load the shard from its URI.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import VanillaDataset
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 [1, 2, 3], uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     config = VanillaDataset.generate_uri_config(shards=shards, assets=assets)
...     config
...
{'loader': {'_target_': 'iden.dataset.loader.VanillaDatasetLoader'},
 'shards': 'file:///.../uri_shards',
 'assets': 'file:///.../uri_assets'}

iden.dataset.create_vanilla_dataset

create_vanilla_dataset(
    shards: ShardDict[ShardTuple[BaseShard[T]]],
    assets: ShardDict[Any],
    uri: str,
) -> VanillaDataset[T]

Create a VanillaDataset from its shards.

Note

It is a utility function to create a VanillaDataset from its shards and URI. It is possible to create a VanillaDataset in other ways.

Parameters:

Name Type Description Default
shards ShardDict[ShardTuple[BaseShard[T]]]

The dataset's shards. Each item in the mapping represent a dataset split, where the key is the dataset split and the value is the shards.

required
assets ShardDict[Any]

The dataset's assets.

required
uri str

The URI associated to the dataset.

required

Returns:

Type Description
VanillaDataset[T]

The instantited VanillaDataset object.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import create_vanilla_dataset
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 [1, 2, 3], uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     dataset = create_vanilla_dataset(
...         uri=Path(tmpdir).joinpath("uri").as_uri(), shards=shards, assets=assets
...     )
...     dataset
...
VanillaDataset(
  (uri): file:///.../uri
  (shards): ShardDict(
      (uri): file:///.../uri_shards
      (shards):
        (train): ShardTuple(
            (uri): file:///.../uri_train
            (shards):
              (0): JsonShard(uri=file:///.../shard/uri1)
              (1): JsonShard(uri=file:///.../shard/uri2)
          )
        (val): ShardTuple(
            (uri): file:///.../uri_val
            (shards):
          )
    )
  (assets): ShardDict(
      (uri): file:///.../uri_assets
      (shards):
        (stats): JsonShard(uri=file:///.../uri_stats)
    )
)

iden.dataset.load_from_uri

load_from_uri(uri: str) -> BaseDataset[T]

Load a dataset from its Uniform Resource Identifier (URI).

Parameters:

Name Type Description Default
uri str

The URI of the dataset.

required

Returns:

Type Description
BaseDataset[T]

The dataset associated to the URI.

Raises:

Type Description
FileNotFoundError

if the URI file does not exist.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import create_vanilla_dataset, load_from_uri
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 [1, 2, 3], uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     uri = Path(tmpdir).joinpath("uri").as_uri()
...     create_vanilla_dataset(uri=uri, shards=shards, assets=assets)
...     dataset = load_from_uri(uri)
...     dataset
...
VanillaDataset(
  (uri): file:///.../uri
  (shards): ShardDict(
      (uri): file:///.../uri_shards
      (shards):
        (train): ShardTuple(
            (uri): file:///.../uri_train
            (shards):
              (0): JsonShard(uri=file:///.../shard/uri1)
              (1): JsonShard(uri=file:///.../shard/uri2)
          )
        (val): ShardTuple(
            (uri): file:///.../uri_val
            (shards):
          )
    )
  (assets): ShardDict(
      (uri): file:///.../uri_assets
      (shards):
        (stats): JsonShard(uri=file:///.../uri_stats)
    )
)

iden.dataset.generator

Contain dataset generator implementations.

iden.dataset.generator.BaseDatasetGenerator

Bases: ABC, Generic[T]

Define the base class to create a dataset.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset.generator import VanillaDatasetGenerator
>>> from iden.shard.generator import ShardDictGenerator
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     generator = VanillaDatasetGenerator(
...         path_uri=Path(tmpdir).joinpath("uri"),
...         shards=ShardDictGenerator(
...             path_uri=Path(tmpdir).joinpath("uri/shards"), shards={}
...         ),
...         assets=ShardDictGenerator(
...             path_uri=Path(tmpdir).joinpath("uri/assets"), shards={}
...         ),
...     )
...     generator
...     dataset = generator.generate("dataset1")
...     dataset
...
VanillaDatasetGenerator(
  (path_uri): PosixPath('/.../uri')
  (shards): ShardDictGenerator(
      (path_uri): PosixPath('/.../uri/shards')
      (shards):
    )
  (assets): ShardDictGenerator(
      (path_uri): PosixPath('/.../uri/assets')
      (shards):
    )
)
VanillaDataset(
  (uri): file:///.../uri/dataset1
  (shards): ShardDict(
      (uri): file:///.../uri/shards/shards
      (shards):
    )
  (assets): ShardDict(
      (uri): file:///.../uri/assets/assets
      (shards):
    )
)

iden.dataset.generator.BaseDatasetGenerator.equal abstractmethod

equal(other: Any, equal_nan: bool = False) -> bool

Indicate if two objects are equal or not.

Parameters:

Name Type Description Default
other Any

The object to compare with.

required
equal_nan bool

If True, then two NaNs will be considered equal.

False

Returns:

Type Description
bool

True if the two objects are equal, otherwise False.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset.generator import VanillaDatasetGenerator
>>> from iden.shard.generator import ShardDictGenerator
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = ShardDictGenerator(path_uri=Path(tmpdir).joinpath("uri/shards"), shards={})
...     assets = ShardDictGenerator(path_uri=Path(tmpdir).joinpath("uri/assets"), shards={})
...     generator1 = VanillaDatasetGenerator(
...         path_uri=Path(tmpdir).joinpath("uri"),
...         shards=shards,
...         assets=assets,
...     )
...     generator2 = VanillaDatasetGenerator(
...         path_uri=Path(tmpdir).joinpath("uri"),
...         shards=shards,
...         assets=assets,
...     )
...     generator3 = VanillaDatasetGenerator(
...         path_uri=Path(tmpdir).joinpath("uri2"),
...         shards=shards,
...         assets=assets,
...     )
...     generator1.equal(generator2)
...     generator1.equal(generator3)
...
True
False

iden.dataset.generator.BaseDatasetGenerator.generate abstractmethod

generate(dataset_id: str) -> BaseDataset[T]

Generate a dataset.

Parameters:

Name Type Description Default
dataset_id str

The dataset IDI.

required

Returns:

Type Description
BaseDataset[T]

The generated dataset.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset.generator import VanillaDatasetGenerator
>>> from iden.shard.generator import ShardDictGenerator
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     generator = VanillaDatasetGenerator(
...         path_uri=Path(tmpdir).joinpath("uri"),
...         shards=ShardDictGenerator(
...             path_uri=Path(tmpdir).joinpath("uri/shards"), shards={}
...         ),
...         assets=ShardDictGenerator(
...             path_uri=Path(tmpdir).joinpath("uri/assets"), shards={}
...         ),
...     )
...     dataset = generator.generate("dataset1")
...     dataset
...
VanillaDataset(
  (uri): file:///.../uri/dataset1
  (shards): ShardDict(
      (uri): file:///.../uri/shards/shards
      (shards):
    )
  (assets): ShardDict(
      (uri): file:///.../uri/assets/assets
      (shards):
    )
)

iden.dataset.generator.VanillaDatasetGenerator

Bases: BaseDatasetGenerator[tuple[BaseShard[T], ...]]

Implement a VanillaDataset generator.

Parameters:

Name Type Description Default
path_uri Path

The path where to save the URI file.

required
shards ShardDictGenerator[T] | dict[Any, Any]

The shards generator or its configuration.

required
assets ShardDictGenerator[Any] | dict[Any, Any]

The assets generator or its configuration.

required
Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset.generator import VanillaDatasetGenerator
>>> from iden.shard.generator import ShardDictGenerator
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     generator = VanillaDatasetGenerator(
...         path_uri=Path(tmpdir).joinpath("uri"),
...         shards=ShardDictGenerator(
...             path_uri=Path(tmpdir).joinpath("uri/shards"), shards={}
...         ),
...         assets=ShardDictGenerator(
...             path_uri=Path(tmpdir).joinpath("uri/assets"), shards={}
...         ),
...     )
...     generator
...     dataset = generator.generate("dataset1")
...     dataset
...
VanillaDatasetGenerator(
  (path_uri): PosixPath('/.../uri')
  (shards): ShardDictGenerator(
      (path_uri): PosixPath('/.../uri/shards')
      (shards):
    )
  (assets): ShardDictGenerator(
      (path_uri): PosixPath('/.../uri/assets')
      (shards):
    )
)
VanillaDataset(
  (uri): file:///.../uri/dataset1
  (shards): ShardDict(
      (uri): file:///.../uri/shards/shards
      (shards):
    )
  (assets): ShardDict(
      (uri): file:///.../uri/assets/assets
      (shards):
    )
)

iden.dataset.generator.is_dataset_generator_config

is_dataset_generator_config(config: dict[Any, Any]) -> bool

Indicate if the input configuration is a configuration for a BaseDatasetGenerator.

This function only checks if the value of the key _target_ is valid. It does not check the other values. If _target_ indicates a function, the returned type hint is used to check the class.

Parameters:

Name Type Description Default
config dict[Any, Any]

The configuration to check.

required

Returns:

Type Description
bool

True if the input configuration is a configuration for a BaseDatasetGenerator object.

Example
>>> from iden.dataset.generator import is_dataset_generator_config
>>> is_dataset_generator_config(
...     {"_target_": "iden.dataset.generator.VanillaDatasetGenerator"}
... )
True

iden.dataset.generator.setup_dataset_generator

setup_dataset_generator(
    dataset_generator: (
        BaseDatasetGenerator[T] | dict[Any, Any]
    ),
) -> BaseDatasetGenerator[T]

Set up a dataset generator.

The dataset generator is instantiated from its configuration by using the BaseDatasetGenerator factory function.

Parameters:

Name Type Description Default
dataset_generator BaseDatasetGenerator[T] | dict[Any, Any]

The dataset generator or its configuration.

required

Returns:

Type Description
BaseDatasetGenerator[T]

The instantiated dataset generator.

Example:

>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset.generator import setup_dataset_generator
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     generator = setup_dataset_generator(
...         {
...             "_target_": "iden.dataset.generator.VanillaDatasetGenerator",
...             "path_uri": Path(tmpdir).joinpath("uri"),
...             "shards": {
...                 "_target_": "iden.shard.generator.ShardDictGenerator",
...                 "path_uri": Path(tmpdir).joinpath("uri/shards"),
...                 "shards": {},
...             },
...             "assets": {
...                 "_target_": "iden.shard.generator.ShardDictGenerator",
...                 "path_uri": Path(tmpdir).joinpath("uri/assets"),
...                 "shards": {},
...             },
...         }
...     )
...     generator
...
VanillaDatasetGenerator(
  (path_uri): PosixPath('/.../uri')
  (shards): ShardDictGenerator(
      (path_uri): PosixPath('/.../uri/shards')
      (shards):
    )
  (assets): ShardDictGenerator(
      (path_uri): PosixPath('/.../uri/assets')
      (shards):
    )
)

iden.dataset.loader

Contain dataset loader implementations.

iden.dataset.loader.BaseDatasetLoader

Bases: ABC, Generic[T]

Define the base class to implement a dataset loader.

A dataset loader object allows to load a BaseDataset object from its Uniform Resource Identifier (URI).

Example
>>> from iden.dataset.loader import VanillaDatasetLoader
>>> loader = VanillaDatasetLoader()
>>> loader
VanillaDatasetLoader()

iden.dataset.loader.BaseDatasetLoader.equal abstractmethod

equal(other: Any, equal_nan: bool = False) -> bool

Indicate if two objects are equal or not.

Parameters:

Name Type Description Default
other Any

The object to compare with.

required
equal_nan bool

If True, then two NaNs will be considered equal.

False

Returns:

Type Description
bool

True if the two objects are equal, otherwise False.

Example
>>> from iden.dataset.loader import VanillaDatasetLoader
>>> VanillaDatasetLoader().equal(VanillaDatasetLoader())
True

iden.dataset.loader.BaseDatasetLoader.load abstractmethod

load(uri: str) -> BaseDataset[T]

Load a dataset from its Uniform Resource Identifier (URI).

Parameters:

Name Type Description Default
uri str

The URI of the dataset to load.

required

Returns:

Type Description
BaseDataset[T]

The loaded dataset.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import create_vanilla_dataset
>>> from iden.dataset.loader import VanillaDatasetLoader
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 [1, 2, 3], uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     uri = Path(tmpdir).joinpath("uri").as_uri()
...     create_vanilla_dataset(uri=uri, shards=shards, assets=assets)
...     loader = VanillaDatasetLoader()
...     dataset = loader.load(uri)
...     dataset
...
VanillaDataset(
  (uri): file:///.../uri
  (shards): ShardDict(
      (uri): file:///.../uri_shards
      (shards):
        (train): ShardTuple(
            (uri): file:///.../uri_train
            (shards):
              (0): JsonShard(uri=file:///.../shard/uri1)
              (1): JsonShard(uri=file:///.../shard/uri2)
          )
        (val): ShardTuple(
            (uri): file:///.../uri_val
            (shards):
          )
    )
  (assets): ShardDict(
      (uri): file:///.../uri_assets
      (shards):
        (stats): JsonShard(uri=file:///.../uri_stats)
    )
)

iden.dataset.loader.VanillaDatasetLoader

Bases: BaseDatasetLoader[T]

Implement a VanillaDatasetLoader loader.

Example
>>> import tempfile
>>> from pathlib import Path
>>> from iden.dataset import create_vanilla_dataset
>>> from iden.dataset.loader import VanillaDatasetLoader
>>> from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = create_shard_dict(
...         shards={
...             "train": create_shard_tuple(
...                 [
...                     create_json_shard(
...                         [1, 2, 3], uri=Path(tmpdir).joinpath("shard/uri1").as_uri()
...                     ),
...                     create_json_shard(
...                         [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shard/uri2").as_uri()
...                     ),
...                 ],
...                 uri=Path(tmpdir).joinpath("uri_train").as_uri(),
...             ),
...             "val": create_shard_tuple(
...                 shards=[],
...                 uri=Path(tmpdir).joinpath("uri_val").as_uri(),
...             ),
...         },
...         uri=Path(tmpdir).joinpath("uri_shards").as_uri(),
...     )
...     assets = create_shard_dict(
...         shards={
...             "stats": create_json_shard(
...                 [1, 2, 3], uri=Path(tmpdir).joinpath("uri_stats").as_uri()
...             )
...         },
...         uri=Path(tmpdir).joinpath("uri_assets").as_uri(),
...     )
...     uri = Path(tmpdir).joinpath("uri").as_uri()
...     create_vanilla_dataset(uri=uri, shards=shards, assets=assets)
...     loader = VanillaDatasetLoader()
...     dataset = loader.load(uri)
...     dataset
...
VanillaDataset(
  (uri): file:///.../uri
  (shards): ShardDict(
      (uri): file:///.../uri_shards
      (shards):
        (train): ShardTuple(
            (uri): file:///.../uri_train
            (shards):
              (0): JsonShard(uri=file:///.../shard/uri1)
              (1): JsonShard(uri=file:///.../shard/uri2)
          )
        (val): ShardTuple(
            (uri): file:///.../uri_val
            (shards):
          )
    )
  (assets): ShardDict(
      (uri): file:///.../uri_assets
      (shards):
        (stats): JsonShard(uri=file:///.../uri_stats)
    )
)

iden.dataset.loader.is_dataset_loader_config

is_dataset_loader_config(config: dict[Any, Any]) -> bool

Indicate if the input configuration is a configuration for a BaseDatasetLoader.

This function only checks if the value of the key _target_ is valid. It does not check the other values. If _target_ indicates a function, the returned type hint is used to check the class.

Parameters:

Name Type Description Default
config dict[Any, Any]

The configuration to check.

required

Returns:

Type Description
bool

True if the input configuration is a configuration for a BaseDatasetLoader object.

Example
>>> from iden.dataset.loader import is_dataset_loader_config
>>> is_dataset_loader_config({"_target_": "iden.dataset.loader.VanillaDatasetLoader"})
True

iden.dataset.loader.setup_dataset_loader

setup_dataset_loader(
    dataset_loader: BaseDatasetLoader[T] | dict[Any, Any],
) -> BaseDatasetLoader[T]

Set up a dataset loader.

The dataset loader is instantiated from its configuration by using the BaseDatasetLoader factory function.

Parameters:

Name Type Description Default
dataset_loader BaseDatasetLoader[T] | dict[Any, Any]

The dataset loader or its configuration.

required

Returns:

Type Description
BaseDatasetLoader[T]

The instantiated dataset loader.

Example
>>> from iden.dataset.loader import setup_dataset_loader
>>> dataset_loader = setup_dataset_loader(
...     {"_target_": "iden.dataset.loader.VanillaDatasetLoader"}
... )
>>> dataset_loader
VanillaDatasetLoader()