Skip to content

Shard

Overview

The shard is an abstraction to represent a unit of data. It provides an abstraction to get the data without knowing how the data are stored. Each shard must have a unique Uniform Resource Identifier (URI), which is used to identify each shard, so it is possible to instantiate a shard from its URI. The get_uri method can be used to get the URI of shard:

>>> import tempfile
>>> from pathlib import Path
>>> from iden.shard import create_json_shard
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shard = create_json_shard([1, 2, 3], uri=Path(tmpdir).joinpath("my_uri").as_uri())
...     uri = shard.get_uri()
...     uri
...
'file:///.../my_uri'

To be scalable, a shard does not contain the data, but it contains the logic to get the data. It allows to create and manage a large number of shards independently of the total size of the data. The get_data method is used to get the data from the shard:

>>> import tempfile
>>> from pathlib import Path
>>> from iden.shard import create_json_shard
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shard = create_json_shard([1, 2, 3], uri=Path(tmpdir).joinpath("my_uri").as_uri())
...     data = shard.get_data()
...     data
...
[1, 2, 3]

If the data from a shard are often used, it is possible to cache them by specifying cache=True:

>>> import tempfile
>>> from pathlib import Path
>>> from iden.shard import create_json_shard
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shard = create_json_shard([1, 2, 3], uri=Path(tmpdir).joinpath("my_uri").as_uri())
...     data = shard.get_data(cache=True)
...     data
...
[1, 2, 3]

Most of the shards can cache the data in-memory. It is possible to clear the cache by calling the clear method.

>>> import tempfile
>>> from pathlib import Path
>>> from iden.shard import create_json_shard
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shard = create_json_shard([1, 2, 3], uri=Path(tmpdir).joinpath("my_uri").as_uri())
...     data = shard.get_data(cache=True)
...     data
...     data.append(4)  # in-place modification
...     data = shard.get_data()
...     data
...     shard.clear()
...     data = shard.get_data()
...     data
...
...
[1, 2, 3]
[1, 2, 3, 4]
[1, 2, 3]

It is important to clear the cache if the shard is not used because it can lead to OOM issues if the data of too may shards are cached in-memory at the same time. It is possible to call the is_cached method to know if the data in the data are cached or not.

>>> import tempfile
>>> from pathlib import Path
>>> from iden.shard import create_json_shard
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shard = create_json_shard([1, 2, 3], uri=Path(tmpdir).joinpath("my_uri").as_uri())
...     shard.is_cached()
...     data = shard.get_data(cache=True)
...     shard.is_cached()
...     shard.clear()
...     shard.is_cached()
...
...
False
True
False

Finally, there is the equal method to check if two shards are equal or not:

>>> import tempfile
>>> from pathlib import Path
>>> from iden.shard import JsonShard,  create_json_shard
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     uri1 = Path(tmpdir).joinpath("my_uri1").as_uri()
...     uri2 = Path(tmpdir).joinpath("my_uri2").as_uri()
...     shard1 = create_json_shard([1, 2, 3], uri=uri1)
...     shard2 = create_json_shard([4, 5, 6], uri=uri2)
...     shard3 = JsonShard.from_uri(uri=uri1)
...     shard1.equal(shard2)
...     shard1.equal(shard3)
...
...
False
True

Built-in shards

iden has some built-in shard implementations that can be used out of the box. It is possible to extend iden to support more shard implementation. This page explains how to add a new shard implementation.

Each shard implementation is different and has different properties. You need to choose the best shard based on your requirements. It is not a one size fits all. For example, the PickleShard implementation supports a lot of types of data whereas the TorchSafetensorsShard implementation only supports dictionary of torch.Tensors. The following table shows a summary of supported data for some of the built-in shards.

shard supported data
FileShard depend on the file format
JsonShard any data compatible with JSON format
PickleShard any serializable data
TorchSafetensorsShard a dictionary of torch.Tensors
TorchShard any serializable data
YamlShard any data compatible with YAML format

File-based shards. iden has some shard implementations to load data from files. iden relies on existing packages to save and load data in s shard. Most of these packages are optional and should be installed if necessary. The following table shows some of the supported file format, the package used to save and load data, and their associated shard implementations.

shard file format package
JsonShard JSON file json
PickleShard pickle file yaml
TorchSafetensorsShard safetensors file safetensors
TorchShard pytorch file torch
YamlShard YAML file yaml

FileShard is generic file-based shard that supports most of the file formats.

Special shards. iden has some special shards that allows to combine multiple shards. ShardTuple is the shard implementation to manage a tuple of shards.

>>> import tempfile
>>> from pathlib import Path
>>> from iden.shard import ShardTuple, create_json_shard
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = [
...         create_json_shard([1, 2, 3], uri=Path(tmpdir).joinpath("shards/uri1").as_uri()),
...         create_json_shard(
...             [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shards/uri2").as_uri()
...         ),
...     ]
...     sl = ShardTuple(uri=Path(tmpdir).joinpath("uri").as_uri(), shards=shards)
...     sl
...
ShardTuple(
  (uri): file:///.../uri
  (shards):
    (0): JsonShard(uri=file:///.../shards/uri1)
    (1): JsonShard(uri=file:///.../shards/uri2)
)

ShardDict is the shard implementation to manage a dictionary of shards.

>>> import tempfile
>>> from pathlib import Path
>>> from iden.shard import ShardDict, create_json_shard
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     shards = {
...         "train": create_json_shard(
...             [1, 2, 3], uri=Path(tmpdir).joinpath("shards/uri1").as_uri()
...         ),
...         "val": create_json_shard(
...             [4, 5, 6, 7], uri=Path(tmpdir).joinpath("shards/uri2").as_uri()
...         ),
...     }
...     sd = ShardDict(uri=Path(tmpdir).joinpath("uri").as_uri(), shards=shards)
...     sd
...
ShardDict(
  (uri): file:///.../uri
  (shards):
    (train): JsonShard(uri=file:///.../shards/uri1)
    (val): JsonShard(uri=file:///.../shards/uri2)
)

Instantiating a shard from its URI

iden has a functionality to instantiate a shard from its Uniform Resource Identifier (URI). A shard can be represented by its URI, which can help to make the data management more scalable. It is easier to manage the URIs than all the data. The load_from_uri function can be used to instantiate a shard from its URI.

>>> import tempfile
>>> from pathlib import Path
>>> from iden.shard import create_json_shard, load_from_uri
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     uri = Path(tmpdir).joinpath("my_uri").as_uri()
...     _ = create_json_shard([1, 2, 3], uri=uri)
...     shard = load_from_uri(uri)
...     shard
...
JsonShard(uri=file:///.../my_uri)

Under the hood, the load_from_uri function relies on a shard loader object to instantiate a shard object. A shard loader contains the logic to instantiate a shard object from its URI. For instance in the previous example, the JsonShardLoader class is used to instantiate the JsonShard object. load_from_uri is a universal function to load any shards, but it is also possible to use specific data loaders. For instance, the following example is equivalent to the previous example:

>>> import tempfile
>>> from pathlib import Path
>>> from iden.shard import create_json_shard
>>> from iden.shard.loader import JsonShardLoader
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     uri = Path(tmpdir).joinpath("my_uri").as_uri()
...     _ = create_json_shard([1, 2, 3], uri=uri)
...     loader = JsonShardLoader()
...     shard = loader.load(uri)
...     shard
...
JsonShard(uri=file:///.../my_uri)

Uniform Resource Identifier (URI)

The URI file contains enough information to instantiate the shard object, and it is encoded as a JSON file. All URI files should contain a dictionary, with at least one key which indicates which the shard loader to use to instantiate the shard. The following example shows the generate configuration for a JsonShard object:

>>> import tempfile
>>> from pathlib import Path
>>> from iden.shard import JsonShard
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     path = Path(tmpdir).joinpath("data.json")
...     config = JsonShard.generate_uri_config(path)
...     config
...
{'kwargs': {'path': '/.../data.json'},
 'loader': {'_target_': 'iden.shard.loader.JsonShardLoader'}}

The 'kwargs' key is specific to JsonShard and indicates where to find the JSON file associated to the shard.