Skip to content

arkas.analyzer

arkas.analyzer

Contain DataFrame analyzers.

arkas.analyzer.AccuracyAnalyzer

Bases: BaseTruePredAnalyzer

Implement the accuracy analyzer.

Parameters:

Name Type Description Default
y_true str

The column name of the ground truth target labels.

required
y_pred str

The column name of the predicted labels.

required
drop_nulls bool

If True, the rows with null values in y_true or y_pred columns are dropped.

True
missing_policy str

The policy on how to handle missing columns. The following options are available: 'ignore', 'warn', and 'raise'. If 'raise', an exception is raised if at least one column is missing. If 'warn', a warning is raised if at least one column is missing and the missing columns are ignored. If 'ignore', the missing columns are ignored and no warning message appears.

'raise'
nan_policy str

The policy on how to handle NaN values in the input arrays. The following options are available: 'omit', 'propagate', and 'raise'.

'propagate'

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import AccuracyAnalyzer
>>> analyzer = AccuracyAnalyzer(y_true="target", y_pred="pred")
>>> analyzer
AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
>>> frame = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(frame)
>>> output
AccuracyOutput(
  (state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)

arkas.analyzer.BalancedAccuracyAnalyzer

Bases: BaseTruePredAnalyzer

Implement the balanced accuracy analyzer.

Parameters:

Name Type Description Default
y_true str

The column name of the ground truth target labels.

required
y_pred str

The column name of the predicted labels.

required
drop_nulls bool

If True, the rows with null values in y_true or y_pred columns are dropped.

True
missing_policy str

The policy on how to handle missing columns. The following options are available: 'ignore', 'warn', and 'raise'. If 'raise', an exception is raised if at least one column is missing. If 'warn', a warning is raised if at least one column is missing and the missing columns are ignored. If 'ignore', the missing columns are ignored and no warning message appears.

'raise'
nan_policy str

The policy on how to handle NaN values in the input arrays. The following options are available: 'omit', 'propagate', and 'raise'.

'propagate'

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import BalancedAccuracyAnalyzer
>>> analyzer = BalancedAccuracyAnalyzer(y_true="target", y_pred="pred")
>>> analyzer
BalancedAccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
>>> frame = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(frame)
>>> output
BalancedAccuracyOutput(
  (state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)

arkas.analyzer.BaseAnalyzer

Bases: ABC

Define the base class to analyze a DataFrame.

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import AccuracyAnalyzer
>>> analyzer = AccuracyAnalyzer(y_true="target", y_pred="pred")
>>> analyzer
AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
>>> data = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(data)
>>> output
AccuracyOutput(
  (state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)
arkas.analyzer.BaseAnalyzer.analyze abstractmethod
analyze(frame: DataFrame, lazy: bool = True) -> BaseOutput

Analyze the DataFrame.

Parameters:

Name Type Description Default
frame DataFrame

The DataFrame to analyze.

required
lazy bool

If True, it forces the computation of the output, otherwise it returns an output object that contains the logic.

True

Returns:

Type Description
BaseOutput

The generated output.

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import AccuracyAnalyzer
>>> analyzer = AccuracyAnalyzer(y_true="target", y_pred="pred")
>>> data = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(data)
>>> output
AccuracyOutput(
  (state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)

arkas.analyzer.BaseInNLazyAnalyzer

Bases: BaseAnalyzer

Define a base class to implement analyzers that analyze DataFrames by using multiple input columns.

Parameters:

Name Type Description Default
columns Sequence[str] | None

The columns to analyze. If None, it analyzes all the columns.

None
exclude_columns Sequence[str]

The columns to exclude from the input columns. If any column is not found, it will be ignored during the filtering process.

()
missing_policy str

The policy on how to handle missing columns. The following options are available: 'ignore', 'warn', and 'raise'. If 'raise', an exception is raised if at least one column is missing. If 'warn', a warning is raised if at least one column is missing and the missing columns are ignored. If 'ignore', the missing columns are ignored and no warning message appears.

'raise'

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer
ColumnCooccurrenceAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', ignore_self=False, figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [0, 0, 0, 0, 1, 1, 1],
...     }
... )
>>> output = analyzer.analyze(frame)
>>> output
ColumnCooccurrenceOutput(
  (state): ColumnCooccurrenceState(matrix=(3, 3), figure_config=MatplotlibFigureConfig())
)
arkas.analyzer.BaseInNLazyAnalyzer.find_columns
find_columns(frame: DataFrame) -> tuple[str, ...]

Find the columns to transform.

Parameters:

Name Type Description Default
frame DataFrame

The input DataFrame. Sometimes the columns to transform are found by analyzing the input DataFrame.

required

Returns:

Type Description
tuple[str, ...]

The columns to transform.

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [0, 0, 0, 0, 1, 1, 1],
...     }
... )
>>> analyzer = ColumnCooccurrenceAnalyzer(columns=["col2", "col3"])
>>> analyzer.find_columns(frame)
('col2', 'col3')
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer.find_columns(frame)
('col1', 'col2', 'col3')
arkas.analyzer.BaseInNLazyAnalyzer.find_common_columns
find_common_columns(frame: DataFrame) -> tuple[str, ...]

Find the common columns between the DataFrame columns and the input columns.

Parameters:

Name Type Description Default
frame DataFrame

The input DataFrame. Sometimes the columns to transform are found by analyzing the input DataFrame.

required

Returns:

Type Description
tuple[str, ...]

The common columns.

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [0, 0, 0, 0, 1, 1, 1],
...     }
... )
>>> analyzer = ColumnCooccurrenceAnalyzer(columns=["col2", "col3", "col5"])
>>> analyzer.find_common_columns(frame)
('col2', 'col3')
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer.find_common_columns(frame)
('col1', 'col2', 'col3')
arkas.analyzer.BaseInNLazyAnalyzer.find_missing_columns
find_missing_columns(frame: DataFrame) -> tuple[str, ...]

Find the missing columns.

Parameters:

Name Type Description Default
frame DataFrame

The input DataFrame. Sometimes the columns to transform are found by analyzing the input DataFrame.

required

Returns:

Type Description
tuple[str, ...]

The missing columns.

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [0, 0, 0, 0, 1, 1, 1],
...     }
... )
>>> analyzer = ColumnCooccurrenceAnalyzer(columns=["col2", "col3", "col5"])
>>> analyzer.find_missing_columns(frame)
('col5',)
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer.find_missing_columns(frame)
()
arkas.analyzer.BaseInNLazyAnalyzer.get_args
get_args() -> dict

Get the arguments of the analyzer.

Returns:

Type Description
dict

The arguments.

arkas.analyzer.BaseLazyAnalyzer

Bases: BaseAnalyzer

Define a base class to implement a lazy analyzer.

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import SummaryAnalyzer
>>> analyzer = SummaryAnalyzer()
>>> analyzer
SummaryAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', top=5)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
...     },
...     schema={"col1": pl.Int64, "col2": pl.Int32, "col3": pl.Float64},
... )
>>> output = analyzer.analyze(frame)
>>> output
SummaryOutput(
  (state): DataFrameState(dataframe=(7, 3), nan_policy='propagate', figure_config=MatplotlibFigureConfig(), top=5)
)

arkas.analyzer.BaseTruePredAnalyzer

Bases: BaseAnalyzer

Define a base class to implement polars.DataFrame analyzer that takes two input columns: y_true and y_pred.

Parameters:

Name Type Description Default
y_true str

The column name of the ground truth target labels.

required
y_pred str

The column name of the predicted labels.

required
drop_nulls bool

If True, the rows with null values in y_true or y_pred columns are dropped.

required
missing_policy str

The policy on how to handle missing columns. The following options are available: 'ignore', 'warn', and 'raise'. If 'raise', an exception is raised if at least one column is missing. If 'warn', a warning is raised if at least one column is missing and the missing columns are ignored. If 'ignore', the missing columns are ignored and no warning message appears.

required

arkas.analyzer.ColumnCooccurrenceAnalyzer

Bases: BaseInNLazyAnalyzer

Implement a pairwise column co-occurrence analyzer.

Parameters:

Name Type Description Default
columns Sequence[str] | None

The columns to analyze. If None, it analyzes all the columns.

None
exclude_columns Sequence[str]

The columns to exclude from the input columns. If any column is not found, it will be ignored during the filtering process.

()
missing_policy str

The policy on how to handle missing columns. The following options are available: 'ignore', 'warn', and 'raise'. If 'raise', an exception is raised if at least one column is missing. If 'warn', a warning is raised if at least one column is missing and the missing columns are ignored. If 'ignore', the missing columns are ignored and no warning message appears.

'raise'
ignore_self bool

If True, the diagonal of the co-occurrence matrix (a.k.a. self-co-occurrence) is set to 0.

False
figure_config BaseFigureConfig | None

The figure configuration.

None

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer
ColumnCooccurrenceAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', ignore_self=False, figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [0, 0, 0, 0, 1, 1, 1],
...     }
... )
>>> output = analyzer.analyze(frame)
>>> output
ColumnCooccurrenceOutput(
  (state): ColumnCooccurrenceState(matrix=(3, 3), figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.ColumnCorrelationAnalyzer

Bases: BaseInNLazyAnalyzer

Implement an analyzer to analyze the correlation between numeric columns.

Parameters:

Name Type Description Default
columns Sequence[str] | None

The columns to analyze. If None, it analyzes all the columns.

None
exclude_columns Sequence[str]

The columns to exclude from the input columns. If any column is not found, it will be ignored during the filtering process.

()
missing_policy str

The policy on how to handle missing columns. The following options are available: 'ignore', 'warn', and 'raise'. If 'raise', an exception is raised if at least one column is missing. If 'warn', a warning is raised if at least one column is missing and the missing columns are ignored. If 'ignore', the missing columns are ignored and no warning message appears.

'raise'
nan_policy str

The policy on how to handle NaN values in the input arrays. The following options are available: 'omit', 'propagate', and 'raise'.

'propagate'
sort_metric str

The key used to sort the correlation table.

'spearman_coeff'

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ColumnCorrelationAnalyzer
>>> analyzer = ColumnCorrelationAnalyzer(target_column="col3")
>>> analyzer
ColumnCorrelationAnalyzer(target_column='col3', sort_metric='spearman_coeff', columns=None, exclude_columns=(), missing_policy='raise', nan_policy='propagate')
>>> frame = pl.DataFrame(
...     {
...         "col1": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
...         "col2": [7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0],
...         "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
...     },
... )
>>> output = analyzer.analyze(frame)
>>> output
ColumnCorrelationOutput(
  (state): TargetDataFrameState(dataframe=(7, 3), target_column='col3', nan_policy='propagate', figure_config=MatplotlibFigureConfig(), sort_metric='spearman_coeff')
)

arkas.analyzer.ContentAnalyzer

Bases: BaseLazyAnalyzer

Implement an analyzer that generates an output with the given custom content.

Parameters:

Name Type Description Default
content str

The content to use in the HTML code.

required

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ContentAnalyzer
>>> analyzer = ContentAnalyzer(content="meow")
>>> analyzer
ContentAnalyzer()
>>> frame = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(frame)
>>> output
ContentOutput(
  (content): ContentGenerator()
  (evaluator): Evaluator(count=0)
)

arkas.analyzer.ContinuousColumnAnalyzer

Bases: BaseLazyAnalyzer

Implement an analyzer that analyzes a column with continuous values.

Parameters:

Name Type Description Default
column str

The column to analyze.

required
figure_config BaseFigureConfig | None

The figure configuration.

None

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ContinuousColumnAnalyzer
>>> analyzer = ContinuousColumnAnalyzer(column="col1")
>>> analyzer
ContinuousColumnAnalyzer(column='col1', figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 0, 1],
...         "col2": [1, 0, 1, 0],
...         "col3": [1, 1, 1, 1],
...     },
...     schema={"col1": pl.Int64, "col2": pl.Int64, "col3": pl.Int64},
... )
>>> output = analyzer.analyze(frame)
>>> output
ContinuousSeriesOutput(
  (state): SeriesState(name='col1', values=(4,), figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.CorrelationAnalyzer

Bases: BaseLazyAnalyzer

Implement an analyzer that analyzes the correlation between two columns.

Parameters:

Name Type Description Default
x str

The first column.

required
y str

The second column.

required
drop_nulls bool

If True, the rows with null values in x or y columns are dropped.

True
missing_policy str

The policy on how to handle missing columns. The following options are available: 'ignore', 'warn', and 'raise'. If 'raise', an exception is raised if at least one column is missing. If 'warn', a warning is raised if at least one column is missing and the missing columns are ignored. If 'ignore', the missing columns are ignored and no warning message appears.

'raise'
nan_policy str

The policy on how to handle NaN values in the input arrays. The following options are available: 'omit', 'propagate', and 'raise'.

'propagate'
figure_config BaseFigureConfig | None

The figure configuration.

None

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import CorrelationAnalyzer
>>> analyzer = CorrelationAnalyzer(x="col1", y="col2")
>>> analyzer
CorrelationAnalyzer(x='col1', y='col2', drop_nulls=True, missing_policy='raise', nan_policy='propagate', figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
...         "col2": [7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0],
...         "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
...     },
...     schema={"col1": pl.Float64, "col2": pl.Float64, "col3": pl.Float64},
... )
>>> output = analyzer.analyze(frame)
>>> output
CorrelationOutput(
  (state): TwoColumnDataFrameState(dataframe=(7, 2), column1='col1', column2='col2', nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.HexbinColumnAnalyzer

Bases: BaseLazyAnalyzer

Implement an analyzer that plots the content of each column.

Parameters:

Name Type Description Default
x str

The x-axis data column.

required
y str

The y-axis data column.

required
color str | None

An optional color axis data column.

None
figure_config BaseFigureConfig | None

The figure configuration.

None

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import HexbinColumnAnalyzer
>>> analyzer = HexbinColumnAnalyzer(x="col1", y="col2")
>>> analyzer
HexbinColumnAnalyzer(x='col1', y='col2', color=None, figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 0, 1],
...         "col2": [1, 0, 1, 0],
...         "col3": [1, 1, 1, 1],
...     },
...     schema={"col1": pl.Int64, "col2": pl.Int64, "col3": pl.Int64},
... )
>>> output = analyzer.analyze(frame)
>>> output
HexbinColumnOutput(
  (state): ScatterDataFrameState(dataframe=(4, 2), x='col1', y='col2', color=None, nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.MappingAnalyzer

Bases: BaseAnalyzer

Implement an analyzer that processes a mapping of analyzers.

Parameters:

Name Type Description Default
analyzers Mapping[str, BaseAnalyzer]

The mapping of analyzers.

required

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import (
...     MappingAnalyzer,
...     AccuracyAnalyzer,
...     BalancedAccuracyAnalyzer,
... )
>>> analyzer = MappingAnalyzer(
...     {
...         "one": AccuracyAnalyzer(y_true="target", y_pred="pred"),
...         "two": BalancedAccuracyAnalyzer(y_true="target", y_pred="pred"),
...     }
... )
>>> analyzer
MappingAnalyzer(
  (one): AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
  (two): BalancedAccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
)
>>> frame = pl.DataFrame({"pred": [1, 0, 0, 1, 1], "target": [1, 0, 0, 1, 1]})
>>> output = analyzer.analyze(frame)
>>> output
OutputDict(count=2)

arkas.analyzer.NullValueAnalyzer

Bases: BaseInNLazyAnalyzer

Implement an analyzer that plots the content of each column.

Parameters:

Name Type Description Default
columns Sequence[str] | None

The columns to analyze. If None, it analyzes all the columns.

None
exclude_columns Sequence[str]

The columns to exclude from the input columns. If any column is not found, it will be ignored during the filtering process.

()
missing_policy str

The policy on how to handle missing columns. The following options are available: 'ignore', 'warn', and 'raise'. If 'raise', an exception is raised if at least one column is missing. If 'warn', a warning is raised if at least one column is missing and the missing columns are ignored. If 'ignore', the missing columns are ignored and no warning message appears.

'raise'
figure_config BaseFigureConfig | None

The figure configuration.

None

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import NullValueAnalyzer
>>> analyzer = NullValueAnalyzer()
>>> analyzer
NullValueAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, None],
...         "col2": [0, 1, None, None, 0, 1, 0],
...         "col3": [None, 0, 0, 0, None, 1, None],
...     }
... )
>>> output = analyzer.analyze(frame)
>>> output
NullValueOutput(
  (state): NullValueState(num_columns=3, figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.NumericSummaryAnalyzer

Bases: BaseInNLazyAnalyzer

Implement an analyzer to show a summary of the numeric columns of a DataFrame.

Parameters:

Name Type Description Default
columns Sequence[str] | None

The columns to analyze. If None, it analyzes all the columns.

None
exclude_columns Sequence[str]

The columns to exclude from the input columns. If any column is not found, it will be ignored during the filtering process.

()
missing_policy str

The policy on how to handle missing columns. The following options are available: 'ignore', 'warn', and 'raise'. If 'raise', an exception is raised if at least one column is missing. If 'warn', a warning is raised if at least one column is missing and the missing columns are ignored. If 'ignore', the missing columns are ignored and no warning message appears.

'raise'

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import NumericSummaryAnalyzer
>>> analyzer = NumericSummaryAnalyzer()
>>> analyzer
NumericSummaryAnalyzer(columns=None, exclude_columns=(), missing_policy='raise')
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
...     },
...     schema={"col1": pl.Int64, "col2": pl.Int32, "col3": pl.Float64},
... )
>>> output = analyzer.analyze(frame)
>>> output
NumericSummaryOutput(
  (state): DataFrameState(dataframe=(7, 3), nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.PlotColumnAnalyzer

Bases: BaseInNLazyAnalyzer

Implement an analyzer that plots the content of each column.

Parameters:

Name Type Description Default
columns Sequence[str] | None

The columns to analyze. If None, it analyzes all the columns.

None
exclude_columns Sequence[str]

The columns to exclude from the input columns. If any column is not found, it will be ignored during the filtering process.

()
missing_policy str

The policy on how to handle missing columns. The following options are available: 'ignore', 'warn', and 'raise'. If 'raise', an exception is raised if at least one column is missing. If 'warn', a warning is raised if at least one column is missing and the missing columns are ignored. If 'ignore', the missing columns are ignored and no warning message appears.

'raise'
figure_config BaseFigureConfig | None

The figure configuration.

None

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import PlotColumnAnalyzer
>>> analyzer = PlotColumnAnalyzer()
>>> analyzer
PlotColumnAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 0, 1],
...         "col2": [1, 0, 1, 0],
...         "col3": [1, 1, 1, 1],
...     },
...     schema={"col1": pl.Int64, "col2": pl.Int64, "col3": pl.Int64},
... )
>>> output = analyzer.analyze(frame)
>>> output
PlotColumnOutput(
  (state): DataFrameState(dataframe=(4, 3), nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.ScatterColumnAnalyzer

Bases: BaseLazyAnalyzer

Implement an analyzer that plots the content of each column.

Parameters:

Name Type Description Default
x str

The x-axis data column.

required
y str

The y-axis data column.

required
color str | None

An optional color axis data column.

None
figure_config BaseFigureConfig | None

The figure configuration.

None

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ScatterColumnAnalyzer
>>> analyzer = ScatterColumnAnalyzer(x="col1", y="col2")
>>> analyzer
ScatterColumnAnalyzer(x='col1', y='col2', color=None, figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 0, 1],
...         "col2": [1, 0, 1, 0],
...         "col3": [1, 1, 1, 1],
...     },
...     schema={"col1": pl.Int64, "col2": pl.Int64, "col3": pl.Int64},
... )
>>> output = analyzer.analyze(frame)
>>> output
ScatterColumnOutput(
  (state): ScatterDataFrameState(dataframe=(4, 2), x='col1', y='col2', color=None, nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.SummaryAnalyzer

Bases: BaseInNLazyAnalyzer

Implement an analyzer to show a summary of a DataFrame.

Parameters:

Name Type Description Default
columns Sequence[str] | None

The columns to analyze. If None, it analyzes all the columns.

None
exclude_columns Sequence[str]

The columns to exclude from the input columns. If any column is not found, it will be ignored during the filtering process.

()
missing_policy str

The policy on how to handle missing columns. The following options are available: 'ignore', 'warn', and 'raise'. If 'raise', an exception is raised if at least one column is missing. If 'warn', a warning is raised if at least one column is missing and the missing columns are ignored. If 'ignore', the missing columns are ignored and no warning message appears.

'raise'
top int

The number of most frequent values to show.

5

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import SummaryAnalyzer
>>> analyzer = SummaryAnalyzer()
>>> analyzer
SummaryAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', top=5)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
...     },
...     schema={"col1": pl.Int64, "col2": pl.Int32, "col3": pl.Float64},
... )
>>> output = analyzer.analyze(frame)
>>> output
SummaryOutput(
  (state): DataFrameState(dataframe=(7, 3), nan_policy='propagate', figure_config=MatplotlibFigureConfig(), top=5)
)

arkas.analyzer.TemporalContinuousColumnAnalyzer

Bases: BaseLazyAnalyzer

Implement an analyzer that analyzes the temporal distribution of a column with continuous values.

Parameters:

Name Type Description Default
target_column str

The column to analyze.

required
temporal_column str

The temporal column in the DataFrame.

required
period str | None

An optional temporal period e.g. monthly or daily.

None
nan_policy str

The policy on how to handle NaN values in the input arrays. The following options are available: 'omit', 'propagate', and 'raise'.

'propagate'
figure_config BaseFigureConfig | None

The figure configuration.

None

Example usage:

>>> from datetime import datetime, timezone
>>> import polars as pl
>>> from arkas.analyzer import TemporalContinuousColumnAnalyzer
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 0, 1],
...         "col2": [0, 1, 2, 3],
...         "datetime": [
...             datetime(year=2020, month=1, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=2, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=3, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=4, day=3, tzinfo=timezone.utc),
...         ],
...     },
...     schema={
...         "col1": pl.Int64,
...         "col2": pl.Int64,
...         "datetime": pl.Datetime(time_unit="us", time_zone="UTC"),
...     },
... )
>>> analyzer = TemporalContinuousColumnAnalyzer(
...     target_column="col2", temporal_column="datetime"
... )
>>> analyzer
TemporalContinuousColumnAnalyzer(target_column='col2', temporal_column='datetime', period=None, nan_policy='propagate', figure_config=None)
>>> output = analyzer.analyze(frame)
>>> output
TemporalContinuousColumnOutput(
  (state): TemporalColumnState(dataframe=(4, 3), target_column='col2', temporal_column='datetime', period=None, nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.TemporalNullValueAnalyzer

Bases: BaseInNLazyAnalyzer

Implement an analyzer that analyzes the number of null values in a DataFrame.

Parameters:

Name Type Description Default
temporal_column str

The temporal column in the DataFrame.

required
period str

The temporal period e.g. monthly or daily.

required
columns Sequence[str] | None

The columns to analyze. If None, it analyzes all the columns.

None
exclude_columns Sequence[str]

The columns to exclude from the input columns. If any column is not found, it will be ignored during the filtering process.

()
missing_policy str

The policy on how to handle missing columns. The following options are available: 'ignore', 'warn', and 'raise'. If 'raise', an exception is raised if at least one column is missing. If 'warn', a warning is raised if at least one column is missing and the missing columns are ignored. If 'ignore', the missing columns are ignored and no warning message appears.

'raise'
figure_config BaseFigureConfig | None

The figure configuration.

None

Example usage:

>>> from datetime import datetime, timezone
>>> import polars as pl
>>> from arkas.analyzer import TemporalNullValueAnalyzer
>>> analyzer = TemporalNullValueAnalyzer(temporal_column="datetime", period="1d")
>>> analyzer
TemporalNullValueAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', temporal_column='datetime', period='1d', figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0],
...         "col2": [0, 1, 0, 1],
...         "col3": [1, 0, 0, 0],
...         "datetime": [
...             datetime(year=2020, month=1, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=2, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=3, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=4, day=3, tzinfo=timezone.utc),
...         ],
...     },
...     schema={
...         "col1": pl.Int64,
...         "col2": pl.Int64,
...         "col3": pl.Int64,
...         "datetime": pl.Datetime(time_unit="us", time_zone="UTC"),
...     },
... )
>>> output = analyzer.analyze(frame)
>>> output
TemporalNullValueOutput(
  (state): TemporalDataFrameState(dataframe=(4, 4), temporal_column='datetime', period='1d', nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.TemporalPlotColumnAnalyzer

Bases: BaseInNLazyAnalyzer

Implement an analyzer that plots the content of each column.

Parameters:

Name Type Description Default
temporal_column str

The temporal column in the DataFrame.

required
period str | None

An optional temporal period e.g. monthly or daily.

None
columns Sequence[str] | None

The columns to analyze. If None, it analyzes all the columns.

None
exclude_columns Sequence[str]

The columns to exclude from the input columns. If any column is not found, it will be ignored during the filtering process.

()
missing_policy str

The policy on how to handle missing columns. The following options are available: 'ignore', 'warn', and 'raise'. If 'raise', an exception is raised if at least one column is missing. If 'warn', a warning is raised if at least one column is missing and the missing columns are ignored. If 'ignore', the missing columns are ignored and no warning message appears.

'raise'
figure_config BaseFigureConfig | None

The figure configuration.

None

Example usage:

>>> from datetime import datetime, timezone
>>> import polars as pl
>>> from arkas.analyzer import TemporalPlotColumnAnalyzer
>>> analyzer = TemporalPlotColumnAnalyzer(temporal_column="datetime")
>>> analyzer
TemporalPlotColumnAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', temporal_column='datetime', period=None, figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0],
...         "col2": [0, 1, 0, 1],
...         "col3": [1, 0, 0, 0],
...         "datetime": [
...             datetime(year=2020, month=1, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=2, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=3, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=4, day=3, tzinfo=timezone.utc),
...         ],
...     },
...     schema={
...         "col1": pl.Int64,
...         "col2": pl.Int64,
...         "col3": pl.Int64,
...         "datetime": pl.Datetime(time_unit="us", time_zone="UTC"),
...     },
... )
>>> output = analyzer.analyze(frame)
>>> output
TemporalPlotColumnOutput(
  (state): TemporalDataFrameState(dataframe=(4, 4), temporal_column='datetime', period=None, nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.TransformAnalyzer

Bases: BaseAnalyzer

Implement an analyzer that transforms the data before to analyze them.

Parameters:

Name Type Description Default
transformer BaseTransformer | dict

The transformer or its configuration.

required
analyzer BaseAnalyzer | dict

The analyzer or its configuration.

required

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import AccuracyAnalyzer, TransformAnalyzer
>>> from grizz.transformer import DropNullRow
>>> analyzer = TransformAnalyzer(
...     transformer=DropNullRow(), analyzer=AccuracyAnalyzer(y_true="target", y_pred="pred")
... )
>>> analyzer
TransformAnalyzer(
  (transformer): DropNullRowTransformer(columns=None, exclude_columns=(), missing_policy='raise')
  (analyzer): AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
)
>>> frame = pl.DataFrame(
...     {"pred": [3, 2, 0, 1, 0, 1, None], "target": [3, 2, 0, 1, 0, 1, None]}
... )
>>> output = analyzer.analyze(frame)
>>> output
AccuracyOutput(
  (state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)

arkas.analyzer.is_analyzer_config

is_analyzer_config(config: dict) -> bool

Indicate if the input configuration is a configuration for a BaseAnalyzer.

This function only checks if the value of the key _target_ is valid. It does not check the other values. If _target_ indicates a function, the returned type hint is used to check the class.

Parameters:

Name Type Description Default
config dict

The configuration to check.

required

Returns:

Type Description
bool

True if the input configuration is a configuration for a BaseAnalyzer object.

Example usage:

>>> from arkas.analyzer import is_analyzer_config
>>> is_analyzer_config({"_target_": "arkas.analyzer.AccuracyAnalyzer"})
True

arkas.analyzer.setup_analyzer

setup_analyzer(
    analyzer: BaseAnalyzer | dict,
) -> BaseAnalyzer

Set up an analyzer.

The analyzer is instantiated from its configuration by using the BaseAnalyzer factory function.

Parameters:

Name Type Description Default
analyzer BaseAnalyzer | dict

An analyzer or its configuration.

required

Returns:

Type Description
BaseAnalyzer

An instantiated analyzer.

Example usage:

>>> from arkas.analyzer import setup_analyzer
>>> analyzer = setup_analyzer(
...     {
...         "_target_": "arkas.analyzer.AccuracyAnalyzer",
...         "y_true": "target",
...         "y_pred": "pred",
...     }
... )
>>> analyzer
AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')