arkas.analyzer¶

arkas.analyzer ¶

Contain DataFrame analyzers.

arkas.analyzer.AccuracyAnalyzer ¶

Bases: BaseTruePredAnalyzer

Implement the accuracy analyzer.

Parameters:

Name	Type	Description	Default
`y_true`	`str`	The column name of the ground truth target labels.	required
`y_pred`	`str`	The column name of the predicted labels.	required
`drop_nulls`	`bool`	If `True`, the rows with null values in `y_true` or `y_pred` columns are dropped.	`True`
`missing_policy`	`str`	The policy on how to handle missing columns. The following options are available: `'ignore'`, `'warn'`, and `'raise'`. If `'raise'`, an exception is raised if at least one column is missing. If `'warn'`, a warning is raised if at least one column is missing and the missing columns are ignored. If `'ignore'`, the missing columns are ignored and no warning message appears.	`'raise'`
`nan_policy`	`str`	The policy on how to handle NaN values in the input arrays. The following options are available: `'omit'`, `'propagate'`, and `'raise'`.	`'propagate'`

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import AccuracyAnalyzer
>>> analyzer = AccuracyAnalyzer(y_true="target", y_pred="pred")
>>> analyzer
AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
>>> frame = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(frame)
>>> output
AccuracyOutput(
  (state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)

arkas.analyzer.BalancedAccuracyAnalyzer ¶

Bases: BaseTruePredAnalyzer

Implement the balanced accuracy analyzer.

Parameters:

Name	Type	Description	Default
`y_true`	`str`	The column name of the ground truth target labels.	required
`y_pred`	`str`	The column name of the predicted labels.	required
`drop_nulls`	`bool`	If `True`, the rows with null values in `y_true` or `y_pred` columns are dropped.	`True`
`missing_policy`	`str`	The policy on how to handle missing columns. The following options are available: `'ignore'`, `'warn'`, and `'raise'`. If `'raise'`, an exception is raised if at least one column is missing. If `'warn'`, a warning is raised if at least one column is missing and the missing columns are ignored. If `'ignore'`, the missing columns are ignored and no warning message appears.	`'raise'`
`nan_policy`	`str`	The policy on how to handle NaN values in the input arrays. The following options are available: `'omit'`, `'propagate'`, and `'raise'`.	`'propagate'`

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import BalancedAccuracyAnalyzer
>>> analyzer = BalancedAccuracyAnalyzer(y_true="target", y_pred="pred")
>>> analyzer
BalancedAccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
>>> frame = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(frame)
>>> output
BalancedAccuracyOutput(
  (state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)

arkas.analyzer.BaseAnalyzer ¶

Bases: ABC

Define the base class to analyze a DataFrame.

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import AccuracyAnalyzer
>>> analyzer = AccuracyAnalyzer(y_true="target", y_pred="pred")
>>> analyzer
AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
>>> data = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(data)
>>> output
AccuracyOutput(
  (state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)

arkas.analyzer.BaseAnalyzer.analyze `abstractmethod` ¶

analyze(frame: DataFrame, lazy: bool = True) -> BaseOutput

Analyze the DataFrame.

Parameters:

Name	Type	Description	Default
`frame`	`DataFrame`	The DataFrame to analyze.	required
`lazy`	`bool`	If `True`, it forces the computation of the output, otherwise it returns an output object that contains the logic.	`True`

Returns:

Type	Description
`BaseOutput`	The generated output.

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import AccuracyAnalyzer
>>> analyzer = AccuracyAnalyzer(y_true="target", y_pred="pred")
>>> data = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(data)
>>> output
AccuracyOutput(
  (state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)

arkas.analyzer.BaseInNLazyAnalyzer ¶

Bases: BaseAnalyzer

Define a base class to implement analyzers that analyze DataFrames by using multiple input columns.

Parameters:

Name	Type	Description	Default
`columns`	`Sequence[str] \| None`	The columns to analyze. If `None`, it analyzes all the columns.	`None`
`exclude_columns`	`Sequence[str]`	The columns to exclude from the input `columns`. If any column is not found, it will be ignored during the filtering process.	`()`
`missing_policy`	`str`	The policy on how to handle missing columns. The following options are available: `'ignore'`, `'warn'`, and `'raise'`. If `'raise'`, an exception is raised if at least one column is missing. If `'warn'`, a warning is raised if at least one column is missing and the missing columns are ignored. If `'ignore'`, the missing columns are ignored and no warning message appears.	`'raise'`

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer
ColumnCooccurrenceAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', ignore_self=False, figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [0, 0, 0, 0, 1, 1, 1],
...     }
... )
>>> output = analyzer.analyze(frame)
>>> output
ColumnCooccurrenceOutput(
  (state): ColumnCooccurrenceState(matrix=(3, 3), figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.BaseInNLazyAnalyzer.find_columns ¶

find_columns(frame: DataFrame) -> tuple[str, ...]

Find the columns to transform.

Parameters:

Name	Type	Description	Default
`frame`	`DataFrame`	The input DataFrame. Sometimes the columns to transform are found by analyzing the input DataFrame.	required

Returns:

Type	Description
`tuple[str, ...]`	The columns to transform.

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [0, 0, 0, 0, 1, 1, 1],
...     }
... )
>>> analyzer = ColumnCooccurrenceAnalyzer(columns=["col2", "col3"])
>>> analyzer.find_columns(frame)
('col2', 'col3')
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer.find_columns(frame)
('col1', 'col2', 'col3')

arkas.analyzer.BaseInNLazyAnalyzer.find_common_columns ¶

find_common_columns(frame: DataFrame) -> tuple[str, ...]

Find the common columns between the DataFrame columns and the input columns.

Parameters:

Name	Type	Description	Default
`frame`	`DataFrame`	The input DataFrame. Sometimes the columns to transform are found by analyzing the input DataFrame.	required

Returns:

Type	Description
`tuple[str, ...]`	The common columns.

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [0, 0, 0, 0, 1, 1, 1],
...     }
... )
>>> analyzer = ColumnCooccurrenceAnalyzer(columns=["col2", "col3", "col5"])
>>> analyzer.find_common_columns(frame)
('col2', 'col3')
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer.find_common_columns(frame)
('col1', 'col2', 'col3')

arkas.analyzer.BaseInNLazyAnalyzer.find_missing_columns ¶

find_missing_columns(frame: DataFrame) -> tuple[str, ...]

Find the missing columns.

Parameters:

Name	Type	Description	Default
`frame`	`DataFrame`	The input DataFrame. Sometimes the columns to transform are found by analyzing the input DataFrame.	required

Returns:

Type	Description
`tuple[str, ...]`	The missing columns.

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [0, 0, 0, 0, 1, 1, 1],
...     }
... )
>>> analyzer = ColumnCooccurrenceAnalyzer(columns=["col2", "col3", "col5"])
>>> analyzer.find_missing_columns(frame)
('col5',)
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer.find_missing_columns(frame)
()

arkas.analyzer.BaseInNLazyAnalyzer.get_args ¶

get_args() -> dict

Get the arguments of the analyzer.

Returns:

Type	Description
`dict`	The arguments.

arkas.analyzer.BaseLazyAnalyzer ¶

Bases: BaseAnalyzer

Define a base class to implement a lazy analyzer.

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import SummaryAnalyzer
>>> analyzer = SummaryAnalyzer()
>>> analyzer
SummaryAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', top=5)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
...     },
...     schema={"col1": pl.Int64, "col2": pl.Int32, "col3": pl.Float64},
... )
>>> output = analyzer.analyze(frame)
>>> output
SummaryOutput(
  (state): DataFrameState(dataframe=(7, 3), nan_policy='propagate', figure_config=MatplotlibFigureConfig(), top=5)
)

arkas.analyzer.BaseTruePredAnalyzer ¶

Bases: BaseAnalyzer

Define a base class to implement polars.DataFrame analyzer that takes two input columns: y_true and y_pred.

Parameters:

Name	Type	Description	Default
`y_true`	`str`	The column name of the ground truth target labels.	required
`y_pred`	`str`	The column name of the predicted labels.	required
`drop_nulls`	`bool`	If `True`, the rows with null values in `y_true` or `y_pred` columns are dropped.	required
`missing_policy`	`str`	The policy on how to handle missing columns. The following options are available: `'ignore'`, `'warn'`, and `'raise'`. If `'raise'`, an exception is raised if at least one column is missing. If `'warn'`, a warning is raised if at least one column is missing and the missing columns are ignored. If `'ignore'`, the missing columns are ignored and no warning message appears.	required

arkas.analyzer.ColumnCooccurrenceAnalyzer ¶

Bases: BaseInNLazyAnalyzer

Implement a pairwise column co-occurrence analyzer.

Parameters:

Name	Type	Description	Default
`columns`	`Sequence[str] \| None`	The columns to analyze. If `None`, it analyzes all the columns.	`None`
`exclude_columns`	`Sequence[str]`	The columns to exclude from the input `columns`. If any column is not found, it will be ignored during the filtering process.	`()`
`missing_policy`	`str`	The policy on how to handle missing columns. The following options are available: `'ignore'`, `'warn'`, and `'raise'`. If `'raise'`, an exception is raised if at least one column is missing. If `'warn'`, a warning is raised if at least one column is missing and the missing columns are ignored. If `'ignore'`, the missing columns are ignored and no warning message appears.	`'raise'`
`ignore_self`	`bool`	If `True`, the diagonal of the co-occurrence matrix (a.k.a. self-co-occurrence) is set to 0.	`False`
`figure_config`	`BaseFigureConfig \| None`	The figure configuration.	`None`

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer
ColumnCooccurrenceAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', ignore_self=False, figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [0, 0, 0, 0, 1, 1, 1],
...     }
... )
>>> output = analyzer.analyze(frame)
>>> output
ColumnCooccurrenceOutput(
  (state): ColumnCooccurrenceState(matrix=(3, 3), figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.ColumnCorrelationAnalyzer ¶

Bases: BaseInNLazyAnalyzer

Implement an analyzer to analyze the correlation between numeric columns.

Parameters:

Name	Type	Description	Default
`columns`	`Sequence[str] \| None`	The columns to analyze. If `None`, it analyzes all the columns.	`None`
`exclude_columns`	`Sequence[str]`	The columns to exclude from the input `columns`. If any column is not found, it will be ignored during the filtering process.	`()`
`missing_policy`	`str`	The policy on how to handle missing columns. The following options are available: `'ignore'`, `'warn'`, and `'raise'`. If `'raise'`, an exception is raised if at least one column is missing. If `'warn'`, a warning is raised if at least one column is missing and the missing columns are ignored. If `'ignore'`, the missing columns are ignored and no warning message appears.	`'raise'`
`nan_policy`	`str`	The policy on how to handle NaN values in the input arrays. The following options are available: `'omit'`, `'propagate'`, and `'raise'`.	`'propagate'`
`sort_metric`	`str`	The key used to sort the correlation table.	`'spearman_coeff'`

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ColumnCorrelationAnalyzer
>>> analyzer = ColumnCorrelationAnalyzer(target_column="col3")
>>> analyzer
ColumnCorrelationAnalyzer(target_column='col3', sort_metric='spearman_coeff', columns=None, exclude_columns=(), missing_policy='raise', nan_policy='propagate')
>>> frame = pl.DataFrame(
...     {
...         "col1": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
...         "col2": [7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0],
...         "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
...     },
... )
>>> output = analyzer.analyze(frame)
>>> output
ColumnCorrelationOutput(
  (state): TargetDataFrameState(dataframe=(7, 3), target_column='col3', nan_policy='propagate', figure_config=MatplotlibFigureConfig(), sort_metric='spearman_coeff')
)

arkas.analyzer.ContentAnalyzer ¶

Bases: BaseLazyAnalyzer

Implement an analyzer that generates an output with the given custom content.

Parameters:

Name	Type	Description	Default
`content`	`str`	The content to use in the HTML code.	required

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ContentAnalyzer
>>> analyzer = ContentAnalyzer(content="meow")
>>> analyzer
ContentAnalyzer()
>>> frame = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(frame)
>>> output
ContentOutput(
  (content): ContentGenerator()
  (evaluator): Evaluator(count=0)
)

arkas.analyzer.ContinuousColumnAnalyzer ¶

Bases: BaseLazyAnalyzer

Implement an analyzer that analyzes a column with continuous values.

Parameters:

Name	Type	Description	Default
`column`	`str`	The column to analyze.	required
`figure_config`	`BaseFigureConfig \| None`	The figure configuration.	`None`

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ContinuousColumnAnalyzer
>>> analyzer = ContinuousColumnAnalyzer(column="col1")
>>> analyzer
ContinuousColumnAnalyzer(column='col1', figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 0, 1],
...         "col2": [1, 0, 1, 0],
...         "col3": [1, 1, 1, 1],
...     },
...     schema={"col1": pl.Int64, "col2": pl.Int64, "col3": pl.Int64},
... )
>>> output = analyzer.analyze(frame)
>>> output
ContinuousSeriesOutput(
  (state): SeriesState(name='col1', values=(4,), figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.CorrelationAnalyzer ¶

Bases: BaseLazyAnalyzer

Implement an analyzer that analyzes the correlation between two columns.

Parameters:

Name	Type	Description	Default
`x`	`str`	The first column.	required
`y`	`str`	The second column.	required
`drop_nulls`	`bool`	If `True`, the rows with null values in `x` or `y` columns are dropped.	`True`
`missing_policy`	`str`	The policy on how to handle missing columns. The following options are available: `'ignore'`, `'warn'`, and `'raise'`. If `'raise'`, an exception is raised if at least one column is missing. If `'warn'`, a warning is raised if at least one column is missing and the missing columns are ignored. If `'ignore'`, the missing columns are ignored and no warning message appears.	`'raise'`
`nan_policy`	`str`	The policy on how to handle NaN values in the input arrays. The following options are available: `'omit'`, `'propagate'`, and `'raise'`.	`'propagate'`
`figure_config`	`BaseFigureConfig \| None`	The figure configuration.	`None`

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import CorrelationAnalyzer
>>> analyzer = CorrelationAnalyzer(x="col1", y="col2")
>>> analyzer
CorrelationAnalyzer(x='col1', y='col2', drop_nulls=True, missing_policy='raise', nan_policy='propagate', figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
...         "col2": [7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0],
...         "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
...     },
...     schema={"col1": pl.Float64, "col2": pl.Float64, "col3": pl.Float64},
... )
>>> output = analyzer.analyze(frame)
>>> output
CorrelationOutput(
  (state): TwoColumnDataFrameState(dataframe=(7, 2), column1='col1', column2='col2', nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.HexbinColumnAnalyzer ¶

Bases: BaseLazyAnalyzer

Implement an analyzer that plots the content of each column.

Parameters:

Name	Type	Description	Default
`x`	`str`	The x-axis data column.	required
`y`	`str`	The y-axis data column.	required
`color`	`str \| None`	An optional color axis data column.	`None`
`figure_config`	`BaseFigureConfig \| None`	The figure configuration.	`None`

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import HexbinColumnAnalyzer
>>> analyzer = HexbinColumnAnalyzer(x="col1", y="col2")
>>> analyzer
HexbinColumnAnalyzer(x='col1', y='col2', color=None, figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 0, 1],
...         "col2": [1, 0, 1, 0],
...         "col3": [1, 1, 1, 1],
...     },
...     schema={"col1": pl.Int64, "col2": pl.Int64, "col3": pl.Int64},
... )
>>> output = analyzer.analyze(frame)
>>> output
HexbinColumnOutput(
  (state): ScatterDataFrameState(dataframe=(4, 2), x='col1', y='col2', color=None, nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.MappingAnalyzer ¶

Bases: BaseAnalyzer

Implement an analyzer that processes a mapping of analyzers.

Parameters:

Name	Type	Description	Default
`analyzers`	`Mapping[str, BaseAnalyzer]`	The mapping of analyzers.	required

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import (
...     MappingAnalyzer,
...     AccuracyAnalyzer,
...     BalancedAccuracyAnalyzer,
... )
>>> analyzer = MappingAnalyzer(
...     {
...         "one": AccuracyAnalyzer(y_true="target", y_pred="pred"),
...         "two": BalancedAccuracyAnalyzer(y_true="target", y_pred="pred"),
...     }
... )
>>> analyzer
MappingAnalyzer(
  (one): AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
  (two): BalancedAccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
)
>>> frame = pl.DataFrame({"pred": [1, 0, 0, 1, 1], "target": [1, 0, 0, 1, 1]})
>>> output = analyzer.analyze(frame)
>>> output
OutputDict(count=2)

arkas.analyzer.NullValueAnalyzer ¶

Bases: BaseInNLazyAnalyzer

Implement an analyzer that plots the content of each column.

Parameters:

Name	Type	Description	Default
`columns`	`Sequence[str] \| None`	The columns to analyze. If `None`, it analyzes all the columns.	`None`
`exclude_columns`	`Sequence[str]`	The columns to exclude from the input `columns`. If any column is not found, it will be ignored during the filtering process.	`()`
`missing_policy`	`str`	The policy on how to handle missing columns. The following options are available: `'ignore'`, `'warn'`, and `'raise'`. If `'raise'`, an exception is raised if at least one column is missing. If `'warn'`, a warning is raised if at least one column is missing and the missing columns are ignored. If `'ignore'`, the missing columns are ignored and no warning message appears.	`'raise'`
`figure_config`	`BaseFigureConfig \| None`	The figure configuration.	`None`

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import NullValueAnalyzer
>>> analyzer = NullValueAnalyzer()
>>> analyzer
NullValueAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, None],
...         "col2": [0, 1, None, None, 0, 1, 0],
...         "col3": [None, 0, 0, 0, None, 1, None],
...     }
... )
>>> output = analyzer.analyze(frame)
>>> output
NullValueOutput(
  (state): NullValueState(num_columns=3, figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.NumericSummaryAnalyzer ¶

Bases: BaseInNLazyAnalyzer

Implement an analyzer to show a summary of the numeric columns of a DataFrame.

Parameters:

Name	Type	Description	Default
`columns`	`Sequence[str] \| None`	The columns to analyze. If `None`, it analyzes all the columns.	`None`
`exclude_columns`	`Sequence[str]`	The columns to exclude from the input `columns`. If any column is not found, it will be ignored during the filtering process.	`()`
`missing_policy`	`str`	The policy on how to handle missing columns. The following options are available: `'ignore'`, `'warn'`, and `'raise'`. If `'raise'`, an exception is raised if at least one column is missing. If `'warn'`, a warning is raised if at least one column is missing and the missing columns are ignored. If `'ignore'`, the missing columns are ignored and no warning message appears.	`'raise'`

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import NumericSummaryAnalyzer
>>> analyzer = NumericSummaryAnalyzer()
>>> analyzer
NumericSummaryAnalyzer(columns=None, exclude_columns=(), missing_policy='raise')
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
...     },
...     schema={"col1": pl.Int64, "col2": pl.Int32, "col3": pl.Float64},
... )
>>> output = analyzer.analyze(frame)
>>> output
NumericSummaryOutput(
  (state): DataFrameState(dataframe=(7, 3), nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.PlotColumnAnalyzer ¶

Bases: BaseInNLazyAnalyzer

Implement an analyzer that plots the content of each column.

Parameters:

Name	Type	Description	Default
`columns`	`Sequence[str] \| None`	The columns to analyze. If `None`, it analyzes all the columns.	`None`
`exclude_columns`	`Sequence[str]`	The columns to exclude from the input `columns`. If any column is not found, it will be ignored during the filtering process.	`()`
`missing_policy`	`str`	The policy on how to handle missing columns. The following options are available: `'ignore'`, `'warn'`, and `'raise'`. If `'raise'`, an exception is raised if at least one column is missing. If `'warn'`, a warning is raised if at least one column is missing and the missing columns are ignored. If `'ignore'`, the missing columns are ignored and no warning message appears.	`'raise'`
`figure_config`	`BaseFigureConfig \| None`	The figure configuration.	`None`

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import PlotColumnAnalyzer
>>> analyzer = PlotColumnAnalyzer()
>>> analyzer
PlotColumnAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 0, 1],
...         "col2": [1, 0, 1, 0],
...         "col3": [1, 1, 1, 1],
...     },
...     schema={"col1": pl.Int64, "col2": pl.Int64, "col3": pl.Int64},
... )
>>> output = analyzer.analyze(frame)
>>> output
PlotColumnOutput(
  (state): DataFrameState(dataframe=(4, 3), nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.ScatterColumnAnalyzer ¶

Bases: BaseLazyAnalyzer

Implement an analyzer that plots the content of each column.

Parameters:

Name	Type	Description	Default
`x`	`str`	The x-axis data column.	required
`y`	`str`	The y-axis data column.	required
`color`	`str \| None`	An optional color axis data column.	`None`
`figure_config`	`BaseFigureConfig \| None`	The figure configuration.	`None`

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import ScatterColumnAnalyzer
>>> analyzer = ScatterColumnAnalyzer(x="col1", y="col2")
>>> analyzer
ScatterColumnAnalyzer(x='col1', y='col2', color=None, figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 0, 1],
...         "col2": [1, 0, 1, 0],
...         "col3": [1, 1, 1, 1],
...     },
...     schema={"col1": pl.Int64, "col2": pl.Int64, "col3": pl.Int64},
... )
>>> output = analyzer.analyze(frame)
>>> output
ScatterColumnOutput(
  (state): ScatterDataFrameState(dataframe=(4, 2), x='col1', y='col2', color=None, nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.SummaryAnalyzer ¶

Bases: BaseInNLazyAnalyzer

Implement an analyzer to show a summary of a DataFrame.

Parameters:

Name	Type	Description	Default
`columns`	`Sequence[str] \| None`	The columns to analyze. If `None`, it analyzes all the columns.	`None`
`exclude_columns`	`Sequence[str]`	The columns to exclude from the input `columns`. If any column is not found, it will be ignored during the filtering process.	`()`
`missing_policy`	`str`	The policy on how to handle missing columns. The following options are available: `'ignore'`, `'warn'`, and `'raise'`. If `'raise'`, an exception is raised if at least one column is missing. If `'warn'`, a warning is raised if at least one column is missing and the missing columns are ignored. If `'ignore'`, the missing columns are ignored and no warning message appears.	`'raise'`
`top`	`int`	The number of most frequent values to show.	`5`

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import SummaryAnalyzer
>>> analyzer = SummaryAnalyzer()
>>> analyzer
SummaryAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', top=5)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0, 0, 1, 0],
...         "col2": [0, 1, 0, 1, 0, 1, 0],
...         "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
...     },
...     schema={"col1": pl.Int64, "col2": pl.Int32, "col3": pl.Float64},
... )
>>> output = analyzer.analyze(frame)
>>> output
SummaryOutput(
  (state): DataFrameState(dataframe=(7, 3), nan_policy='propagate', figure_config=MatplotlibFigureConfig(), top=5)
)

arkas.analyzer.TemporalContinuousColumnAnalyzer ¶

Bases: BaseLazyAnalyzer

Implement an analyzer that analyzes the temporal distribution of a column with continuous values.

Parameters:

Name	Type	Description	Default
`target_column`	`str`	The column to analyze.	required
`temporal_column`	`str`	The temporal column in the DataFrame.	required
`period`	`str \| None`	An optional temporal period e.g. monthly or daily.	`None`
`nan_policy`	`str`	The policy on how to handle NaN values in the input arrays. The following options are available: `'omit'`, `'propagate'`, and `'raise'`.	`'propagate'`
`figure_config`	`BaseFigureConfig \| None`	The figure configuration.	`None`

Example usage:

>>> from datetime import datetime, timezone
>>> import polars as pl
>>> from arkas.analyzer import TemporalContinuousColumnAnalyzer
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 0, 1],
...         "col2": [0, 1, 2, 3],
...         "datetime": [
...             datetime(year=2020, month=1, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=2, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=3, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=4, day=3, tzinfo=timezone.utc),
...         ],
...     },
...     schema={
...         "col1": pl.Int64,
...         "col2": pl.Int64,
...         "datetime": pl.Datetime(time_unit="us", time_zone="UTC"),
...     },
... )
>>> analyzer = TemporalContinuousColumnAnalyzer(
...     target_column="col2", temporal_column="datetime"
... )
>>> analyzer
TemporalContinuousColumnAnalyzer(target_column='col2', temporal_column='datetime', period=None, nan_policy='propagate', figure_config=None)
>>> output = analyzer.analyze(frame)
>>> output
TemporalContinuousColumnOutput(
  (state): TemporalColumnState(dataframe=(4, 3), target_column='col2', temporal_column='datetime', period=None, nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.TemporalNullValueAnalyzer ¶

Bases: BaseInNLazyAnalyzer

Implement an analyzer that analyzes the number of null values in a DataFrame.

Parameters:

Name	Type	Description	Default
`temporal_column`	`str`	The temporal column in the DataFrame.	required
`period`	`str`	The temporal period e.g. monthly or daily.	required
`columns`	`Sequence[str] \| None`	The columns to analyze. If `None`, it analyzes all the columns.	`None`
`exclude_columns`	`Sequence[str]`	The columns to exclude from the input `columns`. If any column is not found, it will be ignored during the filtering process.	`()`
`missing_policy`	`str`	The policy on how to handle missing columns. The following options are available: `'ignore'`, `'warn'`, and `'raise'`. If `'raise'`, an exception is raised if at least one column is missing. If `'warn'`, a warning is raised if at least one column is missing and the missing columns are ignored. If `'ignore'`, the missing columns are ignored and no warning message appears.	`'raise'`
`figure_config`	`BaseFigureConfig \| None`	The figure configuration.	`None`

Example usage:

>>> from datetime import datetime, timezone
>>> import polars as pl
>>> from arkas.analyzer import TemporalNullValueAnalyzer
>>> analyzer = TemporalNullValueAnalyzer(temporal_column="datetime", period="1d")
>>> analyzer
TemporalNullValueAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', temporal_column='datetime', period='1d', figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0],
...         "col2": [0, 1, 0, 1],
...         "col3": [1, 0, 0, 0],
...         "datetime": [
...             datetime(year=2020, month=1, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=2, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=3, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=4, day=3, tzinfo=timezone.utc),
...         ],
...     },
...     schema={
...         "col1": pl.Int64,
...         "col2": pl.Int64,
...         "col3": pl.Int64,
...         "datetime": pl.Datetime(time_unit="us", time_zone="UTC"),
...     },
... )
>>> output = analyzer.analyze(frame)
>>> output
TemporalNullValueOutput(
  (state): TemporalDataFrameState(dataframe=(4, 4), temporal_column='datetime', period='1d', nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.TemporalPlotColumnAnalyzer ¶

Bases: BaseInNLazyAnalyzer

Implement an analyzer that plots the content of each column.

Parameters:

Name	Type	Description	Default
`temporal_column`	`str`	The temporal column in the DataFrame.	required
`period`	`str \| None`	An optional temporal period e.g. monthly or daily.	`None`
`columns`	`Sequence[str] \| None`	The columns to analyze. If `None`, it analyzes all the columns.	`None`
`exclude_columns`	`Sequence[str]`	The columns to exclude from the input `columns`. If any column is not found, it will be ignored during the filtering process.	`()`
`missing_policy`	`str`	The policy on how to handle missing columns. The following options are available: `'ignore'`, `'warn'`, and `'raise'`. If `'raise'`, an exception is raised if at least one column is missing. If `'warn'`, a warning is raised if at least one column is missing and the missing columns are ignored. If `'ignore'`, the missing columns are ignored and no warning message appears.	`'raise'`
`figure_config`	`BaseFigureConfig \| None`	The figure configuration.	`None`

Example usage:

>>> from datetime import datetime, timezone
>>> import polars as pl
>>> from arkas.analyzer import TemporalPlotColumnAnalyzer
>>> analyzer = TemporalPlotColumnAnalyzer(temporal_column="datetime")
>>> analyzer
TemporalPlotColumnAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', temporal_column='datetime', period=None, figure_config=None)
>>> frame = pl.DataFrame(
...     {
...         "col1": [0, 1, 1, 0],
...         "col2": [0, 1, 0, 1],
...         "col3": [1, 0, 0, 0],
...         "datetime": [
...             datetime(year=2020, month=1, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=2, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=3, day=3, tzinfo=timezone.utc),
...             datetime(year=2020, month=4, day=3, tzinfo=timezone.utc),
...         ],
...     },
...     schema={
...         "col1": pl.Int64,
...         "col2": pl.Int64,
...         "col3": pl.Int64,
...         "datetime": pl.Datetime(time_unit="us", time_zone="UTC"),
...     },
... )
>>> output = analyzer.analyze(frame)
>>> output
TemporalPlotColumnOutput(
  (state): TemporalDataFrameState(dataframe=(4, 4), temporal_column='datetime', period=None, nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)

arkas.analyzer.TransformAnalyzer ¶

Bases: BaseAnalyzer

Implement an analyzer that transforms the data before to analyze them.

Parameters:

Name	Type	Description	Default
`transformer`	`BaseTransformer \| dict`	The transformer or its configuration.	required
`analyzer`	`BaseAnalyzer \| dict`	The analyzer or its configuration.	required

Example usage:

>>> import polars as pl
>>> from arkas.analyzer import AccuracyAnalyzer, TransformAnalyzer
>>> from grizz.transformer import DropNullRow
>>> analyzer = TransformAnalyzer(
...     transformer=DropNullRow(), analyzer=AccuracyAnalyzer(y_true="target", y_pred="pred")
... )
>>> analyzer
TransformAnalyzer(
  (transformer): DropNullRowTransformer(columns=None, exclude_columns=(), missing_policy='raise')
  (analyzer): AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
)
>>> frame = pl.DataFrame(
...     {"pred": [3, 2, 0, 1, 0, 1, None], "target": [3, 2, 0, 1, 0, 1, None]}
... )
>>> output = analyzer.analyze(frame)
>>> output
AccuracyOutput(
  (state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)

arkas.analyzer.is_analyzer_config ¶

is_analyzer_config(config: dict) -> bool

Indicate if the input configuration is a configuration for a BaseAnalyzer.

This function only checks if the value of the key _target_ is valid. It does not check the other values. If _target_ indicates a function, the returned type hint is used to check the class.

Parameters:

Name	Type	Description	Default
`config`	`dict`	The configuration to check.	required

Returns:

Type	Description
`bool`	`True` if the input configuration is a configuration for a `BaseAnalyzer` object.

Example usage:

>>> from arkas.analyzer import is_analyzer_config
>>> is_analyzer_config({"_target_": "arkas.analyzer.AccuracyAnalyzer"})
True

arkas.analyzer.setup_analyzer ¶

setup_analyzer(
    analyzer: BaseAnalyzer | dict,
) -> BaseAnalyzer

Set up an analyzer.

The analyzer is instantiated from its configuration by using the BaseAnalyzer factory function.

Parameters:

Name	Type	Description	Default
`analyzer`	`BaseAnalyzer \| dict`	An analyzer or its configuration.	required

Returns:

Type	Description
`BaseAnalyzer`	An instantiated analyzer.

Example usage:

>>> from arkas.analyzer import setup_analyzer
>>> analyzer = setup_analyzer(
...     {
...         "_target_": "arkas.analyzer.AccuracyAnalyzer",
...         "y_true": "target",
...         "y_pred": "pred",
...     }
... )
>>> analyzer
AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')

arkas.analyzer¶

arkas.analyzer ¶

arkas.analyzer.AccuracyAnalyzer ¶

arkas.analyzer.BalancedAccuracyAnalyzer ¶

arkas.analyzer.BaseAnalyzer ¶

arkas.analyzer.BaseAnalyzer.analyze abstractmethod ¶

arkas.analyzer.BaseInNLazyAnalyzer ¶

arkas.analyzer.BaseInNLazyAnalyzer.find_columns ¶

arkas.analyzer.BaseInNLazyAnalyzer.find_common_columns ¶

arkas.analyzer.BaseInNLazyAnalyzer.find_missing_columns ¶

arkas.analyzer.BaseInNLazyAnalyzer.get_args ¶

arkas.analyzer.BaseLazyAnalyzer ¶

arkas.analyzer.BaseTruePredAnalyzer ¶

arkas.analyzer.ColumnCooccurrenceAnalyzer ¶

arkas.analyzer.ColumnCorrelationAnalyzer ¶

arkas.analyzer.ContentAnalyzer ¶

arkas.analyzer.ContinuousColumnAnalyzer ¶

arkas.analyzer.CorrelationAnalyzer ¶

arkas.analyzer.HexbinColumnAnalyzer ¶

arkas.analyzer.MappingAnalyzer ¶

arkas.analyzer.NullValueAnalyzer ¶

arkas.analyzer.NumericSummaryAnalyzer ¶

arkas.analyzer.PlotColumnAnalyzer ¶

arkas.analyzer.ScatterColumnAnalyzer ¶

arkas.analyzer.SummaryAnalyzer ¶

arkas.analyzer.TemporalContinuousColumnAnalyzer ¶

arkas.analyzer.TemporalNullValueAnalyzer ¶

arkas.analyzer.TemporalPlotColumnAnalyzer ¶

arkas.analyzer.TransformAnalyzer ¶

arkas.analyzer.is_analyzer_config ¶

arkas.analyzer.setup_analyzer ¶

arkas.analyzer.BaseAnalyzer.analyze `abstractmethod` ¶