arkas.analyzer¶
arkas.analyzer ¶
Contain DataFrame analyzers.
arkas.analyzer.AccuracyAnalyzer ¶
Bases: BaseTruePredAnalyzer
Implement the accuracy analyzer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
y_true
|
str
|
The column name of the ground truth target labels. |
required |
y_pred
|
str
|
The column name of the predicted labels. |
required |
drop_nulls
|
bool
|
If |
True
|
missing_policy
|
str
|
The policy on how to handle missing columns.
The following options are available: |
'raise'
|
nan_policy
|
str
|
The policy on how to handle NaN values in the input
arrays. The following options are available: |
'propagate'
|
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import AccuracyAnalyzer
>>> analyzer = AccuracyAnalyzer(y_true="target", y_pred="pred")
>>> analyzer
AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
>>> frame = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(frame)
>>> output
AccuracyOutput(
(state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)
arkas.analyzer.BalancedAccuracyAnalyzer ¶
Bases: BaseTruePredAnalyzer
Implement the balanced accuracy analyzer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
y_true
|
str
|
The column name of the ground truth target labels. |
required |
y_pred
|
str
|
The column name of the predicted labels. |
required |
drop_nulls
|
bool
|
If |
True
|
missing_policy
|
str
|
The policy on how to handle missing columns.
The following options are available: |
'raise'
|
nan_policy
|
str
|
The policy on how to handle NaN values in the input
arrays. The following options are available: |
'propagate'
|
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import BalancedAccuracyAnalyzer
>>> analyzer = BalancedAccuracyAnalyzer(y_true="target", y_pred="pred")
>>> analyzer
BalancedAccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
>>> frame = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(frame)
>>> output
BalancedAccuracyOutput(
(state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)
arkas.analyzer.BaseAnalyzer ¶
Bases: ABC
Define the base class to analyze a DataFrame.
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import AccuracyAnalyzer
>>> analyzer = AccuracyAnalyzer(y_true="target", y_pred="pred")
>>> analyzer
AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
>>> data = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(data)
>>> output
AccuracyOutput(
(state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)
arkas.analyzer.BaseAnalyzer.analyze
abstractmethod
¶
analyze(frame: DataFrame, lazy: bool = True) -> BaseOutput
Analyze the DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
frame
|
DataFrame
|
The DataFrame to analyze. |
required |
lazy
|
bool
|
If |
True
|
Returns:
Type | Description |
---|---|
BaseOutput
|
The generated output. |
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import AccuracyAnalyzer
>>> analyzer = AccuracyAnalyzer(y_true="target", y_pred="pred")
>>> data = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(data)
>>> output
AccuracyOutput(
(state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)
arkas.analyzer.BaseInNLazyAnalyzer ¶
Bases: BaseAnalyzer
Define a base class to implement analyzers that analyze DataFrames by using multiple input columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
Sequence[str] | None
|
The columns to analyze. If |
None
|
exclude_columns
|
Sequence[str]
|
The columns to exclude from the input
|
()
|
missing_policy
|
str
|
The policy on how to handle missing columns.
The following options are available: |
'raise'
|
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer
ColumnCooccurrenceAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', ignore_self=False, figure_config=None)
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 1, 0, 0, 1, 0],
... "col2": [0, 1, 0, 1, 0, 1, 0],
... "col3": [0, 0, 0, 0, 1, 1, 1],
... }
... )
>>> output = analyzer.analyze(frame)
>>> output
ColumnCooccurrenceOutput(
(state): ColumnCooccurrenceState(matrix=(3, 3), figure_config=MatplotlibFigureConfig())
)
arkas.analyzer.BaseInNLazyAnalyzer.find_columns ¶
find_columns(frame: DataFrame) -> tuple[str, ...]
Find the columns to transform.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
frame
|
DataFrame
|
The input DataFrame. Sometimes the columns to transform are found by analyzing the input DataFrame. |
required |
Returns:
Type | Description |
---|---|
tuple[str, ...]
|
The columns to transform. |
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 1, 0, 0, 1, 0],
... "col2": [0, 1, 0, 1, 0, 1, 0],
... "col3": [0, 0, 0, 0, 1, 1, 1],
... }
... )
>>> analyzer = ColumnCooccurrenceAnalyzer(columns=["col2", "col3"])
>>> analyzer.find_columns(frame)
('col2', 'col3')
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer.find_columns(frame)
('col1', 'col2', 'col3')
arkas.analyzer.BaseInNLazyAnalyzer.find_common_columns ¶
find_common_columns(frame: DataFrame) -> tuple[str, ...]
Find the common columns between the DataFrame columns and the input columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
frame
|
DataFrame
|
The input DataFrame. Sometimes the columns to transform are found by analyzing the input DataFrame. |
required |
Returns:
Type | Description |
---|---|
tuple[str, ...]
|
The common columns. |
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 1, 0, 0, 1, 0],
... "col2": [0, 1, 0, 1, 0, 1, 0],
... "col3": [0, 0, 0, 0, 1, 1, 1],
... }
... )
>>> analyzer = ColumnCooccurrenceAnalyzer(columns=["col2", "col3", "col5"])
>>> analyzer.find_common_columns(frame)
('col2', 'col3')
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer.find_common_columns(frame)
('col1', 'col2', 'col3')
arkas.analyzer.BaseInNLazyAnalyzer.find_missing_columns ¶
find_missing_columns(frame: DataFrame) -> tuple[str, ...]
Find the missing columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
frame
|
DataFrame
|
The input DataFrame. Sometimes the columns to transform are found by analyzing the input DataFrame. |
required |
Returns:
Type | Description |
---|---|
tuple[str, ...]
|
The missing columns. |
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 1, 0, 0, 1, 0],
... "col2": [0, 1, 0, 1, 0, 1, 0],
... "col3": [0, 0, 0, 0, 1, 1, 1],
... }
... )
>>> analyzer = ColumnCooccurrenceAnalyzer(columns=["col2", "col3", "col5"])
>>> analyzer.find_missing_columns(frame)
('col5',)
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer.find_missing_columns(frame)
()
arkas.analyzer.BaseInNLazyAnalyzer.get_args ¶
get_args() -> dict
Get the arguments of the analyzer.
Returns:
Type | Description |
---|---|
dict
|
The arguments. |
arkas.analyzer.BaseLazyAnalyzer ¶
Bases: BaseAnalyzer
Define a base class to implement a lazy analyzer.
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import SummaryAnalyzer
>>> analyzer = SummaryAnalyzer()
>>> analyzer
SummaryAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', top=5)
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 1, 0, 0, 1, 0],
... "col2": [0, 1, 0, 1, 0, 1, 0],
... "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
... },
... schema={"col1": pl.Int64, "col2": pl.Int32, "col3": pl.Float64},
... )
>>> output = analyzer.analyze(frame)
>>> output
SummaryOutput(
(state): DataFrameState(dataframe=(7, 3), nan_policy='propagate', figure_config=MatplotlibFigureConfig(), top=5)
)
arkas.analyzer.BaseTruePredAnalyzer ¶
Bases: BaseAnalyzer
Define a base class to implement polars.DataFrame
analyzer that takes two input columns: y_true
and y_pred
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
y_true
|
str
|
The column name of the ground truth target labels. |
required |
y_pred
|
str
|
The column name of the predicted labels. |
required |
drop_nulls
|
bool
|
If |
required |
missing_policy
|
str
|
The policy on how to handle missing columns.
The following options are available: |
required |
arkas.analyzer.ColumnCooccurrenceAnalyzer ¶
Bases: BaseInNLazyAnalyzer
Implement a pairwise column co-occurrence analyzer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
Sequence[str] | None
|
The columns to analyze. If |
None
|
exclude_columns
|
Sequence[str]
|
The columns to exclude from the input
|
()
|
missing_policy
|
str
|
The policy on how to handle missing columns.
The following options are available: |
'raise'
|
ignore_self
|
bool
|
If |
False
|
figure_config
|
BaseFigureConfig | None
|
The figure configuration. |
None
|
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import ColumnCooccurrenceAnalyzer
>>> analyzer = ColumnCooccurrenceAnalyzer()
>>> analyzer
ColumnCooccurrenceAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', ignore_self=False, figure_config=None)
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 1, 0, 0, 1, 0],
... "col2": [0, 1, 0, 1, 0, 1, 0],
... "col3": [0, 0, 0, 0, 1, 1, 1],
... }
... )
>>> output = analyzer.analyze(frame)
>>> output
ColumnCooccurrenceOutput(
(state): ColumnCooccurrenceState(matrix=(3, 3), figure_config=MatplotlibFigureConfig())
)
arkas.analyzer.ColumnCorrelationAnalyzer ¶
Bases: BaseInNLazyAnalyzer
Implement an analyzer to analyze the correlation between numeric columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
Sequence[str] | None
|
The columns to analyze. If |
None
|
exclude_columns
|
Sequence[str]
|
The columns to exclude from the input
|
()
|
missing_policy
|
str
|
The policy on how to handle missing columns.
The following options are available: |
'raise'
|
nan_policy
|
str
|
The policy on how to handle NaN values in the input
arrays. The following options are available: |
'propagate'
|
sort_metric
|
str
|
The key used to sort the correlation table. |
'spearman_coeff'
|
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import ColumnCorrelationAnalyzer
>>> analyzer = ColumnCorrelationAnalyzer(target_column="col3")
>>> analyzer
ColumnCorrelationAnalyzer(target_column='col3', sort_metric='spearman_coeff', columns=None, exclude_columns=(), missing_policy='raise', nan_policy='propagate')
>>> frame = pl.DataFrame(
... {
... "col1": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
... "col2": [7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0],
... "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
... },
... )
>>> output = analyzer.analyze(frame)
>>> output
ColumnCorrelationOutput(
(state): TargetDataFrameState(dataframe=(7, 3), target_column='col3', nan_policy='propagate', figure_config=MatplotlibFigureConfig(), sort_metric='spearman_coeff')
)
arkas.analyzer.ContentAnalyzer ¶
Bases: BaseLazyAnalyzer
Implement an analyzer that generates an output with the given custom content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
content
|
str
|
The content to use in the HTML code. |
required |
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import ContentAnalyzer
>>> analyzer = ContentAnalyzer(content="meow")
>>> analyzer
ContentAnalyzer()
>>> frame = pl.DataFrame({"pred": [3, 2, 0, 1, 0, 1], "target": [3, 2, 0, 1, 0, 1]})
>>> output = analyzer.analyze(frame)
>>> output
ContentOutput(
(content): ContentGenerator()
(evaluator): Evaluator(count=0)
)
arkas.analyzer.ContinuousColumnAnalyzer ¶
Bases: BaseLazyAnalyzer
Implement an analyzer that analyzes a column with continuous values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The column to analyze. |
required |
figure_config
|
BaseFigureConfig | None
|
The figure configuration. |
None
|
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import ContinuousColumnAnalyzer
>>> analyzer = ContinuousColumnAnalyzer(column="col1")
>>> analyzer
ContinuousColumnAnalyzer(column='col1', figure_config=None)
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 0, 1],
... "col2": [1, 0, 1, 0],
... "col3": [1, 1, 1, 1],
... },
... schema={"col1": pl.Int64, "col2": pl.Int64, "col3": pl.Int64},
... )
>>> output = analyzer.analyze(frame)
>>> output
ContinuousSeriesOutput(
(state): SeriesState(name='col1', values=(4,), figure_config=MatplotlibFigureConfig())
)
arkas.analyzer.CorrelationAnalyzer ¶
Bases: BaseLazyAnalyzer
Implement an analyzer that analyzes the correlation between two columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
The first column. |
required |
y
|
str
|
The second column. |
required |
drop_nulls
|
bool
|
If |
True
|
missing_policy
|
str
|
The policy on how to handle missing columns.
The following options are available: |
'raise'
|
nan_policy
|
str
|
The policy on how to handle NaN values in the input
arrays. The following options are available: |
'propagate'
|
figure_config
|
BaseFigureConfig | None
|
The figure configuration. |
None
|
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import CorrelationAnalyzer
>>> analyzer = CorrelationAnalyzer(x="col1", y="col2")
>>> analyzer
CorrelationAnalyzer(x='col1', y='col2', drop_nulls=True, missing_policy='raise', nan_policy='propagate', figure_config=None)
>>> frame = pl.DataFrame(
... {
... "col1": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
... "col2": [7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0],
... "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
... },
... schema={"col1": pl.Float64, "col2": pl.Float64, "col3": pl.Float64},
... )
>>> output = analyzer.analyze(frame)
>>> output
CorrelationOutput(
(state): TwoColumnDataFrameState(dataframe=(7, 2), column1='col1', column2='col2', nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)
arkas.analyzer.HexbinColumnAnalyzer ¶
Bases: BaseLazyAnalyzer
Implement an analyzer that plots the content of each column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
The x-axis data column. |
required |
y
|
str
|
The y-axis data column. |
required |
color
|
str | None
|
An optional color axis data column. |
None
|
figure_config
|
BaseFigureConfig | None
|
The figure configuration. |
None
|
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import HexbinColumnAnalyzer
>>> analyzer = HexbinColumnAnalyzer(x="col1", y="col2")
>>> analyzer
HexbinColumnAnalyzer(x='col1', y='col2', color=None, figure_config=None)
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 0, 1],
... "col2": [1, 0, 1, 0],
... "col3": [1, 1, 1, 1],
... },
... schema={"col1": pl.Int64, "col2": pl.Int64, "col3": pl.Int64},
... )
>>> output = analyzer.analyze(frame)
>>> output
HexbinColumnOutput(
(state): ScatterDataFrameState(dataframe=(4, 2), x='col1', y='col2', color=None, nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)
arkas.analyzer.MappingAnalyzer ¶
Bases: BaseAnalyzer
Implement an analyzer that processes a mapping of analyzers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
analyzers
|
Mapping[str, BaseAnalyzer]
|
The mapping of analyzers. |
required |
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import (
... MappingAnalyzer,
... AccuracyAnalyzer,
... BalancedAccuracyAnalyzer,
... )
>>> analyzer = MappingAnalyzer(
... {
... "one": AccuracyAnalyzer(y_true="target", y_pred="pred"),
... "two": BalancedAccuracyAnalyzer(y_true="target", y_pred="pred"),
... }
... )
>>> analyzer
MappingAnalyzer(
(one): AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
(two): BalancedAccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
)
>>> frame = pl.DataFrame({"pred": [1, 0, 0, 1, 1], "target": [1, 0, 0, 1, 1]})
>>> output = analyzer.analyze(frame)
>>> output
OutputDict(count=2)
arkas.analyzer.NullValueAnalyzer ¶
Bases: BaseInNLazyAnalyzer
Implement an analyzer that plots the content of each column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
Sequence[str] | None
|
The columns to analyze. If |
None
|
exclude_columns
|
Sequence[str]
|
The columns to exclude from the input
|
()
|
missing_policy
|
str
|
The policy on how to handle missing columns.
The following options are available: |
'raise'
|
figure_config
|
BaseFigureConfig | None
|
The figure configuration. |
None
|
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import NullValueAnalyzer
>>> analyzer = NullValueAnalyzer()
>>> analyzer
NullValueAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', figure_config=None)
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 1, 0, 0, 1, None],
... "col2": [0, 1, None, None, 0, 1, 0],
... "col3": [None, 0, 0, 0, None, 1, None],
... }
... )
>>> output = analyzer.analyze(frame)
>>> output
NullValueOutput(
(state): NullValueState(num_columns=3, figure_config=MatplotlibFigureConfig())
)
arkas.analyzer.NumericSummaryAnalyzer ¶
Bases: BaseInNLazyAnalyzer
Implement an analyzer to show a summary of the numeric columns of a DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
Sequence[str] | None
|
The columns to analyze. If |
None
|
exclude_columns
|
Sequence[str]
|
The columns to exclude from the input
|
()
|
missing_policy
|
str
|
The policy on how to handle missing columns.
The following options are available: |
'raise'
|
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import NumericSummaryAnalyzer
>>> analyzer = NumericSummaryAnalyzer()
>>> analyzer
NumericSummaryAnalyzer(columns=None, exclude_columns=(), missing_policy='raise')
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 1, 0, 0, 1, 0],
... "col2": [0, 1, 0, 1, 0, 1, 0],
... "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
... },
... schema={"col1": pl.Int64, "col2": pl.Int32, "col3": pl.Float64},
... )
>>> output = analyzer.analyze(frame)
>>> output
NumericSummaryOutput(
(state): DataFrameState(dataframe=(7, 3), nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)
arkas.analyzer.PlotColumnAnalyzer ¶
Bases: BaseInNLazyAnalyzer
Implement an analyzer that plots the content of each column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
Sequence[str] | None
|
The columns to analyze. If |
None
|
exclude_columns
|
Sequence[str]
|
The columns to exclude from the input
|
()
|
missing_policy
|
str
|
The policy on how to handle missing columns.
The following options are available: |
'raise'
|
figure_config
|
BaseFigureConfig | None
|
The figure configuration. |
None
|
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import PlotColumnAnalyzer
>>> analyzer = PlotColumnAnalyzer()
>>> analyzer
PlotColumnAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', figure_config=None)
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 0, 1],
... "col2": [1, 0, 1, 0],
... "col3": [1, 1, 1, 1],
... },
... schema={"col1": pl.Int64, "col2": pl.Int64, "col3": pl.Int64},
... )
>>> output = analyzer.analyze(frame)
>>> output
PlotColumnOutput(
(state): DataFrameState(dataframe=(4, 3), nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)
arkas.analyzer.ScatterColumnAnalyzer ¶
Bases: BaseLazyAnalyzer
Implement an analyzer that plots the content of each column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
The x-axis data column. |
required |
y
|
str
|
The y-axis data column. |
required |
color
|
str | None
|
An optional color axis data column. |
None
|
figure_config
|
BaseFigureConfig | None
|
The figure configuration. |
None
|
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import ScatterColumnAnalyzer
>>> analyzer = ScatterColumnAnalyzer(x="col1", y="col2")
>>> analyzer
ScatterColumnAnalyzer(x='col1', y='col2', color=None, figure_config=None)
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 0, 1],
... "col2": [1, 0, 1, 0],
... "col3": [1, 1, 1, 1],
... },
... schema={"col1": pl.Int64, "col2": pl.Int64, "col3": pl.Int64},
... )
>>> output = analyzer.analyze(frame)
>>> output
ScatterColumnOutput(
(state): ScatterDataFrameState(dataframe=(4, 2), x='col1', y='col2', color=None, nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)
arkas.analyzer.SummaryAnalyzer ¶
Bases: BaseInNLazyAnalyzer
Implement an analyzer to show a summary of a DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
Sequence[str] | None
|
The columns to analyze. If |
None
|
exclude_columns
|
Sequence[str]
|
The columns to exclude from the input
|
()
|
missing_policy
|
str
|
The policy on how to handle missing columns.
The following options are available: |
'raise'
|
top
|
int
|
The number of most frequent values to show. |
5
|
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import SummaryAnalyzer
>>> analyzer = SummaryAnalyzer()
>>> analyzer
SummaryAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', top=5)
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 1, 0, 0, 1, 0],
... "col2": [0, 1, 0, 1, 0, 1, 0],
... "col3": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
... },
... schema={"col1": pl.Int64, "col2": pl.Int32, "col3": pl.Float64},
... )
>>> output = analyzer.analyze(frame)
>>> output
SummaryOutput(
(state): DataFrameState(dataframe=(7, 3), nan_policy='propagate', figure_config=MatplotlibFigureConfig(), top=5)
)
arkas.analyzer.TemporalContinuousColumnAnalyzer ¶
Bases: BaseLazyAnalyzer
Implement an analyzer that analyzes the temporal distribution of a column with continuous values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target_column
|
str
|
The column to analyze. |
required |
temporal_column
|
str
|
The temporal column in the DataFrame. |
required |
period
|
str | None
|
An optional temporal period e.g. monthly or daily. |
None
|
nan_policy
|
str
|
The policy on how to handle NaN values in the input
arrays. The following options are available: |
'propagate'
|
figure_config
|
BaseFigureConfig | None
|
The figure configuration. |
None
|
Example usage:
>>> from datetime import datetime, timezone
>>> import polars as pl
>>> from arkas.analyzer import TemporalContinuousColumnAnalyzer
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 0, 1],
... "col2": [0, 1, 2, 3],
... "datetime": [
... datetime(year=2020, month=1, day=3, tzinfo=timezone.utc),
... datetime(year=2020, month=2, day=3, tzinfo=timezone.utc),
... datetime(year=2020, month=3, day=3, tzinfo=timezone.utc),
... datetime(year=2020, month=4, day=3, tzinfo=timezone.utc),
... ],
... },
... schema={
... "col1": pl.Int64,
... "col2": pl.Int64,
... "datetime": pl.Datetime(time_unit="us", time_zone="UTC"),
... },
... )
>>> analyzer = TemporalContinuousColumnAnalyzer(
... target_column="col2", temporal_column="datetime"
... )
>>> analyzer
TemporalContinuousColumnAnalyzer(target_column='col2', temporal_column='datetime', period=None, nan_policy='propagate', figure_config=None)
>>> output = analyzer.analyze(frame)
>>> output
TemporalContinuousColumnOutput(
(state): TemporalColumnState(dataframe=(4, 3), target_column='col2', temporal_column='datetime', period=None, nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)
arkas.analyzer.TemporalNullValueAnalyzer ¶
Bases: BaseInNLazyAnalyzer
Implement an analyzer that analyzes the number of null values in a DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
temporal_column
|
str
|
The temporal column in the DataFrame. |
required |
period
|
str
|
The temporal period e.g. monthly or daily. |
required |
columns
|
Sequence[str] | None
|
The columns to analyze. If |
None
|
exclude_columns
|
Sequence[str]
|
The columns to exclude from the input
|
()
|
missing_policy
|
str
|
The policy on how to handle missing columns.
The following options are available: |
'raise'
|
figure_config
|
BaseFigureConfig | None
|
The figure configuration. |
None
|
Example usage:
>>> from datetime import datetime, timezone
>>> import polars as pl
>>> from arkas.analyzer import TemporalNullValueAnalyzer
>>> analyzer = TemporalNullValueAnalyzer(temporal_column="datetime", period="1d")
>>> analyzer
TemporalNullValueAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', temporal_column='datetime', period='1d', figure_config=None)
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 1, 0],
... "col2": [0, 1, 0, 1],
... "col3": [1, 0, 0, 0],
... "datetime": [
... datetime(year=2020, month=1, day=3, tzinfo=timezone.utc),
... datetime(year=2020, month=2, day=3, tzinfo=timezone.utc),
... datetime(year=2020, month=3, day=3, tzinfo=timezone.utc),
... datetime(year=2020, month=4, day=3, tzinfo=timezone.utc),
... ],
... },
... schema={
... "col1": pl.Int64,
... "col2": pl.Int64,
... "col3": pl.Int64,
... "datetime": pl.Datetime(time_unit="us", time_zone="UTC"),
... },
... )
>>> output = analyzer.analyze(frame)
>>> output
TemporalNullValueOutput(
(state): TemporalDataFrameState(dataframe=(4, 4), temporal_column='datetime', period='1d', nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)
arkas.analyzer.TemporalPlotColumnAnalyzer ¶
Bases: BaseInNLazyAnalyzer
Implement an analyzer that plots the content of each column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
temporal_column
|
str
|
The temporal column in the DataFrame. |
required |
period
|
str | None
|
An optional temporal period e.g. monthly or daily. |
None
|
columns
|
Sequence[str] | None
|
The columns to analyze. If |
None
|
exclude_columns
|
Sequence[str]
|
The columns to exclude from the input
|
()
|
missing_policy
|
str
|
The policy on how to handle missing columns.
The following options are available: |
'raise'
|
figure_config
|
BaseFigureConfig | None
|
The figure configuration. |
None
|
Example usage:
>>> from datetime import datetime, timezone
>>> import polars as pl
>>> from arkas.analyzer import TemporalPlotColumnAnalyzer
>>> analyzer = TemporalPlotColumnAnalyzer(temporal_column="datetime")
>>> analyzer
TemporalPlotColumnAnalyzer(columns=None, exclude_columns=(), missing_policy='raise', temporal_column='datetime', period=None, figure_config=None)
>>> frame = pl.DataFrame(
... {
... "col1": [0, 1, 1, 0],
... "col2": [0, 1, 0, 1],
... "col3": [1, 0, 0, 0],
... "datetime": [
... datetime(year=2020, month=1, day=3, tzinfo=timezone.utc),
... datetime(year=2020, month=2, day=3, tzinfo=timezone.utc),
... datetime(year=2020, month=3, day=3, tzinfo=timezone.utc),
... datetime(year=2020, month=4, day=3, tzinfo=timezone.utc),
... ],
... },
... schema={
... "col1": pl.Int64,
... "col2": pl.Int64,
... "col3": pl.Int64,
... "datetime": pl.Datetime(time_unit="us", time_zone="UTC"),
... },
... )
>>> output = analyzer.analyze(frame)
>>> output
TemporalPlotColumnOutput(
(state): TemporalDataFrameState(dataframe=(4, 4), temporal_column='datetime', period=None, nan_policy='propagate', figure_config=MatplotlibFigureConfig())
)
arkas.analyzer.TransformAnalyzer ¶
Bases: BaseAnalyzer
Implement an analyzer that transforms the data before to analyze them.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transformer
|
BaseTransformer | dict
|
The transformer or its configuration. |
required |
analyzer
|
BaseAnalyzer | dict
|
The analyzer or its configuration. |
required |
Example usage:
>>> import polars as pl
>>> from arkas.analyzer import AccuracyAnalyzer, TransformAnalyzer
>>> from grizz.transformer import DropNullRow
>>> analyzer = TransformAnalyzer(
... transformer=DropNullRow(), analyzer=AccuracyAnalyzer(y_true="target", y_pred="pred")
... )
>>> analyzer
TransformAnalyzer(
(transformer): DropNullRowTransformer(columns=None, exclude_columns=(), missing_policy='raise')
(analyzer): AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')
)
>>> frame = pl.DataFrame(
... {"pred": [3, 2, 0, 1, 0, 1, None], "target": [3, 2, 0, 1, 0, 1, None]}
... )
>>> output = analyzer.analyze(frame)
>>> output
AccuracyOutput(
(state): AccuracyState(y_true=(6,), y_pred=(6,), y_true_name='target', y_pred_name='pred', nan_policy='propagate')
)
arkas.analyzer.is_analyzer_config ¶
is_analyzer_config(config: dict) -> bool
Indicate if the input configuration is a configuration for a
BaseAnalyzer
.
This function only checks if the value of the key _target_
is valid. It does not check the other values. If _target_
indicates a function, the returned type hint is used to check
the class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config
|
dict
|
The configuration to check. |
required |
Returns:
Type | Description |
---|---|
bool
|
|
Example usage:
>>> from arkas.analyzer import is_analyzer_config
>>> is_analyzer_config({"_target_": "arkas.analyzer.AccuracyAnalyzer"})
True
arkas.analyzer.setup_analyzer ¶
setup_analyzer(
analyzer: BaseAnalyzer | dict,
) -> BaseAnalyzer
Set up an analyzer.
The analyzer is instantiated from its configuration
by using the BaseAnalyzer
factory function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
analyzer
|
BaseAnalyzer | dict
|
An analyzer or its configuration. |
required |
Returns:
Type | Description |
---|---|
BaseAnalyzer
|
An instantiated analyzer. |
Example usage:
>>> from arkas.analyzer import setup_analyzer
>>> analyzer = setup_analyzer(
... {
... "_target_": "arkas.analyzer.AccuracyAnalyzer",
... "y_true": "target",
... "y_pred": "pred",
... }
... )
>>> analyzer
AccuracyAnalyzer(y_true='target', y_pred='pred', drop_nulls=True, missing_policy='raise', nan_policy='propagate')