Skip to content

polars

sumeh.engines.polars_engine

all_date_checks

all_date_checks(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Default date check - filters past dates.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

are_complete

are_complete(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where any of the specified fields are null and adds a data quality status column.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be checked.

required
rule RuleDef

Rule definition containing fields (list), check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

are_unique

are_unique(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Identifies duplicate rows based on a combination of specified fields and adds a data quality status column.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to check.

required
rule RuleDef

Rule definition containing fields (list), check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: DataFrame containing rows where the field combination is not unique with dq_status column.

extract_schema

extract_schema(df: DataFrame) -> List[Dict[str, Any]]

Extracts schema from Polars DataFrame.

Parameters:

Name Type Description Default
df DataFrame

Input Polars DataFrame.

required

Returns:

Type Description
List[Dict[str, Any]]

List[Dict[str, Any]]: List of dictionaries containing field information.

has_cardinality

has_cardinality(df: DataFrame, rule: RuleDef) -> dict

Checks if the cardinality (distinct count) of the specified field meets expectations.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be checked.

required
rule RuleDef

Rule definition containing field, value (expected cardinality), and optional threshold.

required

Returns:

Name Type Description
dict dict

Validation result with status, expected, actual, and message.

has_entropy

has_entropy(df: DataFrame, rule: RuleDef) -> dict

Checks if the entropy of the specified field meets expectations.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to evaluate.

required
rule RuleDef

Rule definition containing field, value (expected entropy), and optional threshold.

required

Returns:

Name Type Description
dict dict

Validation result with status, expected, actual, and message.

has_infogain

has_infogain(df: DataFrame, rule: RuleDef) -> dict

Checks if the information gain of the specified field meets expectations.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to evaluate.

required
rule RuleDef

Rule definition containing field, value (expected info gain), and optional threshold.

required

Returns:

Name Type Description
dict dict

Validation result with status, expected, actual, and message.

has_max

has_max(df: DataFrame, rule: RuleDef) -> dict

Checks if the maximum value of the specified field meets expectations.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be checked.

required
rule RuleDef

Rule definition containing field, value (expected max), and optional threshold.

required

Returns:

Name Type Description
dict dict

Validation result with status, expected, actual, and message.

has_mean

has_mean(df: DataFrame, rule: RuleDef) -> dict

Checks if the mean of the specified field meets expectations.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame.

required
rule RuleDef

Rule definition containing field, value (expected mean), and optional threshold.

required

Returns:

Name Type Description
dict dict

Validation result with status, expected, actual, and message.

has_min

has_min(df: DataFrame, rule: RuleDef) -> dict

Checks if the minimum value of the specified field meets expectations.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be checked.

required
rule RuleDef

Rule definition containing field, value (expected min), and optional threshold.

required

Returns:

Name Type Description
dict dict

Validation result with status, expected, actual, and message.

has_pattern

has_pattern(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field does not match the given regex pattern.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule RuleDef

Rule definition containing field, check_type, and value (regex pattern).

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

has_std

has_std(df: DataFrame, rule: RuleDef) -> dict

Checks if the standard deviation of the specified field meets expectations.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame.

required
rule RuleDef

Rule definition containing field, value (expected std), and optional threshold.

required

Returns:

Name Type Description
dict dict

Validation result with status, expected, actual, and message.

has_sum

has_sum(df: DataFrame, rule: RuleDef) -> dict

Checks if the sum of the specified field meets expectations.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame.

required
rule RuleDef

Rule definition containing field, value (expected sum), and optional threshold.

required

Returns:

Name Type Description
dict dict

Validation result with status, expected, actual, and message.

is_between

is_between(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field is not within the given range.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame.

required
rule RuleDef

Rule definition containing field, check_type, and value (range format).

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_complete

is_complete(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field is null and adds a data quality status column.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be checked.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_contained_in

is_contained_in(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field is not in the given list of values.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule RuleDef

Rule definition containing field, check_type, and value (list format).

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_date_after

is_date_after(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field has a date lower than the date informed in the rule.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be checked.

required
rule RuleDef

Rule definition containing field, check_type, and value (target date).

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_date_before

is_date_before(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field has a date greater than the date informed in the rule.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be checked.

required
rule RuleDef

Rule definition containing field, check_type, and value (target date).

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_date_between

is_date_between(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where date field is not between two dates in format: "[, ]".

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be checked.

required
rule RuleDef

Rule definition containing field, check_type, and value (date range).

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_equal

is_equal(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field is not equal to the given value.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_equal_than

is_equal_than(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Alias for is_equal.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_future_date

is_future_date(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field has a date greater than the current date.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be checked.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_greater_or_equal_than

is_greater_or_equal_than(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field is less than the given value.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_greater_than

is_greater_than(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field is less than or equal to the given value.

Parameters:

Name Type Description Default
df DataFrame

The Polars DataFrame to filter.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_in

is_in(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Alias for is_contained_in.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to evaluate.

required
rule RuleDef

Rule definition.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations.

is_in_billions

is_in_billions(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the field value is less than 1,000,000,000.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_in_millions

is_in_millions(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the field value is less than 1,000,000.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to filter and modify.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_legit

is_legit(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field is null or does not match a non-whitespace pattern.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be validated.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_less_or_equal_than

is_less_or_equal_than(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field is greater than the given value.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_less_than

is_less_than(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field is greater than or equal to the given value.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_negative

is_negative(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field is non-negative and adds a data quality status column.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_on_friday

is_on_friday(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where date is not Friday.

is_on_monday

is_on_monday(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where date is not Monday.

is_on_saturday

is_on_saturday(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where date is not Saturday.

is_on_sunday

is_on_sunday(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where date is not Sunday.

is_on_thursday

is_on_thursday(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where date is not Thursday.

is_on_tuesday

is_on_tuesday(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where date is not Tuesday.

is_on_wednesday

is_on_wednesday(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where date is not Wednesday.

is_on_weekday

is_on_weekday(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the date field does not fall on a weekday (Mon-Fri).

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_on_weekend

is_on_weekend(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the date field does not fall on a weekend (Sat-Sun).

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_past_date

is_past_date(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field has a date lower than the current date.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be checked.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_positive

is_positive(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field is negative and adds a data quality status column.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_t_minus_1

is_t_minus_1(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the date field does not equal one day ago (T-1).

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be filtered.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_t_minus_2

is_t_minus_2(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the date field does not equal two days ago (T-2).

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be filtered.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_t_minus_3

is_t_minus_3(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the date field does not equal three days ago (T-3).

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be filtered.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_today

is_today(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the date field does not equal today.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to filter.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

is_unique

is_unique(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Identifies duplicate rows based on the specified field and adds a data quality status column.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to check for uniqueness.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: DataFrame containing rows where the field is not unique with dq_status column.

is_yesterday

is_yesterday(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Alias for is_t_minus_1. Filters rows where date field does not equal yesterday.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame.

required
rule RuleDef

Rule definition containing field, check_type, and value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

not_contained_in

not_contained_in(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field is in the given list of values.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule RuleDef

Rule definition containing field, check_type, and value (list format).

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

not_in

not_in(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Alias for not_contained_in.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be filtered.

required
rule RuleDef

Rule definition.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations.

satisfies

satisfies(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified expression is not satisfied.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule RuleDef

Rule definition with value containing SQL expression.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

summarize

summarize(rules: List[RuleDef], total_rows: int, df_with_errors: Optional[DataFrame] = None, table_error: Optional[DataFrame] = None) -> pl.DataFrame

Summarizes validation results from both row-level and table-level checks.

Parameters:

Name Type Description Default
rules List[RuleDef]

List of all validation rules.

required
total_rows int

Total number of rows in the input DataFrame.

required
df_with_errors Optional[DataFrame]

DataFrame with row-level violations.

None
table_error Optional[DataFrame]

DataFrame with table-level results.

None

Returns:

Type Description
DataFrame

pl.DataFrame: Summary DataFrame with aggregated validation metrics.

validate

validate(df: DataFrame, rules: List[RuleDef]) -> Tuple[pl.DataFrame, pl.DataFrame, pl.DataFrame]

Main validation function that orchestrates row-level and table-level validations.

Parameters:

Name Type Description Default
df DataFrame

Input Polars DataFrame to validate.

required
rules List[RuleDef]

List of all validation rules.

required

Returns:

Type Description
Tuple[DataFrame, DataFrame, DataFrame]

Tuple[pl.DataFrame, pl.DataFrame, pl.DataFrame]: - DataFrame with row-level violations and dq_status - Raw row-level violations DataFrame - Table-level summary DataFrame

validate_date_format

validate_date_format(df: DataFrame, rule: RuleDef) -> pl.DataFrame

Filters rows where the specified field has wrong date format based on the format from the rule.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be checked.

required
rule RuleDef

Rule definition containing field, check_type, and value (date format).

required

Returns:

Type Description
DataFrame

pl.DataFrame: Filtered DataFrame with violations and dq_status column.

validate_row_level

validate_row_level(df: DataFrame, rules: List[RuleDef]) -> Tuple[pl.DataFrame, pl.DataFrame]

Validates DataFrame at row level using specified rules.

Parameters:

Name Type Description Default
df DataFrame

Input Polars DataFrame to validate.

required
rules List[RuleDef]

List of row-level validation rules.

required

Returns:

Type Description
Tuple[DataFrame, DataFrame]

Tuple[pl.DataFrame, pl.DataFrame]: - DataFrame with violations and dq_status column - Raw violations DataFrame

validate_schema

validate_schema(df: DataFrame, expected) -> Tuple[bool, List[Dict[str, Any]]]

Validates the schema of a Polars DataFrame against an expected schema.

Parameters:

Name Type Description Default
df DataFrame

The Polars DataFrame whose schema is to be validated.

required
expected list

The expected schema.

required

Returns:

Type Description
Tuple[bool, List[Dict[str, Any]]]

Tuple[bool, List[Dict[str, Any]]]: - Boolean indicating whether the schema matches - List of schema errors/mismatches

validate_table_level

validate_table_level(df: DataFrame, rules: List[RuleDef]) -> pl.DataFrame

Validates DataFrame at table level using specified rules.

Parameters:

Name Type Description Default
df DataFrame

Input Polars DataFrame to validate.

required
rules List[RuleDef]

List of table-level validation rules.

required

Returns:

Type Description
DataFrame

pl.DataFrame: Summary DataFrame with validation results.