polars¶
sumeh.engines.polars_engine ¶
all_date_checks ¶
Default date check - filters past dates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
are_complete ¶
Filters rows where any of the specified fields are null and adds a data quality status column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be checked. |
required |
rule
|
RuleDef
|
Rule definition containing fields (list), check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
are_unique ¶
Identifies duplicate rows based on a combination of specified fields and adds a data quality status column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to check. |
required |
rule
|
RuleDef
|
Rule definition containing fields (list), check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: DataFrame containing rows where the field combination is not unique with dq_status column. |
extract_schema ¶
Extracts schema from Polars DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input Polars DataFrame. |
required |
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
List[Dict[str, Any]]: List of dictionaries containing field information. |
has_cardinality ¶
Checks if the cardinality (distinct count) of the specified field meets expectations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be checked. |
required |
rule
|
RuleDef
|
Rule definition containing field, value (expected cardinality), and optional threshold. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
Validation result with status, expected, actual, and message. |
has_entropy ¶
Checks if the entropy of the specified field meets expectations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to evaluate. |
required |
rule
|
RuleDef
|
Rule definition containing field, value (expected entropy), and optional threshold. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
Validation result with status, expected, actual, and message. |
has_infogain ¶
Checks if the information gain of the specified field meets expectations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to evaluate. |
required |
rule
|
RuleDef
|
Rule definition containing field, value (expected info gain), and optional threshold. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
Validation result with status, expected, actual, and message. |
has_max ¶
Checks if the maximum value of the specified field meets expectations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be checked. |
required |
rule
|
RuleDef
|
Rule definition containing field, value (expected max), and optional threshold. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
Validation result with status, expected, actual, and message. |
has_mean ¶
Checks if the mean of the specified field meets expectations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame. |
required |
rule
|
RuleDef
|
Rule definition containing field, value (expected mean), and optional threshold. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
Validation result with status, expected, actual, and message. |
has_min ¶
Checks if the minimum value of the specified field meets expectations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be checked. |
required |
rule
|
RuleDef
|
Rule definition containing field, value (expected min), and optional threshold. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
Validation result with status, expected, actual, and message. |
has_pattern ¶
Filters rows where the specified field does not match the given regex pattern.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value (regex pattern). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
has_std ¶
Checks if the standard deviation of the specified field meets expectations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame. |
required |
rule
|
RuleDef
|
Rule definition containing field, value (expected std), and optional threshold. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
Validation result with status, expected, actual, and message. |
has_sum ¶
Checks if the sum of the specified field meets expectations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame. |
required |
rule
|
RuleDef
|
Rule definition containing field, value (expected sum), and optional threshold. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
Validation result with status, expected, actual, and message. |
is_between ¶
Filters rows where the specified field is not within the given range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value (range format). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_complete ¶
Filters rows where the specified field is null and adds a data quality status column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be checked. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_contained_in ¶
Filters rows where the specified field is not in the given list of values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value (list format). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_date_after ¶
Filters rows where the specified field has a date lower than the date informed in the rule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be checked. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value (target date). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_date_before ¶
Filters rows where the specified field has a date greater than the date informed in the rule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be checked. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value (target date). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_date_between ¶
Filters rows where date field is not between two dates in format: "[
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be checked. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value (date range). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_equal ¶
Filters rows where the specified field is not equal to the given value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_equal_than ¶
Alias for is_equal.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_future_date ¶
Filters rows where the specified field has a date greater than the current date.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be checked. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_greater_or_equal_than ¶
Filters rows where the specified field is less than the given value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_greater_than ¶
Filters rows where the specified field is less than or equal to the given value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The Polars DataFrame to filter. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_in ¶
Alias for is_contained_in.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame to evaluate. |
required |
rule
|
RuleDef
|
Rule definition. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations. |
is_in_billions ¶
Filters rows where the field value is less than 1,000,000,000.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_in_millions ¶
Filters rows where the field value is less than 1,000,000.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame to filter and modify. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_legit ¶
Filters rows where the specified field is null or does not match a non-whitespace pattern.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be validated. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_less_or_equal_than ¶
Filters rows where the specified field is greater than the given value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_less_than ¶
Filters rows where the specified field is greater than or equal to the given value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_negative ¶
Filters rows where the specified field is non-negative and adds a data quality status column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_on_friday ¶
Filters rows where date is not Friday.
is_on_monday ¶
Filters rows where date is not Monday.
is_on_saturday ¶
Filters rows where date is not Saturday.
is_on_sunday ¶
Filters rows where date is not Sunday.
is_on_thursday ¶
Filters rows where date is not Thursday.
is_on_tuesday ¶
Filters rows where date is not Tuesday.
is_on_wednesday ¶
Filters rows where date is not Wednesday.
is_on_weekday ¶
Filters rows where the date field does not fall on a weekday (Mon-Fri).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_on_weekend ¶
Filters rows where the date field does not fall on a weekend (Sat-Sun).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_past_date ¶
Filters rows where the specified field has a date lower than the current date.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be checked. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_positive ¶
Filters rows where the specified field is negative and adds a data quality status column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_t_minus_1 ¶
Filters rows where the date field does not equal one day ago (T-1).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame to be filtered. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_t_minus_2 ¶
Filters rows where the date field does not equal two days ago (T-2).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame to be filtered. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_t_minus_3 ¶
Filters rows where the date field does not equal three days ago (T-3).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame to be filtered. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_today ¶
Filters rows where the date field does not equal today.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame to filter. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
is_unique ¶
Identifies duplicate rows based on the specified field and adds a data quality status column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to check for uniqueness. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: DataFrame containing rows where the field is not unique with dq_status column. |
is_yesterday ¶
Alias for is_t_minus_1. Filters rows where date field does not equal yesterday.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
not_contained_in ¶
Filters rows where the specified field is in the given list of values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value (list format). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
not_in ¶
Alias for not_contained_in.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame to be filtered. |
required |
rule
|
RuleDef
|
Rule definition. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations. |
satisfies ¶
Filters rows where the specified expression is not satisfied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
RuleDef
|
Rule definition with value containing SQL expression. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
summarize ¶
summarize(rules: List[RuleDef], total_rows: int, df_with_errors: Optional[DataFrame] = None, table_error: Optional[DataFrame] = None) -> pl.DataFrame
Summarizes validation results from both row-level and table-level checks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rules
|
List[RuleDef]
|
List of all validation rules. |
required |
total_rows
|
int
|
Total number of rows in the input DataFrame. |
required |
df_with_errors
|
Optional[DataFrame]
|
DataFrame with row-level violations. |
None
|
table_error
|
Optional[DataFrame]
|
DataFrame with table-level results. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Summary DataFrame with aggregated validation metrics. |
validate ¶
Main validation function that orchestrates row-level and table-level validations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input Polars DataFrame to validate. |
required |
rules
|
List[RuleDef]
|
List of all validation rules. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[DataFrame, DataFrame, DataFrame]
|
Tuple[pl.DataFrame, pl.DataFrame, pl.DataFrame]: - DataFrame with row-level violations and dq_status - Raw row-level violations DataFrame - Table-level summary DataFrame |
validate_date_format ¶
Filters rows where the specified field has wrong date format based on the format from the rule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be checked. |
required |
rule
|
RuleDef
|
Rule definition containing field, check_type, and value (date format). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Filtered DataFrame with violations and dq_status column. |
validate_row_level ¶
Validates DataFrame at row level using specified rules.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input Polars DataFrame to validate. |
required |
rules
|
List[RuleDef]
|
List of row-level validation rules. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[DataFrame, DataFrame]
|
Tuple[pl.DataFrame, pl.DataFrame]: - DataFrame with violations and dq_status column - Raw violations DataFrame |
validate_schema ¶
Validates the schema of a Polars DataFrame against an expected schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The Polars DataFrame whose schema is to be validated. |
required |
expected
|
list
|
The expected schema. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[bool, List[Dict[str, Any]]]
|
Tuple[bool, List[Dict[str, Any]]]: - Boolean indicating whether the schema matches - List of schema errors/mismatches |
validate_table_level ¶
Validates DataFrame at table level using specified rules.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input Polars DataFrame to validate. |
required |
rules
|
List[RuleDef]
|
List of table-level validation rules. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: Summary DataFrame with validation results. |