Skip to content

polars

sumeh.engines.polars_engine

This module provides a set of data quality validation functions using the Polars library. It includes various checks for data validation, such as completeness, uniqueness, range checks, pattern matching, and schema validation.

Functions:

Name Description
is_positive

Filters rows where the specified field is less than zero.

is_negative

Filters rows where the specified field is greater than or equal to zero.

is_complete

Filters rows where the specified field is null.

is_unique

Filters rows with duplicate values in the specified field.

are_complete

Filters rows where any of the specified fields are null.

are_unique

Filters rows with duplicate combinations of the specified fields.

is_greater_than

Filters rows where the specified field is less than or equal to the given value.

is_greater_or_equal_than

Filters rows where the specified field is less than the given value.

is_less_than

Filters rows where the specified field is greater than or equal to the given value.

is_less_or_equal_than

Filters rows where the specified field is greater than the given value.

is_equal

Filters rows where the specified field is not equal to the given value.

is_equal_than

Alias for is_equal.

is_in_millions

Retains rows where the field value is less than 1,000,000 and flags them with dq_status.

is_in_billions

Retains rows where the field value is less than 1,000,000,000 and flags them with dq_status.

is_t_minus_1

Retains rows where the date field not equals yesterday (T-1) and flags them with dq_status.

is_t_minus_2

Retains rows where the date field not equals two days ago (T-2) and flags them with dq_status.

is_t_minus_3

Retains rows where the date field not equals three days ago (T-3) and flags them with dq_status.

is_today

Retains rows where the date field not equals today and flags them with dq_status.

is_yesterday

Retains rows where the date field not equals yesterday and flags them with dq_status.

is_on_weekday

Retains rows where the date field not falls on a weekday (Mon-Fri) and flags them with dq_status.

is_on_weekend

Retains rows where the date field is not on a weekend (Sat-Sun) and flags them with dq_status.

is_on_monday

Retains rows where the date field is not on Monday and flags them with dq_status.

is_on_tuesday

Retains rows where the date field is not on Tuesday and flags them with dq_status.

is_on_wednesday

Retains rows where the date field is not on Wednesday and flags them with dq_status.

is_on_thursday

Retains rows where the date field is not on Thursday and flags them with dq_status.

is_on_friday

Retains rows where the date field is not on Friday and flags them with dq_status.

is_on_saturday

Retains rows where the date field is not on Saturday and flags them with dq_status.

is_on_sunday

Retains rows where the date field is not on Sunday and flags them with dq_status.

is_contained_in

Filters rows where the specified field is not in the given list of values.

not_contained_in

Filters rows where the specified field is in the given list of values.

is_between

Filters rows where the specified field is not within the given range.

has_pattern

Filters rows where the specified field does not match the given regex pattern.

is_legit

Filters rows where the specified field is null or contains whitespace.

has_max

Filters rows where the specified field exceeds the given maximum value.

has_min

Filters rows where the specified field is below the given minimum value.

has_std

Checks if the standard deviation of the specified field exceeds the given value.

has_mean

Checks if the mean of the specified field exceeds the given value.

has_sum

Checks if the sum of the specified field exceeds the given value.

has_cardinality

Checks if the cardinality (number of unique values) of the specified field exceeds the given value.

has_infogain

Placeholder for information gain validation (currently uses cardinality).

has_entropy

Placeholder for entropy validation (currently uses cardinality).

satisfies

Filters rows that do not satisfy the given SQL condition.

validate_date_format

Filters rows where the specified field does not match the expected date format or is null.

is_future_date

Filters rows where the specified date field is after today.

is_past_date

Filters rows where the specified date field is before today.

is_date_between

Filters rows where the specified date field is not within the given [start,end] range.

is_date_after

Filters rows where the specified date field is before the given date.

is_date_before

Filters rows where the specified date field is after the given date.

all_date_checks

Alias for is_past_date (checks date against today).

validate

Validates a DataFrame against a list of rules and returns the original DataFrame with data quality status and a DataFrame of violations.

__build_rules_df

Converts a list of rules into a Polars DataFrame for summarization.

summarize

Summarizes the results of data quality checks, including pass rates and statuses.

__polars_schema_to_list

Converts a Polars DataFrame schema into a list of dictionaries.

validate_schema

Validates the schema of a DataFrame against an expected schema and returns a boolean result and a list of errors.

__build_rules_df

__build_rules_df(rules: list[dict]) -> pl.DataFrame

Builds a Polars DataFrame from a list of rule dictionaries.

This function processes a list of rule dictionaries, filters out rules that are not marked for execution, and constructs a DataFrame with the relevant rule information. It ensures uniqueness of rows based on specific columns and casts the data to appropriate types.

Parameters:

Name Type Description Default
rules list[dict]

A list of dictionaries, where each dictionary represents a rule. Each rule dictionary may contain the following keys: - "field" (str or list): The column(s) the rule applies to. - "check_type" (str): The type of rule or check. - "threshold" (float, optional): The pass threshold for the rule. Defaults to 1.0. - "value" (any, optional): Additional value associated with the rule. - "execute" (bool, optional): Whether the rule should be executed. Defaults to True.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A Polars DataFrame containing the processed rules with the following columns: - "column" (str): The column(s) the rule applies to, joined by commas if multiple. - "rule" (str): The type of rule or check. - "pass_threshold" (float): The pass threshold for the rule. - "value" (str): The value associated with the rule, or an empty string if not provided.

all_date_checks

all_date_checks(df: DataFrame, rule: dict) -> pl.DataFrame

Applies all date-related validation checks on the given DataFrame based on the specified rule.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to validate.

required
rule dict

A dictionary containing the validation rules to apply.

required

Returns:

Type Description
DataFrame

pl.DataFrame: The DataFrame after applying the date validation checks.

are_complete

are_complete(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to identify rows where specified fields contain null values and tags them with a data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be checked.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'fields': A list of column names to check for null values. - 'check': A string representing the type of check (e.g., "is_null"). - 'value': A value associated with the check (not used in this function).

required

Returns:

Type Description
DataFrame

pl.DataFrame: A filtered DataFrame containing only rows where at least one of the

DataFrame

specified fields is null, with an additional column "dq_status" indicating the

DataFrame

data quality status.

are_unique

are_unique(df: DataFrame, rule: dict) -> pl.DataFrame

Checks for duplicate combinations of specified fields in a Polars DataFrame and returns a DataFrame containing the rows with duplicates along with a data quality status column.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to check for duplicates.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the following keys: - 'fields': A list of column names to check for uniqueness. - 'check': A string representing the type of check (e.g., "unique"). - 'value': A value associated with the check (e.g., "True").

required

Returns:

Type Description
DataFrame

pl.DataFrame: A DataFrame containing rows with duplicate combinations of the specified fields. An additional column, "dq_status", is added to indicate the data quality status in the format "{fields}:{check}:{value}".

extract_schema

extract_schema(df) -> List[Dict[str, Any]]

Converts the schema of a Polars DataFrame into a list of dictionaries, where each dictionary represents a field in the schema.

Parameters:

Name Type Description Default
df DataFrame

The Polars DataFrame whose schema is to be converted.

required

Returns:

Type Description
List[Dict[str, Any]]

List[Dict[str, Any]]: A list of dictionaries, each containing the following keys: - "field" (str): The name of the field. - "data_type" (str): The data type of the field, converted to lowercase. - "nullable" (bool): Always set to True, as Polars does not expose nullability in the schema. - "max_length" (None): Always set to None, as max length is not applicable.

has_cardinality

has_cardinality(df: DataFrame, rule: dict) -> pl.DataFrame

Checks if the cardinality (number of unique values) of a specified field in the given DataFrame satisfies a condition defined in the rule. If the cardinality exceeds the specified value, a new column "dq_status" is added to the DataFrame with a string indicating the rule violation. Otherwise, an empty DataFrame is returned.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to evaluate.

required
rule dict

A dictionary containing the rule parameters. It should include: - "field" (str): The column name to check. - "check" (str): The type of check (e.g., "greater_than"). - "value" (int): The threshold value for the cardinality.

required

Returns:

Type Description
DataFrame

pl.DataFrame: The original DataFrame with an added "dq_status" column if the rule is violated, or an empty DataFrame if the rule is not violated.

has_entropy

has_entropy(df: DataFrame, rule: dict) -> pl.DataFrame

Evaluates the entropy of a specified field in a Polars DataFrame based on a given rule.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to evaluate.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field' (str): The column name in the DataFrame to evaluate. - 'check' (str): The type of check to perform (not used directly in this function). - 'value' (float): The threshold value for entropy comparison.

required

Returns:

Type Description
DataFrame

pl.DataFrame: - If the entropy of the specified field exceeds the given threshold (value), returns the original DataFrame with an additional column dq_status indicating the rule that was applied. - If the entropy does not exceed the threshold, returns an empty DataFrame with the same schema as the input DataFrame.

Notes
  • The entropy is calculated as the number of unique values in the specified field.
  • The dq_status column contains a string in the format "{field}:{check}:{value}".

has_infogain

has_infogain(df: DataFrame, rule: dict) -> pl.DataFrame

Evaluates whether a given DataFrame satisfies an information gain condition based on a specified rule. If the condition is met, a new column indicating the rule is added; otherwise, an empty DataFrame is returned.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to evaluate.

required
rule dict

A dictionary containing the rule parameters. It should include the following keys: - 'field': The column name to evaluate. - 'check': The type of check to perform (not used directly in this function). - 'value': The threshold value for the information gain.

required

Returns:

Type Description
DataFrame

pl.DataFrame: The original DataFrame with an additional column named

DataFrame

"dq_status" if the condition is met, or an empty DataFrame if the

DataFrame

condition is not met.

has_max

has_max(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the value in a specified column exceeds a given threshold, and adds a new column indicating the rule applied.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field' (str): The column name to apply the filter on. - 'check' (str): The type of check being performed (e.g., "max"). - 'value' (numeric): The threshold value to compare against.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame containing only the rows that satisfy the condition,

DataFrame

with an additional column named "dq_status" that describes the applied rule.

has_mean

has_mean(df: DataFrame, rule: dict) -> pl.DataFrame

Checks if the mean value of a specified column in a Polars DataFrame satisfies a given condition.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column to calculate the mean for. - 'check' (str): The condition to check (e.g., 'greater than'). - 'value' (float): The threshold value to compare the mean against.

required

Returns:

Type Description
DataFrame

pl.DataFrame: - If the mean value of the specified column is greater than the threshold value, returns the original DataFrame with an additional column "dq_status" containing a string in the format "{field}:{check}:{value}". - If the condition is not met, returns an empty DataFrame with the same schema as the input.

has_min

has_min(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the value of a specified column is less than a given threshold and adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The name of the column to apply the filter on. - 'check': A string representing the type of check (e.g., 'min'). - 'value': The threshold value for the filter.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame containing only the rows that satisfy

DataFrame

the condition, with an additional column named "dq_status" indicating the

DataFrame

applied rule in the format "field:check:value".

has_pattern

has_pattern(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame based on a pattern-matching rule and adds a data quality status column.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The column name in the DataFrame to apply the pattern check. - 'check': A descriptive label for the check being performed. - 'pattern': The regex pattern to match against the column values.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame with rows not matching the pattern removed and an additional

DataFrame

column named "dq_status" indicating the rule applied in the format "field:check:pattern".

has_std

has_std(df: DataFrame, rule: dict) -> pl.DataFrame

Evaluates whether the standard deviation of a specified column in a Polars DataFrame exceeds a given threshold and returns a modified DataFrame accordingly.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to evaluate.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column to calculate the standard deviation for. - 'check' (str): A descriptive label for the check being performed. - 'value' (float): The threshold value for the standard deviation.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A modified DataFrame. If the standard deviation of the specified column

DataFrame

exceeds the threshold, the DataFrame will include a new column dq_status with a

DataFrame

descriptive string. Otherwise, an empty DataFrame with the dq_status column is returned.

has_sum

has_sum(df: DataFrame, rule: dict) -> pl.DataFrame

Checks if the sum of a specified column in a Polars DataFrame exceeds a given value.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The name of the column to sum. - 'check': A string representing the check type (not used in this function). - 'value': The threshold value to compare the sum against.

required

Returns:

Type Description
DataFrame

pl.DataFrame: If the sum of the specified column exceeds the given value,

DataFrame

returns the original DataFrame with an additional column dq_status containing

DataFrame

a string in the format "{field}:{check}:{value}". Otherwise, returns an empty DataFrame.

is_between

is_between(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to exclude rows where the specified field's value falls within a given range, and adds a column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The name of the column to check. - 'check': The type of check being performed (e.g., "is_between"). - 'value': A string representing the range in the format "[lo,hi]".

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame with rows outside the specified range

DataFrame

and an additional column named "dq_status" indicating the rule applied.

Raises:

Type Description
ValueError

If the 'value' parameter is not in the expected format "[lo,hi]".

is_complete

is_complete(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field is not null and appends a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered and modified.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column to check for non-null values. - 'check' (str): A descriptive string for the type of check being performed. - 'value' (str): A value associated with the rule for status annotation.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame with rows filtered based on the rule and

DataFrame

an additional column named "dq_status" containing the data quality status.

is_composite_key

is_composite_key(df: DataFrame, rule: dict) -> pl.DataFrame

Determines if the given DataFrame satisfies the composite key condition based on the provided rule.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to evaluate.

required
rule dict

A dictionary defining the rule to check for composite key uniqueness.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A DataFrame indicating whether the composite key condition is met.

is_contained_in

is_contained_in(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to exclude rows where the specified field's value is contained in a given list of values, and adds a new column indicating the rule applied.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The column name to check. - 'check': The type of check being performed (e.g., "is_contained_in"). - 'value': A string representation of a list of values to check against, e.g., "[value1, value2, value3]".

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame with rows filtered based on the rule and an

DataFrame

additional column "dq_status" indicating the rule applied.

is_date_after

is_date_after(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field is earlier than a given date, and adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column containing date strings. - 'check' (str): A descriptive label for the check being performed. - 'date_str' (str): The date string in the format "%Y-%m-%d" to compare against.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame with rows filtered based on the date condition

DataFrame

and an additional column named "dq_status" indicating the applied rule.

is_date_before

is_date_before(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field is after a given date, and adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column to check. - 'check' (str): A descriptive label for the check being performed. - 'date_str' (str): The date string in the format "%Y-%m-%d" to compare against.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame with rows filtered based on the date condition

DataFrame

and an additional column named "dq_status" indicating the applied rule.

is_date_between

is_date_between(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to exclude rows where the specified date field is within a given range.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule dict

A dictionary containing the filtering rule. It should include: - 'field': The name of the column to check. - 'check': A string representing the type of check (e.g., "is_date_between"). - 'value': A string representing the date range in the format "[YYYY-MM-DD,YYYY-MM-DD]".

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame excluding rows where the date in the specified field falls within the given inclusive range, with an additional column "dq_status" indicating the rule applied.

is_equal

is_equal(df: DataFrame, rule: dict) -> pl.DataFrame

Filters rows in a Polars DataFrame that do not match a specified equality condition and adds a column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The column name to apply the equality check on. - 'check': The type of check (expected to be 'eq' for equality). - 'value': The value to compare against.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame with rows filtered based on the rule and an

DataFrame

additional column named "dq_status" indicating the rule applied.

is_equal_than

is_equal_than(df: DataFrame, rule: dict) -> pl.DataFrame

Filters rows in a Polars DataFrame where the specified field is not equal to a given value and adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The name of the column to check. - 'check': The type of check (expected to be 'equal' for this function). - 'value': The value to compare against.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame with rows filtered based on the rule and an

DataFrame

additional column named "dq_status" indicating the applied rule.

is_future_date

is_future_date(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field contains a future date, based on the current date.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the field name to check, the check type, and additional parameters (ignored in this function).

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame containing only rows where the specified

DataFrame

date field is in the future. An additional column "dq_status" is added

DataFrame

to indicate the field, check type, and today's date in the format

DataFrame

"field:check:today".

is_greater_or_equal_than

is_greater_or_equal_than(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field is greater than or equal to a given value, and adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule dict

A dictionary containing the filtering rule. It should include the following keys: - 'field': The name of the column to be checked. - 'check': The type of check being performed (e.g., "greater_or_equal"). - 'value': The threshold value for the comparison.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame with rows filtered based on the

DataFrame

specified rule and an additional column named "dq_status" indicating

DataFrame

the data quality status in the format "field:check:value".

is_greater_than

is_greater_than(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field's value is less than or equal to a given value, and adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The Polars DataFrame to filter.

required
rule dict

A dictionary containing the filtering rule. It should include: - 'field': The name of the column to apply the filter on. - 'check': A string describing the check (e.g., "greater_than"). - 'value': The value to compare against.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame with rows filtered based on the rule and an

DataFrame

additional column named "dq_status" indicating the applied rule.

is_in

is_in(df: DataFrame, rule: dict) -> pl.DataFrame

Checks if the rows in the given DataFrame satisfy the conditions specified in the rule.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to evaluate.

required
rule dict

A dictionary specifying the conditions to check against the DataFrame.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A DataFrame containing rows that satisfy the specified conditions.

is_in_billions

is_in_billions(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field's value is less than one billion and adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - field (str): The name of the column to check. - check (str): The type of check being performed (e.g., "less_than"). - value (any): The value associated with the rule (not used in this function).

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame with rows filtered based on the rule and an

DataFrame

additional column named "dq_status" containing a string in the format

DataFrame

"{field}:{check}:{value}".

is_in_millions

is_in_millions(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field's value is less than one million and adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The name of the column to check. - 'check': A string describing the check being performed. - 'value': A value associated with the rule (used for status annotation).

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame with rows filtered based on the rule and

DataFrame

an additional column named "dq_status" containing the data quality status.

is_legit

is_legit(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame based on a validation rule and appends a data quality status column.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to validate.

required
rule dict

A dictionary containing the validation rule. It should include: - 'field': The name of the column to validate. - 'check': The type of validation check (e.g., regex, condition). - 'value': The value or pattern to validate against.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame containing rows that failed the validation,

DataFrame

with an additional column 'dq_status' indicating the validation rule applied.

is_less_or_equal_than

is_less_or_equal_than(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field's value is greater than the given value, and adds a new column indicating the rule applied.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The name of the column to apply the filter on. - 'check': The type of check being performed (e.g., 'less_or_equal_than'). - 'value': The value to compare against.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame with rows filtered based on the rule and an

DataFrame

additional column named "dq_status" indicating the rule applied.

is_less_than

is_less_than(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field is greater than or equal to a given value. Adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule dict

A dictionary containing the filtering rule. It should include the following keys: - 'field': The name of the column to apply the filter on. - 'check': A string representing the type of check (not used in logic). - 'value': The threshold value for the filter.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame with rows filtered based on the

DataFrame

condition and an additional column named "dq_status" containing the

DataFrame

rule description in the format "field:check:value".

is_negative

is_negative(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to exclude rows where the specified field is negative and adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The name of the column to check. - 'check': The type of check being performed (e.g., "is_negative"). - 'value': The value associated with the rule (not used in this function).

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame with rows where the specified field is non-negative

DataFrame

and an additional column named "dq_status" containing the rule details.

is_on_friday

is_on_friday(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the date corresponds to a Friday.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame containing the data to filter.

required
rule dict

A dictionary containing filtering rules or parameters.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame containing only the rows where the date is a Friday.

is_on_monday

is_on_monday(df: DataFrame, rule: dict) -> pl.DataFrame

Filters the given DataFrame to include only rows where the date corresponds to a Monday.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to filter.

required
rule dict

A dictionary containing rules or parameters for filtering.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame containing only the rows where the date is a Monday.

is_on_saturday

is_on_saturday(df: DataFrame, rule: dict) -> pl.DataFrame

Determines if the dates in the given DataFrame fall on a Saturday.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing date information.

required
rule dict

A dictionary containing rules or parameters for the operation.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A DataFrame with the result of the operation, indicating whether each date is on a Saturday.

is_on_sunday

is_on_sunday(df: DataFrame, rule: dict) -> pl.DataFrame

Filters the given DataFrame to include only rows where the date corresponds to Sunday.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing date-related data.

required
rule dict

A dictionary containing rules or parameters for filtering.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A filtered DataFrame containing only rows where the date is a Sunday.

is_on_thursday

is_on_thursday(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the date corresponds to a Thursday.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame containing the data to filter.

required
rule dict

A dictionary containing filtering rules or parameters.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame containing only the rows where the date is a Thursday.

is_on_tuesday

is_on_tuesday(df: DataFrame, rule: dict) -> pl.DataFrame

Filters the given DataFrame to include only rows where the day of the week matches Tuesday.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to filter.

required
rule dict

A dictionary containing rules or parameters for filtering.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame containing only rows where the day of the week is Tuesday.

is_on_wednesday

is_on_wednesday(df: DataFrame, rule: dict) -> pl.DataFrame

Filters the given DataFrame to include only rows where the day of the week matches Wednesday.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to filter.

required
rule dict

A dictionary containing rules or parameters for filtering.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A filtered DataFrame containing only rows corresponding to Wednesday.

is_on_weekday

is_on_weekday(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field falls on a weekday (Monday to Friday). Adds a new column indicating the rule applied.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame.

required
rule dict

A dictionary containing the rule parameters. It is expected to have keys that can be extracted using the __extract_params function.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame filtered to include only rows where the date field falls on a weekday, with an additional column named "dq_status" indicating the applied rule in the format "field:check:value".

is_on_weekend

is_on_weekend(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field falls on a weekend (Saturday or Sunday). Adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the following keys: - 'field': The name of the column containing date strings. - 'check': A string representing the type of check being performed. - 'value': A value associated with the rule (not used in the logic).

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame filtered to include only rows where

DataFrame

the specified date field falls on a weekend. The resulting DataFrame also

DataFrame

includes an additional column named "dq_status" with a string indicating

DataFrame

the rule applied.

is_past_date

is_past_date(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field contains a date earlier than today. Adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the field name to check, a check identifier, and additional parameters.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame containing only rows where the specified date field is in the past, with an additional column named "dq_status" that contains a string in the format "{field}:{check}:{today}".

is_positive

is_positive(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to identify rows where the specified field contains negative values and appends a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the following keys: - 'field': The name of the column to check. - 'check': The type of check being performed (e.g., "is_positive"). - 'value': The reference value for the check.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame containing only the rows where

DataFrame

the specified field has negative values, with an additional column

DataFrame

named "dq_status" that describes the rule applied.

is_primary_key

is_primary_key(df: DataFrame, rule: dict) -> pl.DataFrame

Checks if the specified rule identifies a primary key in the given DataFrame.

A primary key is a set of columns in a DataFrame that uniquely identifies each row. This function delegates the check to the is_unique function.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to check for primary key uniqueness.

required
rule dict

A dictionary specifying the rule or criteria to determine the primary key.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A DataFrame indicating whether the rule satisfies the primary key condition.

is_t_minus_1

is_t_minus_1(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field matches the date of "yesterday" (T-1) and appends a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the following keys: - 'field': The name of the column to check. - 'check': A string representing the type of check (used for metadata). - 'value': A value associated with the check (used for metadata).

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame filtered to include only rows where

DataFrame

the specified field matches the date of yesterday (T-1). The resulting

DataFrame

DataFrame also includes an additional column named "dq_status" that

DataFrame

contains metadata about the rule applied.

is_t_minus_2

is_t_minus_2(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field matches the date two days prior to the current date. Adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the following keys: - 'field': The name of the date field to check. - 'check': A string representing the type of check (not used in filtering). - 'value': A value associated with the rule (not used in filtering).

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame filtered to include only rows where the

DataFrame

specified date field matches the date two days ago. The resulting DataFrame

DataFrame

includes an additional column named "dq_status" with a string indicating the

DataFrame

rule applied.

is_t_minus_3

is_t_minus_3(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field matches the date three days prior to the current date. Additionally, adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The name of the date column to check. - 'check': A string representing the type of check (used for status annotation). - 'value': A value associated with the rule (used for status annotation).

required

Returns:

Type Description
DataFrame

pl.DataFrame: A filtered Polars DataFrame with an additional column named

DataFrame

"dq_status" that contains a string in the format "{field}:{check}:{value}".

is_today

is_today(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field matches today's date. Additionally, adds a new column "dq_status" with a formatted string indicating the rule applied.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule dict

A dictionary containing the rule parameters. It is expected to have the following keys: - field (str): The name of the column to check. - check (str): A descriptive string for the type of check (used in the "dq_status" column). - value (str): A value associated with the rule (used in the "dq_status" column).

required

Returns:

Type Description
DataFrame

pl.DataFrame: A filtered Polars DataFrame with rows matching today's date in the specified field

DataFrame

and an additional "dq_status" column describing the rule applied.

Raises:

Type Description
ValueError

If the rule dictionary does not contain the required keys or if the date parsing fails.

is_unique

is_unique(df: DataFrame, rule: dict) -> pl.DataFrame

Checks for duplicate values in a specified field of a Polars DataFrame and returns a filtered DataFrame containing only the rows with duplicate values. Additionally, it adds a new column 'dq_status' with a formatted string indicating the field, check type, and value.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to check for duplicates.

required
rule dict

A dictionary containing the rule parameters. It is expected to have keys that allow extraction of the field to check, the type of check, and a value.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A filtered DataFrame containing rows with duplicate values in the specified field, along with an additional column 'dq_status' describing the rule applied.

not_contained_in

not_contained_in(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field's value is in a given list, and adds a new column indicating the data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule dict

A dictionary containing the filtering rule. It should include: - 'field': The column name to apply the filter on. - 'check': A string representing the type of check (not used in logic). - 'value': A string representation of a list of values (e.g., "[value1, value2]").

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new Polars DataFrame with rows filtered based on the rule and

DataFrame

an additional column "dq_status" indicating the applied rule.

not_in

not_in(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame by excluding rows where the specified rule applies.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to filter.

required
rule dict

A dictionary specifying the filtering rule. The structure and expected keys of this dictionary depend on the implementation of the not_contained_in function.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame with rows excluded based on the given rule.

satisfies

satisfies(df: DataFrame, rule: dict) -> pl.DataFrame

Evaluates a given rule against a Polars DataFrame and returns rows that do not satisfy the rule.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to be evaluated.

required
rule dict

A dictionary containing the rule to be applied. The rule should include the following keys: - 'field': The column name in the DataFrame to be checked. - 'check': The type of check or condition to be applied. - 'value': The value or expression to validate against.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A DataFrame containing rows that do not satisfy the rule, with an additional column dq_status indicating the rule that was violated in the format "field:check:value".

Example

rule = {"field": "age", "check": ">", "value": "18"} result = satisfies(df, rule)

summarize

summarize(qc_df: DataFrame, rules: list[dict], total_rows: int) -> pl.DataFrame

Summarizes quality check results by processing a DataFrame containing data quality statuses and comparing them against defined rules.

Parameters:

Name Type Description Default
qc_df DataFrame

A Polars DataFrame containing a column dq_status with semicolon-separated strings representing data quality statuses in the format "column:rule:value".

required
rules list[dict]

A list of dictionaries where each dictionary defines a rule with keys such as "column", "rule", "value", and "pass_threshold".

required
total_rows int

The total number of rows in the original dataset, used to calculate the pass rate.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A summarized DataFrame containing the following columns: - id: A unique identifier for each rule. - timestamp: The timestamp when the summary was generated. - check: A label indicating the type of check (e.g., "Quality Check"). - level: The severity level of the check (e.g., "WARNING"). - column: The column name associated with the rule. - rule: The rule being evaluated. - value: The specific value associated with the rule. - rows: The total number of rows in the dataset. - violations: The number of rows that violated the rule. - pass_rate: The proportion of rows that passed the rule. - pass_threshold: The threshold for passing the rule. - status: The status of the rule evaluation ("PASS" or "FAIL").

validate

validate(df: DataFrame, rules: list[dict]) -> Tuple[pl.DataFrame, pl.DataFrame]

Validates a Polars DataFrame against a set of rules and returns the updated DataFrame with validation statuses and a DataFrame containing the validation violations.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to validate.

required
rules list[dict]

A list of dictionaries representing validation rules. Each rule should contain the following keys: - "check_type" (str): The type of validation to perform (e.g., "is_primary_key", "is_composite_key", "has_pattern", etc.). - "value" (optional): The value to validate against, depending on the rule type. - "execute" (bool, optional): Whether to execute the rule. Defaults to True.

required

Returns:

Type Description
Tuple[DataFrame, DataFrame]

Tuple[pl.DataFrame, pl.DataFrame]: A tuple containing: - The original DataFrame with an additional "dq_status" column indicating the validation status for each row. - A DataFrame containing rows that violated the validation rules, including details of the violations.

Notes
  • The function dynamically resolves validation functions based on the "check_type" specified in the rules.
  • If a rule's "check_type" is unknown, a warning is issued, and the rule is skipped.
  • The "__id" column is temporarily added to the DataFrame for internal processing and is removed in the final output.

validate_date_format

validate_date_format(df: DataFrame, rule: dict) -> pl.DataFrame

Validates the date format of a specified field in a Polars DataFrame based on a given rule.

Parameters:

Name Type Description Default
df DataFrame

The input Polars DataFrame to validate.

required
rule dict

A dictionary containing the validation rule. It should include: - field (str): The name of the column to validate. - check (str): The name of the validation check. - fmt (str): The expected date format to validate against.

required

Returns:

Type Description
DataFrame

pl.DataFrame: A new DataFrame containing only the rows where the specified field

DataFrame

does not match the expected date format or is null. An additional column

DataFrame

"dq_status" is added to indicate the validation status in the format

DataFrame

"{field}:{check}:{fmt}".

validate_schema

validate_schema(df, expected) -> tuple[bool, list[dict[str, Any]]]

Validates the schema of a given DataFrame against an expected schema.

Parameters:

Name Type Description Default
df

The DataFrame whose schema needs to be validated.

required
expected

The expected schema, represented as a list of tuples where each tuple contains the column name and its data type.

required

Returns:

Type Description
tuple[bool, list[dict[str, Any]]]

Tuple[bool, List[Tuple[str, str]]]: A tuple containing: - A boolean indicating whether the schema matches the expected schema. - A list of tuples representing the errors, where each tuple contains the column name and a description of the mismatch.