polars¶

sumeh.engines.polars_engine ¶

This module provides a set of data quality validation functions using the Polars library. It includes various checks for data validation, such as completeness, uniqueness, range checks, pattern matching, and schema validation.

Functions:

Name	Description
`is_positive`	Filters rows where the specified field is less than zero.
`is_negative`	Filters rows where the specified field is greater than or equal to zero.
`is_complete`	Filters rows where the specified field is null.
`is_unique`	Filters rows with duplicate values in the specified field.
`are_complete`	Filters rows where any of the specified fields are null.
`are_unique`	Filters rows with duplicate combinations of the specified fields.
`is_greater_than`	Filters rows where the specified field is less than or equal to the given value.
`is_greater_or_equal_than`	Filters rows where the specified field is less than the given value.
`is_less_than`	Filters rows where the specified field is greater than or equal to the given value.
`is_less_or_equal_than`	Filters rows where the specified field is greater than the given value.
`is_equal`	Filters rows where the specified field is not equal to the given value.
`is_equal_than`	Alias for `is_equal`.
`is_in_millions`	Retains rows where the field value is less than 1,000,000 and flags them with dq_status.
`is_in_billions`	Retains rows where the field value is less than 1,000,000,000 and flags them with dq_status.
`is_t_minus_1`	Retains rows where the date field not equals yesterday (T-1) and flags them with dq_status.
`is_t_minus_2`	Retains rows where the date field not equals two days ago (T-2) and flags them with dq_status.
`is_t_minus_3`	Retains rows where the date field not equals three days ago (T-3) and flags them with dq_status.
`is_today`	Retains rows where the date field not equals today and flags them with dq_status.
`is_yesterday`	Retains rows where the date field not equals yesterday and flags them with dq_status.
`is_on_weekday`	Retains rows where the date field not falls on a weekday (Mon-Fri) and flags them with dq_status.
`is_on_weekend`	Retains rows where the date field is not on a weekend (Sat-Sun) and flags them with dq_status.
`is_on_monday`	Retains rows where the date field is not on Monday and flags them with dq_status.
`is_on_tuesday`	Retains rows where the date field is not on Tuesday and flags them with dq_status.
`is_on_wednesday`	Retains rows where the date field is not on Wednesday and flags them with dq_status.
`is_on_thursday`	Retains rows where the date field is not on Thursday and flags them with dq_status.
`is_on_friday`	Retains rows where the date field is not on Friday and flags them with dq_status.
`is_on_saturday`	Retains rows where the date field is not on Saturday and flags them with dq_status.
`is_on_sunday`	Retains rows where the date field is not on Sunday and flags them with dq_status.
`is_contained_in`	Filters rows where the specified field is not in the given list of values.
`not_contained_in`	Filters rows where the specified field is in the given list of values.
`is_between`	Filters rows where the specified field is not within the given range.
`has_pattern`	Filters rows where the specified field does not match the given regex pattern.
`is_legit`	Filters rows where the specified field is null or contains whitespace.
`has_max`	Filters rows where the specified field exceeds the given maximum value.
`has_min`	Filters rows where the specified field is below the given minimum value.
`has_std`	Checks if the standard deviation of the specified field exceeds the given value.
`has_mean`	Checks if the mean of the specified field exceeds the given value.
`has_sum`	Checks if the sum of the specified field exceeds the given value.
`has_cardinality`	Checks if the cardinality (number of unique values) of the specified field exceeds the given value.
`has_infogain`	Placeholder for information gain validation (currently uses cardinality).
`has_entropy`	Placeholder for entropy validation (currently uses cardinality).
`satisfies`	Filters rows that do not satisfy the given SQL condition.
`validate_date_format`	Filters rows where the specified field does not match the expected date format or is null.
`is_future_date`	Filters rows where the specified date field is after today.
`is_past_date`	Filters rows where the specified date field is before today.
`is_date_between`	Filters rows where the specified date field is not within the given [start,end] range.
`is_date_after`	Filters rows where the specified date field is before the given date.
`is_date_before`	Filters rows where the specified date field is after the given date.
`all_date_checks`	Alias for `is_past_date` (checks date against today).
`validate`	Validates a DataFrame against a list of rules and returns the original DataFrame with data quality status and a DataFrame of violations.
`__build_rules_df`	Converts a list of rules into a Polars DataFrame for summarization.
`summarize`	Summarizes the results of data quality checks, including pass rates and statuses.
`__polars_schema_to_list`	Converts a Polars DataFrame schema into a list of dictionaries.
`validate_schema`	Validates the schema of a DataFrame against an expected schema and returns a boolean result and a list of errors.

__build_rules_df ¶

__build_rules_df(rules: list[dict]) -> pl.DataFrame

Builds a Polars DataFrame from a list of rule dictionaries.

This function processes a list of rule dictionaries, filters out rules that are not marked for execution, and constructs a DataFrame with the relevant rule information. It ensures uniqueness of rows based on specific columns and casts the data to appropriate types.

Parameters:

Name	Type	Description	Default
`rules`	`list[dict]`	A list of dictionaries, where each dictionary represents a rule. Each rule dictionary may contain the following keys: - "field" (str or list): The column(s) the rule applies to. - "check_type" (str): The type of rule or check. - "threshold" (float, optional): The pass threshold for the rule. Defaults to 1.0. - "value" (any, optional): Additional value associated with the rule. - "execute" (bool, optional): Whether the rule should be executed. Defaults to True.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A Polars DataFrame containing the processed rules with the following columns: - "column" (str): The column(s) the rule applies to, joined by commas if multiple. - "rule" (str): The type of rule or check. - "pass_threshold" (float): The pass threshold for the rule. - "value" (str): The value associated with the rule, or an empty string if not provided.

all_date_checks ¶

all_date_checks(df: DataFrame, rule: dict) -> pl.DataFrame

Applies all date-related validation checks on the given DataFrame based on the specified rule.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame to validate.	required
`rule`	`dict`	A dictionary containing the validation rules to apply.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: The DataFrame after applying the date validation checks.

are_complete ¶

are_complete(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to identify rows where specified fields contain null values and tags them with a data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be checked.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'fields': A list of column names to check for null values. - 'check': A string representing the type of check (e.g., "is_null"). - 'value': A value associated with the check (not used in this function).	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A filtered DataFrame containing only rows where at least one of the
`DataFrame`	specified fields is null, with an additional column "dq_status" indicating the
`DataFrame`	data quality status.

are_unique ¶

are_unique(df: DataFrame, rule: dict) -> pl.DataFrame

Checks for duplicate combinations of specified fields in a Polars DataFrame and returns a DataFrame containing the rows with duplicates along with a data quality status column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to check for duplicates.	required
`rule`	`dict`	A dictionary containing the rule parameters. It is expected to include the following keys: - 'fields': A list of column names to check for uniqueness. - 'check': A string representing the type of check (e.g., "unique"). - 'value': A value associated with the check (e.g., "True").	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A DataFrame containing rows with duplicate combinations of the specified fields. An additional column, "dq_status", is added to indicate the data quality status in the format "{fields}:{check}:{value}".

extract_schema ¶

extract_schema(df) -> List[Dict[str, Any]]

Converts the schema of a Polars DataFrame into a list of dictionaries, where each dictionary represents a field in the schema.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The Polars DataFrame whose schema is to be converted.	required

Returns:

Type	Description
`List[Dict[str, Any]]`	List[Dict[str, Any]]: A list of dictionaries, each containing the following keys: - "field" (str): The name of the field. - "data_type" (str): The data type of the field, converted to lowercase. - "nullable" (bool): Always set to True, as Polars does not expose nullability in the schema. - "max_length" (None): Always set to None, as max length is not applicable.

has_cardinality ¶

has_cardinality(df: DataFrame, rule: dict) -> pl.DataFrame

Checks if the cardinality (number of unique values) of a specified field in the given DataFrame satisfies a condition defined in the rule. If the cardinality exceeds the specified value, a new column "dq_status" is added to the DataFrame with a string indicating the rule violation. Otherwise, an empty DataFrame is returned.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame to evaluate.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - "field" (str): The column name to check. - "check" (str): The type of check (e.g., "greater_than"). - "value" (int): The threshold value for the cardinality.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: The original DataFrame with an added "dq_status" column if the rule is violated, or an empty DataFrame if the rule is not violated.

has_entropy ¶

has_entropy(df: DataFrame, rule: dict) -> pl.DataFrame

Evaluates the entropy of a specified field in a Polars DataFrame based on a given rule.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to evaluate.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field' (str): The column name in the DataFrame to evaluate. - 'check' (str): The type of check to perform (not used directly in this function). - 'value' (float): The threshold value for entropy comparison.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: - If the entropy of the specified field exceeds the given threshold (`value`), returns the original DataFrame with an additional column `dq_status` indicating the rule that was applied. - If the entropy does not exceed the threshold, returns an empty DataFrame with the same schema as the input DataFrame.

Notes

The entropy is calculated as the number of unique values in the specified field.
The dq_status column contains a string in the format "{field}:{check}:{value}".

has_infogain ¶

has_infogain(df: DataFrame, rule: dict) -> pl.DataFrame

Evaluates whether a given DataFrame satisfies an information gain condition based on a specified rule. If the condition is met, a new column indicating the rule is added; otherwise, an empty DataFrame is returned.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame to evaluate.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include the following keys: - 'field': The column name to evaluate. - 'check': The type of check to perform (not used directly in this function). - 'value': The threshold value for the information gain.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: The original DataFrame with an additional column named
`DataFrame`	"dq_status" if the condition is met, or an empty DataFrame if the
`DataFrame`	condition is not met.

has_max ¶

has_max(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the value in a specified column exceeds a given threshold, and adds a new column indicating the rule applied.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be filtered.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field' (str): The column name to apply the filter on. - 'check' (str): The type of check being performed (e.g., "max"). - 'value' (numeric): The threshold value to compare against.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame containing only the rows that satisfy the condition,
`DataFrame`	with an additional column named "dq_status" that describes the applied rule.

has_mean ¶

has_mean(df: DataFrame, rule: dict) -> pl.DataFrame

Checks if the mean value of a specified column in a Polars DataFrame satisfies a given condition.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column to calculate the mean for. - 'check' (str): The condition to check (e.g., 'greater than'). - 'value' (float): The threshold value to compare the mean against.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: - If the mean value of the specified column is greater than the threshold value, returns the original DataFrame with an additional column "dq_status" containing a string in the format "{field}:{check}:{value}". - If the condition is not met, returns an empty DataFrame with the same schema as the input.

has_min ¶

has_min(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the value of a specified column is less than a given threshold and adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be filtered.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field': The name of the column to apply the filter on. - 'check': A string representing the type of check (e.g., 'min'). - 'value': The threshold value for the filter.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame containing only the rows that satisfy
`DataFrame`	the condition, with an additional column named "dq_status" indicating the
`DataFrame`	applied rule in the format "field:check:value".

has_pattern ¶

has_pattern(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame based on a pattern-matching rule and adds a data quality status column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be filtered.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field': The column name in the DataFrame to apply the pattern check. - 'check': A descriptive label for the check being performed. - 'pattern': The regex pattern to match against the column values.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame with rows not matching the pattern removed and an additional
`DataFrame`	column named "dq_status" indicating the rule applied in the format "field:check:pattern".

has_std ¶

has_std(df: DataFrame, rule: dict) -> pl.DataFrame

Evaluates whether the standard deviation of a specified column in a Polars DataFrame exceeds a given threshold and returns a modified DataFrame accordingly.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to evaluate.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column to calculate the standard deviation for. - 'check' (str): A descriptive label for the check being performed. - 'value' (float): The threshold value for the standard deviation.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A modified DataFrame. If the standard deviation of the specified column
`DataFrame`	exceeds the threshold, the DataFrame will include a new column `dq_status` with a
`DataFrame`	descriptive string. Otherwise, an empty DataFrame with the `dq_status` column is returned.

has_sum ¶

has_sum(df: DataFrame, rule: dict) -> pl.DataFrame

Checks if the sum of a specified column in a Polars DataFrame exceeds a given value.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field': The name of the column to sum. - 'check': A string representing the check type (not used in this function). - 'value': The threshold value to compare the sum against.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: If the sum of the specified column exceeds the given value,
`DataFrame`	returns the original DataFrame with an additional column `dq_status` containing
`DataFrame`	a string in the format "{field}:{check}:{value}". Otherwise, returns an empty DataFrame.

is_between ¶

is_between(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to exclude rows where the specified field's value falls within a given range, and adds a column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to filter.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field': The name of the column to check. - 'check': The type of check being performed (e.g., "is_between"). - 'value': A string representing the range in the format "[lo,hi]".	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame with rows outside the specified range
`DataFrame`	and an additional column named "dq_status" indicating the rule applied.

Raises:

Type	Description
`ValueError`	If the 'value' parameter is not in the expected format "[lo,hi]".

is_complete ¶

is_complete(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field is not null and appends a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be filtered and modified.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column to check for non-null values. - 'check' (str): A descriptive string for the type of check being performed. - 'value' (str): A value associated with the rule for status annotation.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame with rows filtered based on the rule and
`DataFrame`	an additional column named "dq_status" containing the data quality status.

is_composite_key ¶

is_composite_key(df: DataFrame, rule: dict) -> pl.DataFrame

Determines if the given DataFrame satisfies the composite key condition based on the provided rule.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame to evaluate.	required
`rule`	`dict`	A dictionary defining the rule to check for composite key uniqueness.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A DataFrame indicating whether the composite key condition is met.

is_contained_in ¶

is_contained_in(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to exclude rows where the specified field's value is contained in a given list of values, and adds a new column indicating the rule applied.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be filtered.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field': The column name to check. - 'check': The type of check being performed (e.g., "is_contained_in"). - 'value': A string representation of a list of values to check against, e.g., "[value1, value2, value3]".	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame with rows filtered based on the rule and an
`DataFrame`	additional column "dq_status" indicating the rule applied.

is_date_after ¶

is_date_after(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field is earlier than a given date, and adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to filter.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column containing date strings. - 'check' (str): A descriptive label for the check being performed. - 'date_str' (str): The date string in the format "%Y-%m-%d" to compare against.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame with rows filtered based on the date condition
`DataFrame`	and an additional column named "dq_status" indicating the applied rule.

is_date_before ¶

is_date_before(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field is after a given date, and adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to filter.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column to check. - 'check' (str): A descriptive label for the check being performed. - 'date_str' (str): The date string in the format "%Y-%m-%d" to compare against.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame with rows filtered based on the date condition
`DataFrame`	and an additional column named "dq_status" indicating the applied rule.

is_date_between ¶

is_date_between(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to exclude rows where the specified date field is within a given range.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to filter.	required
`rule`	`dict`	A dictionary containing the filtering rule. It should include: - 'field': The name of the column to check. - 'check': A string representing the type of check (e.g., "is_date_between"). - 'value': A string representing the date range in the format "[YYYY-MM-DD,YYYY-MM-DD]".	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame excluding rows where the date in the specified field falls within the given inclusive range, with an additional column "dq_status" indicating the rule applied.

is_equal ¶

is_equal(df: DataFrame, rule: dict) -> pl.DataFrame

Filters rows in a Polars DataFrame that do not match a specified equality condition and adds a column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be filtered.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field': The column name to apply the equality check on. - 'check': The type of check (expected to be 'eq' for equality). - 'value': The value to compare against.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame with rows filtered based on the rule and an
`DataFrame`	additional column named "dq_status" indicating the rule applied.

is_equal_than ¶

is_equal_than(df: DataFrame, rule: dict) -> pl.DataFrame

Filters rows in a Polars DataFrame where the specified field is not equal to a given value and adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be filtered.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field': The name of the column to check. - 'check': The type of check (expected to be 'equal' for this function). - 'value': The value to compare against.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame with rows filtered based on the rule and an
`DataFrame`	additional column named "dq_status" indicating the applied rule.

is_future_date ¶

is_future_date(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field contains a future date, based on the current date.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to filter.	required
`rule`	`dict`	A dictionary containing the rule parameters. It is expected to include the field name to check, the check type, and additional parameters (ignored in this function).	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame containing only rows where the specified
`DataFrame`	date field is in the future. An additional column "dq_status" is added
`DataFrame`	to indicate the field, check type, and today's date in the format
`DataFrame`	"field:check:today".

is_greater_or_equal_than ¶

is_greater_or_equal_than(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field is greater than or equal to a given value, and adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be filtered.	required
`rule`	`dict`	A dictionary containing the filtering rule. It should include the following keys: - 'field': The name of the column to be checked. - 'check': The type of check being performed (e.g., "greater_or_equal"). - 'value': The threshold value for the comparison.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame with rows filtered based on the
`DataFrame`	specified rule and an additional column named "dq_status" indicating
`DataFrame`	the data quality status in the format "field:check:value".

is_greater_than ¶

is_greater_than(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field's value is less than or equal to a given value, and adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The Polars DataFrame to filter.	required
`rule`	`dict`	A dictionary containing the filtering rule. It should include: - 'field': The name of the column to apply the filter on. - 'check': A string describing the check (e.g., "greater_than"). - 'value': The value to compare against.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame with rows filtered based on the rule and an
`DataFrame`	additional column named "dq_status" indicating the applied rule.

is_in ¶

is_in(df: DataFrame, rule: dict) -> pl.DataFrame

Checks if the rows in the given DataFrame satisfy the conditions specified in the rule.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame to evaluate.	required
`rule`	`dict`	A dictionary specifying the conditions to check against the DataFrame.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A DataFrame containing rows that satisfy the specified conditions.

is_in_billions ¶

is_in_billions(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field's value is less than one billion and adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be filtered.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - field (str): The name of the column to check. - check (str): The type of check being performed (e.g., "less_than"). - value (any): The value associated with the rule (not used in this function).	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame with rows filtered based on the rule and an
`DataFrame`	additional column named "dq_status" containing a string in the format
`DataFrame`	"{field}:{check}:{value}".

is_in_millions ¶

is_in_millions(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field's value is less than one million and adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be filtered.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field': The name of the column to check. - 'check': A string describing the check being performed. - 'value': A value associated with the rule (used for status annotation).	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame with rows filtered based on the rule and
`DataFrame`	an additional column named "dq_status" containing the data quality status.

is_legit ¶

is_legit(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame based on a validation rule and appends a data quality status column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to validate.	required
`rule`	`dict`	A dictionary containing the validation rule. It should include: - 'field': The name of the column to validate. - 'check': The type of validation check (e.g., regex, condition). - 'value': The value or pattern to validate against.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame containing rows that failed the validation,
`DataFrame`	with an additional column 'dq_status' indicating the validation rule applied.

is_less_or_equal_than ¶

is_less_or_equal_than(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field's value is greater than the given value, and adds a new column indicating the rule applied.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be filtered.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field': The name of the column to apply the filter on. - 'check': The type of check being performed (e.g., 'less_or_equal_than'). - 'value': The value to compare against.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame with rows filtered based on the rule and an
`DataFrame`	additional column named "dq_status" indicating the rule applied.

is_less_than ¶

is_less_than(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field is greater than or equal to a given value. Adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to filter.	required
`rule`	`dict`	A dictionary containing the filtering rule. It should include the following keys: - 'field': The name of the column to apply the filter on. - 'check': A string representing the type of check (not used in logic). - 'value': The threshold value for the filter.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame with rows filtered based on the
`DataFrame`	condition and an additional column named "dq_status" containing the
`DataFrame`	rule description in the format "field:check:value".

is_negative ¶

is_negative(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to exclude rows where the specified field is negative and adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be filtered.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field': The name of the column to check. - 'check': The type of check being performed (e.g., "is_negative"). - 'value': The value associated with the rule (not used in this function).	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame with rows where the specified field is non-negative
`DataFrame`	and an additional column named "dq_status" containing the rule details.

is_on_friday ¶

is_on_friday(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the date corresponds to a Friday.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame containing the data to filter.	required
`rule`	`dict`	A dictionary containing filtering rules or parameters.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame containing only the rows where the date is a Friday.

is_on_monday ¶

is_on_monday(df: DataFrame, rule: dict) -> pl.DataFrame

Filters the given DataFrame to include only rows where the date corresponds to a Monday.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame to filter.	required
`rule`	`dict`	A dictionary containing rules or parameters for filtering.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame containing only the rows where the date is a Monday.

is_on_saturday ¶

is_on_saturday(df: DataFrame, rule: dict) -> pl.DataFrame

Determines if the dates in the given DataFrame fall on a Saturday.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame containing date information.	required
`rule`	`dict`	A dictionary containing rules or parameters for the operation.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A DataFrame with the result of the operation, indicating whether each date is on a Saturday.

is_on_sunday ¶

is_on_sunday(df: DataFrame, rule: dict) -> pl.DataFrame

Filters the given DataFrame to include only rows where the date corresponds to Sunday.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame containing date-related data.	required
`rule`	`dict`	A dictionary containing rules or parameters for filtering.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A filtered DataFrame containing only rows where the date is a Sunday.

is_on_thursday ¶

is_on_thursday(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the date corresponds to a Thursday.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame containing the data to filter.	required
`rule`	`dict`	A dictionary containing filtering rules or parameters.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame containing only the rows where the date is a Thursday.

is_on_tuesday ¶

is_on_tuesday(df: DataFrame, rule: dict) -> pl.DataFrame

Filters the given DataFrame to include only rows where the day of the week matches Tuesday.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame to filter.	required
`rule`	`dict`	A dictionary containing rules or parameters for filtering.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame containing only rows where the day of the week is Tuesday.

is_on_wednesday ¶

is_on_wednesday(df: DataFrame, rule: dict) -> pl.DataFrame

Filters the given DataFrame to include only rows where the day of the week matches Wednesday.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame to filter.	required
`rule`	`dict`	A dictionary containing rules or parameters for filtering.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A filtered DataFrame containing only rows corresponding to Wednesday.

is_on_weekday ¶

is_on_weekday(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field falls on a weekday (Monday to Friday). Adds a new column indicating the rule applied.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame.	required
`rule`	`dict`	A dictionary containing the rule parameters. It is expected to have keys that can be extracted using the `__extract_params` function.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame filtered to include only rows where the date field falls on a weekday, with an additional column named "dq_status" indicating the applied rule in the format "field:check:value".

is_on_weekend ¶

is_on_weekend(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field falls on a weekend (Saturday or Sunday). Adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame.	required
`rule`	`dict`	A dictionary containing the rule parameters. It is expected to include the following keys: - 'field': The name of the column containing date strings. - 'check': A string representing the type of check being performed. - 'value': A value associated with the rule (not used in the logic).	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame filtered to include only rows where
`DataFrame`	the specified date field falls on a weekend. The resulting DataFrame also
`DataFrame`	includes an additional column named "dq_status" with a string indicating
`DataFrame`	the rule applied.

is_past_date ¶

is_past_date(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field contains a date earlier than today. Adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to filter.	required
`rule`	`dict`	A dictionary containing the rule parameters. It is expected to include the field name to check, a check identifier, and additional parameters.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame containing only rows where the specified date field is in the past, with an additional column named "dq_status" that contains a string in the format "{field}:{check}:{today}".

is_positive ¶

is_positive(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to identify rows where the specified field contains negative values and appends a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be filtered.	required
`rule`	`dict`	A dictionary containing the rule parameters. It is expected to include the following keys: - 'field': The name of the column to check. - 'check': The type of check being performed (e.g., "is_positive"). - 'value': The reference value for the check.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame containing only the rows where
`DataFrame`	the specified field has negative values, with an additional column
`DataFrame`	named "dq_status" that describes the rule applied.

is_primary_key ¶

is_primary_key(df: DataFrame, rule: dict) -> pl.DataFrame

Checks if the specified rule identifies a primary key in the given DataFrame.

A primary key is a set of columns in a DataFrame that uniquely identifies each row. This function delegates the check to the is_unique function.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to check for primary key uniqueness.	required
`rule`	`dict`	A dictionary specifying the rule or criteria to determine the primary key.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A DataFrame indicating whether the rule satisfies the primary key condition.

is_t_minus_1 ¶

is_t_minus_1(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field matches the date of "yesterday" (T-1) and appends a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to filter.	required
`rule`	`dict`	A dictionary containing the rule parameters. It is expected to include the following keys: - 'field': The name of the column to check. - 'check': A string representing the type of check (used for metadata). - 'value': A value associated with the check (used for metadata).	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame filtered to include only rows where
`DataFrame`	the specified field matches the date of yesterday (T-1). The resulting
`DataFrame`	DataFrame also includes an additional column named "dq_status" that
`DataFrame`	contains metadata about the rule applied.

is_t_minus_2 ¶

is_t_minus_2(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field matches the date two days prior to the current date. Adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to filter.	required
`rule`	`dict`	A dictionary containing the rule parameters. It is expected to include the following keys: - 'field': The name of the date field to check. - 'check': A string representing the type of check (not used in filtering). - 'value': A value associated with the rule (not used in filtering).	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame filtered to include only rows where the
`DataFrame`	specified date field matches the date two days ago. The resulting DataFrame
`DataFrame`	includes an additional column named "dq_status" with a string indicating the
`DataFrame`	rule applied.

is_t_minus_3 ¶

is_t_minus_3(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field matches the date three days prior to the current date. Additionally, adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to filter.	required
`rule`	`dict`	A dictionary containing the rule parameters. It should include: - 'field': The name of the date column to check. - 'check': A string representing the type of check (used for status annotation). - 'value': A value associated with the rule (used for status annotation).	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A filtered Polars DataFrame with an additional column named
`DataFrame`	"dq_status" that contains a string in the format "{field}:{check}:{value}".

is_today ¶

is_today(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified date field matches today's date. Additionally, adds a new column "dq_status" with a formatted string indicating the rule applied.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to filter.	required
`rule`	`dict`	A dictionary containing the rule parameters. It is expected to have the following keys: - field (str): The name of the column to check. - check (str): A descriptive string for the type of check (used in the "dq_status" column). - value (str): A value associated with the rule (used in the "dq_status" column).	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A filtered Polars DataFrame with rows matching today's date in the specified field
`DataFrame`	and an additional "dq_status" column describing the rule applied.

Raises:

Type	Description
`ValueError`	If the rule dictionary does not contain the required keys or if the date parsing fails.

is_unique ¶

is_unique(df: DataFrame, rule: dict) -> pl.DataFrame

Checks for duplicate values in a specified field of a Polars DataFrame and returns a filtered DataFrame containing only the rows with duplicate values. Additionally, it adds a new column 'dq_status' with a formatted string indicating the field, check type, and value.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to check for duplicates.	required
`rule`	`dict`	A dictionary containing the rule parameters. It is expected to have keys that allow extraction of the field to check, the type of check, and a value.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A filtered DataFrame containing rows with duplicate values in the specified field, along with an additional column 'dq_status' describing the rule applied.

not_contained_in ¶

not_contained_in(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame to include only rows where the specified field's value is in a given list, and adds a new column indicating the data quality status.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to filter.	required
`rule`	`dict`	A dictionary containing the filtering rule. It should include: - 'field': The column name to apply the filter on. - 'check': A string representing the type of check (not used in logic). - 'value': A string representation of a list of values (e.g., "[value1, value2]").	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new Polars DataFrame with rows filtered based on the rule and
`DataFrame`	an additional column "dq_status" indicating the applied rule.

not_in ¶

not_in(df: DataFrame, rule: dict) -> pl.DataFrame

Filters a Polars DataFrame by excluding rows where the specified rule applies.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to filter.	required
`rule`	`dict`	A dictionary specifying the filtering rule. The structure and expected keys of this dictionary depend on the implementation of the `not_contained_in` function.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame with rows excluded based on the given rule.

satisfies ¶

satisfies(df: DataFrame, rule: dict) -> pl.DataFrame

Evaluates a given rule against a Polars DataFrame and returns rows that do not satisfy the rule.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to be evaluated.	required
`rule`	`dict`	A dictionary containing the rule to be applied. The rule should include the following keys: - 'field': The column name in the DataFrame to be checked. - 'check': The type of check or condition to be applied. - 'value': The value or expression to validate against.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A DataFrame containing rows that do not satisfy the rule, with an additional column `dq_status` indicating the rule that was violated in the format "field:check:value".

Example

rule = {"field": "age", "check": ">", "value": "18"} result = satisfies(df, rule)

summarize ¶

summarize(qc_df: DataFrame, rules: list[dict], total_rows: int) -> pl.DataFrame

Summarizes quality check results by processing a DataFrame containing data quality statuses and comparing them against defined rules.

Parameters:

Name	Type	Description	Default
`qc_df`	`DataFrame`	A Polars DataFrame containing a column `dq_status` with semicolon-separated strings representing data quality statuses in the format "column:rule:value".	required
`rules`	`list[dict]`	A list of dictionaries where each dictionary defines a rule with keys such as "column", "rule", "value", and "pass_threshold".	required
`total_rows`	`int`	The total number of rows in the original dataset, used to calculate the pass rate.	required

Returns:

Type Description

DataFrame

pl.DataFrame: A summarized DataFrame containing the following columns: - id: A unique identifier for each rule. - timestamp: The timestamp when the summary was generated. - check: A label indicating the type of check (e.g., "Quality Check"). - level: The severity level of the check (e.g., "WARNING"). - column: The column name associated with the rule. - rule: The rule being evaluated. - value: The specific value associated with the rule. - rows: The total number of rows in the dataset. - violations: The number of rows that violated the rule. - pass_rate: The proportion of rows that passed the rule. - pass_threshold: The threshold for passing the rule. - status: The status of the rule evaluation ("PASS" or "FAIL").

validate ¶

validate(df: DataFrame, rules: list[dict]) -> Tuple[pl.DataFrame, pl.DataFrame]

Validates a Polars DataFrame against a set of rules and returns the updated DataFrame with validation statuses and a DataFrame containing the validation violations.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to validate.	required
`rules`	`list[dict]`	A list of dictionaries representing validation rules. Each rule should contain the following keys: - "check_type" (str): The type of validation to perform (e.g., "is_primary_key", "is_composite_key", "has_pattern", etc.). - "value" (optional): The value to validate against, depending on the rule type. - "execute" (bool, optional): Whether to execute the rule. Defaults to True.	required

Returns:

Type	Description
`Tuple[DataFrame, DataFrame]`	Tuple[pl.DataFrame, pl.DataFrame]: A tuple containing: - The original DataFrame with an additional "dq_status" column indicating the validation status for each row. - A DataFrame containing rows that violated the validation rules, including details of the violations.

Notes

The function dynamically resolves validation functions based on the "check_type" specified in the rules.
If a rule's "check_type" is unknown, a warning is issued, and the rule is skipped.
The "__id" column is temporarily added to the DataFrame for internal processing and is removed in the final output.

validate_date_format ¶

validate_date_format(df: DataFrame, rule: dict) -> pl.DataFrame

Validates the date format of a specified field in a Polars DataFrame based on a given rule.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Polars DataFrame to validate.	required
`rule`	`dict`	A dictionary containing the validation rule. It should include: - field (str): The name of the column to validate. - check (str): The name of the validation check. - fmt (str): The expected date format to validate against.	required

Returns:

Type	Description
`DataFrame`	pl.DataFrame: A new DataFrame containing only the rows where the specified field
`DataFrame`	does not match the expected date format or is null. An additional column
`DataFrame`	"dq_status" is added to indicate the validation status in the format
`DataFrame`	"{field}:{check}:{fmt}".

validate_schema ¶

validate_schema(df, expected) -> tuple[bool, list[dict[str, Any]]]

Validates the schema of a given DataFrame against an expected schema.

Parameters:

Name	Type	Description	Default
`df`		The DataFrame whose schema needs to be validated.	required
`expected`		The expected schema, represented as a list of tuples where each tuple contains the column name and its data type.	required

Returns:

Type	Description
`tuple[bool, list[dict[str, Any]]]`	Tuple[bool, List[Tuple[str, str]]]: A tuple containing: - A boolean indicating whether the schema matches the expected schema. - A list of tuples representing the errors, where each tuple contains the column name and a description of the mismatch.