polars¶
sumeh.engines.polars_engine ¶
This module provides a set of data quality validation functions using the Polars library. It includes various checks for data validation, such as completeness, uniqueness, range checks, pattern matching, and schema validation.
Functions:
Name | Description |
---|---|
is_positive |
Filters rows where the specified field is less than zero. |
is_negative |
Filters rows where the specified field is greater than or equal to zero. |
is_complete |
Filters rows where the specified field is null. |
is_unique |
Filters rows with duplicate values in the specified field. |
are_complete |
Filters rows where any of the specified fields are null. |
are_unique |
Filters rows with duplicate combinations of the specified fields. |
is_greater_than |
Filters rows where the specified field is less than or equal to the given value. |
is_greater_or_equal_than |
Filters rows where the specified field is less than the given value. |
is_less_than |
Filters rows where the specified field is greater than or equal to the given value. |
is_less_or_equal_than |
Filters rows where the specified field is greater than the given value. |
is_equal |
Filters rows where the specified field is not equal to the given value. |
is_equal_than |
Alias for |
is_in_millions |
Retains rows where the field value is less than 1,000,000 and flags them with dq_status. |
is_in_billions |
Retains rows where the field value is less than 1,000,000,000 and flags them with dq_status. |
is_t_minus_1 |
Retains rows where the date field not equals yesterday (T-1) and flags them with dq_status. |
is_t_minus_2 |
Retains rows where the date field not equals two days ago (T-2) and flags them with dq_status. |
is_t_minus_3 |
Retains rows where the date field not equals three days ago (T-3) and flags them with dq_status. |
is_today |
Retains rows where the date field not equals today and flags them with dq_status. |
is_yesterday |
Retains rows where the date field not equals yesterday and flags them with dq_status. |
is_on_weekday |
Retains rows where the date field not falls on a weekday (Mon-Fri) and flags them with dq_status. |
is_on_weekend |
Retains rows where the date field is not on a weekend (Sat-Sun) and flags them with dq_status. |
is_on_monday |
Retains rows where the date field is not on Monday and flags them with dq_status. |
is_on_tuesday |
Retains rows where the date field is not on Tuesday and flags them with dq_status. |
is_on_wednesday |
Retains rows where the date field is not on Wednesday and flags them with dq_status. |
is_on_thursday |
Retains rows where the date field is not on Thursday and flags them with dq_status. |
is_on_friday |
Retains rows where the date field is not on Friday and flags them with dq_status. |
is_on_saturday |
Retains rows where the date field is not on Saturday and flags them with dq_status. |
is_on_sunday |
Retains rows where the date field is not on Sunday and flags them with dq_status. |
is_contained_in |
Filters rows where the specified field is not in the given list of values. |
not_contained_in |
Filters rows where the specified field is in the given list of values. |
is_between |
Filters rows where the specified field is not within the given range. |
has_pattern |
Filters rows where the specified field does not match the given regex pattern. |
is_legit |
Filters rows where the specified field is null or contains whitespace. |
has_max |
Filters rows where the specified field exceeds the given maximum value. |
has_min |
Filters rows where the specified field is below the given minimum value. |
has_std |
Checks if the standard deviation of the specified field exceeds the given value. |
has_mean |
Checks if the mean of the specified field exceeds the given value. |
has_sum |
Checks if the sum of the specified field exceeds the given value. |
has_cardinality |
Checks if the cardinality (number of unique values) of the specified field exceeds the given value. |
has_infogain |
Placeholder for information gain validation (currently uses cardinality). |
has_entropy |
Placeholder for entropy validation (currently uses cardinality). |
satisfies |
Filters rows that do not satisfy the given SQL condition. |
validate_date_format |
Filters rows where the specified field does not match the expected date format or is null. |
is_future_date |
Filters rows where the specified date field is after today. |
is_past_date |
Filters rows where the specified date field is before today. |
is_date_between |
Filters rows where the specified date field is not within the given [start,end] range. |
is_date_after |
Filters rows where the specified date field is before the given date. |
is_date_before |
Filters rows where the specified date field is after the given date. |
all_date_checks |
Alias for |
validate |
Validates a DataFrame against a list of rules and returns the original DataFrame with data quality status and a DataFrame of violations. |
__build_rules_df |
Converts a list of rules into a Polars DataFrame for summarization. |
summarize |
Summarizes the results of data quality checks, including pass rates and statuses. |
__polars_schema_to_list |
Converts a Polars DataFrame schema into a list of dictionaries. |
validate_schema |
Validates the schema of a DataFrame against an expected schema and returns a boolean result and a list of errors. |
__build_rules_df ¶
Builds a Polars DataFrame from a list of rule dictionaries.
This function processes a list of rule dictionaries, filters out rules that are not marked for execution, and constructs a DataFrame with the relevant rule information. It ensures uniqueness of rows based on specific columns and casts the data to appropriate types.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
rules
|
list[dict]
|
A list of dictionaries, where each dictionary represents a rule. Each rule dictionary may contain the following keys: - "field" (str or list): The column(s) the rule applies to. - "check_type" (str): The type of rule or check. - "threshold" (float, optional): The pass threshold for the rule. Defaults to 1.0. - "value" (any, optional): Additional value associated with the rule. - "execute" (bool, optional): Whether the rule should be executed. Defaults to True. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A Polars DataFrame containing the processed rules with the following columns: - "column" (str): The column(s) the rule applies to, joined by commas if multiple. - "rule" (str): The type of rule or check. - "pass_threshold" (float): The pass threshold for the rule. - "value" (str): The value associated with the rule, or an empty string if not provided. |
all_date_checks ¶
Applies all date-related validation checks on the given DataFrame based on the specified rule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input DataFrame to validate. |
required |
rule
|
dict
|
A dictionary containing the validation rules to apply. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: The DataFrame after applying the date validation checks. |
are_complete ¶
Filters a Polars DataFrame to identify rows where specified fields contain null values and tags them with a data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be checked. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'fields': A list of column names to check for null values. - 'check': A string representing the type of check (e.g., "is_null"). - 'value': A value associated with the check (not used in this function). |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A filtered DataFrame containing only rows where at least one of the |
DataFrame
|
specified fields is null, with an additional column "dq_status" indicating the |
DataFrame
|
data quality status. |
are_unique ¶
Checks for duplicate combinations of specified fields in a Polars DataFrame and returns a DataFrame containing the rows with duplicates along with a data quality status column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to check for duplicates. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It is expected to include the following keys: - 'fields': A list of column names to check for uniqueness. - 'check': A string representing the type of check (e.g., "unique"). - 'value': A value associated with the check (e.g., "True"). |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A DataFrame containing rows with duplicate combinations of the specified fields. An additional column, "dq_status", is added to indicate the data quality status in the format "{fields}:{check}:{value}". |
extract_schema ¶
Converts the schema of a Polars DataFrame into a list of dictionaries, where each dictionary represents a field in the schema.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The Polars DataFrame whose schema is to be converted. |
required |
Returns:
Type | Description |
---|---|
List[Dict[str, Any]]
|
List[Dict[str, Any]]: A list of dictionaries, each containing the following keys: - "field" (str): The name of the field. - "data_type" (str): The data type of the field, converted to lowercase. - "nullable" (bool): Always set to True, as Polars does not expose nullability in the schema. - "max_length" (None): Always set to None, as max length is not applicable. |
has_cardinality ¶
Checks if the cardinality (number of unique values) of a specified field in the given DataFrame satisfies a condition defined in the rule. If the cardinality exceeds the specified value, a new column "dq_status" is added to the DataFrame with a string indicating the rule violation. Otherwise, an empty DataFrame is returned.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input DataFrame to evaluate. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - "field" (str): The column name to check. - "check" (str): The type of check (e.g., "greater_than"). - "value" (int): The threshold value for the cardinality. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: The original DataFrame with an added "dq_status" column if the rule is violated, or an empty DataFrame if the rule is not violated. |
has_entropy ¶
Evaluates the entropy of a specified field in a Polars DataFrame based on a given rule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to evaluate. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field' (str): The column name in the DataFrame to evaluate. - 'check' (str): The type of check to perform (not used directly in this function). - 'value' (float): The threshold value for entropy comparison. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame:
- If the entropy of the specified field exceeds the given threshold ( |
Notes
- The entropy is calculated as the number of unique values in the specified field.
- The
dq_status
column contains a string in the format "{field}:{check}:{value}".
has_infogain ¶
Evaluates whether a given DataFrame satisfies an information gain condition based on a specified rule. If the condition is met, a new column indicating the rule is added; otherwise, an empty DataFrame is returned.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input DataFrame to evaluate. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include the following keys: - 'field': The column name to evaluate. - 'check': The type of check to perform (not used directly in this function). - 'value': The threshold value for the information gain. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: The original DataFrame with an additional column named |
DataFrame
|
"dq_status" if the condition is met, or an empty DataFrame if the |
DataFrame
|
condition is not met. |
has_max ¶
Filters a Polars DataFrame to include only rows where the value in a specified column exceeds a given threshold, and adds a new column indicating the rule applied.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field' (str): The column name to apply the filter on. - 'check' (str): The type of check being performed (e.g., "max"). - 'value' (numeric): The threshold value to compare against. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame containing only the rows that satisfy the condition, |
DataFrame
|
with an additional column named "dq_status" that describes the applied rule. |
has_mean ¶
Checks if the mean value of a specified column in a Polars DataFrame satisfies a given condition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column to calculate the mean for. - 'check' (str): The condition to check (e.g., 'greater than'). - 'value' (float): The threshold value to compare the mean against. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: - If the mean value of the specified column is greater than the threshold value, returns the original DataFrame with an additional column "dq_status" containing a string in the format "{field}:{check}:{value}". - If the condition is not met, returns an empty DataFrame with the same schema as the input. |
has_min ¶
Filters a Polars DataFrame to include only rows where the value of a specified column is less than a given threshold and adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field': The name of the column to apply the filter on. - 'check': A string representing the type of check (e.g., 'min'). - 'value': The threshold value for the filter. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame containing only the rows that satisfy |
DataFrame
|
the condition, with an additional column named "dq_status" indicating the |
DataFrame
|
applied rule in the format "field:check:value". |
has_pattern ¶
Filters a Polars DataFrame based on a pattern-matching rule and adds a data quality status column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field': The column name in the DataFrame to apply the pattern check. - 'check': A descriptive label for the check being performed. - 'pattern': The regex pattern to match against the column values. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame with rows not matching the pattern removed and an additional |
DataFrame
|
column named "dq_status" indicating the rule applied in the format "field:check:pattern". |
has_std ¶
Evaluates whether the standard deviation of a specified column in a Polars DataFrame exceeds a given threshold and returns a modified DataFrame accordingly.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to evaluate. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column to calculate the standard deviation for. - 'check' (str): A descriptive label for the check being performed. - 'value' (float): The threshold value for the standard deviation. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A modified DataFrame. If the standard deviation of the specified column |
DataFrame
|
exceeds the threshold, the DataFrame will include a new column |
DataFrame
|
descriptive string. Otherwise, an empty DataFrame with the |
has_sum ¶
Checks if the sum of a specified column in a Polars DataFrame exceeds a given value.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field': The name of the column to sum. - 'check': A string representing the check type (not used in this function). - 'value': The threshold value to compare the sum against. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: If the sum of the specified column exceeds the given value, |
DataFrame
|
returns the original DataFrame with an additional column |
DataFrame
|
a string in the format "{field}:{check}:{value}". Otherwise, returns an empty DataFrame. |
is_between ¶
Filters a Polars DataFrame to exclude rows where the specified field's value falls within a given range, and adds a column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field': The name of the column to check. - 'check': The type of check being performed (e.g., "is_between"). - 'value': A string representing the range in the format "[lo,hi]". |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame with rows outside the specified range |
DataFrame
|
and an additional column named "dq_status" indicating the rule applied. |
Raises:
Type | Description |
---|---|
ValueError
|
If the 'value' parameter is not in the expected format "[lo,hi]". |
is_complete ¶
Filters a Polars DataFrame to include only rows where the specified field is not null and appends a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered and modified. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column to check for non-null values. - 'check' (str): A descriptive string for the type of check being performed. - 'value' (str): A value associated with the rule for status annotation. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame with rows filtered based on the rule and |
DataFrame
|
an additional column named "dq_status" containing the data quality status. |
is_composite_key ¶
Determines if the given DataFrame satisfies the composite key condition based on the provided rule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input DataFrame to evaluate. |
required |
rule
|
dict
|
A dictionary defining the rule to check for composite key uniqueness. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A DataFrame indicating whether the composite key condition is met. |
is_contained_in ¶
Filters a Polars DataFrame to exclude rows where the specified field's value is contained in a given list of values, and adds a new column indicating the rule applied.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field': The column name to check. - 'check': The type of check being performed (e.g., "is_contained_in"). - 'value': A string representation of a list of values to check against, e.g., "[value1, value2, value3]". |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame with rows filtered based on the rule and an |
DataFrame
|
additional column "dq_status" indicating the rule applied. |
is_date_after ¶
Filters a Polars DataFrame to include only rows where the specified date field is earlier than a given date, and adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column containing date strings. - 'check' (str): A descriptive label for the check being performed. - 'date_str' (str): The date string in the format "%Y-%m-%d" to compare against. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame with rows filtered based on the date condition |
DataFrame
|
and an additional column named "dq_status" indicating the applied rule. |
is_date_before ¶
Filters a Polars DataFrame to include only rows where the specified date field is after a given date, and adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field' (str): The name of the column to check. - 'check' (str): A descriptive label for the check being performed. - 'date_str' (str): The date string in the format "%Y-%m-%d" to compare against. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame with rows filtered based on the date condition |
DataFrame
|
and an additional column named "dq_status" indicating the applied rule. |
is_date_between ¶
Filters a Polars DataFrame to exclude rows where the specified date field is within a given range.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing the filtering rule. It should include: - 'field': The name of the column to check. - 'check': A string representing the type of check (e.g., "is_date_between"). - 'value': A string representing the date range in the format "[YYYY-MM-DD,YYYY-MM-DD]". |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame excluding rows where the date in the specified field falls within the given inclusive range, with an additional column "dq_status" indicating the rule applied. |
is_equal ¶
Filters rows in a Polars DataFrame that do not match a specified equality condition and adds a column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field': The column name to apply the equality check on. - 'check': The type of check (expected to be 'eq' for equality). - 'value': The value to compare against. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame with rows filtered based on the rule and an |
DataFrame
|
additional column named "dq_status" indicating the rule applied. |
is_equal_than ¶
Filters rows in a Polars DataFrame where the specified field is not equal to a given value and adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field': The name of the column to check. - 'check': The type of check (expected to be 'equal' for this function). - 'value': The value to compare against. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame with rows filtered based on the rule and an |
DataFrame
|
additional column named "dq_status" indicating the applied rule. |
is_future_date ¶
Filters a Polars DataFrame to include only rows where the specified date field contains a future date, based on the current date.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It is expected to include the field name to check, the check type, and additional parameters (ignored in this function). |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame containing only rows where the specified |
DataFrame
|
date field is in the future. An additional column "dq_status" is added |
DataFrame
|
to indicate the field, check type, and today's date in the format |
DataFrame
|
"field:check:today". |
is_greater_or_equal_than ¶
Filters a Polars DataFrame to include only rows where the specified field is greater than or equal to a given value, and adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
dict
|
A dictionary containing the filtering rule. It should include the following keys: - 'field': The name of the column to be checked. - 'check': The type of check being performed (e.g., "greater_or_equal"). - 'value': The threshold value for the comparison. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame with rows filtered based on the |
DataFrame
|
specified rule and an additional column named "dq_status" indicating |
DataFrame
|
the data quality status in the format "field:check:value". |
is_greater_than ¶
Filters a Polars DataFrame to include only rows where the specified field's value is less than or equal to a given value, and adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The Polars DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing the filtering rule. It should include: - 'field': The name of the column to apply the filter on. - 'check': A string describing the check (e.g., "greater_than"). - 'value': The value to compare against. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame with rows filtered based on the rule and an |
DataFrame
|
additional column named "dq_status" indicating the applied rule. |
is_in ¶
Checks if the rows in the given DataFrame satisfy the conditions specified in the rule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input DataFrame to evaluate. |
required |
rule
|
dict
|
A dictionary specifying the conditions to check against the DataFrame. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A DataFrame containing rows that satisfy the specified conditions. |
is_in_billions ¶
Filters a Polars DataFrame to include only rows where the specified field's value is less than one billion and adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - field (str): The name of the column to check. - check (str): The type of check being performed (e.g., "less_than"). - value (any): The value associated with the rule (not used in this function). |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame with rows filtered based on the rule and an |
DataFrame
|
additional column named "dq_status" containing a string in the format |
DataFrame
|
"{field}:{check}:{value}". |
is_in_millions ¶
Filters a Polars DataFrame to include only rows where the specified field's value is less than one million and adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field': The name of the column to check. - 'check': A string describing the check being performed. - 'value': A value associated with the rule (used for status annotation). |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame with rows filtered based on the rule and |
DataFrame
|
an additional column named "dq_status" containing the data quality status. |
is_legit ¶
Filters a Polars DataFrame based on a validation rule and appends a data quality status column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to validate. |
required |
rule
|
dict
|
A dictionary containing the validation rule. It should include: - 'field': The name of the column to validate. - 'check': The type of validation check (e.g., regex, condition). - 'value': The value or pattern to validate against. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame containing rows that failed the validation, |
DataFrame
|
with an additional column 'dq_status' indicating the validation rule applied. |
is_less_or_equal_than ¶
Filters a Polars DataFrame to include only rows where the specified field's value is greater than the given value, and adds a new column indicating the rule applied.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field': The name of the column to apply the filter on. - 'check': The type of check being performed (e.g., 'less_or_equal_than'). - 'value': The value to compare against. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame with rows filtered based on the rule and an |
DataFrame
|
additional column named "dq_status" indicating the rule applied. |
is_less_than ¶
Filters a Polars DataFrame to include only rows where the specified field is greater than or equal to a given value. Adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing the filtering rule. It should include the following keys: - 'field': The name of the column to apply the filter on. - 'check': A string representing the type of check (not used in logic). - 'value': The threshold value for the filter. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame with rows filtered based on the |
DataFrame
|
condition and an additional column named "dq_status" containing the |
DataFrame
|
rule description in the format "field:check:value". |
is_negative ¶
Filters a Polars DataFrame to exclude rows where the specified field is negative and adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field': The name of the column to check. - 'check': The type of check being performed (e.g., "is_negative"). - 'value': The value associated with the rule (not used in this function). |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame with rows where the specified field is non-negative |
DataFrame
|
and an additional column named "dq_status" containing the rule details. |
is_on_friday ¶
Filters a Polars DataFrame to include only rows where the date corresponds to a Friday.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame containing the data to filter. |
required |
rule
|
dict
|
A dictionary containing filtering rules or parameters. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame containing only the rows where the date is a Friday. |
is_on_monday ¶
Filters the given DataFrame to include only rows where the date corresponds to a Monday.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing rules or parameters for filtering. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame containing only the rows where the date is a Monday. |
is_on_saturday ¶
Determines if the dates in the given DataFrame fall on a Saturday.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input DataFrame containing date information. |
required |
rule
|
dict
|
A dictionary containing rules or parameters for the operation. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A DataFrame with the result of the operation, indicating whether each date is on a Saturday. |
is_on_sunday ¶
Filters the given DataFrame to include only rows where the date corresponds to Sunday.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input DataFrame containing date-related data. |
required |
rule
|
dict
|
A dictionary containing rules or parameters for filtering. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A filtered DataFrame containing only rows where the date is a Sunday. |
is_on_thursday ¶
Filters a Polars DataFrame to include only rows where the date corresponds to a Thursday.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame containing the data to filter. |
required |
rule
|
dict
|
A dictionary containing filtering rules or parameters. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame containing only the rows where the date is a Thursday. |
is_on_tuesday ¶
Filters the given DataFrame to include only rows where the day of the week matches Tuesday.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing rules or parameters for filtering. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame containing only rows where the day of the week is Tuesday. |
is_on_wednesday ¶
Filters the given DataFrame to include only rows where the day of the week matches Wednesday.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing rules or parameters for filtering. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A filtered DataFrame containing only rows corresponding to Wednesday. |
is_on_weekday ¶
Filters a Polars DataFrame to include only rows where the specified date field falls on a weekday (Monday to Friday). Adds a new column indicating the rule applied.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It is expected to have
keys that can be extracted using the |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame filtered to include only rows where the date field falls on a weekday, with an additional column named "dq_status" indicating the applied rule in the format "field:check:value". |
is_on_weekend ¶
Filters a Polars DataFrame to include only rows where the specified date field falls on a weekend (Saturday or Sunday). Adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It is expected to include the following keys: - 'field': The name of the column containing date strings. - 'check': A string representing the type of check being performed. - 'value': A value associated with the rule (not used in the logic). |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame filtered to include only rows where |
DataFrame
|
the specified date field falls on a weekend. The resulting DataFrame also |
DataFrame
|
includes an additional column named "dq_status" with a string indicating |
DataFrame
|
the rule applied. |
is_past_date ¶
Filters a Polars DataFrame to include only rows where the specified date field contains a date earlier than today. Adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It is expected to include the field name to check, a check identifier, and additional parameters. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame containing only rows where the specified date field is in the past, with an additional column named "dq_status" that contains a string in the format "{field}:{check}:{today}". |
is_positive ¶
Filters a Polars DataFrame to identify rows where the specified field contains negative values and appends a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be filtered. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It is expected to include the following keys: - 'field': The name of the column to check. - 'check': The type of check being performed (e.g., "is_positive"). - 'value': The reference value for the check. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame containing only the rows where |
DataFrame
|
the specified field has negative values, with an additional column |
DataFrame
|
named "dq_status" that describes the rule applied. |
is_primary_key ¶
Checks if the specified rule identifies a primary key in the given DataFrame.
A primary key is a set of columns in a DataFrame that uniquely identifies each row.
This function delegates the check to the is_unique
function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The DataFrame to check for primary key uniqueness. |
required |
rule
|
dict
|
A dictionary specifying the rule or criteria to determine the primary key. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A DataFrame indicating whether the rule satisfies the primary key condition. |
is_t_minus_1 ¶
Filters a Polars DataFrame to include only rows where the specified field matches the date of "yesterday" (T-1) and appends a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It is expected to include the following keys: - 'field': The name of the column to check. - 'check': A string representing the type of check (used for metadata). - 'value': A value associated with the check (used for metadata). |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame filtered to include only rows where |
DataFrame
|
the specified field matches the date of yesterday (T-1). The resulting |
DataFrame
|
DataFrame also includes an additional column named "dq_status" that |
DataFrame
|
contains metadata about the rule applied. |
is_t_minus_2 ¶
Filters a Polars DataFrame to include only rows where the specified date field matches the date two days prior to the current date. Adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It is expected to include the following keys: - 'field': The name of the date field to check. - 'check': A string representing the type of check (not used in filtering). - 'value': A value associated with the rule (not used in filtering). |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame filtered to include only rows where the |
DataFrame
|
specified date field matches the date two days ago. The resulting DataFrame |
DataFrame
|
includes an additional column named "dq_status" with a string indicating the |
DataFrame
|
rule applied. |
is_t_minus_3 ¶
Filters a Polars DataFrame to include only rows where the specified date field matches the date three days prior to the current date. Additionally, adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It should include: - 'field': The name of the date column to check. - 'check': A string representing the type of check (used for status annotation). - 'value': A value associated with the rule (used for status annotation). |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A filtered Polars DataFrame with an additional column named |
DataFrame
|
"dq_status" that contains a string in the format "{field}:{check}:{value}". |
is_today ¶
Filters a Polars DataFrame to include only rows where the specified date field matches today's date. Additionally, adds a new column "dq_status" with a formatted string indicating the rule applied.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It is expected to have the following keys: - field (str): The name of the column to check. - check (str): A descriptive string for the type of check (used in the "dq_status" column). - value (str): A value associated with the rule (used in the "dq_status" column). |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A filtered Polars DataFrame with rows matching today's date in the specified field |
DataFrame
|
and an additional "dq_status" column describing the rule applied. |
Raises:
Type | Description |
---|---|
ValueError
|
If the rule dictionary does not contain the required keys or if the date parsing fails. |
is_unique ¶
Checks for duplicate values in a specified field of a Polars DataFrame and returns a filtered DataFrame containing only the rows with duplicate values. Additionally, it adds a new column 'dq_status' with a formatted string indicating the field, check type, and value.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to check for duplicates. |
required |
rule
|
dict
|
A dictionary containing the rule parameters. It is expected to have keys that allow extraction of the field to check, the type of check, and a value. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A filtered DataFrame containing rows with duplicate values in the specified field, along with an additional column 'dq_status' describing the rule applied. |
not_contained_in ¶
Filters a Polars DataFrame to include only rows where the specified field's value is in a given list, and adds a new column indicating the data quality status.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
dict
|
A dictionary containing the filtering rule. It should include: - 'field': The column name to apply the filter on. - 'check': A string representing the type of check (not used in logic). - 'value': A string representation of a list of values (e.g., "[value1, value2]"). |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new Polars DataFrame with rows filtered based on the rule and |
DataFrame
|
an additional column "dq_status" indicating the applied rule. |
not_in ¶
Filters a Polars DataFrame by excluding rows where the specified rule applies.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to filter. |
required |
rule
|
dict
|
A dictionary specifying the filtering rule. The structure and
expected keys of this dictionary depend on the implementation of the
|
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame with rows excluded based on the given rule. |
satisfies ¶
Evaluates a given rule against a Polars DataFrame and returns rows that do not satisfy the rule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to be evaluated. |
required |
rule
|
dict
|
A dictionary containing the rule to be applied. The rule should include the following keys: - 'field': The column name in the DataFrame to be checked. - 'check': The type of check or condition to be applied. - 'value': The value or expression to validate against. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A DataFrame containing rows that do not satisfy the rule, with an additional
column |
Example
rule = {"field": "age", "check": ">", "value": "18"} result = satisfies(df, rule)
summarize ¶
Summarizes quality check results by processing a DataFrame containing data quality statuses and comparing them against defined rules.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
qc_df
|
DataFrame
|
A Polars DataFrame containing a column |
required |
rules
|
list[dict]
|
A list of dictionaries where each dictionary defines a rule with keys such as "column", "rule", "value", and "pass_threshold". |
required |
total_rows
|
int
|
The total number of rows in the original dataset, used to calculate the pass rate. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A summarized DataFrame containing the following columns: - id: A unique identifier for each rule. - timestamp: The timestamp when the summary was generated. - check: A label indicating the type of check (e.g., "Quality Check"). - level: The severity level of the check (e.g., "WARNING"). - column: The column name associated with the rule. - rule: The rule being evaluated. - value: The specific value associated with the rule. - rows: The total number of rows in the dataset. - violations: The number of rows that violated the rule. - pass_rate: The proportion of rows that passed the rule. - pass_threshold: The threshold for passing the rule. - status: The status of the rule evaluation ("PASS" or "FAIL"). |
validate ¶
Validates a Polars DataFrame against a set of rules and returns the updated DataFrame with validation statuses and a DataFrame containing the validation violations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to validate. |
required |
rules
|
list[dict]
|
A list of dictionaries representing validation rules. Each rule should contain the following keys: - "check_type" (str): The type of validation to perform (e.g., "is_primary_key", "is_composite_key", "has_pattern", etc.). - "value" (optional): The value to validate against, depending on the rule type. - "execute" (bool, optional): Whether to execute the rule. Defaults to True. |
required |
Returns:
Type | Description |
---|---|
Tuple[DataFrame, DataFrame]
|
Tuple[pl.DataFrame, pl.DataFrame]: A tuple containing: - The original DataFrame with an additional "dq_status" column indicating the validation status for each row. - A DataFrame containing rows that violated the validation rules, including details of the violations. |
Notes
- The function dynamically resolves validation functions based on the "check_type" specified in the rules.
- If a rule's "check_type" is unknown, a warning is issued, and the rule is skipped.
- The "__id" column is temporarily added to the DataFrame for internal processing and is removed in the final output.
validate_date_format ¶
Validates the date format of a specified field in a Polars DataFrame based on a given rule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Polars DataFrame to validate. |
required |
rule
|
dict
|
A dictionary containing the validation rule. It should include: - field (str): The name of the column to validate. - check (str): The name of the validation check. - fmt (str): The expected date format to validate against. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pl.DataFrame: A new DataFrame containing only the rows where the specified field |
DataFrame
|
does not match the expected date format or is null. An additional column |
DataFrame
|
"dq_status" is added to indicate the validation status in the format |
DataFrame
|
"{field}:{check}:{fmt}". |
validate_schema ¶
Validates the schema of a given DataFrame against an expected schema.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
The DataFrame whose schema needs to be validated. |
required | |
expected
|
The expected schema, represented as a list of tuples where each tuple contains the column name and its data type. |
required |
Returns:
Type | Description |
---|---|
tuple[bool, list[dict[str, Any]]]
|
Tuple[bool, List[Tuple[str, str]]]: A tuple containing: - A boolean indicating whether the schema matches the expected schema. - A list of tuples representing the errors, where each tuple contains the column name and a description of the mismatch. |