Skip to content

Module sumeh.engine.pandas_engine

This module provides a set of data quality validation functions using the Pandas library. It includes various checks for data validation, such as completeness, uniqueness, range checks, pattern matching, date validations, SQL-style custom expressions, and schema validation.

Functions:

Name Description
is_positive

Filters rows where the specified field is less than zero.

is_negative

Filters rows where the specified field is greater than or equal to zero.

is_in_millions

Retains rows where the field value is at least 1,000,000 and flags them with dq_status.

is_in_billions

Retains rows where the field value is at least 1,000,000,000 and flags them with dq_status.

is_t_minus_1

Retains rows where the date field equals yesterday (T-1) and flags them with dq_status.

is_t_minus_2

Retains rows where the date field equals two days ago (T-2) and flags them with dq_status.

is_t_minus_3

Retains rows where the date field equals three days ago (T-3) and flags them with dq_status.

is_today

Retains rows where the date field equals today and flags them with dq_status.

is_yesterday

Retains rows where the date field equals yesterday and flags them with dq_status.

is_on_weekday

Retains rows where the date field falls on a weekday (Mon-Fri) and flags them with dq_status.

is_on_weekend

Retains rows where the date field is on a weekend (Sat-Sun) and flags them with dq_status.

is_on_monday

Retains rows where the date field is on Monday and flags them with dq_status.

is_on_tuesday

Retains rows where the date field is on Tuesday and flags them with dq_status.

is_on_wednesday

Retains rows where the date field is on Wednesday and flags them with dq_status.

is_on_thursday

Retains rows where the date field is on Thursday and flags them with dq_status.

is_on_friday

Retains rows where the date field is on Friday and flags them with dq_status.

is_on_saturday

Retains rows where the date field is on Saturday and flags them with dq_status.

is_on_sunday

Retains rows where the date field is on Sunday and flags them with dq_status.

is_complete

Filters rows where the specified field is null.

is_unique

Filters rows with duplicate values in the specified field.

are_complete

Filters rows where any of the specified fields are null.

are_unique

Filters rows with duplicate combinations of the specified fields.

is_greater_than

Filters rows where the specified field is less than or equal to the given value.

is_greater_or_equal_than

Filters rows where the specified field is less than the given value.

is_less_than

Filters rows where the specified field is greater than or equal to the given value.

is_less_or_equal_than

Filters rows where the specified field is greater than the given value.

is_equal

Filters rows where the specified field is not equal to the given value.

is_equal_than

Alias for is_equal.

is_contained_in

Filters rows where the specified field is not in the given list of values.

not_contained_in

Filters rows where the specified field is in the given list of values.

is_between

Filters rows where the specified field is not within the given range.

has_pattern

Filters rows where the specified field does not match the given regex pattern.

is_legit

Filters rows where the specified field is null or contains whitespace.

has_max

Filters rows where the specified field exceeds the given maximum value.

has_min

Filters rows where the specified field is below the given minimum value.

has_std

Checks if the standard deviation of the specified field exceeds the given value.

has_mean

Checks if the mean of the specified field exceeds the given value.

has_sum

Checks if the sum of the specified field exceeds the given value.

has_cardinality

Checks if the cardinality (number of unique values) of the specified field exceeds the given value.

has_infogain

Placeholder for information gain validation (currently uses cardinality).

has_entropy

Placeholder for entropy validation (currently uses cardinality).

satisfies

Filters rows that do not satisfy the given custom expression.

validate_date_format

Filters rows where the specified field does not match the expected date format or is null.

is_future_date

Filters rows where the specified date field is after today’s date.

is_past_date

Filters rows where the specified date field is before today’s date.

is_date_between

Filters rows where the specified date field is not within the given [start,end] range.

is_date_after

Filters rows where the specified date field is before the given date.

is_date_before

Filters rows where the specified date field is after the given date.

all_date_checks

Alias for is_past_date (checks date against today).

validate

Validates a DataFrame against a list of rules and returns the original DataFrame with data quality status and a DataFrame of violations.

__build_rules_df

Converts a list of rules into a Pandas DataFrame for summarization.

summarize

Summarizes the results of data quality checks, including pass rates and statuses.

validate_schema

Validates the schema of a DataFrame against an expected schema and returns a boolean result and a list of errors.

__build_rules_df(rules)

Builds a pandas DataFrame from a list of rule dictionaries.

Parameters:

Name Type Description Default
rules List[dict]

A list of dictionaries where each dictionary represents a rule. Each rule dictionary may contain the following keys: - "field" (str or list): The column(s) the rule applies to. - "check_type" (str): The type of check or rule to apply. - "value" (optional): The value associated with the rule. - "threshold" (optional): A numeric threshold for the rule. Defaults to 1.0 if not provided or invalid. - "execute" (optional): A boolean indicating whether the rule should be executed. Defaults to True.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing the processed rules with the following columns: - "column": The column(s) the rule applies to, as a comma-separated string if multiple. - "rule": The type of check or rule. - "value": The value associated with the rule, or an empty string if not provided. - "pass_threshold": The numeric threshold for the rule.

Notes
  • Rules with "execute" set to False are skipped.
  • Duplicate rows based on "column", "rule", and "value" are removed from the resulting DataFrame.
Source code in sumeh/engine/pandas_engine.py
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
def __build_rules_df(rules: List[dict]) -> pd.DataFrame:
    """
    Builds a pandas DataFrame from a list of rule dictionaries.

    Args:
        rules (List[dict]): A list of dictionaries where each dictionary represents a rule.
            Each rule dictionary may contain the following keys:
                - "field" (str or list): The column(s) the rule applies to.
                - "check_type" (str): The type of check or rule to apply.
                - "value" (optional): The value associated with the rule.
                - "threshold" (optional): A numeric threshold for the rule. Defaults to 1.0 if not provided or invalid.
                - "execute" (optional): A boolean indicating whether the rule should be executed. Defaults to True.

    Returns:
        pd.DataFrame: A DataFrame containing the processed rules with the following columns:
            - "column": The column(s) the rule applies to, as a comma-separated string if multiple.
            - "rule": The type of check or rule.
            - "value": The value associated with the rule, or an empty string if not provided.
            - "pass_threshold": The numeric threshold for the rule.

    Notes:
        - Rules with "execute" set to False are skipped.
        - Duplicate rows based on "column", "rule", and "value" are removed from the resulting DataFrame.
    """
    rows = []
    for r in rules:
        if not r.get("execute", True):
            continue

        col = ",".join(r["field"]) if isinstance(r["field"], list) else r["field"]

        thr_raw = r.get("threshold")
        try:
            thr = float(thr_raw) if thr_raw is not None else 1.0
        except (TypeError, ValueError):
            thr = 1.0

        val = r.get("value")
        rows.append(
            {
                "column": col,
                "rule": r["check_type"],
                "value": val if val is not None else "",
                "pass_threshold": thr,
            }
        )

    df_rules = pd.DataFrame(rows)
    if not df_rules.empty:
        df_rules = df_rules.drop_duplicates(subset=["column", "rule", "value"])
    return df_rules

__compare_schemas(actual, expected)

Compare two lists of schema definitions and identify discrepancies.

Parameters:

Name Type Description Default
actual List[SchemaDef]

The list of actual schema definitions.

required
expected List[SchemaDef]

The list of expected schema definitions.

required

Returns:

Type Description
bool

Tuple[bool, List[Tuple[str, str]]]: A tuple where the first element is a boolean indicating

List[Tuple[str, str]]

whether the schemas match (True if they match, False otherwise), and the second element

Tuple[bool, List[Tuple[str, str]]]

is a list of tuples describing the discrepancies. Each tuple contains: - The field name (str). - A description of the discrepancy (str), such as "missing", "type mismatch", "nullable but expected non-nullable", or "extra column".

Notes
  • A field is considered "missing" if it exists in the expected schema but not in the actual schema.
  • A "type mismatch" occurs if the data type of a field in the actual schema does not match the expected data type.
  • A field is considered "nullable but expected non-nullable" if it is nullable in the actual schema but not nullable in the expected schema.
  • An "extra column" is a field that exists in the actual schema but not in the expected schema.
Source code in sumeh/services/utils.py
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
def __compare_schemas(
    actual: List[SchemaDef],
    expected: List[SchemaDef],
) -> Tuple[bool, List[Tuple[str, str]]]:
    """
    Compare two lists of schema definitions and identify discrepancies.

    Args:
        actual (List[SchemaDef]): The list of actual schema definitions.
        expected (List[SchemaDef]): The list of expected schema definitions.

    Returns:
        Tuple[bool, List[Tuple[str, str]]]: A tuple where the first element is a boolean indicating
        whether the schemas match (True if they match, False otherwise), and the second element
        is a list of tuples describing the discrepancies. Each tuple contains:
            - The field name (str).
            - A description of the discrepancy (str), such as "missing", "type mismatch",
              "nullable but expected non-nullable", or "extra column".

    Notes:
        - A field is considered "missing" if it exists in the expected schema but not in the actual schema.
        - A "type mismatch" occurs if the data type of a field in the actual schema does not match
          the expected data type.
        - A field is considered "nullable but expected non-nullable" if it is nullable in the actual
          schema but not nullable in the expected schema.
        - An "extra column" is a field that exists in the actual schema but not in the expected schema.
    """

    exp_map = {c["field"]: c for c in expected}
    act_map = {c["field"]: c for c in actual}

    erros: List[Tuple[str, str]] = []

    for fld, exp in exp_map.items():
        if fld not in act_map:
            erros.append((fld, "missing"))
            continue
        act = act_map[fld]
        if act["data_type"] != exp["data_type"]:
            erros.append(
                (
                    fld,
                    f"type mismatch (got {act['data_type']}, expected {exp['data_type']})",
                )
            )

        if act["nullable"] and not exp["nullable"]:
            erros.append((fld, "nullable but expected non-nullable"))

        if exp.get("max_length") is not None:
            pass

    # 2. campos extras (se quiser)
    extras = set(act_map) - set(exp_map)
    for fld in extras:
        erros.append((fld, "extra column"))

    return len(erros) == 0, erros

__convert_value(value)

Converts the provided value to the appropriate type (date, float, or int).

Depending on the format of the input value, it will be converted to a datetime object, a floating-point number (float), or an integer (int).

Parameters:

Name Type Description Default
value str

The value to be converted, represented as a string.

required

Returns:

Type Description

Union[datetime, float, int]: The converted value, which can be a datetime object, float, or int.

Raises:

Type Description
ValueError

If the value does not match an expected format.

Source code in sumeh/services/utils.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def __convert_value(value):
    """
    Converts the provided value to the appropriate type (date, float, or int).

    Depending on the format of the input value, it will be converted to a datetime object,
    a floating-point number (float), or an integer (int).

    Args:
        value (str): The value to be converted, represented as a string.

    Returns:
        Union[datetime, float, int]: The converted value, which can be a datetime object, float, or int.

    Raises:
        ValueError: If the value does not match an expected format.
    """
    from datetime import datetime

    value = value.strip()
    try:
        if "-" in value:
            return datetime.strptime(value, "%Y-%m-%d")
        else:
            return datetime.strptime(value, "%d/%m/%Y")
    except ValueError:
        if "." in value:
            return float(value)
        return int(value)

__extract_params(rule)

Source code in sumeh/services/utils.py
38
39
40
41
42
43
44
45
46
47
48
49
50
def __extract_params(rule: dict) -> tuple:
    rule_name = rule["check_type"]
    field = rule["field"]
    raw_value = rule.get("value")
    if isinstance(raw_value, str) and raw_value not in (None, "", "NULL"):
        try:
            value = __convert_value(raw_value)
        except ValueError:
            value = raw_value
    else:
        value = raw_value
    value = value if value not in (None, "", "NULL") else ""
    return field, rule_name, value

__pandas_schema_to_list(df, expected)

Source code in sumeh/engine/pandas_engine.py
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
def __pandas_schema_to_list(df, expected) -> Tuple[bool, List[Tuple[str, str]]]:
    actual = [
        {
            "field": c,
            "data_type": str(dtype).lower(),
            "nullable": True,
            "max_length": None,
        }
        for c, dtype in df.dtypes.items()
    ]
    return __compare_schemas(actual, expected)

__transform_date_format_in_pattern(date_format)

Source code in sumeh/services/utils.py
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
def __transform_date_format_in_pattern(date_format):
    date_patterns = {
        "DD": "(0[1-9]|[12][0-9]|3[01])",
        "MM": "(0[1-9]|1[012])",
        "YYYY": "(19|20)\\d\\d",
        "YY": "\\d\\d",
        " ": "\\s",
        ".": "\\.",
    }

    date_pattern = date_format
    for single_format, pattern in date_patterns.items():
        date_pattern = date_pattern.replace(single_format, pattern)

    return date_pattern

_day_of_week(df, rule, dow)

Filters a DataFrame to include only rows where the day of the week of a specified datetime field matches the given day.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing a datetime field.

required
rule dict

A dictionary containing rule parameters. The function expects this to be parsed by __extract_params.

required
dow int

The day of the week to filter by (0=Monday, 6=Sunday).

required

Returns:

Type Description
DataFrame

pd.DataFrame: A filtered DataFrame containing only rows where the day of the week matches dow. An additional column, "dq_status", is added to indicate the rule applied.

Source code in sumeh/engine/pandas_engine.py
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
def _day_of_week(df: pd.DataFrame, rule: dict, dow: int) -> pd.DataFrame:
    """
    Filters a DataFrame to include only rows where the day of the week of a specified datetime field matches the given day.

    Args:
        df (pd.DataFrame): The input DataFrame containing a datetime field.
        rule (dict): A dictionary containing rule parameters. The function expects this to be parsed by `__extract_params`.
        dow (int): The day of the week to filter by (0=Monday, 6=Sunday).

    Returns:
        pd.DataFrame: A filtered DataFrame containing only rows where the day of the week matches `dow`.
                      An additional column, "dq_status", is added to indicate the rule applied.
    """
    field, check, value = __extract_params(rule)
    mask = df[field].dt.dayofweek != dow
    out = df[mask].copy()
    out["dq_status"] = f"{field}:{check}:{value}"
    return out

all_date_checks(df, rule)

Applies all date-related validation checks on the given DataFrame based on the specified rule.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to be validated.

required
rule dict

A dictionary specifying the validation rules to be applied.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame with the results of the date validation checks.

Source code in sumeh/engine/pandas_engine.py
904
905
906
907
908
909
910
911
912
913
914
915
def all_date_checks(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Applies all date-related validation checks on the given DataFrame based on the specified rule.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be validated.
        rule (dict): A dictionary specifying the validation rules to be applied.

    Returns:
        pd.DataFrame: A DataFrame with the results of the date validation checks.
    """
    return is_past_date(df, rule)

are_complete(df, rule)

Checks for completeness of specified fields in a DataFrame based on a given rule.

This function identifies rows in the DataFrame where any of the specified fields contain missing values (NaN). It returns a DataFrame containing only the rows that violate the completeness rule, along with an additional column dq_status that describes the rule violation.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to check for completeness.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the following keys: - fields: A list of column names to check for completeness. - check: A string describing the type of check (e.g., "completeness"). - value: A value associated with the rule (e.g., a threshold or description).

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing rows that violate the completeness rule.

DataFrame

The returned DataFrame includes all original columns and an additional column

DataFrame

dq_status that describes the rule violation in the format "fields:check:value".

Source code in sumeh/engine/pandas_engine.py
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
def are_complete(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Checks for completeness of specified fields in a DataFrame based on a given rule.

    This function identifies rows in the DataFrame where any of the specified fields
    contain missing values (NaN). It returns a DataFrame containing only the rows
    that violate the completeness rule, along with an additional column `dq_status`
    that describes the rule violation.

    Args:
        df (pd.DataFrame): The input DataFrame to check for completeness.
        rule (dict): A dictionary containing the rule parameters. It is expected to
            include the following keys:
            - fields: A list of column names to check for completeness.
            - check: A string describing the type of check (e.g., "completeness").
            - value: A value associated with the rule (e.g., a threshold or description).

    Returns:
        pd.DataFrame: A DataFrame containing rows that violate the completeness rule.
        The returned DataFrame includes all original columns and an additional column
        `dq_status` that describes the rule violation in the format "fields:check:value".
    """
    fields, check, value = __extract_params(rule)
    mask = df[fields].isna().any(axis=1)
    viol = df[mask].copy()
    viol["dq_status"] = f"{fields}:{check}:{value}"
    return viol

are_unique(df, rule)

Checks for duplicate rows in the specified fields of a DataFrame based on a given rule.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to check for uniqueness.

required
rule dict

A dictionary containing the rule parameters. It should include: - fields: A list of column names to check for uniqueness. - check: A string representing the type of check (e.g., "unique"). - value: A value associated with the rule (e.g., a description or identifier).

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing the rows that violate the uniqueness rule. An additional column 'dq_status' is added to indicate the rule that was violated in the format "{fields}:{check}:{value}".

Source code in sumeh/engine/pandas_engine.py
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
def are_unique(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Checks for duplicate rows in the specified fields of a DataFrame based on a given rule.

    Args:
        df (pd.DataFrame): The input DataFrame to check for uniqueness.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - fields: A list of column names to check for uniqueness.
            - check: A string representing the type of check (e.g., "unique").
            - value: A value associated with the rule (e.g., a description or identifier).

    Returns:
        pd.DataFrame: A DataFrame containing the rows that violate the uniqueness rule.
                      An additional column 'dq_status' is added to indicate the rule
                      that was violated in the format "{fields}:{check}:{value}".
    """
    fields, check, value = __extract_params(rule)
    combo = df[fields].astype(str).agg("|".join, axis=1)
    dup = combo.duplicated(keep=False)
    viol = df[dup].copy()
    viol["dq_status"] = f"{fields}:{check}:{value}"
    return viol

has_cardinality(df, rule)

Checks if the cardinality (number of unique values) of a specified field in the DataFrame exceeds a given value and returns a modified DataFrame if the condition is met.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to check.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The column name in the DataFrame to check. - 'check': The type of check being performed (e.g., 'cardinality'). - 'value': The threshold value for the cardinality.

required

Returns:

Type Description
DataFrame

pd.DataFrame: - If the cardinality of the specified field exceeds the given value, a copy of the DataFrame is returned with an additional column 'dq_status' indicating the field, check, and value. - If the cardinality does not exceed the value, an empty DataFrame is returned.

Source code in sumeh/engine/pandas_engine.py
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
def has_cardinality(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Checks if the cardinality (number of unique values) of a specified field in the DataFrame
    exceeds a given value and returns a modified DataFrame if the condition is met.

    Parameters:
        df (pd.DataFrame): The input DataFrame to check.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field': The column name in the DataFrame to check.
            - 'check': The type of check being performed (e.g., 'cardinality').
            - 'value': The threshold value for the cardinality.

    Returns:
        pd.DataFrame:
            - If the cardinality of the specified field exceeds the given value,
              a copy of the DataFrame is returned with an additional column 'dq_status'
              indicating the field, check, and value.
            - If the cardinality does not exceed the value, an empty DataFrame is returned.
    """
    field, check, value = __extract_params(rule)
    card = df[field].nunique(dropna=True) or 0
    if card > value:
        out = df.copy()
        out["dq_status"] = f"{field}:{check}:{value}"
        return out
    return df.iloc[0:0].copy()

has_entropy(df, rule)

Checks if the given DataFrame satisfies a specific rule related to entropy.

This function is a wrapper around the has_cardinality function, delegating the rule-checking logic to it.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be evaluated.

required
rule dict

A dictionary containing the rule to be applied.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The resulting DataFrame after applying the rule.

Source code in sumeh/engine/pandas_engine.py
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
def has_entropy(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Checks if the given DataFrame satisfies a specific rule related to entropy.

    This function is a wrapper around the `has_cardinality` function, delegating
    the rule-checking logic to it.

    Args:
        df (pd.DataFrame): The input DataFrame to be evaluated.
        rule (dict): A dictionary containing the rule to be applied.

    Returns:
        pd.DataFrame: The resulting DataFrame after applying the rule.
    """
    return has_cardinality(df, rule)

has_infogain(df, rule)

Checks if the given DataFrame satisfies the information gain criteria defined by the provided rule. This function internally delegates the operation to the has_cardinality function.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be evaluated.

required
rule dict

A dictionary defining the rule for information gain.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The resulting DataFrame after applying the rule.

Source code in sumeh/engine/pandas_engine.py
698
699
700
701
702
703
704
705
706
707
708
709
710
711
def has_infogain(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Checks if the given DataFrame satisfies the information gain criteria
    defined by the provided rule. This function internally delegates the
    operation to the `has_cardinality` function.

    Args:
        df (pd.DataFrame): The input DataFrame to be evaluated.
        rule (dict): A dictionary defining the rule for information gain.

    Returns:
        pd.DataFrame: The resulting DataFrame after applying the rule.
    """
    return has_cardinality(df, rule)

has_max(df, rule)

Identifies rows in a DataFrame where the value in a specified field exceeds a given maximum value.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be checked.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field' (str): The column name to check. - 'check' (str): The type of check being performed (e.g., 'max'). - 'value' (numeric): The maximum allowable value for the specified field.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing rows that violate the rule, with an additional column

DataFrame

'dq_status' indicating the rule violation in the format "field:check:value".

Source code in sumeh/engine/pandas_engine.py
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
def has_max(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Identifies rows in a DataFrame where the value in a specified field exceeds a given maximum value.

    Args:
        df (pd.DataFrame): The input DataFrame to be checked.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field' (str): The column name to check.
            - 'check' (str): The type of check being performed (e.g., 'max').
            - 'value' (numeric): The maximum allowable value for the specified field.

    Returns:
        pd.DataFrame: A DataFrame containing rows that violate the rule, with an additional column
        'dq_status' indicating the rule violation in the format "field:check:value".
    """
    field, check, value = __extract_params(rule)
    viol = df[df[field] > value].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

has_mean(df, rule)

Checks if the mean of a specified column in a DataFrame satisfies a given condition.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to evaluate.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field' (str): The column name to calculate the mean for. - 'check' (str): The condition to check (e.g., 'greater_than'). - 'value' (float): The threshold value to compare the mean against.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A copy of the input DataFrame with an additional column 'dq_status'

DataFrame

if the condition is met. The 'dq_status' column contains a string in the format

DataFrame

"{field}:{check}:{value}". If the condition is not met, an empty DataFrame is returned.

Source code in sumeh/engine/pandas_engine.py
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
def has_mean(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Checks if the mean of a specified column in a DataFrame satisfies a given condition.

    Parameters:
        df (pd.DataFrame): The input DataFrame to evaluate.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field' (str): The column name to calculate the mean for.
            - 'check' (str): The condition to check (e.g., 'greater_than').
            - 'value' (float): The threshold value to compare the mean against.

    Returns:
        pd.DataFrame: A copy of the input DataFrame with an additional column 'dq_status'
        if the condition is met. The 'dq_status' column contains a string in the format
        "{field}:{check}:{value}". If the condition is not met, an empty DataFrame is returned.
    """
    field, check, value = __extract_params(rule)
    mean_val = df[field].mean(skipna=True) or 0.0
    if mean_val > value:
        out = df.copy()
        out["dq_status"] = f"{field}:{check}:{value}"
        return out
    return df.iloc[0:0].copy()

has_min(df, rule)

Filters a DataFrame to identify rows where a specified field's value is less than a given threshold.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be checked.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The column name in the DataFrame to be checked. - 'check': The type of check being performed (e.g., 'min'). - 'value': The threshold value for the check.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame containing rows that violate the rule, with an additional

DataFrame

column 'dq_status' indicating the field, check type, and threshold value.

Source code in sumeh/engine/pandas_engine.py
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
def has_min(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to identify rows where a specified field's value is less than a given threshold.

    Args:
        df (pd.DataFrame): The input DataFrame to be checked.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field': The column name in the DataFrame to be checked.
            - 'check': The type of check being performed (e.g., 'min').
            - 'value': The threshold value for the check.

    Returns:
        pd.DataFrame: A new DataFrame containing rows that violate the rule, with an additional
        column 'dq_status' indicating the field, check type, and threshold value.
    """
    field, check, value = __extract_params(rule)
    viol = df[df[field] < value].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

has_pattern(df, rule)

Checks if the values in a specified column of a DataFrame match a given pattern.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to check.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The column name in the DataFrame to check. - 'check': A descriptive label for the check being performed. - 'pattern': The regex pattern to match against the column values.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing rows that do not match the pattern. An additional column 'dq_status' is added to indicate the field, check, and pattern that caused the violation.

Source code in sumeh/engine/pandas_engine.py
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
def has_pattern(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Checks if the values in a specified column of a DataFrame match a given pattern.

    Args:
        df (pd.DataFrame): The input DataFrame to check.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field': The column name in the DataFrame to check.
            - 'check': A descriptive label for the check being performed.
            - 'pattern': The regex pattern to match against the column values.

    Returns:
        pd.DataFrame: A DataFrame containing rows that do not match the pattern.
                      An additional column 'dq_status' is added to indicate the
                      field, check, and pattern that caused the violation.
    """
    field, check, pattern = __extract_params(rule)
    viol = df[~df[field].astype(str).str.contains(pattern, na=False)].copy()
    viol["dq_status"] = f"{field}:{check}:{pattern}"
    return viol

has_std(df, rule)

Checks if the standard deviation of a specified field in the DataFrame exceeds a given value.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to evaluate.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The column name in the DataFrame to calculate the standard deviation for. - 'check': A string representing the type of check (not used in the logic but included in the output). - 'value': A numeric threshold to compare the standard deviation against.

required

Returns:

Type Description
DataFrame

pd.DataFrame: - If the standard deviation of the specified field exceeds the given value, returns a copy of the DataFrame with an additional column 'dq_status' indicating the rule details. - If the standard deviation does not exceed the value, returns an empty DataFrame with the same structure as the input.

Source code in sumeh/engine/pandas_engine.py
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
def has_std(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Checks if the standard deviation of a specified field in the DataFrame exceeds a given value.

    Parameters:
        df (pd.DataFrame): The input DataFrame to evaluate.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field': The column name in the DataFrame to calculate the standard deviation for.
            - 'check': A string representing the type of check (not used in the logic but included in the output).
            - 'value': A numeric threshold to compare the standard deviation against.

    Returns:
        pd.DataFrame:
            - If the standard deviation of the specified field exceeds the given value,
              returns a copy of the DataFrame with an additional column 'dq_status' indicating the rule details.
            - If the standard deviation does not exceed the value, returns an empty DataFrame with the same structure as the input.
    """
    field, check, value = __extract_params(rule)
    std_val = df[field].std(skipna=True) or 0.0
    if std_val > value:
        out = df.copy()
        out["dq_status"] = f"{field}:{check}:{value}"
        return out
    return df.iloc[0:0].copy()

has_sum(df, rule)

Checks if the sum of values in a specified column of a DataFrame exceeds a given threshold.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to be checked.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field' (str): The column name to calculate the sum for. - 'check' (str): A descriptive label for the check (used in the output). - 'value' (float): The threshold value to compare the sum against.

required

Returns:

Type Description
DataFrame

pd.DataFrame: - If the sum of the specified column exceeds the threshold, returns a copy of the input DataFrame with an additional column 'dq_status' indicating the rule that was applied. - If the sum does not exceed the threshold, returns an empty DataFrame with the same structure as the input.

Source code in sumeh/engine/pandas_engine.py
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
def has_sum(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Checks if the sum of values in a specified column of a DataFrame exceeds a given threshold.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be checked.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field' (str): The column name to calculate the sum for.
            - 'check' (str): A descriptive label for the check (used in the output).
            - 'value' (float): The threshold value to compare the sum against.

    Returns:
        pd.DataFrame:
            - If the sum of the specified column exceeds the threshold, returns a copy of the input DataFrame
              with an additional column 'dq_status' indicating the rule that was applied.
            - If the sum does not exceed the threshold, returns an empty DataFrame with the same structure as the input.
    """
    field, check, value = __extract_params(rule)
    sum_val = df[field].sum(skipna=True) or 0.0
    if sum_val > value:
        out = df.copy()
        out["dq_status"] = f"{field}:{check}:{value}"
        return out
    return df.iloc[0:0].copy()

is_between(df, rule)

Filters a DataFrame to identify rows where a specified field's values are not within a given range.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be checked.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The column name in the DataFrame to check. - 'check': A descriptive label for the check being performed. - 'value': A string representation of the range in the format '[lo, hi]'.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing rows that violate the range condition. An additional column 'dq_status' is added to indicate the rule violation in the format 'field:check:value'.

Source code in sumeh/engine/pandas_engine.py
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
def is_between(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to identify rows where a specified field's values are not within a given range.

    Args:
        df (pd.DataFrame): The input DataFrame to be checked.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field': The column name in the DataFrame to check.
            - 'check': A descriptive label for the check being performed.
            - 'value': A string representation of the range in the format '[lo, hi]'.

    Returns:
        pd.DataFrame: A DataFrame containing rows that violate the range condition.
                      An additional column 'dq_status' is added to indicate the rule violation in the format 'field:check:value'.
    """
    field, check, value = __extract_params(rule)
    lo, hi = [__convert_value(x) for x in str(value).strip("[]").split(",")]
    viol = df[~df[field].between(lo, hi)].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

is_complete(df, rule)

Checks for missing values in a specified field of a DataFrame based on a given rule.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to check for completeness.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The name of the field/column to check for missing values. - 'check': The type of check being performed (not used in this function). - 'value': Additional value associated with the rule (not used in this function).

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing rows where the specified field has missing values. An additional column 'dq_status' is added to indicate the rule that was violated.

Source code in sumeh/engine/pandas_engine.py
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
def is_complete(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Checks for missing values in a specified field of a DataFrame based on a given rule.

    Args:
        df (pd.DataFrame): The input DataFrame to check for completeness.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field': The name of the field/column to check for missing values.
            - 'check': The type of check being performed (not used in this function).
            - 'value': Additional value associated with the rule (not used in this function).

    Returns:
        pd.DataFrame: A DataFrame containing rows where the specified field has missing values.
                      An additional column 'dq_status' is added to indicate the rule that was violated.
    """
    field, check, value = __extract_params(rule)
    viol = df[df[field].isna()].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

is_contained_in(df, rule)

Filters a DataFrame to identify rows where the values in a specified field are not contained within a given set of values.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be checked.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the following keys: - 'field': The column name in the DataFrame to check. - 'check': A descriptive string for the check being performed. - 'value': A list or string representation of the allowed values.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing rows from the input DataFrame that do not meet the rule criteria. An additional column 'dq_status' is added to indicate the rule violation in the format "field:check:value".

Source code in sumeh/engine/pandas_engine.py
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
def is_contained_in(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to identify rows where the values in a specified field
    are not contained within a given set of values.

    Args:
        df (pd.DataFrame): The input DataFrame to be checked.
        rule (dict): A dictionary containing the rule parameters. It is expected
                     to include the following keys:
                     - 'field': The column name in the DataFrame to check.
                     - 'check': A descriptive string for the check being performed.
                     - 'value': A list or string representation of the allowed values.

    Returns:
        pd.DataFrame: A DataFrame containing rows from the input DataFrame that
                      do not meet the rule criteria. An additional column
                      'dq_status' is added to indicate the rule violation in
                      the format "field:check:value".
    """
    field, check, value = __extract_params(rule)
    vals = re.findall(r"'([^']*)'", str(value)) or [
        v.strip() for v in str(value).strip("[]").split(",")
    ]
    viol = df[~df[field].isin(vals)].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

is_date_after(df, rule)

Filters a DataFrame to return rows where a specified date field is earlier than a given target date.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to be checked.

required
rule dict

A dictionary containing the rule parameters. It should include: - field (str): The name of the column in the DataFrame to check. - check (str): A descriptive label for the check being performed. - date_str (str): The target date as a string in a format parsable by pd.to_datetime.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing rows where the date in the specified field is earlier

DataFrame

than the target date. An additional column dq_status is added to indicate the rule that

DataFrame

was violated in the format "{field}:{check}:{date_str}".

Source code in sumeh/engine/pandas_engine.py
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
def is_date_after(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to return rows where a specified date field is earlier than a given target date.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be checked.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - field (str): The name of the column in the DataFrame to check.
            - check (str): A descriptive label for the check being performed.
            - date_str (str): The target date as a string in a format parsable by `pd.to_datetime`.

    Returns:
        pd.DataFrame: A DataFrame containing rows where the date in the specified field is earlier
        than the target date. An additional column `dq_status` is added to indicate the rule that
        was violated in the format "{field}:{check}:{date_str}".
    """
    field, check, date_str = __extract_params(rule)
    target = pd.to_datetime(date_str)
    dates = pd.to_datetime(df[field], errors="coerce")
    viol = df[dates < target].copy()
    viol["dq_status"] = f"{field}:{check}:{date_str}"
    return viol

is_date_before(df, rule)

Filters a DataFrame to identify rows where a date field is after a specified target date.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to be checked.

required
rule dict

A dictionary containing the rule parameters. It should include: - field (str): The name of the column in the DataFrame containing date values. - check (str): A descriptive label for the check being performed. - date_str (str): The target date as a string in a format parsable by pd.to_datetime.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing rows where the date in the specified field is after

DataFrame

the target date. An additional column dq_status is added to indicate the rule that was

DataFrame

violated in the format "{field}:{check}:{date_str}".

Source code in sumeh/engine/pandas_engine.py
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
def is_date_before(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to identify rows where a date field is after a specified target date.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be checked.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - field (str): The name of the column in the DataFrame containing date values.
            - check (str): A descriptive label for the check being performed.
            - date_str (str): The target date as a string in a format parsable by `pd.to_datetime`.

    Returns:
        pd.DataFrame: A DataFrame containing rows where the date in the specified field is after
        the target date. An additional column `dq_status` is added to indicate the rule that was
        violated in the format "{field}:{check}:{date_str}".
    """
    field, check, date_str = __extract_params(rule)
    target = pd.to_datetime(date_str)
    dates = pd.to_datetime(df[field], errors="coerce")
    viol = df[dates > target].copy()
    viol["dq_status"] = f"{field}:{check}:{date_str}"
    return viol

is_date_between(df, rule)

Filters rows in a DataFrame where the values in a specified date column are not within a given date range.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to be checked.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the following: - field: The name of the column to check. - check: A string representing the type of check (used for status annotation). - raw: A string representing the date range in the format '[start_date, end_date]'.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing the rows where the date values in the specified column are outside the given range. An additional column 'dq_status' is added to indicate the rule that was violated.

Source code in sumeh/engine/pandas_engine.py
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
def is_date_between(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters rows in a DataFrame where the values in a specified date column
    are not within a given date range.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be checked.
        rule (dict): A dictionary containing the rule parameters. It is expected
                     to include the following:
                     - field: The name of the column to check.
                     - check: A string representing the type of check (used for
                              status annotation).
                     - raw: A string representing the date range in the format
                            '[start_date, end_date]'.

    Returns:
        pd.DataFrame: A DataFrame containing the rows where the date values in
                      the specified column are outside the given range. An
                      additional column 'dq_status' is added to indicate the
                      rule that was violated.
    """
    field, check, raw = __extract_params(rule)
    start_str, end_str = [s.strip() for s in raw.strip("[]").split(",")]
    start = pd.to_datetime(start_str)
    end = pd.to_datetime(end_str)
    dates = pd.to_datetime(df[field], errors="coerce")
    mask = ~dates.between(start, end)
    viol = df[mask].copy()
    viol["dq_status"] = f"{field}:{check}:{raw}"
    return viol

is_equal(df, rule)

Filters a DataFrame to identify rows where the value in a specified field does not match a given value, and annotates these rows with a data quality status.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be checked.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The column name in the DataFrame to check. - 'check': A string describing the check being performed (e.g., "is_equal"). - 'value': The value to compare against.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing rows that do not satisfy the equality check.

DataFrame

An additional column 'dq_status' is added to indicate the data quality status

DataFrame

in the format "{field}:{check}:{value}".

Source code in sumeh/engine/pandas_engine.py
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
def is_equal(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to identify rows where the value in a specified field
    does not match a given value, and annotates these rows with a data quality status.

    Args:
        df (pd.DataFrame): The input DataFrame to be checked.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field': The column name in the DataFrame to check.
            - 'check': A string describing the check being performed (e.g., "is_equal").
            - 'value': The value to compare against.

    Returns:
        pd.DataFrame: A DataFrame containing rows that do not satisfy the equality check.
        An additional column 'dq_status' is added to indicate the data quality status
        in the format "{field}:{check}:{value}".
    """
    field, check, value = __extract_params(rule)
    viol = df[df[field] != value].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

is_equal_than(df, rule)

Compares the values in a DataFrame against a specified rule and returns the result.

This function acts as a wrapper for the is_equal function, passing the given DataFrame and rule to it.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to be evaluated.

required
rule dict

A dictionary containing the comparison rule.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame indicating the result of the comparison.

Source code in sumeh/engine/pandas_engine.py
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
def is_equal_than(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Compares the values in a DataFrame against a specified rule and returns the result.

    This function acts as a wrapper for the `is_equal` function, passing the given
    DataFrame and rule to it.

    Args:
        df (pd.DataFrame): The DataFrame to be evaluated.
        rule (dict): A dictionary containing the comparison rule.

    Returns:
        pd.DataFrame: A DataFrame indicating the result of the comparison.
    """
    return is_equal(df, rule)

is_future_date(df, rule)

Identifies rows in a DataFrame where the date in a specified field is in the future.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to be checked.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the field name to check and the check type.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing only the rows where the date in the specified field is in the future. An additional column 'dq_status' is added to indicate the field, check type, and the current date in ISO format.

Source code in sumeh/engine/pandas_engine.py
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
def is_future_date(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Identifies rows in a DataFrame where the date in a specified field is in the future.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be checked.
        rule (dict): A dictionary containing the rule parameters. It is expected to include
                     the field name to check and the check type.

    Returns:
        pd.DataFrame: A DataFrame containing only the rows where the date in the specified
                      field is in the future. An additional column 'dq_status' is added to
                      indicate the field, check type, and the current date in ISO format.
    """
    field, check, _ = __extract_params(rule)
    today = date.today()
    dates = pd.to_datetime(df[field], errors="coerce")
    viol = df[dates > today].copy()
    viol["dq_status"] = f"{field}:{check}:{today.isoformat()}"
    return viol

is_greater_or_equal_than(df, rule)

Filters a DataFrame to include only rows where the value in a specified field is greater than or equal to a given threshold. Adds a 'dq_status' column to indicate the rule applied.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field' (str): The column name to apply the rule on. - 'check' (str): The type of check being performed (e.g., 'greater_or_equal'). - 'value' (numeric): The threshold value for the comparison.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame containing only the rows that satisfy the rule,

DataFrame

with an additional 'dq_status' column describing the rule applied.

Source code in sumeh/engine/pandas_engine.py
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
def is_greater_or_equal_than(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to include only rows where the value in a specified field
    is greater than or equal to a given threshold. Adds a 'dq_status' column to
    indicate the rule applied.

    Args:
        df (pd.DataFrame): The input DataFrame to be filtered.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field' (str): The column name to apply the rule on.
            - 'check' (str): The type of check being performed (e.g., 'greater_or_equal').
            - 'value' (numeric): The threshold value for the comparison.

    Returns:
        pd.DataFrame: A new DataFrame containing only the rows that satisfy the rule,
        with an additional 'dq_status' column describing the rule applied.
    """
    field, check, value = __extract_params(rule)
    viol = df[df[field] >= value].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

is_greater_than(df, rule)

Filters a DataFrame to return rows where a specified field's value is greater than a given threshold.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field' (str): The column name in the DataFrame to be checked. - 'check' (str): The type of check being performed (e.g., 'greater_than'). - 'value' (numeric): The threshold value to compare against.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame containing rows where the specified field's value is greater than the given threshold. An additional column 'dq_status' is added to indicate the rule applied in the format "field:check:value".

Source code in sumeh/engine/pandas_engine.py
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
def is_greater_than(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to return rows where a specified field's value is greater than a given threshold.

    Args:
        df (pd.DataFrame): The input DataFrame to be filtered.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field' (str): The column name in the DataFrame to be checked.
            - 'check' (str): The type of check being performed (e.g., 'greater_than').
            - 'value' (numeric): The threshold value to compare against.

    Returns:
        pd.DataFrame: A new DataFrame containing rows where the specified field's value is greater than the given threshold.
                      An additional column 'dq_status' is added to indicate the rule applied in the format "field:check:value".
    """
    field, check, value = __extract_params(rule)
    viol = df[df[field] > value].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

is_in(df, rule)

Checks if the values in a DataFrame satisfy a given rule by delegating the operation to the is_contained_in function.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be evaluated.

required
rule dict

A dictionary defining the rule to check against the DataFrame.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame indicating whether each element satisfies the rule.

Source code in sumeh/engine/pandas_engine.py
427
428
429
430
431
432
433
434
435
436
437
438
439
def is_in(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Checks if the values in a DataFrame satisfy a given rule by delegating
    the operation to the `is_contained_in` function.

    Args:
        df (pd.DataFrame): The input DataFrame to be evaluated.
        rule (dict): A dictionary defining the rule to check against the DataFrame.

    Returns:
        pd.DataFrame: A DataFrame indicating whether each element satisfies the rule.
    """
    return is_contained_in(df, rule)

is_in_billions(df, rule)

Filters a DataFrame to include only rows where the specified field's value is greater than or equal to one billion, and adds a data quality status column.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to filter.

required
rule dict

A dictionary containing the rule parameters. It should include: - field (str): The column name to check. - check (str): The type of check being performed (used for status annotation). - value (any): The value associated with the rule (used for status annotation).

required

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame containing rows where the specified field's

DataFrame

value is greater than or equal to one billion. Includes an additional

DataFrame

column dq_status with the format "{field}:{check}:{value}".

Source code in sumeh/engine/pandas_engine.py
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
def is_in_billions(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to include only rows where the specified field's value
    is greater than or equal to one billion, and adds a data quality status column.

    Args:
        df (pd.DataFrame): The input DataFrame to filter.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - field (str): The column name to check.
            - check (str): The type of check being performed (used for status annotation).
            - value (any): The value associated with the rule (used for status annotation).

    Returns:
        pd.DataFrame: A new DataFrame containing rows where the specified field's
        value is greater than or equal to one billion. Includes an additional
        column `dq_status` with the format "{field}:{check}:{value}".
    """
    field, check, value = __extract_params(rule)
    out = df[df[field] < 1_000_000_000].copy()
    out["dq_status"] = f"{field}:{check}:{value}"
    return out

is_in_millions(df, rule)

Filters rows in the DataFrame where the specified field's value is greater than or equal to one million and adds a "dq_status" column with a formatted string indicating the rule applied.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to filter.

required
rule dict

A dictionary containing the rule parameters. It is expected to include: - field (str): The column name to check. - check (str): The type of check being performed (e.g., "greater_than"). - value (any): The value associated with the rule (not used in this function).

required

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame containing rows where the specified field's value is >= 1,000,000. Includes an additional "dq_status" column with the rule details.

Source code in sumeh/engine/pandas_engine.py
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
def is_in_millions(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters rows in the DataFrame where the specified field's value is greater than or equal to one million
    and adds a "dq_status" column with a formatted string indicating the rule applied.

    Args:
        df (pd.DataFrame): The input DataFrame to filter.
        rule (dict): A dictionary containing the rule parameters. It is expected to include:
            - field (str): The column name to check.
            - check (str): The type of check being performed (e.g., "greater_than").
            - value (any): The value associated with the rule (not used in this function).

    Returns:
        pd.DataFrame: A new DataFrame containing rows where the specified field's value is >= 1,000,000.
                      Includes an additional "dq_status" column with the rule details.
    """
    field, check, value = __extract_params(rule)
    out = df[df[field] < 1_000_000].copy()
    out["dq_status"] = f"{field}:{check}:{value}"
    return out

is_legit(df, rule)

Validates a DataFrame against a specified rule and identifies rows that violate the rule.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to validate.

required
rule dict

A dictionary containing the validation rule. It is expected to have keys that define the field to check, the type of check, and the value to validate against.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing rows that violate the rule. An additional column 'dq_status' is added to indicate the field, check, and value that caused the violation in the format "{field}:{check}:{value}".

Source code in sumeh/engine/pandas_engine.py
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
def is_legit(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Validates a DataFrame against a specified rule and identifies rows that violate the rule.

    Args:
        df (pd.DataFrame): The input DataFrame to validate.
        rule (dict): A dictionary containing the validation rule. It is expected to have
                     keys that define the field to check, the type of check, and the value
                     to validate against.

    Returns:
        pd.DataFrame: A DataFrame containing rows that violate the rule. An additional
                      column 'dq_status' is added to indicate the field, check, and value
                      that caused the violation in the format "{field}:{check}:{value}".
    """
    field, check, value = __extract_params(rule)
    mask = df[field].notna() & df[field].astype(str).str.contains(r"^\S+$", na=False)
    viol = df[~mask].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

is_less_or_equal_than(df, rule)

Filters rows in a DataFrame where the value in a specified field is less than or equal to a given value.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field' (str): The column name in the DataFrame to apply the rule on. - 'check' (str): A descriptive label for the check being performed. - 'value' (numeric): The threshold value to compare against.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame containing only the rows that satisfy the condition. An additional column 'dq_status' is added to indicate the rule applied in the format "{field}:{check}:{value}".

Source code in sumeh/engine/pandas_engine.py
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
def is_less_or_equal_than(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters rows in a DataFrame where the value in a specified field is less than or equal to a given value.

    Args:
        df (pd.DataFrame): The input DataFrame to be filtered.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field' (str): The column name in the DataFrame to apply the rule on.
            - 'check' (str): A descriptive label for the check being performed.
            - 'value' (numeric): The threshold value to compare against.

    Returns:
        pd.DataFrame: A new DataFrame containing only the rows that satisfy the condition.
                      An additional column 'dq_status' is added to indicate the rule applied
                      in the format "{field}:{check}:{value}".
    """
    field, check, value = __extract_params(rule)
    viol = df[df[field] <= value].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

is_less_than(df, rule)

Filters a DataFrame to return rows where a specified field's value is less than a given threshold.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field' (str): The column name in the DataFrame to be checked. - 'check' (str): A descriptive string for the check (e.g., "less_than"). - 'value' (numeric): The threshold value to compare against.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame containing only the rows where the specified field's value

DataFrame

is less than the given threshold. An additional column 'dq_status' is added to indicate

DataFrame

the rule applied in the format "field:check:value".

Source code in sumeh/engine/pandas_engine.py
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
def is_less_than(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to return rows where a specified field's value is less than a given threshold.

    Args:
        df (pd.DataFrame): The input DataFrame to be filtered.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field' (str): The column name in the DataFrame to be checked.
            - 'check' (str): A descriptive string for the check (e.g., "less_than").
            - 'value' (numeric): The threshold value to compare against.

    Returns:
        pd.DataFrame: A new DataFrame containing only the rows where the specified field's value
        is less than the given threshold. An additional column 'dq_status' is added to indicate
        the rule applied in the format "field:check:value".
    """
    field, check, value = __extract_params(rule)
    viol = df[df[field] < value].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

is_negative(df, rule)

Filters a DataFrame to identify rows where a specified field does not satisfy a "negative" condition.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be checked.

required
rule dict

A dictionary containing the rule parameters. It is expected to include: - 'field': The column name in the DataFrame to check. - 'check': The type of check being performed (e.g., "negative"). - 'value': Additional value associated with the rule (not used in this function).

required

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame containing rows where the specified field is non-negative (>= 0). An additional column 'dq_status' is added to indicate the rule violation in the format "{field}:{check}:{value}".

Source code in sumeh/engine/pandas_engine.py
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
def is_negative(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to identify rows where a specified field does not satisfy a "negative" condition.

    Args:
        df (pd.DataFrame): The input DataFrame to be checked.
        rule (dict): A dictionary containing the rule parameters. It is expected to include:
            - 'field': The column name in the DataFrame to check.
            - 'check': The type of check being performed (e.g., "negative").
            - 'value': Additional value associated with the rule (not used in this function).

    Returns:
        pd.DataFrame: A new DataFrame containing rows where the specified field is non-negative (>= 0).
                      An additional column 'dq_status' is added to indicate the rule violation in the format
                      "{field}:{check}:{value}".
    """
    field, check, value = __extract_params(rule)
    viol = df[df[field] >= 0].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

is_on_friday(df, rule)

Filters the rows of a DataFrame based on whether a specific date column corresponds to a Friday.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to be filtered.

required
rule dict

A dictionary containing the rules or parameters for filtering. It should specify the column to check for the day of the week.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A filtered DataFrame containing only the rows where the specified date column corresponds to a Friday.

Source code in sumeh/engine/pandas_engine.py
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
def is_on_friday(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters the rows of a DataFrame based on whether a specific date column corresponds to a Friday.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be filtered.
        rule (dict): A dictionary containing the rules or parameters for filtering.
                     It should specify the column to check for the day of the week.

    Returns:
        pd.DataFrame: A filtered DataFrame containing only the rows where the specified date column corresponds to a Friday.
    """
    return _day_of_week(df, rule, 4)

is_on_monday(df, rule)

Filters the rows of a DataFrame based on whether a specific date column corresponds to a Monday.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to be filtered.

required
rule dict

A dictionary containing the filtering rules, including the column to check.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A filtered DataFrame containing only the rows where the specified date column corresponds to a Monday.

Source code in sumeh/engine/pandas_engine.py
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
def is_on_monday(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters the rows of a DataFrame based on whether a specific date column corresponds to a Monday.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be filtered.
        rule (dict): A dictionary containing the filtering rules, including the column to check.

    Returns:
        pd.DataFrame: A filtered DataFrame containing only the rows where the specified date column corresponds to a Monday.
    """
    return _day_of_week(df, rule, 0)

is_on_saturday(df, rule)

Filters a DataFrame to include only rows where the date corresponds to a Saturday.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing date information.

required
rule dict

A dictionary containing rules or parameters for filtering.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A filtered DataFrame containing only rows where the date is a Saturday.

Source code in sumeh/engine/pandas_engine.py
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
def is_on_saturday(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to include only rows where the date corresponds to a Saturday.

    Args:
        df (pd.DataFrame): The input DataFrame containing date information.
        rule (dict): A dictionary containing rules or parameters for filtering.

    Returns:
        pd.DataFrame: A filtered DataFrame containing only rows where the date is a Saturday.
    """
    return _day_of_week(df, rule, 5)

is_on_sunday(df, rule)

Determines whether the dates in a given DataFrame fall on a Sunday.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing date-related data.

required
rule dict

A dictionary containing rules or parameters for the operation.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame indicating whether each date falls on a Sunday.

Source code in sumeh/engine/pandas_engine.py
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
def is_on_sunday(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Determines whether the dates in a given DataFrame fall on a Sunday.

    Args:
        df (pd.DataFrame): The input DataFrame containing date-related data.
        rule (dict): A dictionary containing rules or parameters for the operation.

    Returns:
        pd.DataFrame: A DataFrame indicating whether each date falls on a Sunday.
    """
    return _day_of_week(df, rule, 6)

is_on_thursday(df, rule)

Filters the rows of a DataFrame based on whether a date column corresponds to a Thursday.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to be filtered.

required
rule dict

A dictionary containing the filtering rules, including the column to check.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A filtered DataFrame containing only the rows where the specified date column corresponds to a Thursday.

Source code in sumeh/engine/pandas_engine.py
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
def is_on_thursday(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters the rows of a DataFrame based on whether a date column corresponds to a Thursday.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be filtered.
        rule (dict): A dictionary containing the filtering rules, including the column to check.

    Returns:
        pd.DataFrame: A filtered DataFrame containing only the rows where the specified date column
                      corresponds to a Thursday.
    """
    return _day_of_week(df, rule, 3)

is_on_tuesday(df, rule)

Filters the rows of a DataFrame based on whether a specific date column corresponds to a Tuesday.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to be filtered.

required
rule dict

A dictionary containing the filtering rules, including the column to check.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A filtered DataFrame containing only the rows where the specified date column corresponds to a Tuesday.

Source code in sumeh/engine/pandas_engine.py
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
def is_on_tuesday(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters the rows of a DataFrame based on whether a specific date column corresponds to a Tuesday.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be filtered.
        rule (dict): A dictionary containing the filtering rules, including the column to check.

    Returns:
        pd.DataFrame: A filtered DataFrame containing only the rows where the specified date column corresponds to a Tuesday.
    """
    return _day_of_week(df, rule, 1)

is_on_wednesday(df, rule)

Filters the rows of a DataFrame based on whether a date column corresponds to Wednesday.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to be filtered.

required
rule dict

A dictionary containing the rule configuration. It is expected to specify the column to evaluate.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A filtered DataFrame containing only the rows where the specified date column corresponds to Wednesday.

Source code in sumeh/engine/pandas_engine.py
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
def is_on_wednesday(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters the rows of a DataFrame based on whether a date column corresponds to Wednesday.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be filtered.
        rule (dict): A dictionary containing the rule configuration.
                     It is expected to specify the column to evaluate.

    Returns:
        pd.DataFrame: A filtered DataFrame containing only the rows where the specified date column
                      corresponds to Wednesday.
    """
    return _day_of_week(df, rule, 2)

is_on_weekday(df, rule)

Filters a DataFrame to include only rows where the specified date field falls on a weekday (Monday to Friday) and adds a "dq_status" column indicating the rule applied.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - field (str): The name of the date column to check. - check (str): A descriptive string for the check being performed. - value (str): A value associated with the rule for documentation purposes.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A filtered DataFrame containing only rows where the specified date field

DataFrame

falls on a weekday, with an additional "dq_status" column describing the rule applied.

Source code in sumeh/engine/pandas_engine.py
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
def is_on_weekday(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to include only rows where the specified date field falls on a weekday
    (Monday to Friday) and adds a "dq_status" column indicating the rule applied.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be filtered.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - field (str): The name of the date column to check.
            - check (str): A descriptive string for the check being performed.
            - value (str): A value associated with the rule for documentation purposes.

    Returns:
        pd.DataFrame: A filtered DataFrame containing only rows where the specified date field
        falls on a weekday, with an additional "dq_status" column describing the rule applied.
    """
    field, check, value = __extract_params(rule)
    mask = ~df[field].dt.dayofweek.between(0, 4)
    out = df[mask].copy()
    out["dq_status"] = f"{field}:{check}:{value}"
    return out

is_on_weekend(df, rule)

Filters a DataFrame to include only rows where the specified date field falls on a weekend (Saturday or Sunday) and adds a "dq_status" column indicating the rule applied.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to be filtered.

required
rule dict

A dictionary containing the rule parameters. It is expected to include: - field (str): The name of the date column to check. - check (str): A descriptive string for the type of check being performed. - value (str): A value associated with the rule for documentation purposes.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame containing only the rows where the specified date field

DataFrame

falls on a weekend. Includes an additional "dq_status" column with the rule details.

Source code in sumeh/engine/pandas_engine.py
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
def is_on_weekend(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to include only rows where the specified date field falls on a weekend
    (Saturday or Sunday) and adds a "dq_status" column indicating the rule applied.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be filtered.
        rule (dict): A dictionary containing the rule parameters. It is expected to include:
            - field (str): The name of the date column to check.
            - check (str): A descriptive string for the type of check being performed.
            - value (str): A value associated with the rule for documentation purposes.

    Returns:
        pd.DataFrame: A new DataFrame containing only the rows where the specified date field
        falls on a weekend. Includes an additional "dq_status" column with the rule details.
    """
    field, check, value = __extract_params(rule)
    mask = ~df[field].dt.dayofweek.isin([5, 6])
    out = df[mask].copy()
    out["dq_status"] = f"{field}:{check}:{value}"
    return out

is_past_date(df, rule)

Identifies rows in a DataFrame where the date in a specified column is in the past.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to be checked.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the field name to check and the check type.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing the rows where the date in the specified column is earlier than the current date. An additional column 'dq_status' is added to indicate the field, check type, and the current date.

Notes
  • The function uses pd.to_datetime to convert the specified column to datetime format. Any invalid date entries will be coerced to NaT (Not a Time).
  • Rows with invalid or missing dates are excluded from the result.
Source code in sumeh/engine/pandas_engine.py
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
def is_past_date(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Identifies rows in a DataFrame where the date in a specified column is in the past.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to be checked.
        rule (dict): A dictionary containing the rule parameters. It is expected to include
                     the field name to check and the check type.

    Returns:
        pd.DataFrame: A DataFrame containing the rows where the date in the specified column
                      is earlier than the current date. An additional column 'dq_status' is
                      added to indicate the field, check type, and the current date.

    Notes:
        - The function uses `pd.to_datetime` to convert the specified column to datetime format.
          Any invalid date entries will be coerced to NaT (Not a Time).
        - Rows with invalid or missing dates are excluded from the result.
    """
    field, check, _ = __extract_params(rule)
    today = date.today()
    dates = pd.to_datetime(df[field], errors="coerce")
    viol = df[dates < today].copy()
    viol["dq_status"] = f"{field}:{check}:{today.isoformat()}"
    return viol

is_positive(df, rule)

Identifies rows in a DataFrame where the specified field contains negative values.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be checked.

required
rule dict

A dictionary containing the rule parameters. It is expected to include: - 'field': The column name in the DataFrame to check. - 'check': A descriptive label for the type of check being performed. - 'value': A value associated with the rule (not directly used in this function).

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing only the rows where the specified field has negative values. An additional column 'dq_status' is added to indicate the rule violation in the format "{field}:{check}:{value}".

Source code in sumeh/engine/pandas_engine.py
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
def is_positive(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Identifies rows in a DataFrame where the specified field contains negative values.

    Args:
        df (pd.DataFrame): The input DataFrame to be checked.
        rule (dict): A dictionary containing the rule parameters. It is expected to include:
            - 'field': The column name in the DataFrame to check.
            - 'check': A descriptive label for the type of check being performed.
            - 'value': A value associated with the rule (not directly used in this function).

    Returns:
        pd.DataFrame: A DataFrame containing only the rows where the specified field has negative values.
                      An additional column 'dq_status' is added to indicate the rule violation in the format
                      "{field}:{check}:{value}".
    """
    field, check, value = __extract_params(rule)
    viol = df[df[field] < 0].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

is_t_minus_2(df, rule)

Filters a DataFrame to include only rows where the specified date field matches the date two days prior to the current date.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to filter.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the field name, check type, and value.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A filtered DataFrame containing only the rows where the

DataFrame

specified date field matches the target date (two days prior). An

DataFrame

additional column "dq_status" is added to indicate the rule applied.

Source code in sumeh/engine/pandas_engine.py
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
def is_t_minus_2(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to include only rows where the specified date field
    matches the date two days prior to the current date.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to filter.
        rule (dict): A dictionary containing the rule parameters. It is expected
            to include the field name, check type, and value.

    Returns:
        pd.DataFrame: A filtered DataFrame containing only the rows where the
        specified date field matches the target date (two days prior). An
        additional column "dq_status" is added to indicate the rule applied.
    """
    field, check, value = __extract_params(rule)
    target = pd.Timestamp(date.today() - timedelta(days=2))
    mask = df[field].dt.normalize() != target
    out = df[mask].copy()
    out["dq_status"] = f"{field}:{check}:{value}"
    return out

is_t_minus_3(df, rule)

Filters a DataFrame to include only rows where the specified date field matches the date three days prior to the current date.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to filter.

required
rule dict

A dictionary containing the rule parameters. The rule should include the field to check, the type of check, and the value.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A filtered DataFrame containing only the rows where the

DataFrame

specified date field matches the target date (three days prior). An

DataFrame

additional column "dq_status" is added to indicate the rule applied.

Source code in sumeh/engine/pandas_engine.py
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
def is_t_minus_3(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to include only rows where the specified date field
    matches the date three days prior to the current date.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to filter.
        rule (dict): A dictionary containing the rule parameters. The rule
            should include the field to check, the type of check, and the value.

    Returns:
        pd.DataFrame: A filtered DataFrame containing only the rows where the
        specified date field matches the target date (three days prior). An
        additional column "dq_status" is added to indicate the rule applied.
    """
    field, check, value = __extract_params(rule)
    target = pd.Timestamp(date.today() - timedelta(days=3))
    mask = df[field].dt.normalize() != target
    out = df[mask].copy()
    out["dq_status"] = f"{field}:{check}:{value}"
    return out

is_today(df, rule)

Filters a DataFrame to include only rows where the specified date field matches today's date.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to filter.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the field name, a check operation, and a value.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame containing only the rows where the specified date field matches today's date. An additional column "dq_status" is added to indicate the rule applied in the format "{field}:{check}:{value}".

Source code in sumeh/engine/pandas_engine.py
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
def is_today(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to include only rows where the specified date field matches today's date.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to filter.
        rule (dict): A dictionary containing the rule parameters. It is expected to include
                     the field name, a check operation, and a value.

    Returns:
        pd.DataFrame: A new DataFrame containing only the rows where the specified date field
                      matches today's date. An additional column "dq_status" is added to indicate
                      the rule applied in the format "{field}:{check}:{value}".
    """
    field, check, value = __extract_params(rule)
    today = pd.Timestamp(date.today())
    mask = df[field].dt.normalize() != today
    out = df[mask].copy()
    out["dq_status"] = f"{field}:{check}:{value}"
    return out

is_unique(df, rule)

Checks for duplicate values in a specified field of a DataFrame based on a rule.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to check for duplicates.

required
rule dict

A dictionary containing the rule parameters. It is expected to include the field to check, the type of check, and a value.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing the rows with duplicate values in the specified field. An additional column 'dq_status' is added to indicate the field, check type, and value associated with the rule.

Source code in sumeh/engine/pandas_engine.py
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
def is_unique(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Checks for duplicate values in a specified field of a DataFrame based on a rule.

    Args:
        df (pd.DataFrame): The input DataFrame to check for duplicates.
        rule (dict): A dictionary containing the rule parameters. It is expected to
                     include the field to check, the type of check, and a value.

    Returns:
        pd.DataFrame: A DataFrame containing the rows with duplicate values in the
                      specified field. An additional column 'dq_status' is added
                      to indicate the field, check type, and value associated with
                      the rule.
    """
    field, check, value = __extract_params(rule)
    dup = df[field].duplicated(keep=False)
    viol = df[dup].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

is_yesterday(df, rule)

Filters a DataFrame to include only rows where the specified date field matches yesterday's date.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to filter.

required
rule dict

A dictionary containing the rule parameters. It is expected to have keys that allow __extract_params(rule) to return the field name, check type, and value.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A filtered DataFrame containing only rows where the specified date field matches yesterday's date. An additional column dq_status is added to indicate the data quality status in the format "{field}:{check}:{value}".

Source code in sumeh/engine/pandas_engine.py
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
def is_yesterday(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to include only rows where the specified date field matches yesterday's date.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to filter.
        rule (dict): A dictionary containing the rule parameters. It is expected to have
                     keys that allow `__extract_params(rule)` to return the field name,
                     check type, and value.

    Returns:
        pd.DataFrame: A filtered DataFrame containing only rows where the specified date field
                      matches yesterday's date. An additional column `dq_status` is added to
                      indicate the data quality status in the format "{field}:{check}:{value}".
    """
    field, check, value = __extract_params(rule)
    target = pd.Timestamp(date.today() - timedelta(days=1))
    mask = df[field].dt.normalize() != target
    out = df[mask].copy()
    out["dq_status"] = f"{field}:{check}:{value}"
    return out

not_contained_in(df, rule)

Filters a DataFrame to return rows where the specified field contains values that are not allowed according to the provided rule.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be filtered.

required
rule dict

A dictionary containing the rule parameters. It should include: - 'field': The column name in the DataFrame to check. - 'check': The type of check being performed (used for status annotation). - 'value': A list or string representation of values that are not allowed.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing rows that violate the rule. An additional

DataFrame

column 'dq_status' is added to indicate the rule violation in the format

DataFrame

"{field}:{check}:{value}".

Source code in sumeh/engine/pandas_engine.py
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
def not_contained_in(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame to return rows where the specified field contains values
    that are not allowed according to the provided rule.

    Args:
        df (pd.DataFrame): The input DataFrame to be filtered.
        rule (dict): A dictionary containing the rule parameters. It should include:
            - 'field': The column name in the DataFrame to check.
            - 'check': The type of check being performed (used for status annotation).
            - 'value': A list or string representation of values that are not allowed.

    Returns:
        pd.DataFrame: A DataFrame containing rows that violate the rule. An additional
        column 'dq_status' is added to indicate the rule violation in the format
        "{field}:{check}:{value}".
    """
    field, check, value = __extract_params(rule)
    vals = re.findall(r"'([^']*)'", str(value)) or [
        v.strip() for v in str(value).strip("[]").split(",")
    ]
    viol = df[df[field].isin(vals)].copy()
    viol["dq_status"] = f"{field}:{check}:{value}"
    return viol

not_in(df, rule)

Filters a DataFrame by excluding rows that match the specified rule.

This function is a wrapper around the not_contained_in function, which performs the actual filtering logic.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be filtered.

required
rule dict

A dictionary specifying the filtering criteria.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame with rows that do not match the rule.

Source code in sumeh/engine/pandas_engine.py
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
def not_in(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame by excluding rows that match the specified rule.

    This function is a wrapper around the `not_contained_in` function,
    which performs the actual filtering logic.

    Args:
        df (pd.DataFrame): The input DataFrame to be filtered.
        rule (dict): A dictionary specifying the filtering criteria.

    Returns:
        pd.DataFrame: A new DataFrame with rows that do not match the rule.
    """
    return not_contained_in(df, rule)

satisfies(df, rule)

Filters a DataFrame based on a rule and returns rows that do not satisfy the rule.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be evaluated.

required
rule dict

A dictionary containing the rule to be applied. It is expected to contain parameters that can be extracted using the __extract_params function.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing rows that do not satisfy the rule. An additional

DataFrame

column dq_status is added to indicate the field, check, and expression that failed.

Source code in sumeh/engine/pandas_engine.py
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
def satisfies(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Filters a DataFrame based on a rule and returns rows that do not satisfy the rule.

    Args:
        df (pd.DataFrame): The input DataFrame to be evaluated.
        rule (dict): A dictionary containing the rule to be applied. It is expected
            to contain parameters that can be extracted using the `__extract_params` function.

    Returns:
        pd.DataFrame: A DataFrame containing rows that do not satisfy the rule. An additional
        column `dq_status` is added to indicate the field, check, and expression that failed.
    """
    field, check, expr = __extract_params(rule)
    mask = df.eval(expr)
    viol = df[~mask].copy()
    viol["dq_status"] = f"{field}:{check}:{expr}"
    return viol

summarize(qc_df, rules, total_rows)

Summarizes quality check results for a given DataFrame based on specified rules.

Parameters:

Name Type Description Default
qc_df DataFrame

The input DataFrame containing a 'dq_status' column with quality check results in the format 'column:rule:value', separated by semicolons.

required
rules list[dict]

A list of dictionaries representing the quality check rules. Each dictionary should define the 'column', 'rule', 'value', and 'pass_threshold'.

required
total_rows int

The total number of rows in the original dataset.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame summarizing the quality check results with the following columns: - 'id': A unique identifier for each rule. - 'timestamp': The timestamp of the summary generation. - 'check': The type of check performed (e.g., 'Quality Check'). - 'level': The severity level of the check (e.g., 'WARNING'). - 'column': The column name associated with the rule. - 'rule': The rule being checked. - 'value': The value associated with the rule. - 'rows': The total number of rows in the dataset. - 'violations': The number of rows that violated the rule. - 'pass_rate': The proportion of rows that passed the rule. - 'pass_threshold': The threshold for passing the rule. - 'status': The status of the rule ('PASS' or 'FAIL') based on the pass rate.

Notes
  • The function calculates the number of violations for each rule and merges it with the provided rules to compute the pass rate and status.
  • The 'timestamp' column is set to the current time with seconds and microseconds set to zero.
Source code in sumeh/engine/pandas_engine.py
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
def summarize(qc_df: pd.DataFrame, rules: list[dict], total_rows: int) -> pd.DataFrame:
    """
    Summarizes quality check results for a given DataFrame based on specified rules.

    Args:
        qc_df (pd.DataFrame): The input DataFrame containing a 'dq_status' column with
            quality check results in the format 'column:rule:value', separated by semicolons.
        rules (list[dict]): A list of dictionaries representing the quality check rules.
            Each dictionary should define the 'column', 'rule', 'value', and 'pass_threshold'.
        total_rows (int): The total number of rows in the original dataset.

    Returns:
        pd.DataFrame: A DataFrame summarizing the quality check results with the following columns:
            - 'id': A unique identifier for each rule.
            - 'timestamp': The timestamp of the summary generation.
            - 'check': The type of check performed (e.g., 'Quality Check').
            - 'level': The severity level of the check (e.g., 'WARNING').
            - 'column': The column name associated with the rule.
            - 'rule': The rule being checked.
            - 'value': The value associated with the rule.
            - 'rows': The total number of rows in the dataset.
            - 'violations': The number of rows that violated the rule.
            - 'pass_rate': The proportion of rows that passed the rule.
            - 'pass_threshold': The threshold for passing the rule.
            - 'status': The status of the rule ('PASS' or 'FAIL') based on the pass rate.

    Notes:
        - The function calculates the number of violations for each rule and merges it with the
          provided rules to compute the pass rate and status.
        - The 'timestamp' column is set to the current time with seconds and microseconds set to zero.
    """
    split = qc_df["dq_status"].str.split(";").explode().dropna()
    parts = split.str.split(":", expand=True)
    parts.columns = ["column", "rule", "value"]
    viol_count = (
        parts.groupby(["column", "rule", "value"]).size().reset_index(name="violations")
    )
    rules_df = __build_rules_df(rules)
    df = rules_df.merge(viol_count, on=["column", "rule", "value"], how="left")
    df["violations"] = df["violations"].fillna(0).astype(int)
    df["rows"] = total_rows
    df["pass_rate"] = (total_rows - df["violations"]) / total_rows
    df["status"] = np.where(df["pass_rate"] >= df["pass_threshold"], "PASS", "FAIL")
    df["timestamp"] = datetime.now().replace(second=0, microsecond=0)
    df["check"] = "Quality Check"
    df["level"] = "WARNING"
    df.insert(0, "id", np.array([uuid.uuid4() for _ in range(len(df))], dtype="object"))
    return df[
        [
            "id",
            "timestamp",
            "check",
            "level",
            "column",
            "rule",
            "value",
            "rows",
            "violations",
            "pass_rate",
            "pass_threshold",
            "status",
        ]
    ]

validate(df, rules)

Validates a pandas DataFrame against a set of rules and returns the processed DataFrame along with a DataFrame containing validation violations.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to validate.

required
rules list[dict]

A list of dictionaries, where each dictionary represents a validation rule. Each rule should contain the following keys: - 'check_type' (str): The type of validation to perform. This should correspond to a function name available in the global scope. Special cases include 'is_primary_key' and 'is_composite_key', which map to 'is_unique' and 'are_unique', respectively. - 'execute' (bool, optional): Whether to execute the rule. Defaults to True.

required

Returns:

Type Description
Tuple[DataFrame, DataFrame]

Tuple[pd.DataFrame, pd.DataFrame]: A tuple containing: - The processed DataFrame with validation statuses merged. - A DataFrame containing rows that violated the validation rules.

Notes
  • The input DataFrame is copied and reset to ensure the original data is not modified.
  • An '_id' column is temporarily added to track row indices during validation.
  • If a rule's 'check_type' does not correspond to a known function, a warning is issued.
  • The 'dq_status' column in the violations DataFrame summarizes validation issues for each row.
Source code in sumeh/engine/pandas_engine.py
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
def validate(df: pd.DataFrame, rules: list[dict]) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Validates a pandas DataFrame against a set of rules and returns the processed DataFrame
    along with a DataFrame containing validation violations.

    Args:
        df (pd.DataFrame): The input DataFrame to validate.
        rules (list[dict]): A list of dictionaries, where each dictionary represents a validation
            rule. Each rule should contain the following keys:
            - 'check_type' (str): The type of validation to perform. This should correspond to a
              function name available in the global scope. Special cases include 'is_primary_key'
              and 'is_composite_key', which map to 'is_unique' and 'are_unique', respectively.
            - 'execute' (bool, optional): Whether to execute the rule. Defaults to True.

    Returns:
        Tuple[pd.DataFrame, pd.DataFrame]: A tuple containing:
            - The processed DataFrame with validation statuses merged.
            - A DataFrame containing rows that violated the validation rules.

    Notes:
        - The input DataFrame is copied and reset to ensure the original data is not modified.
        - An '_id' column is temporarily added to track row indices during validation.
        - If a rule's 'check_type' does not correspond to a known function, a warning is issued.
        - The 'dq_status' column in the violations DataFrame summarizes validation issues for
          each row.
    """
    df = df.copy().reset_index(drop=True)
    df["_id"] = df.index
    raw_list = []
    for rule in rules:
        if not rule.get("execute", True):
            continue
        rt = rule["check_type"]
        fn = globals().get(
            rt
            if rt not in ("is_primary_key", "is_composite_key")
            else ("is_unique" if rt == "is_primary_key" else "are_unique")
        )
        if fn is None:
            warnings.warn(f"Unknown rule: {rt}")
            continue
        viol = fn(df, rule)
        raw_list.append(viol)
    raw = (
        pd.concat(raw_list, ignore_index=True)
        if raw_list
        else pd.DataFrame(columns=df.columns)
    )
    summary = raw.groupby("_id")["dq_status"].agg(";".join).reset_index()
    out = df.merge(summary, on="_id", how="left").drop(columns=["_id"])
    return out, raw

validate_date_format(df, rule)

Validates the date format of a specified field in a DataFrame against a given format.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing the data to validate.

required
rule dict

A dictionary containing the validation rule. It should include: - 'field': The name of the column to validate. - 'check': A description or identifier for the validation check. - 'fmt': The expected date format to validate against.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing rows that violate the date format rule. An additional column 'dq_status' is added to indicate the validation status in the format "{field}:{check}:{fmt}".

Source code in sumeh/engine/pandas_engine.py
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
def validate_date_format(df: pd.DataFrame, rule: dict) -> pd.DataFrame:
    """
    Validates the date format of a specified field in a DataFrame against a given format.

    Args:
        df (pd.DataFrame): The input DataFrame containing the data to validate.
        rule (dict): A dictionary containing the validation rule. It should include:
            - 'field': The name of the column to validate.
            - 'check': A description or identifier for the validation check.
            - 'fmt': The expected date format to validate against.

    Returns:
        pd.DataFrame: A DataFrame containing rows that violate the date format rule.
                      An additional column 'dq_status' is added to indicate the
                      validation status in the format "{field}:{check}:{fmt}".
    """
    field, check, fmt = __extract_params(rule)
    pattern = __transform_date_format_in_pattern(fmt)
    mask = ~df[field].astype(str).str.match(pattern, na=False) | df[field].isna()
    viol = df[mask].copy()
    viol["dq_status"] = f"{field}:{check}:{fmt}"
    return viol

validate_schema(df, expected)

Validates the schema of a given DataFrame against an expected schema.

Parameters:

Name Type Description Default
df

The DataFrame whose schema needs to be validated.

required
expected

The expected schema, represented as a list of tuples where each tuple contains the column name and its data type.

required

Returns:

Type Description
Tuple[bool, List[Tuple[str, str]]]

Tuple[bool, List[Tuple[str, str]]]: A tuple containing: - A boolean indicating whether the schema matches the expected schema. - A list of tuples representing the errors, where each tuple contains the column name and a description of the mismatch.

Source code in sumeh/engine/pandas_engine.py
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
def validate_schema(df, expected) -> Tuple[bool, List[Tuple[str, str]]]:
    """
    Validates the schema of a given DataFrame against an expected schema.

    Args:
        df: The DataFrame whose schema needs to be validated.
        expected: The expected schema, represented as a list of tuples where each tuple
                  contains the column name and its data type.

    Returns:
        Tuple[bool, List[Tuple[str, str]]]: A tuple containing:
            - A boolean indicating whether the schema matches the expected schema.
            - A list of tuples representing the errors, where each tuple contains
              the column name and a description of the mismatch.
    """
    actual = __pandas_schema_to_list(df)
    result, errors = __compare_schemas(actual, expected)
    return result, errors