Skip to content

Module sumeh.core

sumeh.core

This module provides a set of functions and utilities for data validation, schema retrieval, and summarization. It supports multiple data sources and engines, including BigQuery, S3, CSV files, MySQL, PostgreSQL, AWS Glue, DuckDB, and Databricks.

Functions:

Name Description
get_rules_config

str, **kwargs) -> List[Dict[str, Any]]: Retrieves configuration rules based on the specified source.

get_schema_config

str, **kwargs) -> List[Dict[str, Any]]: Retrieves the schema configuration based on the provided data source.

validate
summarize

list[dict], **context):

report

list[dict], name: str = "Quality Check"):

Imports

cuallee: Provides the Check and CheckLevel classes for data validation. warnings: Used to issue warnings for unknown rule names. importlib: Dynamically imports modules based on engine detection. typing: Provides type hints for function arguments and return values. re: Used for regular expression matching in source string parsing. sumeh.core: Contains functions for retrieving configurations and schemas from various data sources. sumeh.core.utils: Provides utility functions for value conversion and URI parsing.

The module uses Python's structural pattern matching (match-case) to handle different data source types and validation rules. The report function supports a wide range of validation checks, including completeness, uniqueness, value comparisons, patterns, and date-related checks. The validate and summarize functions dynamically detect the appropriate engine based on the input DataFrame type and delegate the processing to the corresponding engine module.

get_rules_config

get_rules_config(source: str, **kwargs) -> List[Dict[str, Any]]

Retrieve configuration rules based on the specified source.

Dispatches to the appropriate loader according to the format of source, returning a list of parsed rule dictionaries.

Supported sources
  • "bigquery" <project>.<dataset>.<table>
  • s3://<bucket>/<path>
  • <file>.csv
  • "mysql" or "postgresql" (requires host/user/etc. in kwargs)
  • "glue" (AWS Glue Data Catalog)
  • duckdb://<db_path>.<table>
  • databricks://<catalog>.<schema>.<table>

Parameters:

Name Type Description Default
source str

Identifier of the rules configuration location. Determines which handler is invoked.

required
**kwargs

Loader-specific parameters (e.g. host, user, password, connection, query, delimiter).

{}

Returns:

Type Description
List[Dict[str, Any]]

List[Dict[str, Any]]: A list of dictionaries, each representing a validation rule with keys like "field", "check_type", "value", "threshold", and "execute".

Raises:

Type Description
ValueError

If source does not match any supported format.

get_schema_config

get_schema_config(source: str, **kwargs) -> List[Dict[str, Any]]

Retrieve the schema configuration based on the provided data source.

This function reads from a schema_registry table/file to get the expected schema for a given table. Supports various data sources such as BigQuery, S3, CSV files, MySQL, PostgreSQL, AWS Glue, DuckDB, and Databricks.

Parameters:

Name Type Description Default
source str

A string representing the data source. Supported formats: - bigquery: BigQuery source - s3://<bucket>/<path>: S3 CSV file - <file>.csv: Local CSV file - mysql: MySQL database - postgresql: PostgreSQL database - glue: AWS Glue Data Catalog - duckdb: DuckDB database - databricks: Databricks Unity Catalog

required
**kwargs

Source-specific parameters. Common ones: - table (str): Table name to look up (REQUIRED for all sources) - environment (str): Environment filter (default: 'prod') - query (str): Additional WHERE filters (optional)

For BigQuery: project_id, dataset_id, table_id For MySQL/PostgreSQL: host, user, password, database OR conn For Glue: glue_context, database_name, table_name For DuckDB: conn, table For Databricks: spark, catalog, schema, table For CSV/S3: file_path/s3_path, table

{}

Returns:

Type Description
List[Dict[str, Any]]

List[Dict[str, Any]]: Schema configuration from schema_registry

Raises:

Type Description
ValueError

If source format is invalid or required params are missing

Examples:

>>> get_schema_config("bigquery", project_id="proj", dataset_id="ds", table_id="users")
>>> get_schema_config("mysql", conn=my_conn, table="users")
>>> get_schema_config("s3://bucket/registry.csv", table="users", environment="prod")

validate

validate(df, rules, **context)

Validates a DataFrame against a set of rules using the appropriate engine.

This function dynamically detects the engine to use based on the input DataFrame and delegates the validation process to the corresponding engine's implementation.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame to be validated.

required
rules list or dict

The validation rules to be applied to the DataFrame.

required
**context

Additional context parameters that may be required by the engine. - conn (optional): A database connection object, required for certain engines like "duckdb_engine".

{}

Returns:

Type Description

bool or dict: The result of the validation process. The return type and structure

depend on the specific engine's implementation.

Raises:

Type Description
ImportError

If the required engine module cannot be imported.

AttributeError

If the detected engine does not have a validate method.

Notes
  • The engine is dynamically determined based on the DataFrame type or other characteristics.
  • For "duckdb_engine", a database connection object should be provided in the context under the key "conn".