Module `sumeh.core`¶

sumeh.core ¶

This module provides a set of functions and utilities for data validation, schema retrieval, and summarization. It supports multiple data sources and engines, including BigQuery, S3, CSV files, MySQL, PostgreSQL, AWS Glue, DuckDB, and Databricks.

Functions:

Name	Description
`get_rules_config`	str, **kwargs) -> List[Dict[str, Any]]: Retrieves configuration rules based on the specified source.
`get_schema_config`	str, **kwargs) -> List[Dict[str, Any]]: Retrieves the schema configuration based on the provided data source.
`validate`
`summarize`	list[dict], **context):
`report`	list[dict], name: str = "Quality Check"):

Imports

cuallee: Provides the Check and CheckLevel classes for data validation. warnings: Used to issue warnings for unknown rule names. importlib: Dynamically imports modules based on engine detection. typing: Provides type hints for function arguments and return values. re: Used for regular expression matching in source string parsing. sumeh.core: Contains functions for retrieving configurations and schemas from various data sources. sumeh.core.utils: Provides utility functions for value conversion and URI parsing.

The module uses Python's structural pattern matching (match-case) to handle different data source types and validation rules. The report function supports a wide range of validation checks, including completeness, uniqueness, value comparisons, patterns, and date-related checks. The validate and summarize functions dynamically detect the appropriate engine based on the input DataFrame type and delegate the processing to the corresponding engine module.

get_rules_config ¶

get_rules_config(source: str, **kwargs) -> List[Dict[str, Any]]

Retrieve configuration rules based on the specified source.

Dispatches to the appropriate loader according to the format of source, returning a list of parsed rule dictionaries.

Supported sources

"bigquery" <project>.<dataset>.<table>
s3://<bucket>/<path>
<file>.csv
"mysql" or "postgresql" (requires host/user/etc. in kwargs)
"glue" (AWS Glue Data Catalog)
duckdb://<db_path>.<table>
databricks://<catalog>.<schema>.<table>

Parameters:

Name	Type	Description	Default
`source`	`str`	Identifier of the rules configuration location. Determines which handler is invoked.	required
`**kwargs`		Loader-specific parameters (e.g. `host`, `user`, `password`, `connection`, `query`, `delimiter`).	`{}`

Returns:

Type	Description
`List[Dict[str, Any]]`	List[Dict[str, Any]]: A list of dictionaries, each representing a validation rule with keys like `"field"`, `"check_type"`, `"value"`, `"threshold"`, and `"execute"`.

Raises:

Type	Description
`ValueError`	If `source` does not match any supported format.

get_schema_config ¶

get_schema_config(source: str, **kwargs) -> List[Dict[str, Any]]

Retrieve the schema configuration based on the provided data source.

This function reads from a schema_registry table/file to get the expected schema for a given table. Supports various data sources such as BigQuery, S3, CSV files, MySQL, PostgreSQL, AWS Glue, DuckDB, and Databricks.

Parameters:

Name	Type	Description	Default
`source`	`str`	A string representing the data source. Supported formats: - `bigquery`: BigQuery source - `s3://<bucket>/<path>`: S3 CSV file - `<file>.csv`: Local CSV file - `mysql`: MySQL database - `postgresql`: PostgreSQL database - `glue`: AWS Glue Data Catalog - `duckdb`: DuckDB database - `databricks`: Databricks Unity Catalog	required
`**kwargs`		Source-specific parameters. Common ones: - table (str): Table name to look up (REQUIRED for all sources) - environment (str): Environment filter (default: 'prod') - query (str): Additional WHERE filters (optional) For BigQuery: project_id, dataset_id, table_id For MySQL/PostgreSQL: host, user, password, database OR conn For Glue: glue_context, database_name, table_name For DuckDB: conn, table For Databricks: spark, catalog, schema, table For CSV/S3: file_path/s3_path, table	`{}`

Returns:

Type	Description
`List[Dict[str, Any]]`	List[Dict[str, Any]]: Schema configuration from schema_registry

Raises:

Type	Description
`ValueError`	If source format is invalid or required params are missing

Examples:

>>> get_schema_config("bigquery", project_id="proj", dataset_id="ds", table_id="users")
>>> get_schema_config("mysql", conn=my_conn, table="users")
>>> get_schema_config("s3://bucket/registry.csv", table="users", environment="prod")

validate ¶

validate(df, rules, **context)

Validates a DataFrame against a set of rules using the appropriate engine.

This function dynamically detects the engine to use based on the input DataFrame and delegates the validation process to the corresponding engine's implementation.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame to be validated.	required
`rules`	`list or dict`	The validation rules to be applied to the DataFrame.	required
`**context`		Additional context parameters that may be required by the engine. - conn (optional): A database connection object, required for certain engines like "duckdb_engine".	`{}`

Returns:

Type	Description
	bool or dict: The result of the validation process. The return type and structure
	depend on the specific engine's implementation.

Raises:

Type	Description
`ImportError`	If the required engine module cannot be imported.
`AttributeError`	If the detected engine does not have a `validate` method.

Notes

The engine is dynamically determined based on the DataFrame type or other characteristics.
For "duckdb_engine", a database connection object should be provided in the context under the key "conn".

Module sumeh.core¶

sumeh.core ¶

get_rules_config ¶

get_schema_config ¶

validate ¶

Module `sumeh.core`¶