Module sumeh.core
¶
sumeh.core ¶
This module provides a set of functions and utilities for data validation, schema retrieval, and summarization. It supports multiple data sources and engines, including BigQuery, S3, CSV files, MySQL, PostgreSQL, AWS Glue, DuckDB, and Databricks.
Functions:
Name | Description |
---|---|
get_rules_config |
str, **kwargs) -> List[Dict[str, Any]]: Retrieves configuration rules based on the specified source. |
get_schema_config |
str, **kwargs) -> List[Dict[str, Any]]: Retrieves the schema configuration based on the provided data source. |
validate |
|
summarize |
list[dict], **context): |
report |
list[dict], name: str = "Quality Check"): |
Imports
cuallee: Provides the Check
and CheckLevel
classes for data validation.
warnings: Used to issue warnings for unknown rule names.
importlib: Dynamically imports modules based on engine detection.
typing: Provides type hints for function arguments and return values.
re: Used for regular expression matching in source string parsing.
sumeh.core: Contains functions for retrieving configurations and schemas
from various data sources.
sumeh.core.utils: Provides utility functions for value conversion and URI parsing.
The module uses Python's structural pattern matching (match-case
) to handle
different data source types and validation rules.
The report
function supports a wide range of validation checks, including
completeness, uniqueness, value comparisons, patterns, and date-related checks.
The validate
and summarize
functions dynamically detect the appropriate engine
based on the input DataFrame type and delegate the processing to the corresponding
engine module.
get_rules_config ¶
Retrieve configuration rules based on the specified source.
Dispatches to the appropriate loader according to the format of source
,
returning a list of parsed rule dictionaries.
Supported sources
"bigquery" <project>.<dataset>.<table>
s3://<bucket>/<path>
<file>.csv
"mysql"
or"postgresql"
(requires host/user/etc. in kwargs)"glue"
(AWS Glue Data Catalog)duckdb://<db_path>.<table>
databricks://<catalog>.<schema>.<table>
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source
|
str
|
Identifier of the rules configuration location. Determines which handler is invoked. |
required |
**kwargs
|
Loader-specific parameters (e.g. |
{}
|
Returns:
Type | Description |
---|---|
List[Dict[str, Any]]
|
List[Dict[str, Any]]:
A list of dictionaries, each representing a validation rule with keys
like |
Raises:
Type | Description |
---|---|
ValueError
|
If |
get_schema_config ¶
Retrieve the schema configuration based on the provided data source.
This function reads from a schema_registry table/file to get the expected schema for a given table. Supports various data sources such as BigQuery, S3, CSV files, MySQL, PostgreSQL, AWS Glue, DuckDB, and Databricks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source
|
str
|
A string representing the data source. Supported formats:
- |
required |
**kwargs
|
Source-specific parameters. Common ones: - table (str): Table name to look up (REQUIRED for all sources) - environment (str): Environment filter (default: 'prod') - query (str): Additional WHERE filters (optional) For BigQuery: project_id, dataset_id, table_id For MySQL/PostgreSQL: host, user, password, database OR conn For Glue: glue_context, database_name, table_name For DuckDB: conn, table For Databricks: spark, catalog, schema, table For CSV/S3: file_path/s3_path, table |
{}
|
Returns:
Type | Description |
---|---|
List[Dict[str, Any]]
|
List[Dict[str, Any]]: Schema configuration from schema_registry |
Raises:
Type | Description |
---|---|
ValueError
|
If source format is invalid or required params are missing |
Examples:
validate ¶
Validates a DataFrame against a set of rules using the appropriate engine.
This function dynamically detects the engine to use based on the input DataFrame and delegates the validation process to the corresponding engine's implementation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input DataFrame to be validated. |
required |
rules
|
list or dict
|
The validation rules to be applied to the DataFrame. |
required |
**context
|
Additional context parameters that may be required by the engine. - conn (optional): A database connection object, required for certain engines like "duckdb_engine". |
{}
|
Returns:
Type | Description |
---|---|
bool or dict: The result of the validation process. The return type and structure |
|
depend on the specific engine's implementation. |
Raises:
Type | Description |
---|---|
ImportError
|
If the required engine module cannot be imported. |
AttributeError
|
If the detected engine does not have a |
Notes
- The engine is dynamically determined based on the DataFrame type or other characteristics.
- For "duckdb_engine", a database connection object should be provided in the context under the key "conn".