Skip to content

Quickstart 🚀

A concise guide to get started with Sumeh’s unified data quality framework.

1. Installation 💻

Install Sumeh via pip (recommended) or conda:

pip install sumeh
# or
conda install -c conda-forge sumeh

2. Loading Rules and Schema Configuration ⚙️

Use get_rules_config and get_schema_config to fetch your validation rules and expected schema from various sources.

from sumeh import get_rules_config, get_schema_config

# Load rules from CSV
rules = get_rules_config("path/to/rules.csv", delimiter=';')

# Load expected schema from Glue Data Catalog
schema = get_schema_config(
    "glue",
    catalog_name="my_catalog",
    database_name="my_db",
    table_name="my_table"
)

Supported rule/schema sources include:

  • bigquery://project.dataset.table 🌐
  • s3://bucket/path ☁️
  • Local CSV (*.csv) 📄
  • Relational ("mysql", "postgresql") via kwargs 🗄️
  • AWS Glue ("glue") 🔥
  • DuckDB (duckdb://db_path.table) 🦆
  • Databricks (databricks://catalog.schema.table) 💎

3. Schema Validation 📐

Before validating data, ensure your DataFrame or connection matches the expected schema:

from sumeh import validate_schema

# For a Spark DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet("data.parquet")

is_valid, errors = validate_schema(
    df,
    expected=schema,
    engine="pyspark_engine"
)
if not is_valid:
    print("Schema mismatches:", errors)

4. Data Validation 🔍

Apply your loaded rules to any supported DataFrame using validate:

from sumeh import validate

# Example with Pandas:
import pandas as pd
df = pd.read_csv("data.csv")

# Validate (detects engine automatically)
result = validate(df, rules)
# `result` structure depends on engine (e.g., CheckResult for cuallee engines)

5. Summarization 📊

Generate a tabular summary of violations and pass rates with summarize:

from sumeh import summarize

# For DataFrames requiring manual total_rows (e.g., Pandas):
total = len(df)
summary_df = summarize(
    df=result,       # could be validation output or raw DataFrame
    rules=rules,
    total_rows=total
)
print(summary_df)

6. One-Step Reporting 📝

Use report for an end-to-end quality check and summary in one call:

from sumeh import report

report_df = report(
    df,              # your DataFrame or connection
    rules,
    name="My Quality Check"
)
print(report_df)

For deeper customization and engine-specific options, explore the full API and examples in the Sumeh repository.