fabrictools.quality.clean

Pure DataFrame cleaning helpers.

fabrictools.quality.clean.add_silver_metadata(df: pyspark.sql.DataFrame, source_lakehouse_name: str, source_relative_path: str, source_layer: str = 'bronze', ingestion_timestamp_col: str = 'ingestion_timestamp', source_layer_col: str = 'ingestion_source_layer', source_path_col: str = 'ingestion_source_path', year_col: str = 'ingestion_year', month_col: str = 'ingestion_month', day_col: str = 'ingestion_day', spark: pyspark.sql.SparkSession | None = None, verbose: bool = False, *, resolved_source_relative_path: str | None = None) → pyspark.sql.DataFrame

Add Silver-layer metadata columns (ingestion time, source path, date parts).

Resolves source_relative_path with fabrictools.io.lakehouse.resolve_lakehouse_read_candidate() unless a resolved path is provided. Date partition columns (year_col / month_col / day_col) are derived from the current ingestion date.

Parameters:

df (DataFrame) – Bronze or intermediate dataframe.
source_lakehouse_name (str) – Source Lakehouse display name.
source_relative_path (str) – Source path passed to path resolution.
source_layer (str) – Literal stored in source_layer_col (default bronze).
ingestion_timestamp_col (str) – Column name for current_timestamp().
source_layer_col (str) – Column name for the layer literal.
source_path_col (str) – Column name for the resolved relative path string.
year_col (str) – Partition year column name.
month_col (str) – Partition month column name.
day_col (str) – Partition day-of-month column name.
spark (SparkSession | None) – Optional SparkSession for path resolution.
resolved_source_relative_path (str | None) – Optional already-resolved source path. When provided, path resolution is skipped.

Returns:

df with metadata and partition columns appended/overwritten.

Return type:

DataFrame

Example

>>> silver_df = add_silver_metadata(
...     bronze_df,
...     source_lakehouse_name="BronzeLakehouse",
...     source_relative_path="dbo.RawOrders",
... )

fabrictools.quality.clean.clean_data(df: pyspark.sql.DataFrame, drop_all_null_rows: bool = True, verbose: bool = False) → pyspark.sql.DataFrame

Normalize names, trim empty strings to null, and infer column types.

Renames columns to unique snake_case (via internal helpers), then runs detect_and_cast_columns() with string normalization enabled, and optionally drops rows that are all-null.

Parameters:

df (DataFrame) – Input dataframe.
drop_all_null_rows (bool) – If True, call dropna(how="all").

Returns:

Cleaned dataframe.

Return type:

DataFrame

Example

>>> cleaned = clean_data(raw_df, drop_all_null_rows=True)

fabrictools.quality.clean.detect_and_cast_columns(df: pyspark.sql.DataFrame, verbose: bool = False, *, normalize_strings: bool = False) → pyspark.sql.DataFrame[source]

Infer primitive types from string columns and cast when the column is uniform.

Uses a two-pass strategy: a lightweight first aggregation collects per-column integer/float failure counts plus one representative non-empty sample value; candidate date/timestamp formats are derived from that sample on the driver, then a second aggregation validates only those formats across all rows.

Order of detection (first match wins): date, timestamp, integer, double, else string. Columns that are all-null are skipped.

When normalize_strings is True, string columns are trimmed and blank strings are converted to null in the final projection (same behavior as _replace_empty_strings_with_nulls()).

Sets spark.sql.legacy.timeParserPolicy to CORRECTED so Spark can evaluate the returned lazy DataFrame with the same parser policy.

Parameters:

df (DataFrame) – Input dataframe.
verbose (bool) – Reserved for future logging; currently unused.
normalize_strings (bool) – If True, trim strings and map "" to null.

Returns:

Dataframe with qualifying string columns cast.

Return type:

DataFrame

fabrictools.quality.clean.to_snake_case(name: str) → str

Normalize a label to snake_case (same rules as fabrictools.clean_data()).

Strips accents, replaces non-alphanumeric runs with _, collapses repeated underscores, lowercases, and prefixes with col_ when the result starts with a digit. Empty input yields "col".

Parameters:: name (str) – Source label (e.g. column name, file name, join prefix).
Returns:: Snake-case identifier.
Return type:: str

Example

>>> to_snake_case("OIT avril 2026.xlsx")
'oit_avril_2026_xlsx'
>>> to_snake_case("n_commande_OIT avril 2026")
'n_commande_oit_avril_2026'