Dimensions (calendrier et géographie)

Construction des dimensions date, pays, ville, attributs dérivés des données et orchestration.

fabrictools.build_dimension_date(start_date: str | None = None, end_date: str | None = None, fiscal_year_start_month: int = 1, lakehouse_name: str | None = None, lakehouse_relative_path: str | None = None, warehouse_name: str | None = None, warehouse_table: str | None = None, default_relative_path: str = 'Dimension_Date', mode: str = 'overwrite', batch_size: int = 10000, spark: pyspark.sql.SparkSession | None = None, *, merge_condition: str | None = None, upsert_key_columns: list[str] | None = None) → pyspark.sql.DataFrame

Build a calendar date dimension (keys, labels, fiscal attributes, weekend flag, rolling last-N-day flags vs. the job’s current date).

Columns is_last_7days, is_last_30days, and is_last_90days are integers 0 or 1. A row is 1 when its calendar date falls in the inclusive window from current_date() - N through current_date() in the Spark session at execution time (not the client/report “today”). Re-run the dimension job when those flags must stay aligned with the operational calendar day.

Column periode_relation_key is a string bridge key for relating to the small Periode lookup written by fabrictools.make_business_ready(): 7_30_90 if in the last-7-days window, else 30_90 if in last 30 (but not 7), else 90 if in last 90 (but not 30), else 0.

Default inclusive range when start_date / end_date are omitted: from January 1st of current_year - (current_year % 100) through December 31st of current_year + 4.

Parameters:

start_date (str | None) – Inclusive lower bound yyyy-MM-dd, or None for default.
end_date (str | None) – Inclusive upper bound yyyy-MM-dd, or None for default.
fiscal_year_start_month (int) – First fiscal month (1–12).
lakehouse_name (str | None) – If set with lakehouse_relative_path, write Delta there.
lakehouse_relative_path (str | None) – Path under the Lakehouse for the dimension table.
warehouse_name (str | None) – If set with warehouse_table, JDBC-write to Warehouse.
warehouse_table (str | None) – Fully qualified warehouse table name.
default_relative_path (str) – Fallback Lakehouse path segment when none given.
mode (str) – Spark write mode for persistence.
batch_size (int) – JDBC batch size when writing the warehouse.
spark (SparkSession | None) – Optional SparkSession.

Returns:

Ordered date dimension dataframe.

Return type:

DataFrame

Raises:

ValueError – If fiscal_year_start_month is outside 1..12.

Example

>>> dim_date = build_dimension_date(
...     start_date="2024-01-01",
...     end_date="2024-12-31",
...     lakehouse_name="GoldLakehouse",
...     lakehouse_relative_path="dimension_date",
... )

fabrictools.build_dimension_country(countries_limit: int | None = None, fail_on_source_error: bool = True, lakehouse_name: str | None = None, lakehouse_relative_path: str | None = None, warehouse_name: str | None = None, warehouse_table: str | None = None, default_relative_path: str = 'Dimension_Country', mode: str = 'overwrite', batch_size: int = 10000, spark: pyspark.sql.SparkSession | None = None, *, merge_condition: str | None = None, upsert_key_columns: list[str] | None = None) → pyspark.sql.DataFrame

Build dimension_country from the countrystatecity-countries package.

Parameters:

countries_limit (int | None) – Optional cap on countries processed (source list order).
fail_on_source_error (bool) – If True, raise on source errors; else log and return empty frame.
lakehouse_name (str | None) – Optional Lakehouse for Delta output.
lakehouse_relative_path (str | None) – Path under Lakehouse (defaults via default_relative_path).
warehouse_name (str | None) – Optional Warehouse for JDBC output.
warehouse_table (str | None) – Fully qualified warehouse table.
default_relative_path (str) – Default Lakehouse relative path segment.
mode (str) – Spark write mode.
batch_size (int) – JDBC batch size for warehouse writes.
spark (SparkSession | None) – Optional SparkSession.

Returns:

Country dimension dataframe.

Return type:

DataFrame

Raises:

ImportError – When countrystatecity-countries is not installed.
RuntimeError – When fail_on_source_error is True and building fails.

Example

>>> countries = build_dimension_country(
...     lakehouse_name="GoldLakehouse",
...     lakehouse_relative_path="dimension_country",
... )

fabrictools.build_dimension_city(countries_limit: int | None = None, include_states_metadata: bool = True, fail_on_source_error: bool = True, regions: list[str] | None = None, subregions: list[str] | None = None, countries: list[str] | None = None, lakehouse_name: str | None = None, lakehouse_relative_path: str | None = None, warehouse_name: str | None = None, warehouse_table: str | None = None, default_relative_path: str = 'Dimension_City', mode: str = 'overwrite', batch_size: int = 10000, spark: pyspark.sql.SparkSession | None = None, *, merge_condition: str | None = None, upsert_key_columns: list[str] | None = None) → pyspark.sql.DataFrame

Build dimension_city from countrystatecity-countries with optional filters.

countries may list country_code_2, country_code_3, or country_name values (case-insensitive). regions and subregions narrow by geography; filters combine with AND logic.

Parameters:

countries_limit (int | None) – Optional cap on countries iterated.
include_states_metadata (bool) – If True, resolve state names via get_states_of_country.
fail_on_source_error (bool) – If True, raise on failure; else return empty frame after logging.
regions (list[str] | None) – Allow-list of region names (uppercased internally).
subregions (list[str] | None) – Allow-list of subregion names.
countries (list[str] | None) – Allow-list of country identifiers or names.
lakehouse_name (str | None) – Optional Lakehouse for Delta output.
lakehouse_relative_path (str | None) – Lakehouse path for the dimension table.
warehouse_name (str | None) – Optional Warehouse name.
warehouse_table (str | None) – Fully qualified warehouse table.
default_relative_path (str) – Default Lakehouse path segment.
mode (str) – Spark write mode.
batch_size (int) – JDBC batch size.
spark (SparkSession | None) – Optional SparkSession.

Returns:

City dimension dataframe.

Return type:

DataFrame

Example

>>> cities = build_dimension_city(
...     lakehouse_name="GoldLakehouse",
...     lakehouse_relative_path="dimension_city",
...     countries=["FR", "DE"],
... )

fabrictools.build_dimension_from_columns(sources: Sequence[tuple[pyspark.sql.DataFrame, ...]], dimension_columns: Sequence[str], *, surrogate_key_column: str | None = None, include_surrogate_key: bool = True, exclude_nulls: bool = True, normalize_strings: bool = True, cast_to_string: bool = True, log_distinct_count: bool = False) → pyspark.sql.DataFrame

Union distinct dimension rows from many dataframes into one dimension table.

sources accepts two equivalent syntaxes: - grouped: [(df, "source_col_1", "source_col_2")] - detailed: [(df, "source_col_1"), (df, "source_col_2")] Both map positionally to dimension_columns. Duplicate rows are collapsed once after union (single shuffle for dedupe). Optionally adds a deterministic surrogate integer key ordered by natural columns.

Parameters:

sources – Non-empty sequence of source definitions. Each definition must provide either one source column (detailed syntax) or exactly len(dimension_columns) source columns (grouped syntax).
dimension_columns – Output natural key column names (e.g. ["ID", "Compagnie"]).
surrogate_key_column – Name of the surrogate key column when include_surrogate_key is True. When None, uses "dimension_key".
include_surrogate_key – When False, return only natural columns (deduped, ordered by natural columns).
exclude_nulls – When True, drop rows where at least one natural key column is null after cast/trim.
normalize_strings – When True, apply trim() after any string cast (with cast_to_string=True, values are trimmed as strings).
cast_to_string – When True, cast every source column to string so values from heterogeneous types merge safely (recommended for label dimensions).
log_distinct_count – When True, run an action and log the resulting row count.

Returns:

Deduped dimension with optional surrogate key first, then natural columns.

Raises:

ValueError – If dimension_columns is empty, sources are malformed, source columns are missing, Spark sessions differ, or surrogate and natural names collide.

Exemples dimension composite:

dim_compagnie = ft.build_dimension_from_columns(
    [(correspondant_client_df, "trigramme"), (correspondant_client_df, "nom")],
    dimension_columns=["ID", "Compagnie"],
)

dim_compagnie = ft.build_dimension_from_columns(
    [(correspondant_client_df, "trigramme", "nom")],
    dimension_columns=["ID", "Compagnie"],
)

fabrictools.generate_dimensions(lakehouse_name: str | None = None, warehouse_name: str | None = None, include_date: bool = True, include_country: bool = True, include_city: bool = True, start_date: str | None = None, end_date: str | None = None, fiscal_year_start_month: int = 1, countries_limit: int | None = None, include_states_metadata: bool = True, fail_on_source_error: bool = True, city_regions: list[str] | None = None, city_subregions: list[str] | None = None, city_countries: list[str] | None = None, mode: str = 'overwrite', batch_size: int = 10000, date_relative_path: str = 'Dimension_Date', country_relative_path: str = 'Dimension_Country', city_relative_path: str = 'Dimension_City', date_warehouse_table: str = 'dbo.Dimension_Date', country_warehouse_table: str = 'dbo.Dimension_Country', city_warehouse_table: str = 'dbo.Dimension_City', spark: pyspark.sql.SparkSession | None = None) → dict[str, pyspark.sql.DataFrame]

Build enabled dimensions and persist each to the configured Lakehouse and/or Warehouse.

Keys in the returned map mirror the chosen relative path (or warehouse table name).

Parameters:

lakehouse_name (str | None) – Optional Lakehouse for all dimension writes.
warehouse_name (str | None) – Optional Warehouse for JDBC writes.
include_date (bool) – Build date dimension when True.
include_country (bool) – Build country dimension when True.
include_city (bool) – Build city dimension when True.
start_date (str | None) – Passed to fabrictools.build_dimension_date().
end_date (str | None) – Passed to build_dimension_date.
fiscal_year_start_month (int) – Passed to build_dimension_date.
countries_limit (int | None) – Passed to geo builders.
include_states_metadata (bool) – Passed to fabrictools.build_dimension_city().
fail_on_source_error (bool) – Passed to geo builders.
city_regions (list[str] | None) – Passed as regions to build_dimension_city.
city_subregions (list[str] | None) – Passed as subregions to build_dimension_city.
city_countries (list[str] | None) – Passed as countries to build_dimension_city.
mode (str) – Write mode for all targets.
batch_size (int) – JDBC batch size for warehouse writes.
date_relative_path (str) – Lakehouse path for the date table.
country_relative_path (str) – Lakehouse path for the country table.
city_relative_path (str) – Lakehouse path for the city table.
date_warehouse_table (str) – Warehouse table for date dimension.
country_warehouse_table (str) – Warehouse table for country dimension.
city_warehouse_table (str) – Warehouse table for city dimension.
spark (SparkSession | None) – Optional SparkSession.

Returns:

Map of dimension key to dataframe.

Return type:

dict[str, DataFrame]

Raises:

ValueError – If all dimension flags are False.

Example

>>> dims = generate_dimensions(
...     lakehouse_name="GoldLakehouse",
...     include_date=True,
...     include_country=True,
...     include_city=False,
... )