Dimensions (calendrier et géographie)

Construction des dimensions date, pays, ville et orchestration.

fabrictools.build_dimension_date(start_date: str | None = None, end_date: str | None = None, fiscal_year_start_month: int = 1, lakehouse_name: str | None = None, lakehouse_relative_path: str | None = None, warehouse_name: str | None = None, warehouse_table: str | None = None, default_relative_path: str = 'Dimension_Date', mode: str = 'overwrite', batch_size: int = 10000, spark: pyspark.sql.SparkSession | None = None) pyspark.sql.DataFrame

Build a calendar date dimension (keys, labels, fiscal attributes, weekend flag).

Default inclusive range when start_date / end_date are omitted: from January 1st of current_year - (current_year % 100) through December 31st of current_year + 4.

Parameters:
  • start_date (str | None) – Inclusive lower bound yyyy-MM-dd, or None for default.

  • end_date (str | None) – Inclusive upper bound yyyy-MM-dd, or None for default.

  • fiscal_year_start_month (int) – First fiscal month (1–12).

  • lakehouse_name (str | None) – If set with lakehouse_relative_path, write Delta there.

  • lakehouse_relative_path (str | None) – Path under the Lakehouse for the dimension table.

  • warehouse_name (str | None) – If set with warehouse_table, JDBC-write to Warehouse.

  • warehouse_table (str | None) – Fully qualified warehouse table name.

  • default_relative_path (str) – Fallback Lakehouse path segment when none given.

  • mode (str) – Spark write mode for persistence.

  • batch_size (int) – JDBC batch size when writing the warehouse.

  • spark (SparkSession | None) – Optional SparkSession.

Returns:

Ordered date dimension dataframe.

Return type:

DataFrame

Raises:

ValueError – If fiscal_year_start_month is outside 1..12.

Example

>>> dim_date = build_dimension_date(
...     start_date="2024-01-01",
...     end_date="2024-12-31",
...     lakehouse_name="GoldLakehouse",
...     lakehouse_relative_path="dimension_date",
... )
fabrictools.build_dimension_country(countries_limit: int | None = None, fail_on_source_error: bool = True, lakehouse_name: str | None = None, lakehouse_relative_path: str | None = None, warehouse_name: str | None = None, warehouse_table: str | None = None, default_relative_path: str = 'Dimension_Country', mode: str = 'overwrite', batch_size: int = 10000, spark: pyspark.sql.SparkSession | None = None) pyspark.sql.DataFrame

Build dimension_country from the countrystatecity-countries package.

Parameters:
  • countries_limit (int | None) – Optional cap on countries processed (source list order).

  • fail_on_source_error (bool) – If True, raise on source errors; else log and return empty frame.

  • lakehouse_name (str | None) – Optional Lakehouse for Delta output.

  • lakehouse_relative_path (str | None) – Path under Lakehouse (defaults via default_relative_path).

  • warehouse_name (str | None) – Optional Warehouse for JDBC output.

  • warehouse_table (str | None) – Fully qualified warehouse table.

  • default_relative_path (str) – Default Lakehouse relative path segment.

  • mode (str) – Spark write mode.

  • batch_size (int) – JDBC batch size for warehouse writes.

  • spark (SparkSession | None) – Optional SparkSession.

Returns:

Country dimension dataframe.

Return type:

DataFrame

Raises:
  • ImportError – When countrystatecity-countries is not installed.

  • RuntimeError – When fail_on_source_error is True and building fails.

Example

>>> countries = build_dimension_country(
...     lakehouse_name="GoldLakehouse",
...     lakehouse_relative_path="dimension_country",
... )
fabrictools.build_dimension_city(countries_limit: int | None = None, include_states_metadata: bool = True, fail_on_source_error: bool = True, regions: list[str] | None = None, subregions: list[str] | None = None, countries: list[str] | None = None, lakehouse_name: str | None = None, lakehouse_relative_path: str | None = None, warehouse_name: str | None = None, warehouse_table: str | None = None, default_relative_path: str = 'Dimension_City', mode: str = 'overwrite', batch_size: int = 10000, spark: pyspark.sql.SparkSession | None = None) pyspark.sql.DataFrame

Build dimension_city from countrystatecity-countries with optional filters.

countries may list country_code_2, country_code_3, or country_name values (case-insensitive). regions and subregions narrow by geography; filters combine with AND logic.

Parameters:
  • countries_limit (int | None) – Optional cap on countries iterated.

  • include_states_metadata (bool) – If True, resolve state names via get_states_of_country.

  • fail_on_source_error (bool) – If True, raise on failure; else return empty frame after logging.

  • regions (list[str] | None) – Allow-list of region names (uppercased internally).

  • subregions (list[str] | None) – Allow-list of subregion names.

  • countries (list[str] | None) – Allow-list of country identifiers or names.

  • lakehouse_name (str | None) – Optional Lakehouse for Delta output.

  • lakehouse_relative_path (str | None) – Lakehouse path for the dimension table.

  • warehouse_name (str | None) – Optional Warehouse name.

  • warehouse_table (str | None) – Fully qualified warehouse table.

  • default_relative_path (str) – Default Lakehouse path segment.

  • mode (str) – Spark write mode.

  • batch_size (int) – JDBC batch size.

  • spark (SparkSession | None) – Optional SparkSession.

Returns:

City dimension dataframe.

Return type:

DataFrame

Example

>>> cities = build_dimension_city(
...     lakehouse_name="GoldLakehouse",
...     lakehouse_relative_path="dimension_city",
...     countries=["FR", "DE"],
... )
fabrictools.generate_dimensions(lakehouse_name: str | None = None, warehouse_name: str | None = None, include_date: bool = True, include_country: bool = True, include_city: bool = True, start_date: str | None = None, end_date: str | None = None, fiscal_year_start_month: int = 1, countries_limit: int | None = None, include_states_metadata: bool = True, fail_on_source_error: bool = True, city_regions: list[str] | None = None, city_subregions: list[str] | None = None, city_countries: list[str] | None = None, mode: str = 'overwrite', batch_size: int = 10000, date_relative_path: str = 'Dimension_Date', country_relative_path: str = 'Dimension_Country', city_relative_path: str = 'Dimension_City', date_warehouse_table: str = 'dbo.Dimension_Date', country_warehouse_table: str = 'dbo.Dimension_Country', city_warehouse_table: str = 'dbo.Dimension_City', spark: pyspark.sql.SparkSession | None = None) dict[str, pyspark.sql.DataFrame]

Build enabled dimensions and persist each to the configured Lakehouse and/or Warehouse.

Keys in the returned map mirror the chosen relative path (or warehouse table name).

Parameters:
  • lakehouse_name (str | None) – Optional Lakehouse for all dimension writes.

  • warehouse_name (str | None) – Optional Warehouse for JDBC writes.

  • include_date (bool) – Build date dimension when True.

  • include_country (bool) – Build country dimension when True.

  • include_city (bool) – Build city dimension when True.

  • start_date (str | None) – Passed to fabrictools.build_dimension_date().

  • end_date (str | None) – Passed to build_dimension_date.

  • fiscal_year_start_month (int) – Passed to build_dimension_date.

  • countries_limit (int | None) – Passed to geo builders.

  • include_states_metadata (bool) – Passed to fabrictools.build_dimension_city().

  • fail_on_source_error (bool) – Passed to geo builders.

  • city_regions (list[str] | None) – Passed as regions to build_dimension_city.

  • city_subregions (list[str] | None) – Passed as subregions to build_dimension_city.

  • city_countries (list[str] | None) – Passed as countries to build_dimension_city.

  • mode (str) – Write mode for all targets.

  • batch_size (int) – JDBC batch size for warehouse writes.

  • date_relative_path (str) – Lakehouse path for the date table.

  • country_relative_path (str) – Lakehouse path for the country table.

  • city_relative_path (str) – Lakehouse path for the city table.

  • date_warehouse_table (str) – Warehouse table for date dimension.

  • country_warehouse_table (str) – Warehouse table for country dimension.

  • city_warehouse_table (str) – Warehouse table for city dimension.

  • spark (SparkSession | None) – Optional SparkSession.

Returns:

Map of dimension key to dataframe.

Return type:

dict[str, DataFrame]

Raises:

ValueError – If all dimension flags are False.

Example

>>> dims = generate_dimensions(
...     lakehouse_name="GoldLakehouse",
...     include_date=True,
...     include_country=True,
...     include_city=False,
... )