I/O Lakehouse et Warehouse

Lakehouse filesystem discovery utilities.

fabrictools.io.discovery.filter_pipeline_discovered_tables(relative_paths: list[str]) → list[str][source]

Remove internal fabrictools tables and schema snapshot paths from a path list.

Drops names ending with _schema_snapshot and fixed names pipeline_audit_log, prefix_rules, and profiling_cache.

Parameters:: relative_paths (list[str]) – Tables/<schema>/<table> style relative paths.
Returns:: Filtered paths in original order.
Return type:: list[str]

fabrictools.io.discovery.get_fs_entry_name(fs_entry: Any) → str[source]

Return a directory or file name from a notebookutils.fs.ls entry.

Parameters:: fs_entry (Any) – Object with name or path as provided by Fabric APIs.
Returns:: Normalized leaf name, or empty string if none.
Return type:: str

fabrictools.io.discovery.list_lakehouse_tables(lakehouse_name: str, include_schemas: List[str] | None = None, exclude_tables: List[str] | None = None) → List[str][source]

List table paths under a Lakehouse as Tables/<schema>/<table>.

Uses filesystem listing under <abfs>/Tables/<schema>/<table>.

Parameters:

lakehouse_name (str) – Lakehouse display name.
include_schemas (list[str] | None) – If set, only these schemas (case-insensitive).
exclude_tables (list[str] | None) – Table names or schema.table to exclude (case-insensitive).

Returns:

Sorted relative paths.

Return type:

list[str]

Raises:

ValueError – When notebookutils is unavailable (not in Fabric).

fabrictools.io.discovery.list_lakehouse_tables_for_pipeline(lakehouse_name: str, include_schemas: List[str] | None = None, exclude_tables: List[str] | None = None) → List[str][source]

Like list_lakehouse_tables(), then filter_pipeline_discovered_tables().

Bulk pipelines use this to skip internal metadata and schema snapshot tables.

Parameters:

lakehouse_name (str) – Lakehouse display name.
include_schemas (list[str] | None) – Optional schema allow-list (case-insensitive).
exclude_tables (list[str] | None) – Optional table / schema.table deny-list.

Returns:

Sorted, pipeline-safe relative table paths.

Return type:

list[str]

Lakehouse I/O facade module.

fabrictools.io.lakehouse.delete_all_lakehouse_tables(lakehouse_name: str, include_schemas: List[str] | None = None, exclude_tables: List[str] | None = None, continue_on_error: bool = False) → dict[str, Any]

Hard-delete all discovered Lakehouse table folders.

Tables are discovered as Tables/<schema>/<table> and removed with notebookutils.fs.rm(<abfs>/Tables/<schema>/<table>, recurse=True).

Parameters:

lakehouse_name (str) – Lakehouse display name to purge.
include_schemas (list[str] | None) – If set, only these schema names (case-insensitive).
exclude_tables (list[str] | None) – Table or schema.table names to skip (case-insensitive).
continue_on_error (bool) – If False (default), stop on first delete failure.

Returns:

Summary with counts and per-table relative_path / errors.

Return type:

dict

Raises:

ValueError – When notebookutils is unavailable (not in Fabric).

Example

>>> summary = delete_all_lakehouse_tables(
...     "DevLakehouse",
...     include_schemas=["dbo"],
...     exclude_tables=["dbo.KeepThis"],
...     continue_on_error=True,
... )

fabrictools.io.lakehouse.delete_lakehouse(lakehouse_name: str, relative_path: str, condition: str, spark: pyspark.sql.SparkSession | None = None) → None

Delete rows from a Delta table in a Lakehouse matching condition.

Uses Delta Lake DeltaTable.forPath and delete(condition=...). The predicate is a SQL expression evaluated against target columns (same style as merge_lakehouse()’s merge_condition).

Parameters:

lakehouse_name (str) – Lakehouse display name holding the target table.
relative_path (str) – Path of the Delta table inside the Lakehouse (same rules as write_lakehouse(), including schema.table with spaces).
condition (str) – SQL predicate selecting rows to delete (e.g. "to_date(`LoadDate`) = '2025-06-03'").
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.

Example

>>> delete_lakehouse(
...     "SilverLakehouse",
...     "suivi_journalier",
...     condition="to_date(`Date`) = '2025-06-03'",
... )

fabrictools.io.lakehouse.lakehouse_table_exists(lakehouse_name: str, relative_path: str) → bool

Check if a table or path exists in a Fabric Lakehouse.

Uses notebookutils.fs.exists to check candidate paths without reading data.

Parameters:

lakehouse_name (str) – Display name of the Lakehouse.
relative_path (str) – Logical path under the Lakehouse root (slash path or SQL-style schema.table, e.g. dbo.PdC Extraction).

Returns:

True if at least one candidate path exists, False otherwise.

Return type:

bool

Raises:

ValueError – When notebookutils is unavailable (not in Fabric).

Example

>>> if lakehouse_table_exists("BronzeLakehouse", "dbo.SalesOrders"):
...     print("Table exists!")

fabrictools.io.lakehouse.merge_lakehouse(source_df: pyspark.sql.DataFrame, lakehouse_name: str, relative_path: str, merge_condition: str, update_set: dict | None = None, insert_set: dict | None = None, spark: pyspark.sql.SparkSession | None = None) → None

Upsert (merge) a DataFrame into an existing Delta table in a Lakehouse.

Uses Delta Lake DeltaTable.forPath. When update_set and/or insert_set are None, whenMatchedUpdateAll / whenNotMatchedInsertAll are used.

Parameters:

source_df (DataFrame) – Rows to merge into the target table.
lakehouse_name (str) – Lakehouse display name holding the target table.
relative_path (str) – Path of the Delta table inside the Lakehouse (same rules as write_lakehouse(), including schema.table with spaces).
merge_condition (str) – SQL predicate joining source and target (e.g. "src.id = tgt.id").
update_set (dict | None) – {target_col: source_expr} for matched rows, or None to update all columns.
insert_set (dict | None) – {target_col: source_expr} for new rows, or None to insert all columns.
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.

Example

>>> merge_lakehouse(
...     new_df,
...     "SilverLakehouse",
...     "sales_clean",
...     merge_condition="src.id = tgt.id",
... )

fabrictools.io.lakehouse.read_lakehouse(lakehouse_name: str, relative_path: str, spark: pyspark.sql.SparkSession | None = None, *, format: str = 'auto') → pyspark.sql.DataFrame

Read a dataset from a Fabric Lakehouse.

By default, tries formats in order: Delta → Parquet → CSV. The first format that succeeds is used; the detected format is logged with the resulting shape. Pass format="delta", "parquet", or "csv" to skip auto-detection and read that format directly.

Parameters:

lakehouse_name (str) – Display name of the Lakehouse (e.g. "BronzeLakehouse").
relative_path (str) – Path inside the Lakehouse root, relative to the ABFS base (e.g. "sales/2024", "Tables/customers", or SQL-style "dbo.MyTable" / "dbo.PdC Extraction" with spaces in the table name).
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.
format (str) – "auto" (default), "delta", "parquet", or "csv".

Returns:

Loaded dataframe.

Return type:

DataFrame

Raises:

RuntimeError – When none of the supported formats can be read from the path.

Example

>>> df = read_lakehouse("BronzeLakehouse", "sales/2024")

fabrictools.io.lakehouse.read_lakehouses(requests: list[dict[str, Any]], *, max_workers: int | None = None, continue_on_error: bool | None = False, spark: pyspark.sql.SparkSession | None = None) → dict[str, Any]

Read multiple Lakehouse datasets in parallel.

Each request must contain lakehouse_name and relative_path. Optional keys are format ("auto", "delta", "parquet", "csv") and name to identify the result entry.

Parameters:

requests (list[dict]) – Per-read parameter dictionaries.
max_workers (int | None) – Maximum number of concurrent read tasks. When omitted, uses min(len(requests), 5). Pass a value greater than 5 to opt in to higher parallelism.
continue_on_error (bool) – If False (default), raise on the first failed read.
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.

Returns:

Dict containing one DataFrame per request key plus a summary entry.

Return type:

dict

Example

>>> result = read_lakehouses(
...     [
...         {"name": "orders", "lakehouse_name": "BronzeLakehouse", "relative_path": "dbo.orders"},
...         {"name": "customers", "lakehouse_name": "BronzeLakehouse", "relative_path": "dbo.customers", "format": "delta"},
...     ],
...     max_workers=2,
... )
>>> orders_df = result["orders"]
>>> details = result["summary"]

fabrictools.io.lakehouse.resolve_lakehouse_read_candidate(lakehouse_name: str, relative_path: str, spark: pyspark.sql.SparkSession | None = None) → str[source]

Resolve the best candidate relative path for a Lakehouse read.

If candidate generation yields a single path, return it directly. If multiple candidates exist, try each path and return the first readable one.

Parameters:

lakehouse_name (str) – Display name of the Lakehouse.
relative_path (str) – Logical path under the Lakehouse root (slash path or SQL-style schema.table, e.g. dbo.PdC Extraction).
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.

Returns:

Relative path string that was verified readable.

Return type:

str

Raises:

RuntimeError – When no candidate path can be read.

Example

>>> resolved = resolve_lakehouse_read_candidate(
...     "BronzeLakehouse", "dbo.SalesOrders"
... )

fabrictools.io.lakehouse.write_lakehouse(df: pyspark.sql.DataFrame, lakehouse_name: str, relative_path: str, mode: str = 'overwrite', partition_by: List[str] | None = None, format: str = 'delta', spark: pyspark.sql.SparkSession | None = None, *, merge_condition: str | None = None, upsert_key_columns: Sequence[str] | None = None, normalize_column_names: bool = True, enable_column_mapping: bool = False, auto_partition: bool = False, auto_partition_threshold_bytes: int = 1073741824) → None

Write a DataFrame to a Fabric Lakehouse (default format: Delta).

Parameters:

df (DataFrame) – DataFrame to persist.
lakehouse_name (str) – Display name of the target Lakehouse.
relative_path (str) – Destination path inside the Lakehouse (e.g. "sales_clean", "Tables/sales_clean", or "dbo.PdC Extraction").
mode (str) – Spark save mode "overwrite" (default), "append", "ignore", "error", or Delta merge modes "upsert" / "merge" (bootstrap overwrite when the target table does not exist). "merge" is treated like "upsert".
partition_by (list[str] | None) – Optional column names to partition by. Each name is resolved like fabrictools.clean_data() / fabrictools.merge_dataframes() (physical name, normalized unique label, or snake_case). Auto-detected date partitions are appended when present on df.
format (str) – "delta" (default), "parquet", or "csv".
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.
normalize_column_names (bool) – If True (default), run fabrictools.rename_columns_normalized() before resolving partition_by and writing. If False, keep physical column names unchanged.
enable_column_mapping (bool) – If True and format="delta", writes table properties required for Delta column mapping (mode name), allowing column names with spaces or special characters.
auto_partition – If True, automatically partition the data by detected date columns if they exist. Default False.
merge_condition (str | None) – For mode="upsert" / "merge": Delta merge predicate (e.g. "src.order_id = tgt.order_id"). When set, overrides upsert_key_columns entirely. Use this when every column in the join must appear in the predicate regardless of resolution.
upsert_key_columns (collections.abc.Sequence[str] | None) – Ordered candidates for key columns when merge_condition is omitted. Each name is resolved like partition_by; only candidates that match a column are kept, then combined with AND (one match → simple key, several → composite). Duplicates collapse to one conjunct per physical column. If none match, raises.

Example

>>> write_lakehouse(
...     df,
...     "SilverLakehouse",
...     "sales_clean",
...     mode="upsert",
...     partition_by=["year"],
...     upsert_key_columns=["id"],
... )

fabrictools.io.lakehouse.write_lakehouses(requests: list[dict[str, Any]], *, max_workers: int | None = None, continue_on_error: bool | None = False, spark: pyspark.sql.SparkSession | None = None) → dict[str, Any]

Write multiple DataFrame objects to Lakehouses in parallel.

Each request must contain df, lakehouse_name and relative_path. Optional keys mirror write_lakehouse(): mode (defaults to overwrite when omitted), partition_by, format, merge_condition, upsert_key_columns (ordered merge-key candidates; see that function), normalize_column_names and enable_column_mapping.

Automatic partition detection is disabled in this bulk path for performance (no call to write_lakehouse()’s auto-partition heuristics). Use explicit partition_by per request when partitioning is required. Keys auto_partition and auto_partition_threshold_bytes are ignored if present.

Parameters:

requests (list[dict]) – Per-write parameter dictionaries.
max_workers (int | None) – Maximum number of concurrent write tasks. When omitted, uses min(len(requests), 5). Pass a value greater than 5 to opt in to higher parallelism.
continue_on_error (bool) – If False (default), raise on the first failed write.
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.

Returns:

Summary dict with counts and per-target success/failure entries.

Return type:

dict

Example

>>> summary = write_lakehouses(
...     [
...         {"df": orders_df, "lakehouse_name": "SilverLakehouse", "relative_path": "dbo.orders"},
...         {"df": customers_df, "lakehouse_name": "SilverLakehouse", "relative_path": "dbo.customers", "partition_by": ["country"]},
...     ],
...     max_workers=2,
... )

Warehouse I/O facade module.

fabrictools.io.warehouse.read_warehouse(warehouse_name: str, query: str, spark: pyspark.sql.SparkSession | None = None) → pyspark.sql.DataFrame

Run a SQL query on a Fabric Warehouse and return the result as a DataFrame.

The JDBC URL is resolved from the warehouse display name via notebookutils. Authentication uses the signed-in Fabric user token.

Parameters:

warehouse_name (str) – Warehouse display name (e.g. "MyWarehouse").
query (str) – SQL text (e.g. "SELECT * FROM dbo.sales"). Wrap subqueries in parentheses when needed, e.g. "(SELECT id, name FROM dbo.sales WHERE year = 2024) t".
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.

Returns:

Query result.

Return type:

DataFrame

Example

>>> df = read_warehouse("MyWarehouse", "SELECT * FROM dbo.sales")

fabrictools.io.warehouse.write_warehouse(df: pyspark.sql.DataFrame, warehouse_name: str, table: str, mode: str = 'overwrite', batch_size: int = 10000, spark: pyspark.sql.SparkSession | None = None) → None

Write a DataFrame to a Fabric Warehouse table via JDBC.

Parameters:

df (DataFrame) – DataFrame to persist.
warehouse_name (str) – Target Warehouse display name.
table (str) – Fully-qualified table name (e.g. "dbo.sales_clean").
mode (str) – Spark write mode: "overwrite" (default), "append", "ignore", or "error".
batch_size (int) – Rows per JDBC batch (default 10000).
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.

Example

>>> write_warehouse(df, "MyWarehouse", "dbo.sales_clean", mode="append")

I/O adapters for Fabric Lakehouse and Warehouse (see fabrictools.io.lakehouse, fabrictools.io.warehouse, fabrictools.io.discovery).

fabrictools.io.delete_all_lakehouse_tables(lakehouse_name: str, include_schemas: List[str] | None = None, exclude_tables: List[str] | None = None, continue_on_error: bool = False) → dict[str, Any]

Hard-delete all discovered Lakehouse table folders.

Tables are discovered as Tables/<schema>/<table> and removed with notebookutils.fs.rm(<abfs>/Tables/<schema>/<table>, recurse=True).

Parameters:

lakehouse_name (str) – Lakehouse display name to purge.
include_schemas (list[str] | None) – If set, only these schema names (case-insensitive).
exclude_tables (list[str] | None) – Table or schema.table names to skip (case-insensitive).
continue_on_error (bool) – If False (default), stop on first delete failure.

Returns:

Summary with counts and per-table relative_path / errors.

Return type:

dict

Raises:

ValueError – When notebookutils is unavailable (not in Fabric).

Example

>>> summary = delete_all_lakehouse_tables(
...     "DevLakehouse",
...     include_schemas=["dbo"],
...     exclude_tables=["dbo.KeepThis"],
...     continue_on_error=True,
... )

fabrictools.io.delete_lakehouse(lakehouse_name: str, relative_path: str, condition: str, spark: pyspark.sql.SparkSession | None = None) → None

Delete rows from a Delta table in a Lakehouse matching condition.

Uses Delta Lake DeltaTable.forPath and delete(condition=...). The predicate is a SQL expression evaluated against target columns (same style as merge_lakehouse()’s merge_condition).

Parameters:

lakehouse_name (str) – Lakehouse display name holding the target table.
relative_path (str) – Path of the Delta table inside the Lakehouse (same rules as write_lakehouse(), including schema.table with spaces).
condition (str) – SQL predicate selecting rows to delete (e.g. "to_date(`LoadDate`) = '2025-06-03'").
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.

Example

>>> delete_lakehouse(
...     "SilverLakehouse",
...     "suivi_journalier",
...     condition="to_date(`Date`) = '2025-06-03'",
... )

fabrictools.io.filter_pipeline_discovered_tables(relative_paths: list[str]) → list[str][source]

Remove internal fabrictools tables and schema snapshot paths from a path list.

Drops names ending with _schema_snapshot and fixed names pipeline_audit_log, prefix_rules, and profiling_cache.

Parameters:: relative_paths (list[str]) – Tables/<schema>/<table> style relative paths.
Returns:: Filtered paths in original order.
Return type:: list[str]

fabrictools.io.lakehouse_table_exists(lakehouse_name: str, relative_path: str) → bool

Check if a table or path exists in a Fabric Lakehouse.

Uses notebookutils.fs.exists to check candidate paths without reading data.

Parameters:

lakehouse_name (str) – Display name of the Lakehouse.
relative_path (str) – Logical path under the Lakehouse root (slash path or SQL-style schema.table, e.g. dbo.PdC Extraction).

Returns:

True if at least one candidate path exists, False otherwise.

Return type:

bool

Raises:

ValueError – When notebookutils is unavailable (not in Fabric).

Example

>>> if lakehouse_table_exists("BronzeLakehouse", "dbo.SalesOrders"):
...     print("Table exists!")

fabrictools.io.list_lakehouse_tables(lakehouse_name: str, include_schemas: List[str] | None = None, exclude_tables: List[str] | None = None) → List[str][source]

List table paths under a Lakehouse as Tables/<schema>/<table>.

Uses filesystem listing under <abfs>/Tables/<schema>/<table>.

Parameters:

lakehouse_name (str) – Lakehouse display name.
include_schemas (list[str] | None) – If set, only these schemas (case-insensitive).
exclude_tables (list[str] | None) – Table names or schema.table to exclude (case-insensitive).

Returns:

Sorted relative paths.

Return type:

list[str]

Raises:

ValueError – When notebookutils is unavailable (not in Fabric).

fabrictools.io.list_lakehouse_tables_for_pipeline(lakehouse_name: str, include_schemas: List[str] | None = None, exclude_tables: List[str] | None = None) → List[str][source]

Like list_lakehouse_tables(), then filter_pipeline_discovered_tables().

Bulk pipelines use this to skip internal metadata and schema snapshot tables.

Parameters:

lakehouse_name (str) – Lakehouse display name.
include_schemas (list[str] | None) – Optional schema allow-list (case-insensitive).
exclude_tables (list[str] | None) – Optional table / schema.table deny-list.

Returns:

Sorted, pipeline-safe relative table paths.

Return type:

list[str]

fabrictools.io.merge_lakehouse(source_df: pyspark.sql.DataFrame, lakehouse_name: str, relative_path: str, merge_condition: str, update_set: dict | None = None, insert_set: dict | None = None, spark: pyspark.sql.SparkSession | None = None) → None

Upsert (merge) a DataFrame into an existing Delta table in a Lakehouse.

Uses Delta Lake DeltaTable.forPath. When update_set and/or insert_set are None, whenMatchedUpdateAll / whenNotMatchedInsertAll are used.

Parameters:

source_df (DataFrame) – Rows to merge into the target table.
lakehouse_name (str) – Lakehouse display name holding the target table.
relative_path (str) – Path of the Delta table inside the Lakehouse (same rules as write_lakehouse(), including schema.table with spaces).
merge_condition (str) – SQL predicate joining source and target (e.g. "src.id = tgt.id").
update_set (dict | None) – {target_col: source_expr} for matched rows, or None to update all columns.
insert_set (dict | None) – {target_col: source_expr} for new rows, or None to insert all columns.
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.

Example

>>> merge_lakehouse(
...     new_df,
...     "SilverLakehouse",
...     "sales_clean",
...     merge_condition="src.id = tgt.id",
... )

fabrictools.io.read_lakehouse(lakehouse_name: str, relative_path: str, spark: pyspark.sql.SparkSession | None = None, *, format: str = 'auto') → pyspark.sql.DataFrame

Read a dataset from a Fabric Lakehouse.

By default, tries formats in order: Delta → Parquet → CSV. The first format that succeeds is used; the detected format is logged with the resulting shape. Pass format="delta", "parquet", or "csv" to skip auto-detection and read that format directly.

Parameters:

lakehouse_name (str) – Display name of the Lakehouse (e.g. "BronzeLakehouse").
relative_path (str) – Path inside the Lakehouse root, relative to the ABFS base (e.g. "sales/2024", "Tables/customers", or SQL-style "dbo.MyTable" / "dbo.PdC Extraction" with spaces in the table name).
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.
format (str) – "auto" (default), "delta", "parquet", or "csv".

Returns:

Loaded dataframe.

Return type:

DataFrame

Raises:

RuntimeError – When none of the supported formats can be read from the path.

Example

>>> df = read_lakehouse("BronzeLakehouse", "sales/2024")

fabrictools.io.read_lakehouses(requests: list[dict[str, Any]], *, max_workers: int | None = None, continue_on_error: bool | None = False, spark: pyspark.sql.SparkSession | None = None) → dict[str, Any]

Read multiple Lakehouse datasets in parallel.

Each request must contain lakehouse_name and relative_path. Optional keys are format ("auto", "delta", "parquet", "csv") and name to identify the result entry.

Parameters:

requests (list[dict]) – Per-read parameter dictionaries.
max_workers (int | None) – Maximum number of concurrent read tasks. When omitted, uses min(len(requests), 5). Pass a value greater than 5 to opt in to higher parallelism.
continue_on_error (bool) – If False (default), raise on the first failed read.
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.

Returns:

Dict containing one DataFrame per request key plus a summary entry.

Return type:

dict

Example

>>> result = read_lakehouses(
...     [
...         {"name": "orders", "lakehouse_name": "BronzeLakehouse", "relative_path": "dbo.orders"},
...         {"name": "customers", "lakehouse_name": "BronzeLakehouse", "relative_path": "dbo.customers", "format": "delta"},
...     ],
...     max_workers=2,
... )
>>> orders_df = result["orders"]
>>> details = result["summary"]

fabrictools.io.read_warehouse(warehouse_name: str, query: str, spark: pyspark.sql.SparkSession | None = None) → pyspark.sql.DataFrame

Run a SQL query on a Fabric Warehouse and return the result as a DataFrame.

The JDBC URL is resolved from the warehouse display name via notebookutils. Authentication uses the signed-in Fabric user token.

Parameters:

warehouse_name (str) – Warehouse display name (e.g. "MyWarehouse").
query (str) – SQL text (e.g. "SELECT * FROM dbo.sales"). Wrap subqueries in parentheses when needed, e.g. "(SELECT id, name FROM dbo.sales WHERE year = 2024) t".
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.

Returns:

Query result.

Return type:

DataFrame

Example

>>> df = read_warehouse("MyWarehouse", "SELECT * FROM dbo.sales")

fabrictools.io.resolve_lakehouse_read_candidate(lakehouse_name: str, relative_path: str, spark: pyspark.sql.SparkSession | None = None) → str[source]

Resolve the best candidate relative path for a Lakehouse read.

If candidate generation yields a single path, return it directly. If multiple candidates exist, try each path and return the first readable one.

Parameters:

lakehouse_name (str) – Display name of the Lakehouse.
relative_path (str) – Logical path under the Lakehouse root (slash path or SQL-style schema.table, e.g. dbo.PdC Extraction).
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.

Returns:

Relative path string that was verified readable.

Return type:

str

Raises:

RuntimeError – When no candidate path can be read.

Example

>>> resolved = resolve_lakehouse_read_candidate(
...     "BronzeLakehouse", "dbo.SalesOrders"
... )

fabrictools.io.write_lakehouse(df: pyspark.sql.DataFrame, lakehouse_name: str, relative_path: str, mode: str = 'overwrite', partition_by: List[str] | None = None, format: str = 'delta', spark: pyspark.sql.SparkSession | None = None, *, merge_condition: str | None = None, upsert_key_columns: Sequence[str] | None = None, normalize_column_names: bool = True, enable_column_mapping: bool = False, auto_partition: bool = False, auto_partition_threshold_bytes: int = 1073741824) → None

Write a DataFrame to a Fabric Lakehouse (default format: Delta).

Parameters:

df (DataFrame) – DataFrame to persist.
lakehouse_name (str) – Display name of the target Lakehouse.
relative_path (str) – Destination path inside the Lakehouse (e.g. "sales_clean", "Tables/sales_clean", or "dbo.PdC Extraction").
mode (str) – Spark save mode "overwrite" (default), "append", "ignore", "error", or Delta merge modes "upsert" / "merge" (bootstrap overwrite when the target table does not exist). "merge" is treated like "upsert".
partition_by (list[str] | None) – Optional column names to partition by. Each name is resolved like fabrictools.clean_data() / fabrictools.merge_dataframes() (physical name, normalized unique label, or snake_case). Auto-detected date partitions are appended when present on df.
format (str) – "delta" (default), "parquet", or "csv".
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.
normalize_column_names (bool) – If True (default), run fabrictools.rename_columns_normalized() before resolving partition_by and writing. If False, keep physical column names unchanged.
enable_column_mapping (bool) – If True and format="delta", writes table properties required for Delta column mapping (mode name), allowing column names with spaces or special characters.
auto_partition – If True, automatically partition the data by detected date columns if they exist. Default False.
merge_condition (str | None) – For mode="upsert" / "merge": Delta merge predicate (e.g. "src.order_id = tgt.order_id"). When set, overrides upsert_key_columns entirely. Use this when every column in the join must appear in the predicate regardless of resolution.
upsert_key_columns (collections.abc.Sequence[str] | None) – Ordered candidates for key columns when merge_condition is omitted. Each name is resolved like partition_by; only candidates that match a column are kept, then combined with AND (one match → simple key, several → composite). Duplicates collapse to one conjunct per physical column. If none match, raises.

Example

>>> write_lakehouse(
...     df,
...     "SilverLakehouse",
...     "sales_clean",
...     mode="upsert",
...     partition_by=["year"],
...     upsert_key_columns=["id"],
... )

fabrictools.io.write_lakehouses(requests: list[dict[str, Any]], *, max_workers: int | None = None, continue_on_error: bool | None = False, spark: pyspark.sql.SparkSession | None = None) → dict[str, Any]

Write multiple DataFrame objects to Lakehouses in parallel.

Each request must contain df, lakehouse_name and relative_path. Optional keys mirror write_lakehouse(): mode (defaults to overwrite when omitted), partition_by, format, merge_condition, upsert_key_columns (ordered merge-key candidates; see that function), normalize_column_names and enable_column_mapping.

Automatic partition detection is disabled in this bulk path for performance (no call to write_lakehouse()’s auto-partition heuristics). Use explicit partition_by per request when partitioning is required. Keys auto_partition and auto_partition_threshold_bytes are ignored if present.

Parameters:

requests (list[dict]) – Per-write parameter dictionaries.
max_workers (int | None) – Maximum number of concurrent write tasks. When omitted, uses min(len(requests), 5). Pass a value greater than 5 to opt in to higher parallelism.
continue_on_error (bool) – If False (default), raise on the first failed write.
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.

Returns:

Summary dict with counts and per-target success/failure entries.

Return type:

dict

Example

>>> summary = write_lakehouses(
...     [
...         {"df": orders_df, "lakehouse_name": "SilverLakehouse", "relative_path": "dbo.orders"},
...         {"df": customers_df, "lakehouse_name": "SilverLakehouse", "relative_path": "dbo.customers", "partition_by": ["country"]},
...     ],
...     max_workers=2,
... )

fabrictools.io.write_warehouse(df: pyspark.sql.DataFrame, warehouse_name: str, table: str, mode: str = 'overwrite', batch_size: int = 10000, spark: pyspark.sql.SparkSession | None = None) → None

Write a DataFrame to a Fabric Warehouse table via JDBC.

Parameters:

df (DataFrame) – DataFrame to persist.
warehouse_name (str) – Target Warehouse display name.
table (str) – Fully-qualified table name (e.g. "dbo.sales_clean").
mode (str) – Spark write mode: "overwrite" (default), "append", "ignore", or "error".
batch_size (int) – Rows per JDBC batch (default 10000).
spark (SparkSession | None) – Optional SparkSession; when omitted the active session is used.

Example

>>> write_warehouse(df, "MyWarehouse", "dbo.sales_clean", mode="append")