Opsio - Cloud and AI Solutions
AnalyticsMLOps6 min read· 1,236 words

MLflow on Databricks: Experiment Tracking, Model Registry, and Production Deployment

Published: ·Updated: ·Reviewed by Opsio Engineering Team
Vaishnavi Shree

Director & MLOps Lead

Predictive maintenance specialist, industrial data analysis, vibration-based condition monitoring, applied AI for manufacturing and automotive operations

MLflow on Databricks: Experiment Tracking, Model Registry, and Production Deployment

MLflow is the open-source ML lifecycle platform created at Databricks in 2018 and now the de facto standard for experiment tracking, model packaging, and model registry across the industry. On Databricks, MLflow is hosted natively — no setup, no managed-service add-on — and integrated with Unity Catalog, jobs, and Model Serving so that the path from a notebook experiment to a production endpoint is a few hundred lines of code rather than a quarter of platform engineering.

This article walks through the three MLflow components that matter day to day, the operating patterns we use across customer engagements, and the gotchas that show up when teams move beyond a single notebook into team-scale and production-scale ML.

The Three MLflow Components

MLflow on Databricks is four projects in one platform. Three matter for almost every workload:

  1. Tracking — log parameters, metrics, code version, and artifacts for every run. The system of record for experiments.
  2. Models — a packaging format that wraps a trained artifact with its dependencies and signature, so it can be loaded back consistently across runtimes.
  3. Model Registry — versioned, governed catalog of models, with stage transitions and access control. On Databricks, this lives inside Unity Catalog.

The fourth, MLflow Projects, sees less production use; on Databricks, jobs and Workflows fill the same role with better integration.

Tracking: The Minimal Working Example

An MLflow run is started as a context manager. Everything logged inside the context belongs to that run.

import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score

mlflow.set_experiment("/Users/jane@acme.com/fraud-detection")

with mlflow.start_run(run_name="gb-baseline") as run:
    params = {"n_estimators": 200, "max_depth": 4, "learning_rate": 0.05}
    mlflow.log_params(params)

    clf = GradientBoostingClassifier(**params)
    clf.fit(X_train, y_train)

    auc = roc_auc_score(y_val, clf.predict_proba(X_val)[:, 1])
    mlflow.log_metric("val_auc", auc)
    mlflow.log_metric("train_size", len(X_train))

    # Sklearn flavor handles env capture, signature, input example
    mlflow.sklearn.log_model(
        sk_model=clf,
        artifact_path="model",
        signature=mlflow.models.infer_signature(X_train, clf.predict(X_train)),
        input_example=X_train.head(),
    )

    print(f"Run {run.info.run_id}: AUC={auc:.4f}")

Two details matter for production readiness. The signature describes input and output schemas, which Model Serving uses to validate inference requests. The input_example goes into the model artifact and shows downstream consumers what a valid input row looks like.

Free Expert Consultation

Need expert help with mlflow on databricks?

Our cloud architects can help you with mlflow on databricks — from strategy to implementation. Book a free 30-minute advisory call with no obligation.

Solution ArchitectAI ExpertSecurity SpecialistDevOps Engineer
50+ certified engineersAWS Advanced Partner24/7 support
Completely free — no obligationResponse within 24h

Tracking URI Patterns

The MLflow tracking URI tells the client where to send runs. On Databricks the patterns to know:

Tracking URIWhen to use
databricks (default in workspace)Notebook attached to a workspace cluster — runs flow to the workspace tracking server
databricks-ucFor model registration to Unity Catalog (modern, recommended)
databricks://<profile>From outside Databricks (local dev, GitHub Actions) using a CLI profile
http://<mlflow-server>Self-hosted MLflow tracking server — rare on Databricks but supported
import mlflow
mlflow.set_tracking_uri("databricks")
mlflow.set_registry_uri("databricks-uc")  # register models into Unity Catalog

The split between tracking URI and registry URI lets runs flow to the workspace tracking server while models register into Unity Catalog. This is the modern setup; do not register new models into the legacy workspace registry.

Unity Catalog Model Registry

Models in Unity Catalog have fully qualified names (catalog.schema.model) and follow the same GRANT / REVOKE access model as tables.

# Register a logged model into Unity Catalog
result = mlflow.register_model(
    model_uri=f"runs:/{run.info.run_id}/model",
    name="prod.ml.fraud_score",
)
print(f"Version {result.version} registered as prod.ml.fraud_score")

Stage management uses aliases (e.g., champion, challenger, archived) rather than the legacy fixed stages (None / Staging / Production). Aliases are semantic and assignable, which makes blue-green and shadow patterns cleaner:

from mlflow import MlflowClient
client = MlflowClient()

# Promote version 7 to champion
client.set_registered_model_alias(
    name="prod.ml.fraud_score",
    alias="champion",
    version=7,
)

# Production code references the alias, not a fixed version
import mlflow.pyfunc
model = mlflow.pyfunc.load_model("models:/prod.ml.fraud_score@champion")

Model Serving: From Registry to REST Endpoint

Databricks Model Serving spins up a managed endpoint from a registered model in roughly 5-10 minutes. The endpoint scales 0-N replicas, supports A/B traffic split between model versions, and integrates with Unity Catalog for access control.

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import (
    EndpointCoreConfigInput, ServedEntityInput,
)

w = WorkspaceClient()
w.serving_endpoints.create(
    name="fraud-score",
    config=EndpointCoreConfigInput(
        served_entities=[
            ServedEntityInput(
                entity_name="prod.ml.fraud_score",
                entity_version="7",
                workload_size="Small",  # 4 concurrent requests
                scale_to_zero_enabled=True,
            ),
        ],
    ),
)

Scale-to-zero is the cost-saver: idle endpoints drop to zero replicas after the configured timeout and cold-start in roughly 10-30 seconds on the next request. Latency-sensitive endpoints serving real-time inference should disable scale-to-zero and run a warm minimum replica count.

Tracking Inference: Inference Tables and Lakehouse Monitoring

Production ML needs the loop closed: predictions and the features that produced them must land in a Delta table so you can detect drift, debug bad scores, and retrain on real data. Databricks does this automatically via inference tables:

# When creating the endpoint
auto_capture_config = AutoCaptureConfigInput(
    enabled=True,
    catalog_name="prod",
    schema_name="ml",
    table_name_prefix="fraud_score_inference",
)

Each inference request and response lands as a row in prod.ml.fraud_score_inference_payload. Pair it with Lakehouse Monitoring to detect feature drift, label drift, and prediction distribution drift on a schedule. Without this loop, your "production model" is essentially flying blind.

Operating Patterns Across Customer Engagements

Five patterns separate ML platforms that work from those that just exist:

  1. Catalog hierarchy mirrors environmentsdev.ml, staging.ml, prod.ml as separate UC catalogs with hard ACL boundaries. Dev models do not get loaded into prod by accident.
  2. Aliases over versions in code — production code references @champion, never a fixed version number. Promotion is one alias swap, not a redeploy.
  3. Inference tables on by default — every prod endpoint logs to a Delta table from day one, even before monitoring is wired up. You cannot retrofit observability after a customer incident.
  4. Jobs cluster training, not all-purpose — training runs scheduled via Workflows on jobs clusters or serverless GPU. The DBU cost difference for a single weekly training run is meaningful at scale.
  5. Feature engineering in Delta tables, not in notebooks — feature pipelines are jobs producing curated Delta tables. Models read from those tables. This makes feature reuse, lineage, and drift detection tractable.

Gotchas

  • Mismatched signature and input — endpoint returns 400 because the JSON column order doesn't match the signature. Always log a representative input_example and test with the same JSON shape
  • Library version drift — model trained on scikit-learn 1.4 fails to load on 1.5. MLflow captures the env, but the serving runtime must match; pin the runtime image explicitly
  • Run pollution — thousands of throwaway runs from interactive experimentation make the experiment unusable. Use separate experiments for ad-hoc vs. tracked work, or use mlflow.delete_run as part of the cleanup loop
  • Scale-to-zero on latency-critical endpoints — cold start of 10-30 seconds breaks SLAs. Disable scale-to-zero and run min-replicas above zero for these endpoints

How Opsio Helps

Opsio's mlops services practice designs and operates MLflow-on-Databricks platforms for industrial, automotive, and financial-services customers. We integrate MLflow with databricks implementation services workspaces, build the catalog hierarchy and CI/CD around the model registry, wire inference tables into Lakehouse Monitoring, and operate the resulting platform under SLAs. For broader use cases like demand forecasting and anomaly detection, see also our predictive analytics delivery.

About the Author

Vaishnavi Shree
Vaishnavi Shree

Director & MLOps Lead

Vaishnavi leads machine learning operations initiatives at Opsio, enabling ML and predictive capabilities for industrial and automotive operations. Her expertise spans predictive maintenance, industrial data analysis, vibration-based condition monitoring, and applied AI — with a focus on practical, experiment-driven solutions designed for real operational environments.

Editorial standards: This article was written by a certified practitioner and peer-reviewed by our engineering team. We update content quarterly to ensure technical accuracy. Opsio maintains editorial independence — we recommend solutions based on technical merit, not commercial relationships.