In June, our managing partner Greg Umstead attended the Databricks Data + AI Summit in San Francisco, CA.
While he was there, our team back home received updates and observations real-time in our Databricks Slack channel. There was a clear trend in his sentiment, which didn’t require AI to recognize.
Upon his return, Greg presented the key learnings of the event to our team. Here’s the top 10 in TLDR; format:
- Databricks is becoming a Data Intelligence Platform. And a powerful one at that!
- Dizzying Change in 2024 with 20+ Launches! Including Unity Catalog, Delta Lake 4.0, AI/BI, and GenAI.
- New Compound AI Apps and AI/BI Genie. For complete solutions, from understanding to action.
- Separate Storage from Processing. Store your data in cheap cloud storage, process it and bring it to life in Databricks.
- Unity Catalog is pretty awesome! Supports views, models, and volumes across many platforms.
- Databricks liquid clustering and serverless computing are available. Everything.
- Databricks Mosaic AI tools are probably our easiest path to production-ize AI workloads. Thanks to numerous LLM models, AutoML, MLFlow, and model publishing formats.
- “Clean Rooms” for safe data exchange with customers, partners, etc. Delta Sharing allows you to compute existing tables in any language, share without replication, and easily scale to multiple collaborators.
- Upcoming Lakeflow combines ingest|transform|orchestrate. Challenging SSIS, ADF, and other data ETL tools.
- We need to start now! And are already working on our next steps…
As Greg said, the summit was like drinking from a fire hose. Databricks has so much coming in 2024. Right now, we’re most excited about Unity Catalog, Delta Lakes and Delta Live Tables, and Mosaic AI. Here’s what we learned about them at the summit:
Unity Catalog:
Unity Catalog is a centralized data governance solution for all assets. It sits above your data schema, creating an additional layer between it and the Metastore. Because it unifies all data assets, UC ensures that policies and management rules are enforced consistently across different data sources and types.
The Data Lineage Tracking feature is crucial for auditing and debugging, as it offers visibility into data transformation and movements every step of the way. Integrating notebooks and workflows with fine-grained access control makes collaboration simple and secure. Plus, Multi-cloud Support provides a consistent governance model, regardless of how many cloud environments are in play.
Delta Live Tables and Delta Lakes:
Delta Live tables simplify the development of data pipelines. They are built using Databricks foundational technology, like the Delta Lake and Delta File format. The Live Tables operate in conjunction with Lakes and the File format, but whereas the latter two focus on “stagnant” parts of the data process, the Live Tables focus on the transformation portion. The DLT framework allows data engineers to describe how the data should be transformed between tables in the environment. Declarative transformations accelerate a data pipeline’s construction, deployment, and monitoring.
The Delta Lake builds on Hadoop and Data Lakes. It stores data in a cloud storage system using a columnar format. It ensures data integrity and consistency and enforces data schemas to prevent corruption or incorrect data from being written into the table. Delta Lake allows multiple types of processing to happen in parallel, including batch and real-time streaming data. It also processes Hadoop in parallel, supporting optimization and updatability.
Mosaic AI:
Mosaic AI is the newest AI playground. Featuring many large language models (LLM), the possibilities are endless. In addition to LLMs, the tools include:
- MLFlow for multi-model evaluation
- AutoML for wizard-based ML investigation
- Models in the catalog
- Model publishing format
Greg brought all these learnings and more back to our team, and we’ve already begun implementing some of the new features and software.
We have started enabling Unity Catalog and utilizing it in customer projects where we have existing tables. A key feature for us is the AI integration that helps describe columns. We’re hoping to experiment with this, and try loading our data dictionaries to see how they compare.
Over the past few months, we’ve been working on optimizing and scaling our Data and Technology Roadmap offering using AI. A limitation with alternative AI tools has been their LLMs and an inability to pre-seed our model with key terms and industry jargon. We plan to experiment with Mosaic’s capabilities to see how they can enhance our Roadmap consulting service even more.
Everest Customer Solutions has also become a Databricks Partner! We are excited to continue learning about the many forthcoming launches and new capabilities as our team engages in partner-based training. As Greg said upon his return from San Francisco, we need to start now!