Developer Blog | dbt Developer Hub

Snowflake feature store and dbt: A bridge between data pipelines and ML

October 8, 2024 · 14 min read

Senior Partner Sales Engineer @ Snowflake

Partner Solutions Architect @ dbt Labs

Flying home into Detroit this past week working on this blog post on a plane and saw for the first time, the newly connected deck of the Gordie Howe International bridge spanning the Detroit River and connecting the U.S. and Canada. The image stuck out because, in one sense, a feature store is a bridge between the clean, consistent datasets and the machine learning models that rely upon this data. But, more interesting than the bridge itself is the massive process of coordination needed to build it. This construction effort — I think — can teach us more about processes and the need for feature stores in machine learning (ML).

Think of the manufacturing materials needed as our data and the building of the bridge as the building of our ML models. There are thousands of engineers and construction workers taking materials from all over the world, pulling only the specific pieces needed for each part of the project. However, to make this project truly work at this scale, we need the warehousing and logistics to ensure that each load of concrete rebar and steel meets the standards for quality and safety needed and is available to the right people at the right time — as even a single fault can have catastrophic consequences or cause serious delays in project success. This warehouse and the associated logistics play the role of the feature store, ensuring that data is delivered consistently where and when it is needed to train and run ML models.

What is a feature?

A feature is a transformed or enriched data that serves as an input into a machine learning model to make predictions. In machine learning, a data scientist derives features from various data sources to build a model that makes predictions based on historical data. To capture the value from this model, the enterprise must operationalize the data pipeline, ensuring that the features being used in production at inference time match those being used in training and development.

What role does dbt play in getting data ready for ML models?

dbt is the standard for data transformation in the enterprise. Organizations leverage dbt at scale to deliver clean and well-governed datasets wherever and whenever they are needed. Using dbt to manage the data transformation processes to cleanse and prepare datasets used in feature development will ensure consistent datasets of guaranteed data quality — meaning that feature development will be consistent and reliable.

Who is going to use this and what benefits will they see?

Snowflake and dbt are already a well-established and trusted combination for delivering data excellence across the enterprise. The ability to register dbt pipelines in the Snowflake Feature Store further extends this combination for ML and AI workloads, while fitting naturally into the data engineering and feature pipelines already present in dbt.

Some of the key benefits are:

Feature collaboration — Data scientists, data analysts, data engineers, and machine learning engineers collaborate on features used in machine learning models in both Python and SQL, enabling teams to share and reuse features. As a result, teams can improve the time to value of models while improving the understanding of their components. This is all backed by Snowflake’s role-based access control (RBAC) and governance.
Feature consistency — Teams are assured that features generated for training sets and those served for model inference are consistent. This can especially be a concern for large organizations where multiple versions of the truth might persist. Much like how dbt and Snowflake help enterprises have a single source of data truth, now they can have a single source of truth for features.
Feature visibility and use — The Snowflake Feature Store provides an intuitive SDK to work with ML features and their associated metadata. In addition, users can browse and search for features in the Snowflake UI, providing an easy way to identify features
Point-in-time correctness — Snowflake retrieves point-in-time correct features using ASOF Joins, removing the significant complexity in generating the right feature value for a given time period whether for training or batch prediction retrieval.
Integration with data pipelines — Teams that have already built data pipelines in dbt can continue to use these with the Snowflake Feature Store. No additional migration or feature re-creation is necessary as teams plug into the same pipelines.

Why did we integrate/build this with Snowflake?

How does dbt help with ML workloads today? dbt plays a pivotal role in preparing data for ML models by transforming raw data into a format suitable for feature engineering. It helps orchestrate and automate these transformations, ensuring that data is clean, consistent, and ready for ML applications. The combination of Snowflake’s powerful AI Data Cloud and dbt’s transformation prowess makes it an unbeatable pair for organizations aiming to scale their ML operations efficiently.

Making it easier for ML/Data Engineers to both build & deploy ML data & models

dbt is a perfect tool to promote collaboration between data engineers, ML engineers, and data scientists. dbt is designed to support collaboration and quality of data pipelines through features including version control, environments and development life cycles, as well as built-in data and pipeline testing. Leveraging dbt means that data engineers and data scientists can collaborate and develop new models and features while maintaining the rigorous governance and high quality that's needed.

Additionally, dbt Mesh makes maintaining domain ownership extremely easy by breaking up portions of our data projects and pipelines into connected projects where critical models can be published for consumption by others with strict data contracts enforcing quality and governance. This paradigm supports rapid development as each project can be kept to a maintainable size for its contributors and developers. Contracting on published models used between these projects ensures the consistency of the integration points between them.

Finally, dbt Cloud also provides dbt Explorer — a perfect tool to catalog and share knowledge about organizational data across disparate teams. dbt Explorer provides a central place for information on data pipelines, including lineage information, data freshness, and quality. Best of all, dbt Explorer updates every time dbt jobs run, ensuring this information is always up-to-date and relevant.

What tech is at play?

Here’s what you need from dbt. dbt should be used to manage data transformation pipelines and generate the datasets needed by ML engineers and data scientists maintaining the Snowflake Feature Store. dbt Cloud Enterprise users should leverage dbt Mesh to create different projects with clear owners for these different domains of data pipelines. This Mesh design will promote easier collaboration by keeping each dbt project smaller and more manageable for the people building and maintaining it. dbt also supports both SQL and Python-based transformations making it an ideal fit for AI/ML workflows, which commonly leverage both languages.

Using dbt for the data transformation pipelines will also ensure the quality and consistency of data products, which is critical for ensuring successful AI/ML efforts.

Snowflake ML overview

The Feature Store is one component of Snowflake ML’s integrated suite of machine learning features that powers end-to-end machine learning within a single platform. Data scientists and ML engineers leverage ready-to-use ML functions or build custom ML workflows all without any data movement or without sacrificing governance. Snowflake ML includes scalable feature engineering and model training capabilities. Meanwhile, the Feature Store and Model Registry allow teams to store and use features and models in production, providing an end-to-end suite for operating ML workloads at scale.

What do you need to do to make it all work?

dbt Cloud offers the fastest and easiest way to run dbt. It offers a Cloud-based IDE, Cloud-attached CLI, and even a low-code visual editor option (currently in beta), meaning it’s perfect for connecting users across different teams with different workflows and tooling preferences, which is very common in AI/ML workflows. This is the tool you will use to prepare and manage data for AI/ML, promote collaboration across the different teams needed for a successful AI/ML workflow, and ensure the quality and consistency of the underlying data that will be used to create features and train models.

Organizations interested in AI/ML workflows through Snowflake should also look at the new dbt Snowflake Native App — a Snowflake Native Application that extends the functionality of dbt Cloud into Snowflake. Of particular interest is Ask dbt — a chatbot that integrates directly with Snowflake Cortex and the dbt Semantic Layer to allow natural language questions of Snowflake data.

How to power ML pipelines with dbt and Snowflake’s Feature Store

Let’s provide a brief example of what this workflow looks like in dbt and Snowflake to build and use the powerful capabilities of a Feature Store. For this example, consider that we have a data pipeline in dbt to process customer transaction data. Various data science teams in the organization need to derive features from these transactions to use in various models, including to predict fraud and perform customer segmentation and personalization. These different use cases all benefit from having related features, such as the count of transactions or purchased amounts over different periods of time (for example, the last day, 7 days, or 30 days) for a given customer.

Instead of the data scientists building out their own workflows to derive these features, let’s look at the flow of using dbt to manage the feature pipeline and Snowflake’s Feature Store to solve this problem. The following subsections describe the workflow step by step.

Create feature tables as dbt models

The first step consists of building out a feature table as a dbt model. Data scientists and data engineers plug in to existing dbt pipelines and derive a table that includes the underlying entity (for example, customer id, timestamp and feature values). The feature table aggregates the needed features at the appropriate timestamp for a given entity. Note that Snowflake provides various common feature and query patterns available here. So, in our example, we would see a given customer, timestamp, and features representing transaction counts and sums over various periods. Data scientists can use SQL or Python directly in dbt to build this table, which will push down the logic into Snowflake, allowing data scientists to use their existing skill set.

Window aggregations play an important role in the creation of features. Because the logic for these aggregations is often complex, let’s see how Snowflake and dbt make this process easier by leveraging Don’t Repeat Yourself (DRY) principles. We’ll create a macro that will allow us to use Snowflake’s range between syntax in a repeatable way:

{% macro rolling_agg(column, partition_by, order_by, interval='30 days', agg_function='sum') %}
   {{ agg_function }}({{ column }}) over (
       partition by {{ partition_by }}
       order by {{ order_by }}
       range between interval '{{ interval }}' preceding and current row
   )
{% endmacro %}

Now, we use this macro in our feature table to build out various aggregations of customer transactions over the last day, 7 days, and 30 days. Snowflake has just taken significant complexity away in generating appropriate feature values and dbt has just made the code even more readable and repeatable. While the following example is built in SQL, teams can also build these pipelines using Python directly.

select
   tx_datetime,
   customer_id,
   tx_amount,
   {{ rolling_agg("TX_AMOUNT", "CUSTOMER_ID", "TX_DATETIME", "1 days", "sum") }}
   as tx_amount_1d,
   {{ rolling_agg("TX_AMOUNT", "CUSTOMER_ID", "TX_DATETIME", "7 days", "sum") }}
   as tx_amount_7d,
   {{ rolling_agg("TX_AMOUNT", "CUSTOMER_ID", "TX_DATETIME", "30 days", "sum") }}
   as tx_amount_30d,
   {{ rolling_agg("TX_AMOUNT", "CUSTOMER_ID", "TX_DATETIME", "1 days", "avg") }}
   as tx_amount_avg_1d,
   {{ rolling_agg("TX_AMOUNT", "CUSTOMER_ID", "TX_DATETIME", "7 days", "avg") }}
   as tx_amount_avg_7d,
   {{ rolling_agg("TX_AMOUNT", "CUSTOMER_ID", "TX_DATETIME", "30 days", "avg") }}
   as tx_amount_avg_30d,
   {{ rolling_agg("*", "CUSTOMER_ID", "TX_DATETIME", "1 days", "count") }}
   as tx_cnt_1d,
   {{ rolling_agg("*", "CUSTOMER_ID", "TX_DATETIME", "7 days", "count") }}
   as tx_cnt_7d,
   {{ rolling_agg("*", "CUSTOMER_ID", "TX_DATETIME", "30 days", "count") }}
   as tx_cnt_30d
from {{ ref("stg_transactions") }}

Create or connect to a Snowflake Feature Store

Once a feature table is built in dbt, data scientists use Snowflake’s snowflake-ml-python package to create or connect to an existing Feature Store in Snowflake. Data scientists can do this all in Python, including in Jupyter Notebooks or directly in Snowflake using Snowflake Notebooks.

Let’s go ahead and create the Feature Store in Snowflake:

from snowflake.ml.feature_store import (
    FeatureStore,
    FeatureView,
    Entity,
    CreationMode
)

fs = FeatureStore(
    session=session, 
    database=fs_db, 
    name=fs_schema, 
    default_warehouse='WH_DBT',
    creation_mode=CreationMode.CREATE_IF_NOT_EXIST,
)

Create and register feature entities

The next step consists of creating and registering entities. These represent the underlying objects that features are associated with, forming the join keys used for feature lookups. In our example, the data scientist can register various entities, including for the customer, a transaction id, or other necessary attributes.

Let’s create some example entities.

customer = Entity(name="CUSTOMER", join_keys=["CUSTOMER_ID"])
transaction = Entity(name="TRANSACTION", join_keys=["TRANSACTION_ID"])
fs.register_entity(customer)
fs.register_entity(transaction)

Register feature tables as feature views

After registering entities, the next step is to register a feature view. This represents a group of related features that stem from the features tables created in the dbt model. In this case, note that the feature logic, refresh, and consistency is managed by the dbt pipeline. The feature view in Snowflake enables versioning of the features while providing discoverability among teams.

# Create a dataframe from our feature table produced in dbt
customers_transactions_df = session.sql(f"""
    SELECT 
        CUSTOMER_ID,
        TX_DATETIME,
        TX_AMOUNT_1D,
        TX_AMOUNT_7D,
        TX_AMOUNT_30D,
        TX_AMOUNT_AVG_1D,
        TX_AMOUNT_AVG_7D,
        TX_AMOUNT_AVG_30D,
        TX_CNT_1D,
        TX_CNT_7D,
        TX_CNT_30D     
    FROM {fs_db}.{fs_data_schema}.ft_customer_transactions
    """)

# Create a feature view on top of these features
customer_transactions_fv = FeatureView(
    name="customer_transactions_fv", 
    entities=[customer],
    feature_df=customers_transactions_df,
    timestamp_col="TX_DATETIME",
    refresh_freq=None,
    desc="Customer transaction features with window aggregates")

# Register the feature view for use beyond the session
customer_transactions_fv = fs.register_feature_view(
    feature_view=customer_transactions_fv,
    version="1",
    #overwrite=True,
    block=True)

Search and discover features in the Snowflake UI

Now, with features created, teams can view their features directly in the Snowflake UI, as shown below. This enables teams to easily search and browse features, all governed through Snowflake’s role-based access control (RBAC).

Example of Snowflake UI

Generate training dataset

Now that the feature view is created, data scientists produce a training dataset that uses the feature view. In our example, whether the data scientist is building a fraud or segmentation model, they will retrieve point-in-time correct features for a customer at a specific point in time using the Feature Store’s generate_training_set method.

To generate the training set, we need to supply a spine dataframe, representing the entities and timestamp values that we will need to retrieve features for. The following example shows this using a few records, although teams can leverage other tables to produce this spine.

spine_df = session.create_dataframe(
    [
        ('1', '3937', "2019-05-01 00:00"), 
        ('2', '2', "2019-05-01 00:00"),
        ('3', '927', "2019-05-01 00:00"),
    ], 
    schema=["INSTANCE_ID", "CUSTOMER_ID", "EVENT_TIMESTAMP"])

train_dataset = fs.generate_dataset(
    name= "customers_fv",
    version= "1_0",
    spine_df=spine_df,
    features=[customer_transactions_fv],
    spine_timestamp_col= "EVENT_TIMESTAMP",
    spine_label_cols = []
)

Now that we have produced the training dataset, let’s see what it looks like.

Example of training dataset

Train and deploy a model

Now with this training set, data scientists can use Snowflake Snowpark and Snowpark ML Modeling to use familiar Python frameworks for additional preprocessing, feature engineering, and model training all within Snowflake. The model can be registered in the Snowflake Model Registry for secure model management. Note that we will leave the model training for you as part of this exercise.

Retrieve features for predictions

For inference, data pipelines retrieve feature values using the retrieve_feature_values method. These retrieved values can be fed directly to a model’s predict capability in your Python session using a developed model or by invoking a model’s predict method from Snowflake’s Model Registry. For batch scoring purposes, teams can build this entire pipeline using Snowflake ML. The following code demonstrates how the features are retrieved using this method.

infernce_spine = session.create_dataframe(
    [
        ('1', '3937', "2019-07-01 00:00"), 
        ('2', '2', "2019-07-01 00:00"),
        ('3', '927', "2019-07-01 00:00"),
    ], 
    schema=["INSTANCE_ID", "CUSTOMER_ID", "EVENT_TIMESTAMP"])

inference_dataset = fs.retrieve_feature_values(
    spine_df=infernce_spine,
    features=[customer_transactions_fv],
    spine_timestamp_col="EVENT_TIMESTAMP",
)

inference_dataset.to_pandas()

Here’s an example view of our features produced for model inferencing.

Example of training data set

Conclusion

We’ve just seen how quickly and easily you can begin to develop features through dbt and leverage the Snowflake Feature Store to deliver predictive modeling as part of your data pipelines. The ability to build and deploy ML models, including integrating feature storage, data transformation, and ML logic within a single platform, simplifies the entire ML life cycle. Combining this new power with the well-established partnership of dbt and Snowflake unlocks even more potential for organizations to safely build and explore new AI/ML use cases and drive further collaboration in the organization.

The code used in the examples above is publicly available on GitHub. Also, you can run a full example yourself in this quickstart guide from the Snowflake docs.

Iceberg Is An Implementation Detail

October 4, 2024 · 6 min read

Amy Chen

Product Manager @ dbt Labs

If you haven’t paid attention to the data industry news cycle, you might have missed the recent excitement centered around an open table format called Apache Iceberg™. It’s one of many open table formats like Delta Lake, Hudi, and Hive. These formats are changing the way data is stored and metadata accessed. They are groundbreaking in many ways.

But I have to be honest: I don’t care. But not for the reasons you think.

How Hybrid Mesh unlocks dbt collaboration at scale

September 30, 2024 · 7 min read

Jason Ganz

Developer Experience @ dbt Labs

One of the most important things that dbt does is unlock the ability for teams to collaborate on creating and disseminating organizational knowledge.

In the past, this primarily looked like a team working in one dbt Project to create a set of transformed objects in their data platform.

As dbt was adopted by larger organizations and began to drive workloads at a global scale, it became clear that we needed mechanisms to allow teams to operate independently from each other, creating and sharing data models across teams — dbt Mesh.

How to build a Semantic Layer in pieces: step-by-step for busy analytics engineers

July 10, 2024 · 10 min read

Gwen Windflower

The dbt Semantic Layer is founded on the idea that data transformation should be both flexible, allowing for on-the-fly aggregations grouped and filtered by definable dimensions and version-controlled and tested. Like any other codebase, you should have confidence that your transformations express your organization’s business logic correctly. Historically, you had to choose between these options, but the dbt Semantic Layer brings them together. This has required new paradigms for how you express your transformations though.

Putting Your DAG on the internet

June 14, 2024 · 5 min read

Ernesto Ongaro

Senior Solutions Architect @ dbt Labs

Sebastian Stan

Data Engineer @ EQT Group

Filip Byrén

VP and Software Architect @ EQT Group

New in dbt: allow Snowflake Python models to access the internet

With dbt 1.8, dbt released support for Snowflake’s external access integrations further enabling the use of dbt + AI to enrich your data. This allows querying of external APIs within dbt Python models, a functionality that was required for dbt Cloud customer, EQT AB. Learn about why they needed it and how they helped build the feature and get it shipped!

Up and Running with Azure Synapse on dbt Cloud

May 17, 2024 · 11 min read

Anders Swanson

Senior Developer Experience Advocate @ dbt Labs

At dbt Labs, we’ve always believed in meeting analytics engineers where they are. That’s why we’re so excited to announce that today, analytics engineers within the Microsoft Ecosystem can use dbt Cloud with not only Microsoft Fabric but also Azure Synapse Analytics Dedicated SQL Pools (ASADSP).

Since the early days of dbt, folks have been interested having MSFT data platforms. Huge shoutout to Mikael Ene and Jacob Mastel for their efforts back in 2019 on the original SQL Server adapters (dbt-sqlserver and dbt-mssql, respectively)

The journey for the Azure Synapse dbt adapter, dbt-synapse, is closely tied to my journey with dbt. I was the one who forked dbt-sqlserver into dbt-synapse in April of 2020. I had first learned of dbt only a month earlier and knew immediately that my team needed the tool. With a great deal of assistance from Jeremy and experts at Microsoft, my team and I got it off the ground and started using it. When I left my team at Avanade in early 2022 to join dbt Labs, I joked that I wasn’t actually leaving the team; I was just temporarily embedding at dbt Labs to expedite dbt Labs getting into Cloud. Two years later, I can tell my team that the mission has been accomplished! Kudos to all the folks who have contributed to the TSQL adapters either directly in GitHub or in the community Slack channels. The integration would not exist if not for you!

Unit testing in dbt for test-driven development

May 7, 2024 · 9 min read

Doug Beatty

Senior Developer Experience Advocate @ dbt Labs

Do you ever have "bad data" dreams? Or am I the only one that has recurring nightmares? 😱

Here's the one I had last night:

It began with a midnight bug hunt. A menacing insect creature has locked my colleagues in a dungeon, and they are pleading for my help to escape . Finding the key is elusive and always seems just beyond my grasp. The stress is palpable, a physical weight on my chest, as I raced against time to unlock them.

Of course I wake up without actually having saved them, but I am relieved nonetheless. And I've had similar nightmares involving a heroic code refactor or the launch of a new model or feature.

Good news: beginning in dbt v1.8, we're introducing a first-class unit testing framework that can handle each of the scenarios from my data nightmares.

Before we dive into the details, let's take a quick look at how we got here.

Conversational Analytics: A Natural Language Interface to your Snowflake Data

May 2, 2024 · 12 min read

Doug Guthrie

Senior Solutions Architect @ dbt Labs

Introduction

As a solutions architect at dbt Labs, my role is to help our customers and prospects understand how to best utilize the dbt Cloud platform to solve their unique data challenges. That uniqueness presents itself in different ways - organizational maturity, data stack, team size and composition, technical capability, use case, or some combination of those. With all those differences though, there has been one common thread throughout most of my engagements: Generative AI and Large Language Models (LLMs). Data teams are either 1) proactively thinking about applications for it in the context of their work or 2) being pushed to think about it by their stakeholders. It has become the elephant in every single (zoom) room I find myself in.

How we're making sure you can confidently go "Versionless" in dbt Cloud

May 2, 2024 · 10 min read

Michelle Ark

Staff Software Engineer @ dbt Labs

Chenyu Li

Staff Software Engineer @ dbt Labs

Colin Rogers

Senior Software Engineer @ dbt Labs

As long as dbt Cloud has existed, it has required users to select a version of dbt Core to use under the hood in their jobs and environments. This made sense in the earliest days, when dbt Core minor versions often included breaking changes. It provided a clear way for everyone to know which version of the underlying runtime they were getting.

However, this came at a cost. While bumping a project's dbt version appeared as simple as selecting from a dropdown, there was real effort required to test the compatibility of the new version against existing projects, package dependencies, and adapters. On the other hand, putting this off meant foregoing access to new features and bug fixes in dbt.

But no more. Today, we're ready to announce the general availability of a new option in dbt Cloud: "Versionless."

Maximum override: Configuring unique connections in dbt Cloud

April 22, 2024 · 6 min read

Gwen Windflower

dbt Cloud now includes a suite of new features that enable configuring precise and unique connections to data platforms at the environment and user level. These enable more sophisticated setups, like connecting a project to multiple warehouse accounts, first-class support for staging environments, and user-level overrides for specific dbt versions. This gives dbt Cloud developers the features they need to tackle more complex tasks, like Write-Audit-Publish (WAP) workflows and safely testing dbt version upgrades. While you still configure a default connection at the project level and per-developer, you now have tools to get more advanced in a secure way. Soon, dbt Cloud will take this even further allowing multiple connections to be set globally and reused with global connections.

LLM-powered Analytics Engineering: How we're using AI inside of our dbt project, today, with no new tools.

March 19, 2024 · 10 min read

Joel Labes

Senior Developer Experience Advocate @ dbt Labs

Cloud Data Platforms make new things possible; dbt helps you put them into production

The original paradigm shift that enabled dbt to exist and be useful was databases going to the cloud.

All of a sudden it was possible for more people to do better data work as huge blockers became huge opportunities:

We could now dynamically scale compute on-demand, without upgrading to a larger on-prem database.
We could now store and query enormous datasets like clickstream data, without pre-aggregating and transforming it.

Today, the next wave of innovation is happening in AI and LLMs, and it's coming to the cloud data platforms dbt practitioners are already using every day. For one example, Snowflake have just released their Cortex functions to access LLM-powered tools tuned for running common tasks against your existing datasets. In doing so, there are a new set of opportunities available to us:

Column-Level Lineage, Model Performance, and Recommendations: ship trusted data products with dbt Explorer

February 13, 2024 · 9 min read

Dave Connors

Senior Developer Experience Advocate @ dbt Labs

What’s in a data platform?

Raising a dbt project is hard work. We, as data professionals, have poured ourselves into raising happy healthy data products, and we should be proud of the insights they’ve driven. It certainly wasn’t without its challenges though — we remember the terrible twos, where we worked hard to just get the platform to walk straight. We remember the angsty teenage years where tests kept failing, seemingly just to spite us. A lot of blood, sweat, and tears are shed in the service of clean data!

Once the project could dress and feed itself, we also worked hard to get buy-in from our colleagues who put their trust in our little project. Without deep trust and understanding of what we built, our colleagues who depend on your data (or even those involved in developing it with you — it takes a village after all!) are more likely to be in your DMs with questions than in their BI tools, generating insights.

When our teammates ask about where the data in their reports come from, how fresh it is, or about the right calculation for a metric, what a joy! This means they want to put what we’ve built to good use — the challenge is that, historically, it hasn’t been all that easy to answer these questions well. That has often meant a manual, painstaking process of cross checking run logs and your dbt documentation site to get the stakeholder the information they need.

Enter dbt Explorer! dbt Explorer centralizes documentation, lineage, and execution metadata to reduce the work required to ship trusted data products faster.

Serverless, free-tier data stack with dlt + dbt core.

January 15, 2024 · 8 min read

Euan Johnston

The problem, the builder and tooling

The problem: My partner and I are considering buying a property in Portugal. There is no reference data for the real estate market here - how many houses are being sold, for what price? Nobody knows except the property office and maybe the banks, and they don’t readily divulge this information. The only data source we have is Idealista, which is a portal where real estate agencies post ads.

Unfortunately, there are significantly fewer properties than ads - it seems many real estate companies re-post the same ad that others do, with intentionally different data and often misleading bits of info. The real estate agencies do this so the interested parties reach out to them for clarification, and from there they can start a sales process. At the same time, the website with the ads is incentivised to allow this to continue as they get paid per ad, not per property.

The builder: I’m a data freelancer who deploys end to end solutions, so when I have a data problem, I cannot just let it go.

The tools: I want to be able to run my project on Google Cloud Functions due to the generous free tier. dlt is a new Python library for declarative data ingestion which I have wanted to test for some time. Finally, I will use dbt Core for transformation.

Deprecation of dbt Server

January 15, 2024 · 2 min read

Roxi Dahlke

Product Manager @ dbt Labs

Summary

We’re announcing that dbt Server is officially deprecated and will no longer be maintained by dbt Labs going forward. You can continue to use the repository and fork it for your needs. We’re also looking for a maintainer of the repository from our community! If you’re interested, please reach out by opening an issue in the repository.

Why are we deprecating dbt Server?

At dbt Labs, we are continually working to build rich experiences that help our users scale collaboration around data. To achieve this vision, we need to take moments to think about which products are contributing to this goal, and sometimes make hard decisions about the ones that are not, so that we can focus on enhancing the most impactful ones.

dbt Server previously supported our legacy Semantic Layer, which was fully deprecated in December 2023. In October 2023, we introduced the GA of the revamped dbt Semantic Layer with significant improvements, made possible by the acquisition of Transform and the integration of MetricFlow into dbt.

The dbt Semantic Layer is now fully independent of dbt Server and operates on MetricFlow Server, a powerful new proprietary technology designed for enhanced scalability. We’re incredibly excited about the new updates and encourage you to check out our documentation, as well as this blog on how the product works.

The deprecation of dbt Server and updates to the Semantic Layer signify the evolution of the dbt ecosystem towards more focus on in product and out-of-the-box experiences around connectivity, scale, and flexibility. We are excited that you are along with us on this journey.

More time coding, less time waiting: Mastering defer in dbt

January 9, 2024 · 9 min read

Dave Connors

Senior Developer Experience Advocate @ dbt Labs

Picture this — you’ve got a massive dbt project, thousands of models chugging along, creating actionable insights for your stakeholders. A ticket comes your way — a model needs to be refactored! "No problem," you think to yourself, "I will simply make that change and test it locally!" You look at your lineage, and realize this model is many layers deep, buried underneath a long chain of tables and views.

“OK,” you think further, “I’ll just run a dbt build -s +my_changed_model to make sure I have everything I need built into my dev schema and I can test my changes”. You run the command. You wait. You wait some more. You get some coffee, and completely take yourself out of your dbt development flow state. A lot of time and money down the drain to get to a point where you can start your work. That’s no good!

Luckily, dbt’s defer functionality allow you to only build what you care about when you need it, and nothing more. This feature helps developers spend less time and money in development, helping ship trusted data products faster. dbt Cloud offers native support for this workflow in development, so you can start deferring without any additional overhead!

How to integrate with dbt

December 20, 2023 · 9 min read

Amy Chen

Product Manager @ dbt Labs

Overview

Over the course of my three years running the Partner Engineering team at dbt Labs, the most common question I've been asked is, How do we integrate with dbt? Because those conversations often start out at the same place, I decided to create this guide so I’m no longer the blocker to fundamental information. This also allows us to skip the intro and get to the fun conversations so much faster, like what a joint solution for our customers would look like.

This guide doesn't include how to integrate with dbt Core. If you’re interested in creating a dbt adapter, please check out the adapter development guide instead.

Instead, we're going to focus on integrating with dbt Cloud. Integrating with dbt Cloud is a key requirement to become a dbt Labs technology partner, opening the door to a variety of collaborative commercial opportunities.

Here I'll cover how to get started, potential use cases you want to solve for, and points of integrations to do so.

How we built consistent product launch metrics with the dbt Semantic Layer

December 12, 2023 · 9 min read

Jordan Stein

Product Manager @ dbt Labs

There’s nothing quite like the feeling of launching a new product. On launch day emotions can range from excitement, to fear, to accomplishment all in the same hour. Once the dust settles and the product is in the wild, the next thing the team needs to do is track how the product is doing. How many users do we have? How is performance looking? What features are customers using? How often? Answering these questions is vital to understanding the success of any product launch.

At dbt we recently made the Semantic Layer Generally Available. The Semantic Layer lets teams define business metrics centrally, in dbt, and access them in multiple analytics tools through our semantic layer APIs. I’m a Product Manager on the Semantic Layer team, and the launch of the Semantic Layer put our team in an interesting, somewhat “meta,” position: we need to understand how a product launch is doing, and the product we just launched is designed to make defining and consuming metrics much more efficient. It’s the perfect opportunity to put the semantic layer through its paces for product analytics. This blog post walks through the end-to-end process we used to set up product analytics for the dbt Semantic Layer using the dbt Semantic Layer.

Why you should specify a production environment in dbt Cloud

November 14, 2023 · 5 min read

Joel Labes

Senior Developer Experience Advocate @ dbt Labs

You can now use a Staging environment!

This blog post was written before Staging environments. You can now use dbt Cloud can to support the patterns discussed here. Read more about Staging environments.

The Bottom Line:

You should split your Jobs across Environments in dbt Cloud based on their purposes (e.g. Production and Staging/CI) and set one environment as Production. This will improve your CI experience and enable you to use dbt Explorer.

Environmental segmentation has always been an important part of the analytics engineering workflow:

When developing new models you can process a smaller subset of your data by using target.name or an environment variable.
By building your production-grade models into a different schema and database, you can experiment in peace without being worried that your changes will accidentally impact downstream users.
Using dedicated credentials for production runs, instead of an analytics engineer's individual dev credentials, ensures that things don't break when that long-tenured employee finally hangs up their IDE.

Historically, dbt Cloud required a separate environment for Development, but was otherwise unopinionated in how you configured your account. This mostly just worked – as long as you didn't have anything more complex than a CI job mixed in with a couple of production jobs – because important constructs like deferral in CI and documentation were only ever tied to a single job.

But as companies' dbt deployments have grown more complex, it doesn't make sense to assume that a single job is enough anymore. We need to exchange a job-oriented strategy for a more mature and scalable environment-centric view of the world. To support this, a recent change in dbt Cloud enables project administrators to mark one of their environments as the Production environment, just as has long been possible for the Development environment.

Explicitly separating your Production workloads lets dbt Cloud be smarter with the metadata it creates, and is particularly important for two new features: dbt Explorer and the revised CI workflows.

To defer or to clone, that is the question

October 31, 2023 · 6 min read

Kshitij Aranke

Senior Software Engineer @ dbt Labs

Doug Beatty

Senior Developer Experience Advocate @ dbt Labs

Hi all, I’m Kshitij, a senior software engineer on the Core team at dbt Labs. One of the coolest moments of my career here thus far has been shipping the new dbt clone command as part of the dbt-core v1.6 release.

However, one of the questions I’ve received most frequently is guidance around “when” to clone that goes beyond the documentation on “how” to clone. In this blog post, I’ll attempt to provide this guidance by answering these FAQs:

What is dbt clone?
How is it different from deferral?
Should I defer or should I clone?

Optimizing Materialized Views with dbt

August 3, 2023 · 11 min read

Amy Chen

Product Manager @ dbt Labs

note

This blog post was updated on December 18, 2023 to cover the support of MVs on dbt-bigquery and updates on how to test MVs.

Introduction

The year was 2020. I was a kitten-only household, and dbt Labs was still Fishtown Analytics. A enterprise customer I was working with, Jetblue, asked me for help running their dbt models every 2 minutes to meet a 5 minute SLA.

After getting over the initial terror, we talked through the use case and soon realized there was a better option. Together with my team, I created lambda views to meet the need.

Flash forward to 2023. I’m writing this as my giant dog snores next to me (don’t worry the cats have multiplied as well). Jetblue has outgrown lambda views due to performance constraints (a view can only be so performant) and we are at another milestone in dbt’s journey to support streaming. What. a. time.

Today we are announcing that we now support Materialized Views in dbt. So, what does that mean?

Developer Blog | dbt Developer Hub

What is a feature?​

What role does dbt play in getting data ready for ML models?​

Who is going to use this and what benefits will they see?​

Why did we integrate/build this with Snowflake?​

Making it easier for ML/Data Engineers to both build & deploy ML data & models​

What tech is at play?​

Snowflake ML overview​

What do you need to do to make it all work?​

How to power ML pipelines with dbt and Snowflake’s Feature Store​

Create feature tables as dbt models​

Create or connect to a Snowflake Feature Store​

Create and register feature entities​

Register feature tables as feature views​

Search and discover features in the Snowflake UI​

Generate training dataset​

Train and deploy a model​

Retrieve features for predictions​

Conclusion​

Introduction​

Cloud Data Platforms make new things possible; dbt helps you put them into production​

What’s in a data platform?​

The problem, the builder and tooling​

Summary​

Why are we deprecating dbt Server?​

Overview​

Introduction​

What is a feature?

What role does dbt play in getting data ready for ML models?

Who is going to use this and what benefits will they see?

Why did we integrate/build this with Snowflake?

Making it easier for ML/Data Engineers to both build & deploy ML data & models

What tech is at play?

Snowflake ML overview

What do you need to do to make it all work?

How to power ML pipelines with dbt and Snowflake’s Feature Store

Create feature tables as dbt models

Create or connect to a Snowflake Feature Store

Create and register feature entities

Register feature tables as feature views

Search and discover features in the Snowflake UI

Generate training dataset

Train and deploy a model

Retrieve features for predictions

Conclusion

Introduction

Cloud Data Platforms make new things possible; dbt helps you put them into production

What’s in a data platform?

The problem, the builder and tooling

Summary

Why are we deprecating dbt Server?

Overview

Introduction