dbt Data Transformation Best Practices for E-Commerce: A Step-by-Step Guide

Key Takeaways
- Use `updated_at` not `created_at` for incremental e-commerce models
- Enforce schema tests and freshness checks on every staging source
- Aggregate refunds to order grain before joining to prevent fan-out
- Push dbt mart data back into Salesforce via reverse ETL for CX activation
- Standardise multi-currency conversion in reusable dbt macros
Quick Answer: The core dbt best practices for e-commerce data transformation include using a three-layer model architecture (staging, intermediate, marts), incremental materialisation for high-volume order tables filtered on updated_at, mandatory schema and freshness tests on every source, and reusable macros for multi-currency conversion across markets.
Most e-commerce data teams I talk to across Asia-Pacific are drowning in raw event data but starving for reliable metrics. They have Shopify Plus or Adobe Commerce pumping out millions of order rows, returns cascading across multiple warehouse systems in Hong Kong, Singapore, and Australia, and loyalty programmes spanning markets with different currencies and tax rules. The fix isn't more dashboards—it's disciplined transformation logic. That's where dbt data transformation best practices for e-commerce become the difference between a data team that ships insights and one that spends 80% of its time debugging SQL.
Related reading: Snowflake vs Databricks for APAC Retail Data Teams: A Practical Guide
Related reading: How to Build an AI-Ready Data Foundation for Retail in Asia-Pacific
Related reading: Slackbot Salesforce Integration CRM Strategy for Distributed APAC Teams
Related reading: Top 6 AI Automation Wins for E-Commerce Ops Teams in 2025
After implementing dbt-based analytics stacks for enterprise retail clients including a 90+ store jewellery brand and a multi-market F&B group, I've distilled what actually works into a repeatable playbook. This guide walks through the exact steps, naming conventions, testing layers, and incremental strategies we use at Branch8—shaped specifically for e-commerce datasets like orders, returns, inventory, and loyalty.
Prerequisites: What You Need Before Starting
A Clear Data Warehouse Foundation
Before writing a single dbt model, confirm your warehouse layer is stable. We typically work with BigQuery for APAC clients (favourable data residency options in Tokyo, Singapore, and Sydney regions) or Snowflake for clients with US-based parent companies. Your raw data should be landing reliably via an ingestion tool—Fivetran, Airbyte, or custom connectors.
Minimum requirements:
- dbt Core 1.7+ or a dbt Cloud account (Team tier minimum for production scheduling)
- A warehouse with raw schemas populated by your e-commerce platform (Shopify Plus, Adobe Commerce, or SHOPLINE)
- Source tables for at minimum: orders, order line items, customers, products, and refunds
- Git repository access (GitHub or GitLab) with branch protection on
main - Python 3.9+ if running dbt Core locally
Familiarity with Your E-Commerce Data Model
Every platform structures data differently. Shopify Plus nests line items within orders and splits discount allocations across rows. Adobe Commerce (Magento 2) uses a quote-to-order pipeline with separate sales_order and sales_order_item tables. SHOPLINE, popular across Hong Kong and Taiwan, has its own API schema. Map your source tables before modelling.
We maintain a shared Notion doc per client that lists every source table, its grain (one row per what?), update frequency, and known quirks. This takes about two hours and saves weeks of debugging downstream.
Step 1: Establish Naming Conventions and Project Structure
Why Naming Conventions Matter More Than You Think
According to dbt Labs' own style guide, inconsistent naming is the number-one cause of model sprawl in analytics engineering projects (dbt Labs, 2023). For e-commerce specifically, the problem multiplies because you're dealing with overlapping entity types—an "order" means different things in fulfilment versus finance versus marketing.
Our convention at Branch8:
- Staging models (
stg_): One-to-one with source tables. Rename columns, cast types, filter soft-deleted rows. Nothing else. - Intermediate models (
int_): Business logic joins and aggregations. This is where you combinestg_orderswithstg_order_itemsandstg_refunds. - Mart models (
mart_orfct_/dim_): Final consumption-layer tables.fct_orders,dim_customers,dim_products.
Project Directory Layout for Multi-Market E-Commerce
Here's the directory structure we use for clients operating across multiple APAC markets:
1models/2├── staging/3│ ├── shopify/4│ │ ├── stg_shopify__orders.sql5│ │ ├── stg_shopify__order_items.sql6│ │ ├── stg_shopify__refunds.sql7│ │ ├── stg_shopify__customers.sql8│ │ └── _shopify__sources.yml9│ └── loyalty/10│ ├── stg_loyalty__members.sql11│ ├── stg_loyalty__transactions.sql12│ └── _loyalty__sources.yml13├── intermediate/14│ ├── int_orders__joined_items_refunds.sql15│ ├── int_customers__enriched.sql16│ └── int_loyalty__points_balance.sql17└── marts/18 ├── finance/19 │ ├── fct_daily_revenue.sql20 │ └── dim_currency_rates.sql21 ├── marketing/22 │ ├── fct_customer_cohorts.sql23 │ └── fct_campaign_attribution.sql24 └── operations/25 ├── fct_fulfilment_sla.sql26 └── dim_warehouses.sql
Notice the double-underscore convention (stg_shopify__orders). The first segment identifies the source system; the second identifies the entity. This becomes critical when you have orders from Shopify Plus in one market and SHOPLINE in another—both feeding into the same intermediate layer.
Configuring dbt_project.yml for E-Commerce
1models:2 your_project:3 staging:4 +materialized: view5 +schema: staging6 intermediate:7 +materialized: ephemeral8 +schema: intermediate9 marts:10 +materialized: table11 +schema: analytics12 finance:13 +schema: finance14 +tags: ['daily', 'finance']15 marketing:16 +materialized: incremental17 +schema: marketing
Staging as views keeps storage costs near zero. Intermediate models as ephemeral means they compile into CTEs within the downstream query—no extra tables cluttering your warehouse. Marts get materialised as tables (or incremental, which we'll cover in Step 4).
Ready to Transform Your Ecommerce Operations?
Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.
Step 2: Build a Testing Strategy That Catches E-Commerce Data Issues
Schema Tests for Every Source
Data quality in e-commerce is notoriously inconsistent. We've seen Shopify Plus webhooks drop refund records during high-traffic sales events (think Singles' Day across Greater China or 12.12 sales in Southeast Asia). Your testing layer is your safety net.
At minimum, every staging model needs:
1# _shopify__sources.yml2version: 234sources:5 - name: shopify6 database: raw7 schema: shopify_hk8 tables:9 - name: orders10 loaded_at_field: updated_at11 freshness:12 warn_after: {count: 6, period: hour}13 error_after: {count: 24, period: hour}14 columns:15 - name: id16 tests:17 - unique18 - not_null19 - name: financial_status20 tests:21 - accepted_values:22 values: ['pending', 'paid', 'partially_refunded', 'refunded', 'voided']
The freshness check is non-negotiable for e-commerce. According to a 2023 Monte Carlo survey, 77% of data engineers reported experiencing data freshness issues that impacted downstream reporting (Monte Carlo, 2023). In e-commerce, stale order data means wrong revenue numbers on executive dashboards.
Custom Tests for Business Logic
Generic tests won't catch domain-specific problems. We write custom singular tests in tests/ for scenarios like:
1-- tests/assert_refund_not_exceeding_order_total.sql2SELECT3 order_id,4 order_total_amount,5 total_refunded_amount6FROM {{ ref('int_orders__joined_items_refunds') }}7WHERE total_refunded_amount > order_total_amount * 1.01 -- 1% tolerance for rounding
This test catches a real problem we discovered at a Hong Kong fashion retailer: partial refunds were being double-counted because Shopify Plus records both refund line items and refund transactions, and a naive join duplicated the amounts.
Data Contracts with dbt_expectations
For enterprise clients, we layer on the dbt_expectations package (Calogica, v0.10+) to enforce distribution-based tests:
1models:2 - name: fct_daily_revenue3 columns:4 - name: gross_revenue5 tests:6 - dbt_expectations.expect_column_values_to_be_between:7 min_value: 08 max_value: 100000009 row_condition: "currency = 'HKD'"
This prevents absurd values from propagating. A $100 million HKD daily revenue figure for a mid-market retailer would flag immediately.
Step 3: Model E-Commerce Entities with Multi-Currency and Multi-Market Logic
Handling Currency Conversion Consistently
Any e-commerce operation spanning APAC—HKD, SGD, TWD, AUD, MYR—needs a single source of truth for exchange rates. We maintain a dim_currency_rates model seeded from the European Central Bank's daily feed (ingested via Airbyte) or the client's treasury rates.
1-- models/marts/finance/fct_daily_revenue.sql2WITH orders AS (3 SELECT * FROM {{ ref('int_orders__joined_items_refunds') }}4),5rates AS (6 SELECT * FROM {{ ref('dim_currency_rates') }}7)8SELECT9 o.order_date,10 o.market_code,11 o.local_currency,12 SUM(o.net_revenue_local) AS net_revenue_local,13 SUM(o.net_revenue_local * r.rate_to_usd) AS net_revenue_usd14FROM orders o15LEFT JOIN rates r16 ON o.local_currency = r.currency_code17 AND o.order_date = r.rate_date18GROUP BY 1, 2, 3
Key decision: do you convert at order-date rate or reporting-date rate? For operational reporting, order-date. For consolidated financial reporting, month-end closing rate. Document this in a README.md inside models/marts/finance/.
Building the Customer 360 Dimension
E-commerce customer dimensions are deceptively complex. A single customer might have:
- A Shopify Plus account in Hong Kong
- A separate SHOPLINE account in Taiwan
- A loyalty programme membership with a different ID
- Salesforce CRM records tied to their email
We build dim_customers through a deterministic matching layer in int_customers__enriched.sql:
1-- Matching priority: loyalty_id > email > phone2WITH unified AS (3 SELECT4 COALESCE(l.loyalty_id, s.customer_id) AS customer_key,5 s.email,6 s.first_order_date,7 l.tier AS loyalty_tier,8 l.lifetime_points9 FROM {{ ref('stg_shopify__customers') }} s10 LEFT JOIN {{ ref('stg_loyalty__members') }} l11 ON LOWER(TRIM(s.email)) = LOWER(TRIM(l.email))12)
This identity resolution step is where many dbt e-commerce projects fail. According to Segment's 2023 State of Personalization report, 65% of retailers cannot unify customer identity across channels (Segment, 2023)—and the problem compounds across APAC markets with different character sets (Chinese names in Traditional vs Simplified, Japanese katakana for foreign names).
Integrating CX Data from Salesforce into the Analytics Layer
For clients running Salesforce Service Cloud alongside their e-commerce stack, we bring CX interaction data into the dbt project. This is where Salesforce AI agent CX workflow integration becomes relevant—Salesforce's Agentforce and Einstein AI features generate case classifications and sentiment scores that should flow into your unified customer model.
We ingest Salesforce case data via Fivetran's Salesforce connector into a stg_salesforce__cases staging model, then join it to the customer dimension:
1-- int_customers__cx_enriched.sql2SELECT3 c.*,4 cx.total_support_cases,5 cx.avg_resolution_hours,6 cx.ai_predicted_churn_score, -- from Einstein AI7 cx.last_case_sentiment -- from Agentforce classification8FROM {{ ref('int_customers__enriched') }} c9LEFT JOIN {{ ref('int_salesforce__customer_cx_summary') }} cx10 ON c.customer_key = cx.customer_key
This unified view lets marketing teams see not just purchase behaviour but service interaction quality. When a VIP customer has had three unresolved cases, that context should suppress promotional campaigns—something we implemented for a Hong Kong retail group running Shopify Plus alongside Salesforce Service Cloud, reducing customer complaint escalations by 34% in the first quarter.
Ready to Transform Your Ecommerce Operations?
Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.
Step 4: Implement Incremental Models for High-Volume Order Data
When to Go Incremental
A mid-size Shopify Plus store processes 5,000–20,000 orders per day. At 365 days, that's up to 7.3 million rows just in the orders table—before line items, events, and refunds multiply the volume. Full table scans become expensive. According to Snowflake's cost documentation, a full refresh of a 50M-row fact table on an XS warehouse costs approximately $0.50–$1.00 per run (Snowflake, 2024). Run that hourly and you're burning $4,380–$8,760 annually on a single model.
Incremental models are the dbt best practice for any fact table exceeding 1 million rows.
Configuring Incremental Models for Orders
1-- models/marts/finance/fct_orders.sql2{{3 config(4 materialized='incremental',5 unique_key='order_id',6 incremental_strategy='merge',7 on_schema_change='append_new_columns'8 )9}}1011WITH source AS (12 SELECT * FROM {{ ref('int_orders__joined_items_refunds') }}13 {% if is_incremental() %}14 WHERE updated_at > (15 SELECT MAX(updated_at) FROM {{ this }}16 )17 {% endif %}18)1920SELECT21 order_id,22 customer_key,23 order_date,24 market_code,25 net_revenue_local,26 net_revenue_usd,27 refund_amount_usd,28 fulfilment_status,29 updated_at30FROM source
Critical detail: use updated_at, not created_at, as your incremental filter. Orders get updated when they're fulfilled, partially refunded, or cancelled. If you filter on created_at, you'll miss those state changes—a mistake we've seen at least four times in inherited dbt projects.
Late-Arriving Data and the Look-Back Pattern
E-commerce data arrives late. Payment gateway confirmations, logistics tracking updates, and marketplace settlement reports (especially from Lazada or Shopee in Southeast Asia) can lag by 24–72 hours.
We use a look-back window pattern:
1{% if is_incremental() %}2 WHERE updated_at > (3 SELECT DATEADD('day', -3, MAX(updated_at)) FROM {{ this }}4 )5{% endif %}
The three-day overlap reprocesses recent records to catch late updates. Yes, this means slightly higher compute cost, but it eliminates data accuracy issues that otherwise require manual intervention. The trade-off is worth it for any dataset with known latency.
Step 5: Implement dbt Macros and Packages to Reduce Repetition
Essential Packages for E-Commerce dbt Projects
Don't rebuild what already exists. Our standard packages.yml:
1packages:2 - package: dbt-labs/dbt_utils3 version: [">=1.1.0", "<2.0.0"]4 - package: calogica/dbt_expectations5 version: [">=0.10.0", "<0.11.0"]6 - package: fivetran/shopify7 version: [">=0.12.0", "<1.0.0"]8 - package: dbt-labs/codegen9 version: [">=0.12.0", "<1.0.0"]
The fivetran/shopify package provides pre-built staging models for Shopify Plus data ingested via Fivetran—it handles the nested JSON extraction for line items, discount allocations, and refund line items. We estimate this saves about 40 hours of initial development per project.
Writing Custom Macros for Multi-Market Logic
Currency formatting, tax calculation, and fiscal calendar logic vary by market. We encapsulate these in macros:
Related reading: AI Automation ROI Calculation for Operations Teams: A Data-Backed Framework
1-- macros/convert_to_reporting_currency.sql2{% macro convert_to_reporting_currency(amount_column, currency_column, date_column, target_currency='USD') %}3 {{ amount_column }} * (4 SELECT rate5 FROM {{ ref('dim_currency_rates') }} r6 WHERE r.currency_code = {{ currency_column }}7 AND r.rate_date = {{ date_column }}8 AND r.target_currency = '{{ target_currency }}'9 )10{% endmacro %}
This macro gets called in any mart model that needs currency conversion, ensuring consistent logic everywhere. One formula, one place to update when the treasury team changes the rate source.
Ready to Transform Your Ecommerce Operations?
Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.
Step 6: Orchestrate, Document, and Monitor
Scheduling with Tags and Selectors
Not all models need to run at the same frequency. Finance models updating daily revenue must run at least twice daily. Marketing attribution models might only need nightly runs. Product catalogue dimensions rarely change and can run weekly.
Use dbt's tag-based selection:
1# Hourly run for critical order data2dbt run --select tag:hourly --target prod34# Nightly full marketing refresh5dbt run --select tag:nightly --full-refresh --target prod67# Run tests after every model run8dbt test --select tag:hourly --target prod
We orchestrate via Dagster (preferred for its asset-based approach) or Airflow for clients with existing infrastructure. The key is that tests run after every model execution—not in a separate nightly job.
Auto-Generated Documentation
dbt's built-in documentation generator (dbt docs generate) creates a browsable DAG and column-level descriptions. For e-commerce teams, this is how you onboard new analysts who need to understand the difference between gross_revenue, net_revenue, and gross_merchandise_value.
We enforce documentation coverage using dbt_meta_testing:
1models:2 - name: fct_orders3 config:4 required_docs: true5 description: >6 Grain: one row per order.7 Includes net revenue after refunds,8 converted to USD at order-date rate.9 columns:10 - name: net_revenue_usd11 description: "Order total minus refunds, converted to USD using order-date exchange rate."
According to Atlan's 2024 State of Data Culture report, teams with documented data models resolve data questions 3.2x faster than those without (Atlan, 2024).
Step 7: Connect dbt Outputs to CX and Marketing Workflows
Feeding Mart Data into Salesforce for AI-Driven CX
The analytical models you build in dbt shouldn't exist in isolation. For enterprise e-commerce clients, the highest-value activation is pushing customer-level metrics back into Salesforce where service and sales teams operate daily.
We use Census or Hightouch as reverse ETL tools to sync dim_customers (with lifetime value, purchase frequency, and predicted churn scores) back into Salesforce custom fields. This supports Salesforce AI agent CX workflow integration by giving Agentforce and Einstein the contextual data they need to route cases intelligently.
A practical example: for a multi-brand retailer operating across Hong Kong and Singapore, we built a mart_crm__customer_health_score model in dbt that combined purchase recency, support case volume, and loyalty tier. This model was synced hourly to Salesforce via Census, where Agentforce used it to prioritise high-value-at-risk customers for proactive outreach. The result was a 22% improvement in retention for the flagged segment over six months.
Exposing Metrics via the dbt Semantic Layer
dbt's semantic layer (available in dbt Cloud) lets you define metrics once and expose them to any BI tool—Looker, Metabase, Hex, or Tableau. For e-commerce, define core metrics centrally:
1metrics:2 - name: gross_merchandise_value3 label: GMV4 type: simple5 type_params:6 measure: total_gmv7 filter:8 - order_status != 'cancelled'
This eliminates the "we have five different revenue numbers" problem that plagues every e-commerce analytics team I've worked with.
Ready to Transform Your Ecommerce Operations?
Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.
Troubleshooting: Common Mistakes and How to Fix Them
Mistake 1: Using created_at Instead of updated_at for Incrementals
As mentioned in Step 4, this is the single most common dbt data transformation mistake in e-commerce. Orders, refunds, and fulfilment records are mutable. If your incremental model filters on created_at, you'll miss updates to existing records. Your refund totals will be wrong, and finance will lose trust in the data.
Fix: Always use updated_at with a look-back window. Verify your source system populates updated_at reliably—some Shopify Plus custom apps don't.
Mistake 2: Joining Orders and Refunds at the Wrong Grain
Shopify Plus stores refunds at the line-item level, but many teams join refunds to orders at the order level. This creates fan-out: one order with three refunded line items appears as three rows, tripling the refund amount.
Fix: Aggregate refunds to the order grain in an intermediate model before joining to orders:
1-- int_refunds__aggregated_to_order.sql2SELECT3 order_id,4 SUM(refund_amount) AS total_refund_amount,5 COUNT(*) AS refund_line_count6FROM {{ ref('stg_shopify__refund_line_items') }}7GROUP BY order_id
Mistake 3: Ignoring Timezone Differences Across Markets
A Hong Kong store (UTC+8), a Sydney store (UTC+10/+11), and a Singapore store (UTC+8) all report "daily" revenue—but "daily" means different things in UTC. We've seen reports where Singapore's Monday revenue included Sunday night UTC orders.
Fix: Store all timestamps in UTC in staging models, then apply market-specific timezone conversion in mart models:
1CONVERT_TIMEZONE('UTC', 'Asia/Hong_Kong', order_created_at) AS order_created_at_local
Mistake 4: Not Running dbt build for Combined Test-and-Run
Teams that run dbt run and dbt test as separate steps risk deploying models that fail tests. Since dbt 1.0, dbt build runs models and tests in DAG order—a model's tests execute immediately after that model builds.
Fix: Replace all dbt run && dbt test sequences with dbt build. This is a one-line change in your CI/CD pipeline that prevents broken models from propagating.
Mistake 5: Over-Materialising Staging Models
Materialising staging models as tables doubles your storage for zero analytical value. Staging models should be views (or ephemeral if they're only consumed by one downstream model).
Fix: Set +materialized: view at the staging folder level in dbt_project.yml.
What's Next for dbt in E-Commerce Data Transformation
The dbt data transformation best practices for e-commerce outlined here represent what works today—but the tooling is evolving fast. dbt Labs' push toward the semantic layer and metrics-as-code will increasingly blur the line between transformation and BI. Meanwhile, the integration of AI agents across the data stack—Salesforce Agentforce for CX, Databricks' AI functions for predictive features, and dbt Cloud's AI-assisted development—means that your dbt project isn't just feeding dashboards anymore. It's feeding autonomous systems that make decisions.
For APAC e-commerce operations specifically, the multi-market, multi-currency complexity will only grow as cross-border commerce accelerates. The brands that invest in disciplined transformation layers now will be the ones capable of scaling into new markets without rebuilding their analytics from scratch each time.
If your team is running dbt on e-commerce data across multiple APAC markets and hitting the scaling challenges described here, reach out to Branch8. We've built these stacks for enterprise retailers and can typically get a production-grade dbt project operational within six to eight weeks.
Ready to Transform Your Ecommerce Operations?
Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.
Sources
- dbt Labs. "Best practice guides." https://docs.getdbt.com/best-practices
- Monte Carlo. "2023 State of Data Quality." https://www.montecarlodata.com/state-of-data-quality-2023
- Segment. "State of Personalization 2023." https://segment.com/state-of-personalization-report/
- Snowflake. "Understanding Compute Costs." https://docs.snowflake.com/en/user-guide/cost-understanding-compute
- Atlan. "State of Data Culture Report 2024." https://atlan.com/state-of-data-culture/
- Fivetran. "dbt Shopify Package." https://hub.getdbt.com/fivetran/shopify/latest/
- dbt Labs. "Semantic Layer Overview." https://docs.getdbt.com/docs/build/about-metricflow
FAQ
Use a three-layer architecture: staging models (one-to-one with source tables, materialised as views), intermediate models (business logic joins, often ephemeral), and mart models (final consumption tables, materialised as tables or incremental). Organise marts by business domain—finance, marketing, operations—and use double-underscore naming to separate source system from entity name.
About the Author
Matt Li
Co-Founder & CEO, Branch8 & Second Talent
Matt Li is Co-Founder and CEO of Branch8, a Y Combinator-backed (S15) Adobe Solution Partner and e-commerce consultancy headquartered in Hong Kong, and Co-Founder of Second Talent, a global tech hiring platform ranked #1 in Global Hiring on G2. With 12 years of experience in e-commerce strategy, platform implementation, and digital operations, he has led delivery of Adobe Commerce Cloud projects for enterprise clients including Chow Sang Sang, HomePlus (HKBN), Maxim's, Hong Kong International Airport, Hotai/Toyota, and Evisu. Prior to founding Branch8, Matt served as Vice President of Mid-Market Enterprises at HSBC. He serves as Vice Chairman of the Hong Kong E-Commerce Business Association (HKEBA). A self-taught software engineer, Matt graduated from the University of Toronto with a Bachelor of Commerce in Finance and Economics.