Blog
Analytics2025-10-1612 min

How to Design a Data Warehouse Architecture for Marketing and Product Analytics

A well-designed warehouse architecture unifies marketing, product, and revenue data. Here's the reference architecture for SaaS companies.Practical guide with data architecture, attribution models,...

Your marketing team uses GA4, your product team uses Mixpanel, your sales team uses HubSpot, your customer success team uses Gainsight, and your finance team uses Stripe. Each tool has its own version of the truth. Monthly active users in Mixpanel does not match unique visitors in GA4. Revenue in HubSpot does not match revenue in Stripe. Customer count in Gainsight does not match account count in your product database. Nobody agrees on the numbers, so nobody trusts any of them. Every meeting starts with a 20-minute debate about which data source is correct. This is not a tool problem. It is an architecture problem.

A data warehouse solves this by becoming the single source of truth: a centralized repository where data from every tool is collected, cleaned, modeled, and made available for analysis. When marketing wants to know how a campaign performed from first touch through closed deal, the data is in one place. When product wants to correlate feature usage with retention, the data is in one place. When finance wants to reconcile revenue across systems, the data is in one place. Designing this warehouse correctly from the start prevents the most expensive kind of technical debt: data debt, where bad architecture decisions compound into years of unreliable metrics and wasted analysis time.

TL;DR
  • A data warehouse centralizes marketing, product, and revenue data into a single queryable source of truth. It eliminates the 'which number is right?' problem.
  • Use the ELT pattern: Extract data from source systems, Load it into the warehouse raw, then Transform it with dbt or similar tools inside the warehouse.
  • Design your warehouse in three layers: raw (exact copies of source data), staging (cleaned and standardized), and marts (business-ready models for specific teams).
  • Start with five core models: users, accounts, events, subscriptions, and campaigns. Every analysis you will ever need builds on these foundations.

Why You Need a Data Warehouse

The argument for a data warehouse becomes obvious once you try to answer a cross-functional question without one. "What is the ROI of our content marketing program?" To answer this, you need: content performance data from your CMS and GA4, lead data from HubSpot, opportunity and revenue data from your CRM, and product engagement data from your analytics tool. Without a warehouse, answering this question requires manual export from four systems, custom joining in a spreadsheet, and hours of data cleaning. The answer is stale by the time you produce it. With a warehouse, it is a SQL query that runs in seconds and refreshes daily.

The Cost of Not Having One

The cost of operating without a centralized warehouse is paid in three currencies. First, analyst time: your most expensive analytical resource spends 60-80% of their time extracting, cleaning, and joining data rather than analyzing it. Second, decision latency: questions that should take minutes take days because of the manual data assembly process. Third, inconsistency: different teams calculate the same metric differently because they use different data sources with different definitions. Customer count, revenue, conversion rate, and churn rate are all metrics that commonly have three or more conflicting definitions across an organization. Each conflicting definition erodes trust in data-driven decision making.

60-80%
of analyst time
spent on data extraction and cleaning without a warehouse
3-5x
faster time-to-insight
with a properly structured data warehouse
$50-200K
annual cost range
for a mid-market warehouse implementation (tools + headcount)

Source: Fivetran Data Maturity Report, dbt Labs State of Analytics Engineering

Choosing Your Warehouse Platform

Three platforms dominate the modern data warehouse landscape. Each has distinct strengths, and the right choice depends on your existing cloud infrastructure, team skills, and budget model.

Snowflake

Snowflake separates compute and storage, which means you pay for storage at rest (cheap) and compute only when queries run (variable). This makes costs predictable for storage-heavy, query-light workloads and cost-efficient for intermittent analysis patterns. Snowflake is cloud-agnostic (runs on AWS, Azure, and GCP), which avoids lock-in. Its data sharing capabilities are excellent for companies that need to share data with partners or customers. The primary downside is cost management: without careful configuration of warehouse sizes and auto-suspend rules, compute costs can escalate quickly. Snowflake is the best choice for companies that want cloud flexibility and whose analysis patterns are bursty rather than continuous.

Google BigQuery

BigQuery is serverless, meaning there is no infrastructure to manage. You load data in and run queries. Pricing is based on the amount of data scanned per query (on-demand) or flat-rate reservations for predictable workloads. BigQuery integrates natively with GA4, Google Ads, and the broader Google Cloud ecosystem, making it the natural choice for companies that rely heavily on Google tools. Its ML capabilities (BigQuery ML) allow you to build machine learning models directly in SQL, which is valuable for teams without dedicated data scientists. The downside is vendor lock-in to Google Cloud and less granular control over compute resources compared to Snowflake.

Amazon Redshift

Redshift is the original cloud data warehouse and remains the default for companies already invested in AWS. Redshift Serverless offers a pay-per-query model similar to BigQuery, while provisioned clusters give more control and predictable costs for steady workloads. Redshift's strength is its deep integration with the AWS ecosystem: S3 for staging, Glue for ETL, Lambda for event processing, and SageMaker for ML. The downside is that Redshift requires more operational overhead than BigQuery and less flexibility than Snowflake.

The Budget Reality Check
For companies under $10M ARR, the warehouse platform cost is often less significant than the ETL and transformation tooling costs. Snowflake, BigQuery, and Redshift all offer free tiers or credits that cover basic usage. The real cost drivers are Fivetran or equivalent ($1,000-$5,000/month for mid-market), dbt Cloud ($100-$500/month), and the data engineering headcount to maintain the pipeline ($120,000-$180,000/year). If you cannot justify a full-time data engineer, consider managed solutions like Census, Hightouch, or Polytomic that reduce the engineering burden at the cost of flexibility.

The ELT Architecture

Modern data warehouses use ELT (Extract, Load, Transform) rather than the traditional ETL (Extract, Transform, Load). The difference is where the transformation happens. In ETL, data is transformed before it enters the warehouse, which means you must decide upfront what the data should look like. In ELT, raw data is loaded first and transformed inside the warehouse, which preserves the original data and allows transformations to evolve as business needs change. ELT is the dominant pattern because cloud warehouses have enough compute power to handle transformation at query time, and preserving raw data provides an audit trail and the flexibility to reprocess data when models change.

ELT Pipeline Architecture

1
Extract

Pull data from source systems via APIs, webhooks, database replication, or file exports. Tools like Fivetran, Airbyte, and Stitch automate extraction with pre-built connectors for hundreds of SaaS tools. For custom sources, build extraction scripts that write to S3/GCS as an intermediate staging layer.

2
Load

Land the raw extracted data into your warehouse in a dedicated 'raw' schema. Each source system gets its own schema (raw_hubspot, raw_stripe, raw_mixpanel). Data arrives in the same structure as the source: same column names, same data types, same grain. No transformation at this stage.

3
Transform

Use dbt (data build tool) to transform raw data into analytics-ready models. Transformations include: renaming columns to a consistent convention, casting data types, joining tables across sources, calculating derived metrics, and building business-level models (users, accounts, events) from raw operational tables.

4
Serve

Expose the transformed models to downstream consumers: BI tools (Looker, Metabase, Tableau), reverse ETL tools (Census, Hightouch) that push data back to operational tools, and direct SQL access for analysts. Each consumer accesses the same transformed data, ensuring consistency.

The Three-Layer Data Model

The three-layer model is the standard architecture for organizing data inside the warehouse. Each layer serves a distinct purpose, and maintaining clear boundaries between them prevents the spaghetti dependencies that make warehouses unmaintainable.

Layer 1: Raw (Sources)

The raw layer contains exact replicas of source data. One schema per source system: raw_hubspot, raw_stripe, raw_ga4, raw_mixpanel. Tables mirror the source system's structure. No transformations, no renames, no joins. The raw layer is append-only and serves as your data audit trail. If a transformation introduces a bug, you can always reprocess from raw. If a source system changes its API schema, the raw layer preserves the historical structure alongside the new one.

Retention policy for the raw layer should be generous: keep at least 2 years of raw data. Storage costs in modern warehouses are low enough that preserving raw data is almost always worth the cost. The value of being able to reprocess historical data when you improve your models is substantial.

Layer 2: Staging (Cleaned)

The staging layer contains cleaned, standardized versions of raw tables. Each staging model maps to one raw table and applies a consistent set of transformations: column renaming to your naming convention (snake_case), data type casting (strings to dates, strings to numbers), null handling, deduplication, and basic filtering (removing test data, internal users, deleted records). Staging models do not join across sources. Each staging model is self-contained and references only one raw table. This constraint is critical: it means staging models can be tested and validated independently. If stg_hubspot__contacts has a bug, it does not affect stg_stripe__customers.

Layer 3: Marts (Business Models)

The marts layer contains business-ready data models that join across sources and calculate derived metrics. Marts are organized by business domain: marketing, product, finance, core. The core mart contains models used across domains: dim_users, dim_accounts, fct_events, fct_subscriptions. Domain-specific marts extend the core: marketing.fct_campaign_performance, product.fct_feature_usage, finance.fct_mrr_movements.

LayerNaming ConventionExampleJoins Across Sources?Who Queries It?
Rawraw_source__tableraw_hubspot__contactsNoNobody (input only)
Stagingstg_source__entitystg_hubspot__contactsNoData engineers (debugging)
Martsdomain.dim/fct_entitycore.dim_usersYesAnalysts, BI tools, reverse ETL

The Five Core Models

Every analytics warehouse, regardless of industry or company size, needs five foundational models. These models represent the entities and activities that every analysis depends on. Build these first, get them right, and every subsequent model becomes an extension rather than a rebuilding exercise.

1. dim_users

The user dimension table is a single row per user with all known attributes. It joins data from your product database (user ID, creation date, plan type), your CRM (lead source, lifecycle stage, owner), your analytics tool (first touch, last seen, session count), and enrichment sources (company, industry, title). The canonical user ID is the primary key. All other identifiers (email, anonymous IDs, CRM contact ID) are secondary keys for joining. The user dimension answers: who are our users, when did they join, what plan are they on, how did they find us, and what company do they belong to?

2. dim_accounts

The account dimension table is a single row per company/account with all known attributes. It aggregates user-level data to the account level: number of users, plan type, MRR, creation date, industry, company size. For B2B SaaS, the account is often the unit of analysis for revenue, retention, and expansion metrics. The account dimension answers: how many accounts do we have, what is their ARR distribution, which industries are represented, and how does account-level engagement correlate with retention?

3. fct_events

The events fact table is the core behavioral data model. One row per event, with columns for user ID, account ID, event name, event timestamp, event properties (stored as JSON or flattened into typed columns for high-value events), and session ID. This table powers funnel analysis, feature usage tracking, engagement scoring, and behavioral cohort analysis. It is typically the largest table in the warehouse and requires partitioning by date and clustering by user ID for query performance. The events model consolidates behavioral data from multiple sources: your product's analytics events, marketing site events, and email engagement events.

4. fct_subscriptions

The subscription fact table tracks the financial relationship over time. One row per subscription state change: created, upgraded, downgraded, paused, canceled, reactivated. Columns include account ID, plan name, MRR at the time of the change, change type, change date, and the reason for change (if captured). This table powers MRR waterfall analysis, revenue retention, expansion and contraction tracking, and churn analysis. The subscription model typically joins Stripe data (payment amounts, plan IDs) with product data (plan names, feature flags) to create a complete picture of the financial lifecycle.

5. fct_campaigns

The campaign fact table tracks marketing spend, impressions, clicks, and conversions across all channels. One row per campaign per day, with columns for channel (Google Ads, LinkedIn, Facebook, email), campaign name, ad group or ad set, spend, impressions, clicks, and attributed conversions at each stage (lead, MQL, opportunity, closed-won). This table joins data from ad platforms (spend, impressions, clicks) with CRM data (conversions, revenue) to calculate true CAC and ROAS by channel and campaign. Without this model, marketing attribution is guesswork.

Data Ingestion: Getting Data Into the Warehouse

Data ingestion is the first and most mechanical part of the pipeline. The goal is to get data from every source system into the raw layer of the warehouse reliably and on a predictable schedule.

Managed ETL Tools

Fivetran, Airbyte, and Stitch are the three primary managed ETL tools. Fivetran is the market leader with the most pre-built connectors (300+) and the most reliable sync schedules. It is also the most expensive ($1-$5/MAR, where MAR is monthly active rows). Airbyte is open-source with a cloud-managed option, offering comparable connectors at lower cost but with more operational overhead. Stitch (owned by Talend) is the budget option with good coverage of common sources but fewer connectors for niche tools. For most mid-market companies, Fivetran is the right choice because the time saved on connector maintenance exceeds the cost premium. For budget-constrained teams, Airbyte self-hosted provides the same capability at the cost of engineering time to manage the infrastructure.

Custom Ingestion

Some data sources require custom ingestion: your product database (direct replication), internal tools without API connectors, file-based data sources (CSV uploads, partner data feeds), and real-time event streams. For database replication, use change data capture (CDC) via Debezium or your database's native replication to stream changes to the warehouse with minimal latency. For file-based sources, land files in cloud storage (S3 or GCS) and use the warehouse's native file loading capability. For real-time events, stream through Kafka or Kinesis into the warehouse's streaming ingestion endpoint.

Sync Frequency

Not all data needs to sync at the same frequency. Marketing campaign data changes daily: daily sync is sufficient. Product behavioral events should sync every 1-6 hours for timely analysis. Financial data (Stripe, billing) should sync every 1-4 hours to keep revenue metrics current. CRM data (HubSpot, Salesforce) should sync every 1-2 hours because sales teams expect near-real-time pipeline visibility. Real-time event data for alerting (churn risk triggers, anomaly detection) requires streaming ingestion with sub-minute latency. Match sync frequency to the staleness tolerance of each use case.

Transformation with dbt

dbt (data build tool) is the industry standard for transformation inside the warehouse. It lets you write transformations as SQL SELECT statements, version control them in git, test them with assertions, and document them with built-in schema descriptions. If you are building a data warehouse in 2026, you are using dbt or something very similar.

Model Organization

Organize dbt models to mirror the three-layer architecture. The models/staging/ directory contains one subdirectory per source system, each with staging models that clean and standardize raw data. The models/marts/ directory contains subdirectories for each business domain (core, marketing, product, finance), each with dimension and fact models that join across staging models. The models/intermediate/ directory (optional) contains helper models that simplify complex mart logic without being exposed to end users.

Testing and Documentation

dbt's testing framework is your data quality safety net. At minimum, every model should have tests for: primary key uniqueness (no duplicate rows), not-null constraints on critical columns, referential integrity (foreign keys exist in the referenced table), and accepted values for categorical columns. Add custom tests for business logic: MRR should never be negative, event timestamps should never be in the future, user creation dates should never precede the company's founding date. Run tests on every dbt build. Failed tests should block deployment to production, just like failed unit tests in application code.

Insight
The most underrated feature of dbt is its documentation layer. Every model, column, and test can have a description that is automatically compiled into a searchable documentation site. This documentation becomes the data catalog: when a marketing manager asks "where can I find campaign spend by channel?" the dbt docs site answers the question. Invest time in writing clear, business-friendly column descriptions. "The monthly recurring revenue at the time of the subscription change, in USD, calculated as the annualized plan price divided by 12" is infinitely more useful than "mrr_usd."

Reverse ETL: Activating Warehouse Data

A warehouse that only feeds dashboards is using half its potential. Reverse ETL pushes warehouse data back into operational tools, making your analytical models actionable. This closes the loop between insight and action.

Common Reverse ETL Use Cases

Push lead scores from the warehouse into HubSpot so sales reps see AI-calculated scores alongside CRM data. Push product usage segments into your email platform so marketing can send targeted campaigns to users based on behavioral patterns. Push churn risk scores into Gainsight so customer success teams prioritize outreach to at-risk accounts. Push attribution data into ad platforms so campaign managers can see true cost per acquisition that includes the full sales cycle, not just the click.

Reverse ETL Tools

Census, Hightouch, and Polytomic are the three primary reverse ETL tools. They connect to your warehouse, let you define sync rules (which model, which fields, which destination), and push data on a schedule. Census and Hightouch are the most mature, with pre-built destinations for 100+ tools. The key differentiator between them is the sync engine: how they handle incremental updates, conflict resolution, and rate limiting for destination APIs. For most use cases, either works well. Choose based on destination coverage and pricing for your volume.

Performance Optimization

As your warehouse grows, query performance becomes critical. Slow queries mean slow dashboards, which mean low adoption, which means the warehouse investment fails to deliver. Optimize proactively.

Partitioning and Clustering

Partition large tables by date. The events table should be partitioned by event_date so that queries for "last 30 days" only scan 30 partitions, not the entire table. Cluster within partitions by the most common filter columns: user_id, event_name, and account_id are typical clustering keys. In BigQuery, clustering is automatic and requires only specifying the cluster columns. In Snowflake, set cluster keys on tables that exceed 1TB. In Redshift, use sort keys and distribution keys.

Materialization Strategy

dbt supports multiple materialization strategies: views (SQL runs on each query), tables (SQL runs once and results are stored), incremental (only new/changed rows are processed), and ephemeral (inline CTEs, not materialized). Staging models should be views (they add no data, just rename and cast). Core mart models used in dashboards should be tables (pre-computed for fast queries). Large event models should be incremental (processing only new events since the last run). The incremental strategy is the most impactful for performance: instead of rebuilding 500 million event rows every run, you append only the 100,000 new events since the last build.

Governance and Access Control

A warehouse without governance becomes a liability. Sensitive data (PII, financial records, health information) requires access controls. Metric definitions require standardization. Data quality requires monitoring.

Role-Based Access

Define roles that match your organization: analysts get read access to mart models, data engineers get read/write access to all layers, business users get read access to specific mart schemas relevant to their function, and BI tools get service account access to production mart schemas only. The raw layer should be accessible only to data engineers. It contains unprocessed PII and operational data that is not appropriate for direct business consumption. The staging layer is for debugging. The marts layer is for analysis.

PII Management

PII (email, name, phone, IP address) should be handled carefully in the warehouse. Hash or pseudonymize PII in the staging layer so that mart models do not contain raw PII unless specifically required. Create a separate PII-accessible role for users who need unhashed data (customer support, legal). Tag columns that contain PII in dbt's schema documentation so that access controls can be applied systematically. For GDPR deletion requests, the PII mapping in the staging layer provides the key to identify and remove all data associated with a specific user.

Metric Definitions

Define every business metric in one place. MRR should have one definition, one calculation, one source model. If marketing calculates MRR from HubSpot and finance calculates MRR from Stripe, they will get different numbers. The warehouse model fct_subscriptions should be the single source for MRR, and both marketing and finance dashboards should reference it. Use a metrics layer (dbt metrics, Looker LookML, or a dedicated tool like Transform) to codify metric definitions so they are consistent across every dashboard, report, and query.

Key Takeaways

  • 1Use ELT, not ETL. Load raw data first, transform inside the warehouse with dbt. This preserves source data and allows models to evolve.
  • 2Organize data in three layers: raw (exact source replicas), staging (cleaned and standardized), and marts (business-ready models that join across sources).
  • 3Build five core models first: dim_users, dim_accounts, fct_events, fct_subscriptions, and fct_campaigns. Everything else extends these foundations.
  • 4Choose your warehouse platform based on your cloud ecosystem: BigQuery for Google-heavy, Redshift for AWS-heavy, Snowflake for cloud-agnostic.
  • 5Use managed ETL (Fivetran or Airbyte) for data ingestion. The engineering time saved exceeds the tool cost for most companies.
  • 6Test every dbt model for uniqueness, not-null, referential integrity, and business logic. Failed tests should block production deployment.
  • 7Close the loop with reverse ETL: push lead scores, segments, and attribution data from the warehouse back into operational tools.
  • 8Define every business metric once in the warehouse. Conflicting metric definitions across tools are the number one cause of data distrust.

Data architecture that compounds

Warehouse design, dbt patterns, and analytics engineering practices for teams building their data foundation. Weekly.

A data warehouse is not a technology project. It is an organizational decision to have one version of the truth. The technology (Snowflake, dbt, Fivetran) is the implementation, but the value comes from the discipline of centralizing data, standardizing definitions, and making every team query the same models. The warehouse that gets built and abandoned is the one designed as a data engineering project. The warehouse that becomes indispensable is the one designed around business questions: "How do we calculate CAC?" "What is our retention by cohort?" "Which features predict expansion?" Start with the questions. Design models that answer them. Build the pipeline that feeds those models. Then expand from there. The architecture that starts with five well-designed models and grows intentionally will always outperform the architecture that tries to model everything on day one.

Prove what's working and cut what isn't

Oscom connects GA4, Kissmetrics, and your CRM so you can tie every marketing activity to revenue in one dashboard.