Branch8

RAG System Implementation for E-Commerce AI Workflows: A Step-by-Step Guide

Matt Li
April 2, 2026
14 mins read
RAG System Implementation for E-Commerce AI Workflows: A Step-by-Step Guide - Hero Image

Key Takeaways

  • Chunk per product variant, not fixed token size
  • Use hybrid search combining vector similarity and BM25 keyword matching
  • Route order queries to live APIs, not the vector store
  • Evaluate weekly with RAGAS on 100+ real customer queries
  • Multilingual embeddings are essential for APAC markets

Quick Answer: Implement e-commerce RAG by chunking product data per variant, embedding with multilingual models, using hybrid vector + keyword search for retrieval, and grounding LLM generation against retrieved context with low temperature settings. Use webhook-driven updates to keep catalogue data current.


RAG system implementation for e-commerce AI workflows lets online retailers serve accurate, context-aware answers to product questions, order enquiries, and support tickets — without hallucinating details about inventory, pricing, or policies. This tutorial walks through the full pipeline architecture, from data ingestion to production deployment, using patterns we've refined across APAC retail clients running Shopify Plus, Adobe Commerce, and SHOPLINE.

Related reading: Shopify Plus Marketplace Sync for Shopee Lazada Amazon: APAC Guide

Related reading: Multi-Market CDP Activation Playbook for Retail in APAC

Related reading: Meta Layoffs and Tech Hiring: Why APAC Strategy Shifts to Digital Agencies

Related reading: Shopify Plus Multi-Currency Checkout APAC Setup: A Complete Guide

Retrieval-Augmented Generation (RAG) is gaining traction fast. According to Gartner's 2024 Hype Cycle for AI, RAG is one of the most adopted architectural patterns for enterprise generative AI, with over 60% of organisations exploring or piloting it. For e-commerce specifically, the appeal is obvious: your product catalogue, FAQ content, and order data change constantly, and fine-tuning an LLM every time a SKU updates is neither practical nor cost-effective.

Related reading: Quantization LLM Inference Cost Optimization: Cut Costs 60–80%

This guide covers the exact stack, code, and architecture decisions you need to build a production-grade RAG pipeline for an e-commerce operation.

What problem does RAG solve for e-commerce that fine-tuning cannot?

Fine-tuning bakes knowledge into model weights. That works for static domains, but e-commerce catalogues are anything but static. A mid-size APAC fashion retailer might update 2,000–5,000 SKUs per week across seasonal rotations, flash sales, and regional pricing. Fine-tuning on that cadence is expensive and slow.

RAG decouples the knowledge layer from the reasoning layer. Your LLM handles language understanding and generation; your vector database handles the current state of product data, policies, and order information. When a customer asks "Is the Nike Air Max 90 available in size 42 in Singapore?", the retrieval step fetches the live inventory record, and the LLM composes a natural-language answer grounded in that data.

This architecture also gives you auditability. Every response traces back to specific source documents, which matters when you're operating across jurisdictions like Hong Kong, Australia, and Taiwan where consumer protection regulations differ.

What does the production architecture look like?

Here's the component stack we've deployed for APAC e-commerce clients. Each layer has specific tool choices with reasoning.

Data Sources

  • Product catalogue — Pulled via Shopify Admin API (GraphQL), Adobe Commerce REST API, or SHOPLINE's API depending on the platform
  • FAQ / Help centre content — Markdown or HTML from CMS (typically headless Contentful or Strapi)
  • Order data — Real-time via webhooks; historical via database queries
  • Policy documents — Returns, shipping, warranty PDFs parsed into structured text

Ingestion and Chunking Pipeline

  • LangChain v0.2 or LlamaIndex v0.10 for orchestration
  • Unstructured.io for parsing PDFs, HTML, and mixed-format docs
  • Custom chunking logic (more on this below)

Embedding and Vector Storage

  • OpenAI text-embedding-3-small (1536 dimensions, $0.02/1M tokens as of Q1 2025) or Cohere embed-multilingual-v3.0 for CJK language support
  • Pinecone Serverless or Qdrant (self-hosted on AWS ap-southeast-1 for Singapore-based deployments)

Retrieval and Generation

  • Hybrid search: vector similarity + BM25 keyword matching via Pinecone's sparse-dense index or Qdrant's built-in hybrid mode
  • GPT-4o-mini or Claude 3.5 Haiku for generation (cost-optimised for high-volume customer queries)
  • Guardrails AI v0.4 for output validation

Serving Layer

  • FastAPI backend, deployed on AWS ECS Fargate or Google Cloud Run
  • WebSocket integration with Shopify Plus storefront or SHOPLINE chat widget

Ready to Transform Your Ecommerce Operations?

Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.

How should you chunk e-commerce product data?

Chunking strategy is where most RAG implementations succeed or fail. Generic 500-token fixed-size chunks destroy the relational structure of product data. A product has a name, description, specifications, variants, pricing, and availability — splitting these across chunks means the retriever often fetches incomplete context.

Product-aware chunking

For catalogue data, we chunk per product variant, not per text block. Each chunk contains the full context for one purchasable item:

1import json
2from langchain.schema import Document
3
4def chunk_shopify_product(product: dict) -> list[Document]:
5 """Create one chunk per variant with full product context."""
6 chunks = []
7 for variant in product.get("variants", []):
8 content = f"""
9 Product: {product['title']}
10 Brand: {product.get('vendor', 'N/A')}
11 Category: {product.get('product_type', 'N/A')}
12 Description: {product.get('body_html_stripped', '')}
13
14 Variant: {variant.get('title', 'Default')}
15 SKU: {variant.get('sku', 'N/A')}
16 Price: {variant.get('price')} {product.get('currency', 'HKD')}
17 Compare-at Price: {variant.get('compare_at_price', 'N/A')}
18 Available: {variant.get('inventory_quantity', 0) > 0}
19 Inventory: {variant.get('inventory_quantity', 0)} units
20 """.strip()
21
22 metadata = {
23 "source": "shopify_catalogue",
24 "product_id": str(product["id"]),
25 "variant_id": str(variant["id"]),
26 "product_type": product.get("product_type", ""),
27 "vendor": product.get("vendor", ""),
28 "tags": product.get("tags", ""),
29 "updated_at": product.get("updated_at", ""),
30 "region": product.get("market_region", "APAC"),
31 }
32
33 chunks.append(Document(page_content=content, metadata=metadata))
34 return chunks

The metadata fields are critical — they enable filtered retrieval. When a customer in Taiwan asks about pricing, you filter by region: TW before running similarity search, avoiding irrelevant results from other markets.

FAQ and policy chunking

For help centre content, use semantic chunking based on heading structure rather than fixed token counts. LlamaIndex's SemanticSplitterNodeParser handles this well:

1from llama_index.core.node_parser import SemanticSplitterNodeParser
2from llama_index.embeddings.openai import OpenAIEmbedding
3
4splitter = SemanticSplitterNodeParser(
5 buffer_size=1,
6 breakpoint_percentile_threshold=85,
7 embed_model=OpenAIEmbedding(model="text-embedding-3-small"),
8)
9
10nodes = splitter.get_nodes_from_documents(faq_documents)

This groups semantically related sentences together, so a returns policy section stays intact rather than getting split mid-paragraph.

Pure vector search misses exact matches that matter in e-commerce. A customer searching for SKU "NKE-AM90-BLK-42" needs lexical matching, not semantic similarity. Hybrid search combines both.

Here's a Pinecone Serverless implementation with sparse-dense vectors:

1from pinecone import Pinecone, ServerlessSpec
2from pinecone_text.sparse import BM25Encoder
3from openai import OpenAI
4
5# Initialise
6pc = Pinecone(api_key="YOUR_API_KEY")
7openai_client = OpenAI()
8
9# Create index with dotproduct metric for hybrid
10pc.create_index(
11 name="ecommerce-rag",
12 dimension=1536,
13 metric="dotproduct",
14 spec=ServerlessSpec(cloud="aws", region="ap-southeast-1"),
15)
16
17index = pc.Index("ecommerce-rag")
18
19# Fit BM25 on your corpus
20bm25 = BM25Encoder()
21bm25.fit([doc.page_content for doc in all_chunks])
22
23def hybrid_search(query: str, top_k: int = 5, alpha: float = 0.7,
24 filters: dict = None):
25 """alpha: weight toward dense (semantic). 0.7 = 70% semantic, 30% keyword."""
26
27 # Dense embedding
28 dense_vec = openai_client.embeddings.create(
29 input=query, model="text-embedding-3-small"
30 ).data[0].embedding
31
32 # Sparse BM25 encoding
33 sparse_vec = bm25.encode_queries(query)
34
35 # Scale vectors by alpha
36 results = index.query(
37 vector=[v * alpha for v in dense_vec],
38 sparse_vector={
39 "indices": sparse_vec["indices"],
40 "values": [v * (1 - alpha) for v in sparse_vec["values"]],
41 },
42 top_k=top_k,
43 filter=filters,
44 include_metadata=True,
45 )
46
47 return results.matches

The alpha parameter is key. Through testing across three APAC retail deployments, we've found that 0.7 (70% semantic, 30% keyword) works well for general product queries, while order-status and SKU-lookup queries perform better at 0.3 (more keyword weight). You can route dynamically based on query classification.

Ready to Transform Your Ecommerce Operations?

Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.

How do you build the generation layer with grounding?

Retrieval alone isn't enough — you need to ensure the LLM actually uses the retrieved context and doesn't hallucinate details. This is where prompt engineering and output validation matter.

1from openai import OpenAI
2
3client = OpenAI()
4
5SYSTEM_PROMPT = """
6You are a helpful e-commerce assistant for {store_name}.
7
8Rules:
91. Answer ONLY based on the provided context documents.
102. If the context doesn't contain enough information, say "I don't have that
11 information. Let me connect you with our support team."
123. Always include the specific product name, price, and availability when
13 discussing products.
144. For pricing, always specify the currency.
155. Never invent specifications, dimensions, or compatibility information.
166. For order queries, reference the order number and current status.
17
18Context documents:
19{context}
20"""
21
22def generate_response(query: str, retrieved_docs: list,
23 store_name: str = "Store") -> dict:
24 context = "\n---\n".join([
25 f"Source: {doc.metadata.get('source', 'unknown')}\n{doc.page_content}"
26 for doc in retrieved_docs
27 ])
28
29 response = client.chat.completions.create(
30 model="gpt-4o-mini",
31 messages=[
32 {"role": "system", "content": SYSTEM_PROMPT.format(
33 store_name=store_name, context=context
34 )},
35 {"role": "user", "content": query},
36 ],
37 temperature=0.1, # Low temperature for factual accuracy
38 max_tokens=500,
39 )
40
41 return {
42 "answer": response.choices[0].message.content,
43 "sources": [doc.metadata for doc in retrieved_docs],
44 "model": "gpt-4o-mini",
45 "tokens_used": response.usage.total_tokens,
46 }

Note the temperature=0.1. According to OpenAI's own documentation, lower temperature values produce more deterministic outputs. For e-commerce, where stating the wrong price or availability can erode trust (or create legal liability in markets like Australia under Australian Consumer Law), you want minimal creativity in the generation.

How did Branch8 deploy this for a regional fashion retailer?

In Q4 2024, we built a RAG-powered customer assistant for a Hong Kong–based fashion brand operating Shopify Plus stores across Hong Kong, Singapore, and Taiwan. The catalogue had roughly 12,000 active SKUs with three regional price lists and localised product descriptions in English, Traditional Chinese, and Simplified Chinese.

The key challenges were multilingual retrieval and region-specific accuracy. We used Cohere's embed-multilingual-v3.0 model instead of OpenAI's embedding model because it handles CJK character mixing (common when Hong Kong customers code-switch between English and Cantonese) significantly better in our benchmarks — retrieval accuracy at k=5 improved from 72% to 89% on our test set of 500 real customer queries.

The ingestion pipeline ran on AWS Lambda, triggered by Shopify webhooks on products/update and products/create events. This kept the vector index within 30 seconds of the live catalogue — critical during flash sales when inventory changes rapidly.

For the generation layer, we used Claude 3.5 Haiku via Amazon Bedrock (available in the ap-southeast-1 region) to keep data within APAC infrastructure. Total deployment took six weeks from architecture sign-off to production, with two engineers. Monthly running cost settled at approximately USD $420 for a volume of around 45,000 queries per month — substantially cheaper than the three full-time support agents it partially replaced for routine product and order queries.

Ready to Transform Your Ecommerce Operations?

Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.

How do you handle real-time order data in the RAG pipeline?

Product catalogue data changes frequently but not in real-time per query. Order data is different — an order status at 2:00 PM might be "processing" and at 2:05 PM it's "shipped". You can't embed order data into a vector store and expect accuracy.

The solution is a tool-use pattern where the LLM decides when to call a live API:

1import json
2
3tools = [
4 {
5 "type": "function",
6 "function": {
7 "name": "lookup_order_status",
8 "description": "Look up the current status of an order by order number or email.",
9 "parameters": {
10 "type": "object",
11 "properties": {
12 "order_number": {"type": "string"},
13 "email": {"type": "string"},
14 },
15 "required": ["order_number"],
16 },
17 },
18 },
19 {
20 "type": "function",
21 "function": {
22 "name": "search_product_catalogue",
23 "description": "Search products by description, name, category, or specification.",
24 "parameters": {
25 "type": "object",
26 "properties": {
27 "query": {"type": "string"},
28 "region": {"type": "string", "enum": ["HK", "SG", "TW", "AU"]},
29 },
30 "required": ["query"],
31 },
32 },
33 },
34]
35
36def route_query(user_message: str, store_name: str):
37 response = client.chat.completions.create(
38 model="gpt-4o-mini",
39 messages=[
40 {"role": "system", "content": f"You are a helpful assistant for {store_name}."},
41 {"role": "user", "content": user_message},
42 ],
43 tools=tools,
44 tool_choice="auto",
45 )
46
47 if response.choices[0].message.tool_calls:
48 tool_call = response.choices[0].message.tool_calls[0]
49 func_name = tool_call.function.name
50 args = json.loads(tool_call.function.arguments)
51
52 if func_name == "lookup_order_status":
53 # Call Shopify/Adobe Commerce API directly
54 return fetch_live_order(args)
55 elif func_name == "search_product_catalogue":
56 # Route to RAG pipeline
57 return hybrid_search(args["query"], filters={"region": args.get("region")})
58
59 return response.choices[0].message.content

This hybrid approach — RAG for catalogue and policy questions, live API calls for order data — keeps response accuracy above 95% across query types. McKinsey's 2024 report on AI in retail found that AI assistants with access to real-time order data reduce "where is my order" support tickets by 35–45%.

What are the key evaluation metrics for e-commerce RAG?

You can't improve what you don't measure. Here are the metrics that matter, with target benchmarks from our deployments:

Retrieval Quality

  • Recall@5: Does the correct source document appear in the top 5 retrieved chunks? Target: >85%
  • Mean Reciprocal Rank (MRR): How high does the correct document rank? Target: >0.75

Generation Quality

  • Faithfulness: Does the answer contain only information from the retrieved context? Measured via LLM-as-judge evaluation using RAGAS framework v0.1. Target: >0.9
  • Answer relevancy: Does the answer actually address the user's question? Target: >0.85

Business Metrics

  • Deflection rate: Percentage of queries resolved without human handoff. Realistic target for e-commerce: 60–75%
  • Customer satisfaction (CSAT): Post-interaction survey score. A Zendesk 2024 CX Trends report found that AI assistants grounded in company data achieve CSAT scores within 5% of human agents
1from ragas import evaluate
2from ragas.metrics import faithfulness, answer_relevancy, context_precision
3
4# Evaluate on your test dataset
5result = evaluate(
6 dataset=test_dataset, # HuggingFace Dataset with question, answer, contexts, ground_truth
7 metrics=[faithfulness, answer_relevancy, context_precision],
8)
9
10print(result)
11# {'faithfulness': 0.92, 'answer_relevancy': 0.87, 'context_precision': 0.84}

Run this evaluation weekly against a curated test set of 100–200 real customer queries. When scores drop, it usually means your chunking strategy needs updating for new product categories or your embedding model is struggling with new terminology.

Ready to Transform Your Ecommerce Operations?

Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.

What are the common pitfalls to avoid?

Stale embeddings

If your vector index updates on a daily batch job but your catalogue changes hourly during sale events, customers get wrong availability and pricing information. Use webhook-driven incremental updates, not batch jobs, for catalogue data.

Over-retrieving context

Stuffing 20 chunks into the LLM context window dilutes relevant information. According to research from Stanford's "Lost in the Middle" paper (Liu et al., 2023), LLMs perform worse when relevant information is buried among many retrieved documents. Keep retrieval to 3–5 highly relevant chunks.

Ignoring multilingual realities

APAC e-commerce means multilingual queries. A customer in Hong Kong might type "呢對鞋有冇size 42" (mixing Cantonese and English). Test your embedding model on real mixed-language queries from your support logs before committing to a provider.

No fallback path

When confidence is low, the system should hand off to a human agent — not guess. Implement a confidence threshold based on the top retrieval score. If the highest similarity score is below 0.65, route to human support.

How do you scale RAG system implementation across e-commerce AI workflows in multiple markets?

Scaling from one market to multiple APAC markets introduces three complexities: data residency, language, and platform fragmentation.

For data residency, use vector database deployments regional to your customer base. Pinecone Serverless supports AWS ap-southeast-1 (Singapore), which covers most Southeast Asian compliance requirements. For Australian operations, consider ap-southeast-2 (Sydney) to comply with the Australian Privacy Principles.

For language, maintain separate embedding spaces per language if your product descriptions are fully translated, or use a single multilingual embedding model if descriptions are mixed. The latter is simpler but approximately 8–12% less accurate in our testing.

For platform fragmentation — say Shopify Plus in Hong Kong, SHOPLINE in Taiwan, Adobe Commerce in Australia — abstract your data ingestion layer behind a common interface. Each platform adapter normalises product data into a shared schema before chunking and embedding. This lets you maintain one RAG pipeline instead of three.


Building a production-grade RAG system for e-commerce requires disciplined engineering at every layer: chunking strategies that respect product data structure, hybrid retrieval that handles both semantic and exact-match queries, grounded generation with low temperature, and continuous evaluation against real customer queries. The patterns described here are directly applicable whether you're running a 5,000-SKU Shopify Plus store or a 200,000-SKU Adobe Commerce deployment across multiple APAC markets.

Branch8 designs and builds RAG pipelines and AI-powered customer workflows for e-commerce brands operating across Asia-Pacific. Contact our engineering team to discuss your catalogue scale, platform stack, and market requirements.

Ready to Transform Your Ecommerce Operations?

Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.

Sources

  • Gartner Hype Cycle for Artificial Intelligence 2024: https://www.gartner.com/en/articles/what-s-new-in-artificial-intelligence-from-the-2024-gartner-hype-cycle
  • OpenAI Embeddings Documentation and Pricing: https://platform.openai.com/docs/guides/embeddings
  • Cohere Embed Multilingual v3 Model Card: https://docs.cohere.com/docs/embed
  • Stanford "Lost in the Middle" Paper (Liu et al., 2023): https://arxiv.org/abs/2307.03172
  • RAGAS Evaluation Framework Documentation: https://docs.ragas.io/en/latest/
  • Zendesk CX Trends Report 2024: https://www.zendesk.com/cx-trends-report/
  • McKinsey "The State of AI in Retail" 2024: https://www.mckinsey.com/industries/retail/our-insights
  • Pinecone Serverless Documentation: https://docs.pinecone.io/guides/getting-started/overview

FAQ

For a mid-size retailer handling around 45,000 queries per month with 12,000 active SKUs, expect approximately USD $400–600 monthly for embedding generation, vector database hosting, and LLM inference. The primary cost drivers are query volume and the generation model chosen — GPT-4o-mini and Claude 3.5 Haiku are the most cost-effective options currently.

About the Author

Matt Li

Co-Founder & CEO, Branch8 & Second Talent

Matt Li is Co-Founder and CEO of Branch8, a Y Combinator-backed (S15) Adobe Solution Partner and e-commerce consultancy headquartered in Hong Kong, and Co-Founder of Second Talent, a global tech hiring platform ranked #1 in Global Hiring on G2. With 12 years of experience in e-commerce strategy, platform implementation, and digital operations, he has led delivery of Adobe Commerce Cloud projects for enterprise clients including Chow Sang Sang, HomePlus (HKBN), Maxim's, Hong Kong International Airport, Hotai/Toyota, and Evisu. Prior to founding Branch8, Matt served as Vice President of Mid-Market Enterprises at HSBC. He serves as Vice Chairman of the Hong Kong E-Commerce Business Association (HKEBA). A self-taught software engineer, Matt graduated from the University of Toronto with a Bachelor of Commerce in Finance and Economics.