RAG System Implementation for E-Commerce AI Workflows: A Step-by-Step Guide

Key Takeaways
- Chunk per product variant, not fixed token size
- Use hybrid search combining vector similarity and BM25 keyword matching
- Route order queries to live APIs, not the vector store
- Evaluate weekly with RAGAS on 100+ real customer queries
- Multilingual embeddings are essential for APAC markets
Quick Answer: Implement e-commerce RAG by chunking product data per variant, embedding with multilingual models, using hybrid vector + keyword search for retrieval, and grounding LLM generation against retrieved context with low temperature settings. Use webhook-driven updates to keep catalogue data current.
RAG system implementation for e-commerce AI workflows lets online retailers serve accurate, context-aware answers to product questions, order enquiries, and support tickets — without hallucinating details about inventory, pricing, or policies. This tutorial walks through the full pipeline architecture, from data ingestion to production deployment, using patterns we've refined across APAC retail clients running Shopify Plus, Adobe Commerce, and SHOPLINE.
Related reading: Shopify Plus Marketplace Sync for Shopee Lazada Amazon: APAC Guide
Related reading: Multi-Market CDP Activation Playbook for Retail in APAC
Related reading: Meta Layoffs and Tech Hiring: Why APAC Strategy Shifts to Digital Agencies
Related reading: Shopify Plus Multi-Currency Checkout APAC Setup: A Complete Guide
Retrieval-Augmented Generation (RAG) is gaining traction fast. According to Gartner's 2024 Hype Cycle for AI, RAG is one of the most adopted architectural patterns for enterprise generative AI, with over 60% of organisations exploring or piloting it. For e-commerce specifically, the appeal is obvious: your product catalogue, FAQ content, and order data change constantly, and fine-tuning an LLM every time a SKU updates is neither practical nor cost-effective.
Related reading: Quantization LLM Inference Cost Optimization: Cut Costs 60–80%
This guide covers the exact stack, code, and architecture decisions you need to build a production-grade RAG pipeline for an e-commerce operation.
What problem does RAG solve for e-commerce that fine-tuning cannot?
Fine-tuning bakes knowledge into model weights. That works for static domains, but e-commerce catalogues are anything but static. A mid-size APAC fashion retailer might update 2,000–5,000 SKUs per week across seasonal rotations, flash sales, and regional pricing. Fine-tuning on that cadence is expensive and slow.
RAG decouples the knowledge layer from the reasoning layer. Your LLM handles language understanding and generation; your vector database handles the current state of product data, policies, and order information. When a customer asks "Is the Nike Air Max 90 available in size 42 in Singapore?", the retrieval step fetches the live inventory record, and the LLM composes a natural-language answer grounded in that data.
This architecture also gives you auditability. Every response traces back to specific source documents, which matters when you're operating across jurisdictions like Hong Kong, Australia, and Taiwan where consumer protection regulations differ.
What does the production architecture look like?
Here's the component stack we've deployed for APAC e-commerce clients. Each layer has specific tool choices with reasoning.
Data Sources
- Product catalogue — Pulled via Shopify Admin API (GraphQL), Adobe Commerce REST API, or SHOPLINE's API depending on the platform
- FAQ / Help centre content — Markdown or HTML from CMS (typically headless Contentful or Strapi)
- Order data — Real-time via webhooks; historical via database queries
- Policy documents — Returns, shipping, warranty PDFs parsed into structured text
Ingestion and Chunking Pipeline
- LangChain v0.2 or LlamaIndex v0.10 for orchestration
- Unstructured.io for parsing PDFs, HTML, and mixed-format docs
- Custom chunking logic (more on this below)
Embedding and Vector Storage
- OpenAI text-embedding-3-small (1536 dimensions, $0.02/1M tokens as of Q1 2025) or Cohere embed-multilingual-v3.0 for CJK language support
- Pinecone Serverless or Qdrant (self-hosted on AWS ap-southeast-1 for Singapore-based deployments)
Retrieval and Generation
- Hybrid search: vector similarity + BM25 keyword matching via Pinecone's sparse-dense index or Qdrant's built-in hybrid mode
- GPT-4o-mini or Claude 3.5 Haiku for generation (cost-optimised for high-volume customer queries)
- Guardrails AI v0.4 for output validation
Serving Layer
- FastAPI backend, deployed on AWS ECS Fargate or Google Cloud Run
- WebSocket integration with Shopify Plus storefront or SHOPLINE chat widget
Ready to Transform Your Ecommerce Operations?
Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.
How should you chunk e-commerce product data?
Chunking strategy is where most RAG implementations succeed or fail. Generic 500-token fixed-size chunks destroy the relational structure of product data. A product has a name, description, specifications, variants, pricing, and availability — splitting these across chunks means the retriever often fetches incomplete context.
Product-aware chunking
For catalogue data, we chunk per product variant, not per text block. Each chunk contains the full context for one purchasable item:
1import json2from langchain.schema import Document34def chunk_shopify_product(product: dict) -> list[Document]:5 """Create one chunk per variant with full product context."""6 chunks = []7 for variant in product.get("variants", []):8 content = f"""9 Product: {product['title']}10 Brand: {product.get('vendor', 'N/A')}11 Category: {product.get('product_type', 'N/A')}12 Description: {product.get('body_html_stripped', '')}1314 Variant: {variant.get('title', 'Default')}15 SKU: {variant.get('sku', 'N/A')}16 Price: {variant.get('price')} {product.get('currency', 'HKD')}17 Compare-at Price: {variant.get('compare_at_price', 'N/A')}18 Available: {variant.get('inventory_quantity', 0) > 0}19 Inventory: {variant.get('inventory_quantity', 0)} units20 """.strip()2122 metadata = {23 "source": "shopify_catalogue",24 "product_id": str(product["id"]),25 "variant_id": str(variant["id"]),26 "product_type": product.get("product_type", ""),27 "vendor": product.get("vendor", ""),28 "tags": product.get("tags", ""),29 "updated_at": product.get("updated_at", ""),30 "region": product.get("market_region", "APAC"),31 }3233 chunks.append(Document(page_content=content, metadata=metadata))34 return chunks
The metadata fields are critical — they enable filtered retrieval. When a customer in Taiwan asks about pricing, you filter by region: TW before running similarity search, avoiding irrelevant results from other markets.
FAQ and policy chunking
For help centre content, use semantic chunking based on heading structure rather than fixed token counts. LlamaIndex's SemanticSplitterNodeParser handles this well:
1from llama_index.core.node_parser import SemanticSplitterNodeParser2from llama_index.embeddings.openai import OpenAIEmbedding34splitter = SemanticSplitterNodeParser(5 buffer_size=1,6 breakpoint_percentile_threshold=85,7 embed_model=OpenAIEmbedding(model="text-embedding-3-small"),8)910nodes = splitter.get_nodes_from_documents(faq_documents)
This groups semantically related sentences together, so a returns policy section stays intact rather than getting split mid-paragraph.
How do you set up the retrieval pipeline with hybrid search?
Pure vector search misses exact matches that matter in e-commerce. A customer searching for SKU "NKE-AM90-BLK-42" needs lexical matching, not semantic similarity. Hybrid search combines both.
Here's a Pinecone Serverless implementation with sparse-dense vectors:
1from pinecone import Pinecone, ServerlessSpec2from pinecone_text.sparse import BM25Encoder3from openai import OpenAI45# Initialise6pc = Pinecone(api_key="YOUR_API_KEY")7openai_client = OpenAI()89# Create index with dotproduct metric for hybrid10pc.create_index(11 name="ecommerce-rag",12 dimension=1536,13 metric="dotproduct",14 spec=ServerlessSpec(cloud="aws", region="ap-southeast-1"),15)1617index = pc.Index("ecommerce-rag")1819# Fit BM25 on your corpus20bm25 = BM25Encoder()21bm25.fit([doc.page_content for doc in all_chunks])2223def hybrid_search(query: str, top_k: int = 5, alpha: float = 0.7,24 filters: dict = None):25 """alpha: weight toward dense (semantic). 0.7 = 70% semantic, 30% keyword."""2627 # Dense embedding28 dense_vec = openai_client.embeddings.create(29 input=query, model="text-embedding-3-small"30 ).data[0].embedding3132 # Sparse BM25 encoding33 sparse_vec = bm25.encode_queries(query)3435 # Scale vectors by alpha36 results = index.query(37 vector=[v * alpha for v in dense_vec],38 sparse_vector={39 "indices": sparse_vec["indices"],40 "values": [v * (1 - alpha) for v in sparse_vec["values"]],41 },42 top_k=top_k,43 filter=filters,44 include_metadata=True,45 )4647 return results.matches
The alpha parameter is key. Through testing across three APAC retail deployments, we've found that 0.7 (70% semantic, 30% keyword) works well for general product queries, while order-status and SKU-lookup queries perform better at 0.3 (more keyword weight). You can route dynamically based on query classification.
Ready to Transform Your Ecommerce Operations?
Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.
How do you build the generation layer with grounding?
Retrieval alone isn't enough — you need to ensure the LLM actually uses the retrieved context and doesn't hallucinate details. This is where prompt engineering and output validation matter.
1from openai import OpenAI23client = OpenAI()45SYSTEM_PROMPT = """6You are a helpful e-commerce assistant for {store_name}.78Rules:91. Answer ONLY based on the provided context documents.102. If the context doesn't contain enough information, say "I don't have that11 information. Let me connect you with our support team."123. Always include the specific product name, price, and availability when13 discussing products.144. For pricing, always specify the currency.155. Never invent specifications, dimensions, or compatibility information.166. For order queries, reference the order number and current status.1718Context documents:19{context}20"""2122def generate_response(query: str, retrieved_docs: list,23 store_name: str = "Store") -> dict:24 context = "\n---\n".join([25 f"Source: {doc.metadata.get('source', 'unknown')}\n{doc.page_content}"26 for doc in retrieved_docs27 ])2829 response = client.chat.completions.create(30 model="gpt-4o-mini",31 messages=[32 {"role": "system", "content": SYSTEM_PROMPT.format(33 store_name=store_name, context=context34 )},35 {"role": "user", "content": query},36 ],37 temperature=0.1, # Low temperature for factual accuracy38 max_tokens=500,39 )4041 return {42 "answer": response.choices[0].message.content,43 "sources": [doc.metadata for doc in retrieved_docs],44 "model": "gpt-4o-mini",45 "tokens_used": response.usage.total_tokens,46 }
Note the temperature=0.1. According to OpenAI's own documentation, lower temperature values produce more deterministic outputs. For e-commerce, where stating the wrong price or availability can erode trust (or create legal liability in markets like Australia under Australian Consumer Law), you want minimal creativity in the generation.
How did Branch8 deploy this for a regional fashion retailer?
In Q4 2024, we built a RAG-powered customer assistant for a Hong Kong–based fashion brand operating Shopify Plus stores across Hong Kong, Singapore, and Taiwan. The catalogue had roughly 12,000 active SKUs with three regional price lists and localised product descriptions in English, Traditional Chinese, and Simplified Chinese.
The key challenges were multilingual retrieval and region-specific accuracy. We used Cohere's embed-multilingual-v3.0 model instead of OpenAI's embedding model because it handles CJK character mixing (common when Hong Kong customers code-switch between English and Cantonese) significantly better in our benchmarks — retrieval accuracy at k=5 improved from 72% to 89% on our test set of 500 real customer queries.
The ingestion pipeline ran on AWS Lambda, triggered by Shopify webhooks on products/update and products/create events. This kept the vector index within 30 seconds of the live catalogue — critical during flash sales when inventory changes rapidly.
For the generation layer, we used Claude 3.5 Haiku via Amazon Bedrock (available in the ap-southeast-1 region) to keep data within APAC infrastructure. Total deployment took six weeks from architecture sign-off to production, with two engineers. Monthly running cost settled at approximately USD $420 for a volume of around 45,000 queries per month — substantially cheaper than the three full-time support agents it partially replaced for routine product and order queries.
Ready to Transform Your Ecommerce Operations?
Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.
How do you handle real-time order data in the RAG pipeline?
Product catalogue data changes frequently but not in real-time per query. Order data is different — an order status at 2:00 PM might be "processing" and at 2:05 PM it's "shipped". You can't embed order data into a vector store and expect accuracy.
The solution is a tool-use pattern where the LLM decides when to call a live API:
1import json23tools = [4 {5 "type": "function",6 "function": {7 "name": "lookup_order_status",8 "description": "Look up the current status of an order by order number or email.",9 "parameters": {10 "type": "object",11 "properties": {12 "order_number": {"type": "string"},13 "email": {"type": "string"},14 },15 "required": ["order_number"],16 },17 },18 },19 {20 "type": "function",21 "function": {22 "name": "search_product_catalogue",23 "description": "Search products by description, name, category, or specification.",24 "parameters": {25 "type": "object",26 "properties": {27 "query": {"type": "string"},28 "region": {"type": "string", "enum": ["HK", "SG", "TW", "AU"]},29 },30 "required": ["query"],31 },32 },33 },34]3536def route_query(user_message: str, store_name: str):37 response = client.chat.completions.create(38 model="gpt-4o-mini",39 messages=[40 {"role": "system", "content": f"You are a helpful assistant for {store_name}."},41 {"role": "user", "content": user_message},42 ],43 tools=tools,44 tool_choice="auto",45 )4647 if response.choices[0].message.tool_calls:48 tool_call = response.choices[0].message.tool_calls[0]49 func_name = tool_call.function.name50 args = json.loads(tool_call.function.arguments)5152 if func_name == "lookup_order_status":53 # Call Shopify/Adobe Commerce API directly54 return fetch_live_order(args)55 elif func_name == "search_product_catalogue":56 # Route to RAG pipeline57 return hybrid_search(args["query"], filters={"region": args.get("region")})5859 return response.choices[0].message.content
This hybrid approach — RAG for catalogue and policy questions, live API calls for order data — keeps response accuracy above 95% across query types. McKinsey's 2024 report on AI in retail found that AI assistants with access to real-time order data reduce "where is my order" support tickets by 35–45%.
What are the key evaluation metrics for e-commerce RAG?
You can't improve what you don't measure. Here are the metrics that matter, with target benchmarks from our deployments:
Retrieval Quality
- Recall@5: Does the correct source document appear in the top 5 retrieved chunks? Target: >85%
- Mean Reciprocal Rank (MRR): How high does the correct document rank? Target: >0.75
Generation Quality
- Faithfulness: Does the answer contain only information from the retrieved context? Measured via LLM-as-judge evaluation using RAGAS framework v0.1. Target: >0.9
- Answer relevancy: Does the answer actually address the user's question? Target: >0.85
Business Metrics
- Deflection rate: Percentage of queries resolved without human handoff. Realistic target for e-commerce: 60–75%
- Customer satisfaction (CSAT): Post-interaction survey score. A Zendesk 2024 CX Trends report found that AI assistants grounded in company data achieve CSAT scores within 5% of human agents
1from ragas import evaluate2from ragas.metrics import faithfulness, answer_relevancy, context_precision34# Evaluate on your test dataset5result = evaluate(6 dataset=test_dataset, # HuggingFace Dataset with question, answer, contexts, ground_truth7 metrics=[faithfulness, answer_relevancy, context_precision],8)910print(result)11# {'faithfulness': 0.92, 'answer_relevancy': 0.87, 'context_precision': 0.84}
Run this evaluation weekly against a curated test set of 100–200 real customer queries. When scores drop, it usually means your chunking strategy needs updating for new product categories or your embedding model is struggling with new terminology.
Ready to Transform Your Ecommerce Operations?
Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.
What are the common pitfalls to avoid?
Stale embeddings
If your vector index updates on a daily batch job but your catalogue changes hourly during sale events, customers get wrong availability and pricing information. Use webhook-driven incremental updates, not batch jobs, for catalogue data.
Over-retrieving context
Stuffing 20 chunks into the LLM context window dilutes relevant information. According to research from Stanford's "Lost in the Middle" paper (Liu et al., 2023), LLMs perform worse when relevant information is buried among many retrieved documents. Keep retrieval to 3–5 highly relevant chunks.
Ignoring multilingual realities
APAC e-commerce means multilingual queries. A customer in Hong Kong might type "呢對鞋有冇size 42" (mixing Cantonese and English). Test your embedding model on real mixed-language queries from your support logs before committing to a provider.
No fallback path
When confidence is low, the system should hand off to a human agent — not guess. Implement a confidence threshold based on the top retrieval score. If the highest similarity score is below 0.65, route to human support.
How do you scale RAG system implementation across e-commerce AI workflows in multiple markets?
Scaling from one market to multiple APAC markets introduces three complexities: data residency, language, and platform fragmentation.
For data residency, use vector database deployments regional to your customer base. Pinecone Serverless supports AWS ap-southeast-1 (Singapore), which covers most Southeast Asian compliance requirements. For Australian operations, consider ap-southeast-2 (Sydney) to comply with the Australian Privacy Principles.
For language, maintain separate embedding spaces per language if your product descriptions are fully translated, or use a single multilingual embedding model if descriptions are mixed. The latter is simpler but approximately 8–12% less accurate in our testing.
For platform fragmentation — say Shopify Plus in Hong Kong, SHOPLINE in Taiwan, Adobe Commerce in Australia — abstract your data ingestion layer behind a common interface. Each platform adapter normalises product data into a shared schema before chunking and embedding. This lets you maintain one RAG pipeline instead of three.
Building a production-grade RAG system for e-commerce requires disciplined engineering at every layer: chunking strategies that respect product data structure, hybrid retrieval that handles both semantic and exact-match queries, grounded generation with low temperature, and continuous evaluation against real customer queries. The patterns described here are directly applicable whether you're running a 5,000-SKU Shopify Plus store or a 200,000-SKU Adobe Commerce deployment across multiple APAC markets.
Branch8 designs and builds RAG pipelines and AI-powered customer workflows for e-commerce brands operating across Asia-Pacific. Contact our engineering team to discuss your catalogue scale, platform stack, and market requirements.
Ready to Transform Your Ecommerce Operations?
Branch8 specializes in ecommerce platform implementation and AI-powered automation solutions. Contact us today to discuss your ecommerce automation strategy.
Sources
- Gartner Hype Cycle for Artificial Intelligence 2024: https://www.gartner.com/en/articles/what-s-new-in-artificial-intelligence-from-the-2024-gartner-hype-cycle
- OpenAI Embeddings Documentation and Pricing: https://platform.openai.com/docs/guides/embeddings
- Cohere Embed Multilingual v3 Model Card: https://docs.cohere.com/docs/embed
- Stanford "Lost in the Middle" Paper (Liu et al., 2023): https://arxiv.org/abs/2307.03172
- RAGAS Evaluation Framework Documentation: https://docs.ragas.io/en/latest/
- Zendesk CX Trends Report 2024: https://www.zendesk.com/cx-trends-report/
- McKinsey "The State of AI in Retail" 2024: https://www.mckinsey.com/industries/retail/our-insights
- Pinecone Serverless Documentation: https://docs.pinecone.io/guides/getting-started/overview
FAQ
For a mid-size retailer handling around 45,000 queries per month with 12,000 active SKUs, expect approximately USD $400–600 monthly for embedding generation, vector database hosting, and LLM inference. The primary cost drivers are query volume and the generation model chosen — GPT-4o-mini and Claude 3.5 Haiku are the most cost-effective options currently.
About the Author
Matt Li
Co-Founder & CEO, Branch8 & Second Talent
Matt Li is Co-Founder and CEO of Branch8, a Y Combinator-backed (S15) Adobe Solution Partner and e-commerce consultancy headquartered in Hong Kong, and Co-Founder of Second Talent, a global tech hiring platform ranked #1 in Global Hiring on G2. With 12 years of experience in e-commerce strategy, platform implementation, and digital operations, he has led delivery of Adobe Commerce Cloud projects for enterprise clients including Chow Sang Sang, HomePlus (HKBN), Maxim's, Hong Kong International Airport, Hotai/Toyota, and Evisu. Prior to founding Branch8, Matt served as Vice President of Mid-Market Enterprises at HSBC. He serves as Vice Chairman of the Hong Kong E-Commerce Business Association (HKEBA). A self-taught software engineer, Matt graduated from the University of Toronto with a Bachelor of Commerce in Finance and Economics.