Grindbyte

Data model for an retail & ecommerce analytics data warehouse

As we have already established in other articles in this series, there are plenty of data science and machine learning opportunities out there to improve your ecommerce or retail business, and to support a more data driven strategy.

To unlock these opportunities, the first step is to democratize your data to your team, so that the team have the data at their fingertips, ready to be used for analysis and insights. This is done by creating a data warehouse, and transforming the format of your data from your ecommerce system into a more analytic friendly structure.

At Grindbyte, we have worked with a wide range of retailers and ecommerce businesses ranging from some of the largest ones in the US with over 9000 physical stores and 1M articles in their assortment, to smaller ecommerce businesses in single markets. What we see over and over again is that the same data model support all of these businesses with minor difference depending on if its brick and mortar or online, and the industry or type of products that the business is selling.

If you want to accelerate your data warehouse and analytical capabilities, feel free to reach out to the Grindbyte team to get a chance to review our services.

In this blog post I will share with you a standard data model for your data warehouse that will most likely work for your retail business, you can of course tweak this model to fit your needs, but this should be a great starting point to enable your business with analytics.

Introduction

This article is structured by first laying out the full data model, and then walking through each section of it to describe the thinking behind it and key things for you to consider and watch out for.

A few things on terminology:

id refers to the internal ID of the table that is generated by the system or by the data engineer. It is only used to join data together and to provide a unique key per table.
reference refers to the ID that the business uses to identify the data. For example it could be your article ID, your category ID etc. In this data model, the reference is only used for presentation and labeling. We do not rely on the reference ID's for joins or uniqueness.
This data model should work both for brick and mortar and online businesses. It has been used for many omnichannel companies. You may want to do smaller tweaks if you are only in a single channel with your business.

Categroy Hierarchy - Departments, Categories and Subcategories

Plenty of ecommerce and retail systems allow for a dynamic depth of your category hierarchy. You could have 1 level or 10 levels, it is all defined by the person who is setting up the product catalogue, and the dynamic depth is usually enabled by a recursive relationship where a Category may have a relationship to a parent Category.

For analytical purposes, this kind of infinite recursive relationship loop could be quite challenging to work with. Some articles may be placed in level 3, others may be placed in level 5. You do not know it and you are forced to always join in the deepest depth of categories in every query where you want to include the category hierarchy of the data. This will add additional complexity to your analytical queries.

Our advice is to attempt to turn your category hierarchy into a fixed depth. In this data model it is defined as 3 levels - Department, Category and Subcategory. For some retailers we see a fourth level ("Product Category"), but it is often too granular to be useful for analytical use cases, so 3 levels are usually sufficient.

This transformation can be done by first joining in all the levels of your product hierarchies for each product all the way to the root category, ensuring each product is assigned to the top 3 levels, and then dropping the rest. As you will later see on the Product data entity, we then store relationships to these product levels from the product data, so that we can always select and group products per category level.

By structuring our product hierarchy into 3 levels (Departments, Categories, Subcategories) and defining relationships between them (e.g. Category have a relationship to a Department), and then defining relationship for each Article to each level, it allow us to do the following:

Easily query any dataset with relationship to product by its hierarchy. E.g. we can easily get the sales, stock or purchase frequency by Department or Category.
By embedding the product hierarchy into training datasets we can use transfer learning to understand behavior of products within a category, in situations where each individual product may have too little data.
Align your organization into what levels you provide analytics and dashboards. By limiting it to a fixed set of levels you provide an opinion in your dataset on what levels visualizations and reporting should be done.

If your organization do not use at least 3 levels as part of the product hierarchy, I would suggest that this is something that should change within the organization. If it is not granular enough, not only does it make it challenging for customers to navigate your assortment, but it also makes it challenging to get insights from the data since essentially all products get thrown into very large buckets and there is no logical grouping of the product data.

Brands and Vendor datasets

Brand here refers to the customer facing brand of the product. E.g. in hygiene products you may have "Gilette" as a product brand. Vendor (also sometimes called "Supplier" or "Manufacturer") refers to the wholeseller of the product where you as a business is getting supplied from. In the case of "Gilette" it would be "Procter & Gamble".

This data is rarely used to train a model, however it is very useful for reporting purposes. E.g. the category managers of your organization may want to see sales and performance by Vendor as they go into vendor negotiations for things such as vendor funding, or when they plan a promotion they may want to plan it by Brand.

As you can see, the tables representing both Brand and Vendor is very simple, and the key value here is that they have a relationship from Article, so we can very easily get each article's sales performance, stock, cost etc and group it by either Brand or Vendor.

One important callout is the concept of "National Brand" vs "Own Brand" products which is represented by the is_own_brand flag in our data model. This concept refers to the idea that some products are supplied by national vendors that your competitor also have access to (e.g. a grocery store selling Coca Cola, Coca Cola is a national brand) versus products that you have developed in-house and labeled yourself (e.g. a grocery store may produce their own Cola products).

This distinction is often very helpful to have in the dataset to allow for slicing of the data by products that you develop yourself versus products that you purchase from third party suppliers.

Customer data

Customer data is generally a bit of a sensitive topic due to GDPR and other data laws that is sensitive to PII (Personal Identifiable Information). The good news is that for most analytical problems, you can usually get away with dropping all of the PII information. A user's name, address or email address is rarely useful to predict demand, do customer clustering, or other reporting.

Generally the questions you want to ask from your customer data would be:

What is my customer lifetime value?
Is this a returning customer or not?
Is this a high paying customer or not?
What product segment does this customer fall under?
What other customers are similar to this customer?
What is the purchase frequency of this customer?

Most of these questions can be answered by simply having relationship between the customer data and order history. This will allow you to create financial or purchase behavior based features of your customer dataset. Other interesting attributes to keep on your customer dataset may be Age and Gender which can be used for customer clustering and segmentation.

In this particular datamodel, you see that we keep the customer data model very minimal. If you have additional attributes such as Age or Gender, I suggest you add that to the Customer model.

Site / Website / Store data

Depending on if you are a retail business with only an ecommerce presence, if you have physical stores, or if you have multiple websites in different markets, you will have a different need to keep track on orders and performance per "Site" or Store.

To avoid having to have separate data models for ecommerce transactions vs physical store transactions, our suggestion is that you instead combine the concept of a "Website" with a "Store" into "Sites", and then you combine your online vs store transactions into a single "transaction" dataset that in turn refers to a specific site, which then indicates if its an online transaction or a physical store transaction.

By doing this, you will easily have a holistic view of the performance of products across your business rather than having completely separate data models for different channels.

If you only have ecommerce presence, you can get away with a simpler Site table that would contain fields such as:

Reference, your internal way to refering to this website. Could be a domain name or country code.
Region, which region this website covers. For ecommerce a region would generally be group of countries.
Country, which country or language this website covers.

If you have an omnichannel presence that also includes physical stores, the recommendation is that you also add fields such as:

Address
Longitude/Latitude
Store Type (is it a "megastore" or a smaller store?)
Size of store
Link to distribution centre or warehouse that is supplying the store

These additional data points will help with clustering the stores and slicing data by different store types.

Price and Cost datasets

Obviously one of the more critical pieces of information from a retail business is what the price and the cost of a product is. This will be fundamental in being able to calculate things such as Profit/Margin, Revenue and to understand how your products are priced versus competitors.

For price information, the best would be to get it straight from a source that stores the price per date, article and store over time. You want to ensure that you have price history for the full training date range for your models, which usually would be 2+ years to ensure you capture seasonality.

If your business do not store a full history of prices and when each price was active, you can usually infer this information from your transaction or order history, where you should be able to get a view of full price of items purchased on different days.

When it comes to cost, there are usually two types of costs that you will find within a retailer. There is the "cost of inventory", which essentially is a moving average cost that represent the cost over time that you paid to build up your inventory or stock of products. This cost is "backwards looking". The second type of cost is the "latest vendor cost" which is the "forward looking" cost used to calculate forward looking profit.

For purposes of forecasting, the forward looking cost is critical to be able to give an accurate view of profits. If you want to assess value of inventory, then the backwards looking moving average cost is the one you want to be using.

Finally, the pricing strategy of the retailer will have an impact of the granularity of the prices. For example, some retailers have the same price across all stores/sites, while others have different prices in different regions. So please ask yourself the following questions to understand how you must extend this data model to fit your business:

What is the lowest granularity that we price products? Is it per store? Region? Country?
Does my business do pricing in multiple currencies?
Does all stores have the same cost for a product, or is there situations where different stores may use different vendors for the same product, and therefore have different cost and margin?

Products and Article Data Model

One of the central tables of our datamart/warehouse and data model is the "products" table. This is the table that link all the product attributes together into a single entity.

One of the challenges with the product table is to define the granularity of a product. Generally you would have a few different levels of products:

Product Family - Products that should be priced together, e.g. Yoghurt Strawberry and Yoghurt Banana are grouped together into a single product family.
Product Variant - A single product that holds price, inventory, cost etc. For SAP users this would generally be the "Article + Unit of Measure" granularity. For ecommerce users, this is generally what the ecommerce platform calls "Product". On some platforms, all product variants are bundled into a single "Product". E.g. Red/Yellow/Green versions of the same T-Shirt could be given the same ID. If that's the case for your business, the recommendation is to break it apart so you have separate entries in your datalake for the "Red" vs the "Yellow" one as distinct variants.
Production Version - A single product could have different versions of it over time. This is usually defined with an EAN/UPC code. For example, a Coca Cola 330ML can may have one version with Santa Clause printed on it, and one with a marketing campaign printed on it. Both of them are the same product ("Coca Cola 330ML"), but in this case they would likeley have separate UPC/EAN codes since they are different "versions". This is generally too granular to do analysis, if you work on this level it most likeley means that you cannot track products over long periods of history since the EAN/UPC code frequently change.

Our recommendation is that you work on a "Product Variant" level. So a single product could have many different UPC/EAN codes throughout history, but you can still track them