# E-commerce Data Generation
Generate synthetic e-commerce data with realistic user sessions, shopping carts, orders, and customer RFM metrics.
## Overview
The e-commerce generators create data suitable for:
- Conversion funnel analysis
- Customer segmentation (RFM)
- Cart abandonment studies
- Session behavior analytics
- A/B testing simulations
## Quick Start
```python
from superstore import ecommerce_data, EcommerceConfig
# Generate complete e-commerce dataset
data = ecommerce_data()
# Access individual tables
sessions_df = data["sessions"]
products_df = data["products"]
cart_events_df = data["cart_events"]
orders_df = data["orders"]
customers_df = data["customers"]
```
---
## Session Data
The `ecommerce_sessions()` function generates user session records with MarkovChain-based page navigation.
### Basic Usage
```python
from superstore import ecommerce_sessions
# Generate 1000 sessions
df = ecommerce_sessions(count=1000, seed=42)
```
### Output Schema
| Column | Type | Description |
|--------|------|-------------|
| `session_id` | str | Unique session identifier |
| `user_id` | str | User identifier |
| `start_time` | datetime | Session start timestamp |
| `end_time` | datetime | Session end timestamp |
| `duration_seconds` | int | Total session duration |
| `device_type` | str | Device type (desktop, mobile, tablet) |
| `browser` | str | Browser name |
| `traffic_source` | str | Traffic source (organic, paid_search, direct, social, email, referral, affiliate) |
| `landing_page` | str | First page viewed |
| `pages_viewed` | int | Number of pages viewed |
| `bounced` | bool | Whether session was a bounce (single page) |
| `converted` | bool | Whether session resulted in purchase |
| `total_value` | float | Total purchase value (0 if not converted) |
### MarkovChain Session States
Sessions navigate through states using a configurable transition matrix:
```
landing → browse → view_product → add_to_cart → view_cart → checkout_start → checkout_payment → purchase
↘ ↘ ↘ ↘ ↘
exit exit exit exit exit
```
---
## Product Catalog
The `ecommerce_products()` function generates a product catalog with realistic pricing.
### Basic Usage
```python
from superstore import ecommerce_products
# Generate 500 products
df = ecommerce_products(count=500, seed=42)
```
### Output Schema
| Column | Type | Description |
|--------|------|-------------|
| `product_id` | str | Unique product identifier |
| `name` | str | Product name |
| `category` | str | Product category |
| `subcategory` | str | Product subcategory |
| `price` | float | Product price (log-normal distribution) |
| `rating` | float | Average rating (1.0-5.0) |
| `review_count` | int | Number of reviews |
| `in_stock` | bool | Stock availability |
---
## Cart Events
Cart events track user interactions with shopping carts.
### Output Schema
| Column | Type | Description |
|--------|------|-------------|
| `event_id` | str | Unique event identifier |
| `session_id` | str | Associated session |
| `user_id` | str | User identifier |
| `timestamp` | datetime | Event timestamp |
| `event_type` | str | Event type (add, remove, update_quantity) |
| `product_id` | str | Product identifier |
| `quantity` | int | Item quantity |
| `unit_price` | float | Price per unit |
| `total_price` | float | Total line price |
---
## Orders
Completed purchase orders.
### Output Schema
| Column | Type | Description |
|--------|------|-------------|
| `order_id` | str | Unique order identifier |
| `user_id` | str | Customer identifier |
| `session_id` | str | Originating session |
| `order_time` | datetime | Order timestamp |
| `total_items` | int | Number of items |
| `subtotal` | float | Subtotal before tax/shipping |
| `discount` | float | Discount amount |
| `tax` | float | Tax amount |
| `shipping` | float | Shipping cost |
| `total` | float | Final order total |
| `payment_method` | str | Payment method (credit_card, paypal, apple_pay, etc.) |
| `status` | str | Order status (completed, processing, shipped) |
---
## Customers with RFM Metrics
Customer records include RFM (Recency, Frequency, Monetary) segmentation.
### Output Schema
| Column | Type | Description |
|--------|------|-------------|
| `customer_id` | str | Unique customer identifier |
| `email` | str | Customer email |
| `first_order_date` | date | First purchase date |
| `last_order_date` | date | Most recent purchase date |
| `total_orders` | int | Lifetime order count |
| `total_spent` | float | Lifetime spend |
| `avg_order_value` | float | Average order value |
| `rfm_recency` | int | Recency score (1-5) |
| `rfm_frequency` | int | Frequency score (1-5) |
| `rfm_monetary` | float | Monetary value |
| `rfm_score` | str | Combined RFM score (e.g., "544") |
| `rfm_segment` | str | Customer segment label |
### RFM Segments
| Segment | Description |
|---------|-------------|
| Champions | High recency, frequency, and monetary |
| Loyal Customers | High frequency and monetary |
| Potential Loyalists | Recent customers with medium frequency |
| New Customers | Very recent, low frequency |
| At Risk | Previously good customers, declining |
| Need Attention | Below average across metrics |
| Hibernating | Low activity, long time since purchase |
| Lost | No recent activity, low value |
---
## Configuration
Use `EcommerceConfig` for detailed control:
```python
from superstore import ecommerce_data, EcommerceConfig
config = EcommerceConfig(
sessions=10000, # Number of sessions
customers=2000, # Number of unique customers
days=30, # Time span in days
seed=42, # Reproducibility
)
data = ecommerce_data(config=config.model_dump())
```
### Session Configuration
Control user session behavior:
```python
config = EcommerceConfig(
sessions=5000,
session={
"avg_pages_per_session": 5.0,
"cart_add_probability": 0.15,
"checkout_start_probability": 0.40,
"purchase_completion_probability": 0.65,
"avg_session_duration_seconds": 300,
"enable_bounces": True,
"bounce_rate": 0.35,
}
)
```
| Parameter | Default | Description |
|-----------|---------|-------------|
| `avg_pages_per_session` | `5.0` | Average pages viewed |
| `cart_add_probability` | `0.15` | P(add to cart \| view product) |
| `checkout_start_probability` | `0.40` | P(start checkout \| view cart) |
| `purchase_completion_probability` | `0.65` | P(purchase \| checkout start) |
| `avg_session_duration_seconds` | `300` | Average session length |
| `enable_bounces` | `True` | Enable single-page bounces |
| `bounce_rate` | `0.35` | Bounce probability |
### Cart Configuration
Configure cart behavior and abandonment:
```python
config = EcommerceConfig(
cart={
"avg_items_per_cart": 2.5,
"remove_probability": 0.10,
"quantity_update_probability": 0.05,
"max_items": 20,
"enable_abandonment": True,
"abandonment_rate": 0.70,
}
)
```
| Parameter | Default | Description |
|-----------|---------|-------------|
| `avg_items_per_cart` | `2.5` | Average items added |
| `remove_probability` | `0.10` | P(remove item) |
| `quantity_update_probability` | `0.05` | P(update quantity) |
| `max_items` | `20` | Maximum cart size |
| `enable_abandonment` | `True` | Enable cart abandonment |
| `abandonment_rate` | `0.70` | Cart abandonment rate |
### Catalog Configuration
Configure the product catalog:
```python
config = EcommerceConfig(
catalog={
"num_products": 500,
"min_price": 5.0,
"max_price": 1000.0,
"lognormal_prices": True,
"categories": ["Electronics", "Clothing", "Home", "Sports"],
}
)
```
| Parameter | Default | Description |
|-----------|---------|-------------|
| `num_products` | `500` | Number of products |
| `min_price` | `5.0` | Minimum product price |
| `max_price` | `1000.0` | Maximum product price |
| `lognormal_prices` | `True` | Use log-normal price distribution |
| `categories` | `[...]` | Product categories |
### RFM Configuration
Configure RFM analysis parameters:
```python
config = EcommerceConfig(
rfm={
"enable": True,
"recency_window_days": 365,
"num_buckets": 5,
"pareto_shape": 1.5,
}
)
```
| Parameter | Default | Description |
|-----------|---------|-------------|
| `enable` | `True` | Calculate RFM metrics |
| `recency_window_days` | `365` | Recency lookback period |
| `num_buckets` | `5` | Number of RFM score buckets |
| `pareto_shape` | `1.5` | Shape parameter for 80/20 distribution |
### Funnel Configuration
Configure conversion funnel tracking:
```python
config = EcommerceConfig(
funnel={
"enable": True,
"stages": ["visit", "view_product", "add_to_cart", "checkout", "purchase"],
"time_of_day_effects": True,
"day_of_week_effects": True,
}
)
```
---
## Complete Example
```python
from superstore import ecommerce_data, EcommerceConfig
config = EcommerceConfig(
sessions=20000,
customers=5000,
days=90,
seed=42,
session={
"avg_pages_per_session": 6.0,
"bounce_rate": 0.30,
"purchase_completion_probability": 0.70,
},
cart={
"abandonment_rate": 0.65,
"avg_items_per_cart": 3.0,
},
catalog={
"num_products": 1000,
"categories": ["Electronics", "Fashion", "Home", "Beauty", "Sports"],
},
rfm={
"num_buckets": 5,
"pareto_shape": 1.8,
},
)
data = ecommerce_data(config=config.model_dump())
# Analyze conversion rates
sessions = data["sessions"]
conversion_rate = sessions["converted"].mean()
print(f"Conversion rate: {conversion_rate:.2%}")
# Segment customers by RFM
customers = data["customers"]
segment_counts = customers["rfm_segment"].value_counts()
print(segment_counts)
```
---
## API Reference
See the full API documentation:
- [ecommerce_sessions()](api.md)
- [ecommerce_products()](api.md)
- [ecommerce_data()](api.md)
- [EcommerceConfig](api.md)