Retail Data Generation¶
Generate synthetic retail transaction and employee data with realistic business patterns.
Overview¶
The retail generators create data suitable for:
Sales analytics dashboards
Business intelligence demos
Retail forecasting models
Customer segmentation analysis
HR analytics and demographics
Superstore Transactions¶
The superstore() function generates retail transaction records with realistic correlations between sales, quantity, discount, and profit.
Basic Usage¶
from superstore import superstore
# Generate 1000 transactions as a pandas DataFrame
df = superstore(count=1000)
# Generate as polars or dict
df = superstore(count=1000, output="polars")
data = superstore(count=1000, output="dict")
Output Schema¶
Column |
Type |
Description |
|---|---|---|
|
int |
Unique row identifier |
|
str |
Order identifier (format: CA-2024-XXXXXX) |
|
date |
Date of order |
|
date |
Date of shipment |
|
str |
Shipping method (Standard, Express, etc.) |
|
str |
Customer identifier |
|
str |
Customer full name |
|
str |
Customer segment (Consumer, Corporate, Home Office) |
|
str |
Delivery city |
|
str |
Delivery state |
|
str |
Postal code |
|
str |
Region (East, West, Central, South) |
|
str |
Product identifier |
|
str |
Product category |
|
str |
Product sub-category |
|
str |
Product name |
|
float |
Transaction sales amount |
|
int |
Quantity ordered |
|
float |
Discount applied (0.0 - 0.5) |
|
float |
Transaction profit |
Configuration¶
Use SuperstoreConfig for fine-grained control over data generation:
from superstore import superstore, SuperstoreConfig
config = SuperstoreConfig(
count=10000,
seed=42, # Reproducible output
# Correlation settings
sales_quantity_correlation=0.8, # Higher quantities = higher sales
sales_profit_correlation=0.9, # Higher sales = higher profit
discount_profit_correlation=-0.6, # Higher discounts = lower profit
# Price formatting
enable_price_points=True, # Round to $X.99 values
)
df = superstore(config=config)
Seasonality Configuration¶
Model seasonal sales patterns:
config = SuperstoreConfig(
count=10000,
seasonality={
"enable": True,
"q4_multiplier": 1.8, # 80% more sales in Q4 (holidays)
"summer_multiplier": 0.85, # 15% fewer sales in summer
"back_to_school_multiplier": 1.3, # 30% more in Aug/Sep
}
)
Parameter |
Default |
Description |
|---|---|---|
|
|
Enable seasonal effects |
|
|
Q4 (holiday) sales multiplier |
|
|
Summer sales multiplier |
|
|
August/September multiplier |
Promotional Configuration¶
Model promotion and discount effects:
config = SuperstoreConfig(
count=10000,
promotions={
"enable": True,
"discount_quantity_correlation": 0.6, # Discounts increase quantity
"price_elasticity": -1.0, # Price sensitivity
}
)
Parameter |
Default |
Description |
|---|---|---|
|
|
Enable promotional patterns |
|
|
How much discounts increase quantity |
|
|
Price elasticity of demand (-2 to 0) |
Customer Configuration¶
Model customer behavior and segmentation:
config = SuperstoreConfig(
count=10000,
customers={
"enable_cohorts": True,
"repeat_customer_rate": 0.75, # 75% repeat customers
"vip_segment_rate": 0.15, # 15% VIP customers
"vip_order_multiplier": 2.5, # VIPs spend 2.5x more
}
)
Parameter |
Default |
Description |
|---|---|---|
|
|
Enable customer cohort modeling |
|
|
Fraction of repeat customers |
|
|
Fraction of VIP customers |
|
|
VIP order value multiplier |
Large Dataset Generation¶
For datasets larger than memory, use streaming or parallel generation:
from superstore import superstoreStream, superstoreParallel
# Streaming: process chunks one at a time
for chunk in superstoreStream(count=10_000_000, chunk_size=100_000):
process_and_save(chunk)
# Parallel: use all CPU cores for faster generation
df = superstoreParallel(count=1_000_000)
Direct File Export¶
Export directly to files without loading into memory:
from superstore import superstoreToCsv, superstoreToParquet, superstoreArrowIpc
# Export to different formats
superstoreToCsv("sales.csv", count=1_000_000)
superstoreToParquet("sales.parquet", count=1_000_000)
superstoreArrowIpc("sales.arrow", count=1_000_000)
Employee Records¶
The employees() function generates realistic employee records with personal information.
Basic Usage¶
from superstore import employees
# Generate 500 employees
df = employees(count=500)
# Different output formats
df = employees(count=500, output="polars")
data = employees(count=500, output="dict")
Output Schema¶
Column |
Type |
Description |
|---|---|---|
|
int |
Unique employee identifier |
|
str |
First name |
|
str |
Last name |
|
str |
Email address |
|
str |
Phone number |
|
date |
Date of hire |
|
str |
Department name |
|
str |
Job title |
|
float |
Annual salary |
|
int |
Manager’s employee ID |
|
str |
Office location |
Large Dataset Generation¶
from superstore import employeesStream, employeesParallel
# Streaming generation
for chunk in employeesStream(count=100_000, chunk_size=10_000):
process(chunk)
# Parallel generation
df = employeesParallel(count=50_000)
Direct File Export¶
from superstore import employeesToCsv, employeesToParquet, employeesArrowIpc
employeesToCsv("employees.csv", count=10_000)
employeesToParquet("employees.parquet", count=10_000)
employeesArrowIpc("employees.arrow", count=10_000)
API Reference¶
See the full API documentation: