Retail Data Generation¶

Generate synthetic retail transaction and employee data with realistic business patterns.

Overview¶

The retail generators create data suitable for:

Sales analytics dashboards
Business intelligence demos
Retail forecasting models
Customer segmentation analysis
HR analytics and demographics

Superstore Transactions¶

The superstore() function generates retail transaction records with realistic correlations between sales, quantity, discount, and profit.

Basic Usage¶

from superstore import superstore

# Generate 1000 transactions as a pandas DataFrame
df = superstore(count=1000)

# Generate as polars or dict
df = superstore(count=1000, output="polars")
data = superstore(count=1000, output="dict")

Output Schema¶

Column	Type	Description
`row_id`	int	Unique row identifier
`order_id`	str	Order identifier (format: CA-2024-XXXXXX)
`order_date`	date	Date of order
`ship_date`	date	Date of shipment
`ship_mode`	str	Shipping method (Standard, Express, etc.)
`customer_id`	str	Customer identifier
`customer_name`	str	Customer full name
`segment`	str	Customer segment (Consumer, Corporate, Home Office)
`city`	str	Delivery city
`state`	str	Delivery state
`postal_code`	str	Postal code
`region`	str	Region (East, West, Central, South)
`product_id`	str	Product identifier
`category`	str	Product category
`sub_category`	str	Product sub-category
`product_name`	str	Product name
`sales`	float	Transaction sales amount
`quantity`	int	Quantity ordered
`discount`	float	Discount applied (0.0 - 0.5)
`profit`	float	Transaction profit

Configuration¶

Use SuperstoreConfig for fine-grained control over data generation:

from superstore import superstore, SuperstoreConfig

config = SuperstoreConfig(
    count=10000,
    seed=42,  # Reproducible output

    # Correlation settings
    sales_quantity_correlation=0.8,   # Higher quantities = higher sales
    sales_profit_correlation=0.9,     # Higher sales = higher profit
    discount_profit_correlation=-0.6, # Higher discounts = lower profit

    # Price formatting
    enable_price_points=True,  # Round to $X.99 values
)
df = superstore(config=config)

Seasonality Configuration¶

Model seasonal sales patterns:

config = SuperstoreConfig(
    count=10000,
    seasonality={
        "enable": True,
        "q4_multiplier": 1.8,         # 80% more sales in Q4 (holidays)
        "summer_multiplier": 0.85,     # 15% fewer sales in summer
        "back_to_school_multiplier": 1.3,  # 30% more in Aug/Sep
    }
)

Parameter	Default	Description
`enable`	`True`	Enable seasonal effects
`q4_multiplier`	`1.5`	Q4 (holiday) sales multiplier
`summer_multiplier`	`0.9`	Summer sales multiplier
`back_to_school_multiplier`	`1.2`	August/September multiplier

Promotional Configuration¶

Model promotion and discount effects:

config = SuperstoreConfig(
    count=10000,
    promotions={
        "enable": True,
        "discount_quantity_correlation": 0.6,  # Discounts increase quantity
        "price_elasticity": -1.0,  # Price sensitivity
    }
)

Parameter	Default	Description
`enable`	`True`	Enable promotional patterns
`discount_quantity_correlation`	`0.5`	How much discounts increase quantity
`price_elasticity`	`-0.8`	Price elasticity of demand (-2 to 0)

Customer Configuration¶

Model customer behavior and segmentation:

config = SuperstoreConfig(
    count=10000,
    customers={
        "enable_cohorts": True,
        "repeat_customer_rate": 0.75,  # 75% repeat customers
        "vip_segment_rate": 0.15,      # 15% VIP customers
        "vip_order_multiplier": 2.5,   # VIPs spend 2.5x more
    }
)

Parameter	Default	Description
`enable_cohorts`	`True`	Enable customer cohort modeling
`repeat_customer_rate`	`0.7`	Fraction of repeat customers
`vip_segment_rate`	`0.1`	Fraction of VIP customers
`vip_order_multiplier`	`2.0`	VIP order value multiplier

Large Dataset Generation¶

For datasets larger than memory, use streaming or parallel generation:

from superstore import superstoreStream, superstoreParallel

# Streaming: process chunks one at a time
for chunk in superstoreStream(count=10_000_000, chunk_size=100_000):
    process_and_save(chunk)

# Parallel: use all CPU cores for faster generation
df = superstoreParallel(count=1_000_000)

Direct File Export¶

Export directly to files without loading into memory:

from superstore import superstoreToCsv, superstoreToParquet, superstoreArrowIpc

# Export to different formats
superstoreToCsv("sales.csv", count=1_000_000)
superstoreToParquet("sales.parquet", count=1_000_000)
superstoreArrowIpc("sales.arrow", count=1_000_000)

Employee Records¶

The employees() function generates realistic employee records with personal information.

Basic Usage¶

from superstore import employees

# Generate 500 employees
df = employees(count=500)

# Different output formats
df = employees(count=500, output="polars")
data = employees(count=500, output="dict")

Output Schema¶

Column	Type	Description
`employee_id`	int	Unique employee identifier
`first_name`	str	First name
`last_name`	str	Last name
`email`	str	Email address
`phone`	str	Phone number
`hire_date`	date	Date of hire
`department`	str	Department name
`job_title`	str	Job title
`salary`	float	Annual salary
`manager_id`	int	Manager’s employee ID
`location`	str	Office location

Large Dataset Generation¶

from superstore import employeesStream, employeesParallel

# Streaming generation
for chunk in employeesStream(count=100_000, chunk_size=10_000):
    process(chunk)

# Parallel generation
df = employeesParallel(count=50_000)

Direct File Export¶

from superstore import employeesToCsv, employeesToParquet, employeesArrowIpc

employeesToCsv("employees.csv", count=10_000)
employeesToParquet("employees.parquet", count=10_000)
employeesArrowIpc("employees.arrow", count=10_000)

API Reference¶

See the full API documentation: