Statistical Distributions¶
Sample from various probability distributions for statistical modeling and simulation.
Overview¶
The distribution functions provide:
Standard probability distributions
Mixture distributions
Noise injection utilities
Missing data simulation
Continuous Distributions¶
Uniform Distribution¶
Sample from a uniform distribution over [low, high):
from superstore import sampleUniform
# 1000 samples between 0 and 1
values = sampleUniform(n=1000)
# Custom range
values = sampleUniform(n=1000, low=10.0, high=50.0)
Normal (Gaussian) Distribution¶
Sample from a normal distribution:
from superstore import sampleNormal
# Standard normal (mean=0, std=1)
values = sampleNormal(n=1000)
# Custom mean and standard deviation
values = sampleNormal(n=1000, mean=100.0, std=15.0)
Log-Normal Distribution¶
Sample from a log-normal distribution (useful for prices, sizes, durations):
from superstore import sampleLogNormal
# Default parameters
values = sampleLogNormal(n=1000)
# Custom parameters
values = sampleLogNormal(n=1000, mean=2.0, sigma=0.5)
Exponential Distribution¶
Sample from an exponential distribution (useful for waiting times):
from superstore import sampleExponential
# Rate parameter lambda=1.0
values = sampleExponential(n=1000, rate=1.0)
# Mean = 1/rate, so rate=0.1 gives mean=10
values = sampleExponential(n=1000, rate=0.1)
Beta Distribution¶
Sample from a beta distribution (useful for probabilities, proportions):
from superstore import sampleBeta
# Uniform on [0,1]
values = sampleBeta(n=1000, alpha=1.0, beta=1.0)
# Skewed toward 1
values = sampleBeta(n=1000, alpha=5.0, beta=1.0)
# Bell-shaped around 0.5
values = sampleBeta(n=1000, alpha=5.0, beta=5.0)
Gamma Distribution¶
Sample from a gamma distribution (useful for waiting times, amounts):
from superstore import sampleGamma
values = sampleGamma(n=1000, shape=2.0, scale=1.0)
Weibull Distribution¶
Sample from a Weibull distribution (useful for failure times, survival analysis):
from superstore import sampleWeibull
values = sampleWeibull(n=1000, shape=1.5, scale=1.0)
Pareto Distribution¶
Sample from a Pareto distribution (useful for heavy-tailed phenomena):
from superstore import samplePareto
# Classic 80/20 distribution
values = samplePareto(n=1000, alpha=1.16, x_min=1.0)
Discrete Distributions¶
Poisson Distribution¶
Sample from a Poisson distribution (useful for count data):
from superstore import samplePoisson
# Mean = 5 events
values = samplePoisson(n=1000, rate=5.0)
Categorical Distribution¶
Sample from a categorical distribution with specified probabilities:
from superstore import sampleCategorical
# Equal probabilities
categories = ["A", "B", "C", "D"]
values = sampleCategorical(n=1000, categories=categories)
# Custom probabilities
probs = [0.5, 0.3, 0.15, 0.05]
values = sampleCategorical(n=1000, categories=categories, probabilities=probs)
Mixture Distributions¶
Gaussian Mixture¶
Sample from a mixture of Gaussian distributions:
from superstore import sampleMixture
# Two-component mixture
values = sampleMixture(
n=1000,
means=[0.0, 5.0],
stds=[1.0, 0.5],
weights=[0.7, 0.3], # 70% from first, 30% from second
)
# Three-component mixture (bimodal with outliers)
values = sampleMixture(
n=1000,
means=[-2.0, 2.0, 10.0],
stds=[0.5, 0.5, 0.3],
weights=[0.45, 0.45, 0.10],
)
Noise & Missing Data¶
Gaussian Noise¶
Add Gaussian noise to existing data:
from superstore import addGaussianNoise
# Add noise with std=0.1
noisy_values = addGaussianNoise(values, std=0.1)
# Add proportional noise (10% of value)
noisy_values = addGaussianNoise(values, std=0.1, proportional=True)
Missing Data¶
Apply missing values (NaN) randomly:
from superstore import applyMissing
# 5% missing values
data_with_missing = applyMissing(values, missing_rate=0.05)
# Different missing rates for multiple columns
df = applyMissing(df, missing_rate={"col1": 0.1, "col2": 0.05})
Examples¶
Quality Score Distribution¶
from superstore import sampleBeta
# Quality scores skewed toward high values
scores = sampleBeta(n=10000, alpha=8.0, beta=2.0)
# Mode ≈ 0.78, most values between 0.6-1.0
Customer Lifetime Value¶
from superstore import sampleLogNormal
# CLV with long tail
clv = sampleLogNormal(n=10000, mean=5.0, sigma=1.0)
# Median ≈ exp(5) ≈ 148, with some very high values
Service Response Times¶
from superstore import sampleMixture
# Bimodal: fast responses + slow (queued) responses
response_ms = sampleMixture(
n=10000,
means=[50.0, 500.0],
stds=[20.0, 100.0],
weights=[0.9, 0.1],
)
Event Counts with Noise¶
from superstore import samplePoisson, addGaussianNoise
# Poisson counts with measurement noise
counts = samplePoisson(n=1000, rate=20.0)
noisy_counts = addGaussianNoise(counts, std=2.0)
API Reference¶
See the full API Reference for all distribution functions.