I built this because i kept running into the same bottleneck on data projects: staging environments are always either empty or dangerous. Using production dumps always puts you at risk of PII leaks, but generating meaningful test data with python tools (like faker or SDV) often hit OOM errors or took hours once I tried to simulate anything complex.
I spent the last week writing replica_db to solve this. its a CLI tool written in rust that reverse engineers your existing Postgres schema and foreign key topology, then creates a "statistical genome" of your data using reservoir sampling.
The cool part (for me) was implementing Gaussian Copulas to handle correlations. Most generators treat columns independently, which creates non correlated data (like a user with age 5 earning $200k). I used nalgebra to compute the covariance matrix of numeric columns, so the engine actually learns the shape of the data.
I tested this on Uber NYC trip dataset, and it automatically detected the correlation between latitude and longitude. When i generated 5 million fake trips they respected the actual geography of NYC instead of placing points randomly in the ocean.
Benchmarks on my laptop have been encouraging. Scanning 564k real world rows takes about 2.2 seconds and generating 10 million synthetic rows takes under 5 seconds (~49k rows/sec) with constant memory usage. The output streams standard COPY format directly to stdout so you can pipe it straight into psql.
The repo isn't licensed yet. Its my first project involving this level of systems programming and statistical math in rust. So i'd appreciate any feedback on the implementation or the math strategy!
I built this because i kept running into the same bottleneck on data projects: staging environments are always either empty or dangerous. Using production dumps always puts you at risk of PII leaks, but generating meaningful test data with python tools (like faker or SDV) often hit OOM errors or took hours once I tried to simulate anything complex.
I spent the last week writing replica_db to solve this. its a CLI tool written in rust that reverse engineers your existing Postgres schema and foreign key topology, then creates a "statistical genome" of your data using reservoir sampling.
The cool part (for me) was implementing Gaussian Copulas to handle correlations. Most generators treat columns independently, which creates non correlated data (like a user with age 5 earning $200k). I used nalgebra to compute the covariance matrix of numeric columns, so the engine actually learns the shape of the data.
I tested this on Uber NYC trip dataset, and it automatically detected the correlation between latitude and longitude. When i generated 5 million fake trips they respected the actual geography of NYC instead of placing points randomly in the ocean.
Benchmarks on my laptop have been encouraging. Scanning 564k real world rows takes about 2.2 seconds and generating 10 million synthetic rows takes under 5 seconds (~49k rows/sec) with constant memory usage. The output streams standard COPY format directly to stdout so you can pipe it straight into psql.
The repo isn't licensed yet. Its my first project involving this level of systems programming and statistical math in rust. So i'd appreciate any feedback on the implementation or the math strategy!
https://github.com/Pragadeesh-19/replica_db