One of our customers has most of their traffic wrapped in the shrouds of SSL encryption which by design make external observability impossible. They came to us with an ask of being able to replay actual production traffic to a staging environment where the data would be a snapshot of prod.
The idea is:
- take a db snapshot from prod
- record prod traffic
- much later, restore prod DB
- replay traffic
While this isn't perfect in many respects, it's simple and ought to serve various purposes like evaluating how much impact something would actually have had in production on a given day. It may be a tuning, a configuration change, a change in hardware -CPU upgrade, memory increase, IO subsystem like moving to SSD or NAS, etc..- or explore migration scenario with actual traffic. For example, you can imagine comparing the production performance on day x WHILE UPGRADING THE INFRASTRUCTURE to make sure that the upgrade scenario would not impact the client applications, which in this case is a healthy 4,000-strong ecosystem.
So we did. We wrote a logger capable of dumping live traffic to a dedicated log and extreme care was taken to reduce the overhead. When replayed, the tool can process traffic in "actual time" -at the same pace it happened in prod when recorded- and even honoring the same numbers of connections used by clients. This is to ensure that the reproduced conditions are as close as possible to what happened in real life a the time of the recording.
In a later post, I will describe how that works in more detail.