DEV Community

Joe Auty
Joe Auty

Posted on • Edited on

How To Deliver Timely Safe Production Data to your Engineering Teams

This guide showcases a brand new tool (free for use for developers) called Redactics, which is a managed appliance for powering a growing number of data management workflows using your own infrastructure. They would love to hear from you if you appreciate what they are building. This guide is one of a series of recipe-based instructions for solving specific problems and accomplishing specific tasks using Redactics, and its focus is in powering datasets for demo environments with the following features/characteristics:

  • A growing collection of “data feeds” which are data delivery and replication options including uploading data to an Amazon S3 bucket used to back a new or existing data lake, or a “digital twin” database/data warehouse which clones new production data (minus sensitive information and PII). If you do not see a data feed that is appropriate for your use case, please let us know and we will let you know when this will be added based on our product roadmap.
  • Options to delta-update this new data so that performance approaches real-time, allowing you to schedule updates every few minutes or so.
  • Delivery of the original raw data from a selection of tables, with the only transformations being the handling of sensitive information including PII. This allows your teams to build out their own views and reports based on a copy of this original data, the focus of this workflow being simply to provide your stakeholders with a clean copy of this data that is being constantly updated.
  • The replicated tables include a column called source_primary_key which provides a definitive record of the original primary key from the production database, should data ever need to be reconciled against its master records.
  • Options to replicate data from a certain time period for specific tables, in addition to full-copy options, and exclude tables that provide no value to be replicated. Support to aggregate data from multiple input data sources within a single workflow.

Step 1: Create Your Redactics Account

Have your engineer create your company’s Redactics account (or create it yourself and invite them to it). They will need to also create their first Redactics SMART Agent and workflow. Don’t worry about configuring the workflow for right now, the engineer simply needs to follow the instructions to install the SMART Agent with an empty workflow of type ERL (Extract, Redact, Load). You can give this workflow any name you like, e.g. “ML Experimentation”. They’ll also need to define the master database by clicking on “Add Database” in the “Input Settings” section. This will also require listing all of the tables you intend to use within this workflow (and don’t worry, if you need to change these you can always come back later). Once the SMART Agent has been installed it will report back to the Redactics Dashboard, and you’ll see a green checkmark in the SMART Agent section:

A successful SMART Agent installation (based on its heartbeat tracking)<br>

With this step complete, once you have established a working workflow and decide to update it later, the SMART Agent will automatically recognize these changes without a re-installation being required.

Step 2: Configure Your Redactics Workflow

Return to your workflow configuration, your Input Settings should already be completed, but if you wish to aggregate data from multiple input sources you can define additional databases here. Some notes on this:

  • The SMART Agent requires network access to each database.
  • Please let us know if you require support for input sources other than the available options. It is not difficult for us to add support for additional input sources, we are prioritizing based on customer feedback!

Once you’ve defined all of your input sources, proceed to do the following:

  1. In “Processing/Transformation Settings” define all of your tables and fields containing sensitive information, and select a ruleset for handling this.
  2. If you are unsure that you’ve identified all of your PII, your engineer can install the SMART Agent CLI and kick off an automated scan using the PII Scanner tool. Results will be reported back to the “PII Scanner” section of the Dashboard where you can automate creating additional redaction rules to your configuration for these new fields.
  3. In the “Workflow Schedule Options” you can decide to put your workflow on a schedule. Please note that these times are in the UTC timezone (also known as Greenwich Mean Time or GMT), and custom times are expressed in crontab format. You can use this guide to format a custom time if you wish. You might want to start with running this jobs overnight (e.g. to run this at midnight UTC this custom time will be *0 0 * * **). For testing purposes your engineer can run these jobs manually whenever needed, and you can change this schedule whenever you want and have this recognized within minutes.
  4. In the “Output Settings” section, you can specify time periods for each table. Since a lot of databases include relationships to things like users and companies it is advisable to include all of these tables as to not break any relationships, but for data such as individual transactions made by users you can usually safely omit historic data within these tables. You can always return to this setting and make adjustments later. Note that these time periods can be based on creation times, update times, or both, and you’ll need to note which fields are used for recording these respective timestamps.
  5. Select as many data feeds as you wish for your workflow. For example, the “Create a PII-free Digital Twin/Clone” will create the aforementioned clone, and the “Upload/Sync Data to an Amazon S3 Bucket” will add data to an S3 bucket which could back a data lake.
  6. If you’ve selected the Digital Twin data feed option you will be provided with instructions for adding the connection info for this database to your SMART Agent configuration (which will require a one-time re-install to inject and save this information).
  7. Click “Update”, and then “Save Changes”.

Step 3: Running Your Workflow

Congratulations, your data feeds will deliver production data to their destinations! To ensure that you are good to go, you can either bump up your schedule or else have an engineer invoke the workflow manually via the Redactics SMART Agent CLI. Any issues with the workflow will be reported to the Redactics Dashboard, and, of course, you are welcome to contact Redactics Support if you require any assistance!

As workflows run, the progress will be reported to the Workflows -> Jobs page, and when this work has been completed a report will be provided detailing what was copied. If you’ve enabled the Digital Twin data feed option with delta updates updated, on subsequent runs you’ll see visual feedback like the following:

deltafeedback.jpg

Note that if the schema of these tables change, this will automatically detected and the table will be full copied for its next run instead. This way, the table’s schema will be applied in addition to any backfilled data.

Relating Data Back To Its Master

This is worth re-iterating: when you delta update a new column will be created called source_primary_key containing the primary key of the master record. If you need to search for specific records in your digital twin, you’ll need to adjust your queries to use this field instead. One reason for this design is to establish a single source of truth, with that truth being the master record in your production database. Whenever this master record is updated, this update will automatically be applied to your test data on its next run so long as the updated_at fields in these original tables have been updated.

Data Privacy By Design

By setting up all of your stakeholders with access to the data they need delivered by an appropriate data feed, you can then sever any access they had to your production database. By doing so, this establishes a new paradigm of safe data by default, your own “No PII Zone”, which in turn provides data privacy by design, and not having to account for production data access in your compliance audits as you would have otherwise.

Top comments (0)