How do you build a data warehouse with OOP? I suggest designing by delineating the classes that you'll need to both create and maintain the data warehouse. The first time that you build these classes, you will probably build them wrong. This is expected as testing helps improve the design over time. A general overview of the classes that you'll create:
Your configuration or connection classes should be able to access the appropriate properties to the source, staging and final destinations. Related libraries should be imported at the highest scope here, if these libraries cannot be accessed or referenced at more detailed levels of your library. For example, you can import pandas wherever you need (generally transformations), so you wouldn't need to import it here at this step unless you'll use it.
Consider how you separate the dimension and fact classes. One technique that I tend to use involves fact and dimension classes inheriting from a base object. This base object has the standard columns that we tend to use frequently in ETL, such as timestamps of when records were added, updated and whether they're active. Because dimensions can be different types whereas facts cannot, facts and dimensions have different properties and methods. For instance, a fact class will have a foreign key reference method that ensures that a key column ties out with a dimension. A dimension doesn't need this because a dimension can have a key that isn't found in a fact, but the inverse cannot be true.
Build a transformation class for repeatable functionality. Generally, transformations occur on a single column level, though there are some cases in which they occur on multiple columns or even the entire table (much rarer). One of the most expensive steps in building and maintaining a data warehouse (and ETL in general) is the transformation step. Over time, you may find it convenient to further break this tranformation class into smaller classes, such as transformation classes for machine learning, analysis, etc. For instance, one of my repetitive methods involves pivot tables. Over time, I've added this functionality into a one hot encoding method. Keep in mind that you can spend decades never doing certain transformations and you will generally find 20-30% of your transformations are the most repeated in use.
Finally, consider scale. Scale can impact all the other steps, but it can also cause people to never finish the other steps. I've seen more waste in planning for scale than actually solving it. When you know the data volume will grow, you want to consider it. Make sure that this doesn't become the procrastination point that results in unnecessary planning meetings where nothing gets done. Horizontally scaling your data may require change to your other code. Accept that so that you can still create repetitive functionality. In general, scale is not always an issue that people in data face and this is why I've seen worries about scale cost companies more in planning than if they approached the problem without the concern.
This provides an overall picture of how you'd approach a data warehouse with OOP. This is not the actual solution that you'd use as details matter.
Top comments (0)