Table of content
- What are packages
- Why use it
- Local packages
- Dbt hub packages
- Verify packages are installed
- Macros usage
- Models usage
- dbt_modules under the hood
- Disclaimer
- Contact
What are packages
dbt packages are collections of macros, models, and other resources that are used to extend the functionality of dbt. Packages can be used to share common code and resources across multiple dbt projects, and can be published and installed from the dbt Hub, from GitHub or can be stored locally and installed by specifying the path to the project.
In dbt, libraries like these are called packages.
Why use it
dbt packages are so powerful because so many of the analytic problems we encountered are shared across organizations.
There are a few general benefits to using packages in dbt:
Reusability: packages allow you to reuse code across multiple projects and models. This can save you a lot of time and effort, as you don't have to copy and paste the same code into multiple places.
Collaboration: packaging your models in a package allows multiple people to work on the same models at the same time. You can use version control systems like git to manage changes to the models, and use tools like the
dbt test
command to ensure that the models are correct and reliable.Sharing: packaging your models or macros in a package allows you to share them with others. You can publish your package on the dbt Hub or on GitHub, and others can install and use your models in their own dbt projects.
Managing: packages make it easier to manage your codebase. You can use version control to track changes to your package, and you can easily install and update packages in your dbt project.
Modularity: packaging your models in a package allows you to break your data pipeline into smaller, more manageable pieces, which are easier to understand and maintain. This could make it streamline development and upkeep your dbt project over time. This is especially useful if you are working on a large project with many different models and transformations.
Overall, using packages can help you to build more efficient, maintainable, and scalable data pipelines with dbt.
For example, if your aim is to extract the day of the week, there is no sense to reinvent the wheel and develop this macro on your own. Rather, we might want to find the right package and make use of it {{ dbt_date.day_of_week(column_name) }}
.
Local packages
In dbt, you can use local packages to organize and reuse code within a single dbt project. Local packages are stored within your project directory and are only available to the models in that project. The best use-case for local packages is some module that you want to live in the same repository, nearby to the main project.
To create a reusable local package do the following:
- Consider you have the following dbt project dir structure
>>> tree -L 1 .
.
├── data
├── dbt_project.yml
├── macros
├── models
├── packages
├── packages.yml
├── profiles.yml
└── target
- Create
packages
dir, so here we would put our first local package.
mkdir packages
- Create
packages.yml
file, so here we would link our first local package.
touch packages.yml
- Before moving on please verify that you have the following dir structure
>>> tree -L 1 .
.
├── data
├── dbt_modules
├── dbt_project.yml
├── macros
├── models
├── packages
├── packages.yml
├── profiles.yml
├── snapshots
└── target
- Jump into
packages
dir and init your package with the namelocal_utils
. The name of package is arbitrary.
cd packages
dbt init local_utils
It will create a package with the following structure:
>>> tree local_utils
local_utils
├── README.md
├── analysis
├── data
├── dbt_project.yml
├── macros
├── models
│ └── example
│ ├── my_first_dbt_model.sql
│ ├── my_second_dbt_model.sql
│ └── schema.yml
├── snapshots
└── tests
- Next, you need to change the project
name
indbt_project.yml
frommy_new_project
to a meaningful and self-explainable name. This name will be used further as macro or model references. Let's call itlocal_utils
. - Specify our local package in the before-mentioned
packages.yml
as follows:
packages:
- local: /opt/dbt/packages/local_utils/
Make sure that you provide your absolute path to the packages. Otherwise, it would not work.
Save the packages.yml
file and run the dbt deps
command to install the package. This will link the package and make it available to your dbt models.
Here is an example of what the dbt deps
command might look like when you install your local package:
>>> dbt deps
Running with dbt=0.21.1
Installing /opt/dbt/packages/local_utils/
Installed from <local @ /opt/dbt/packages/local_utils/>
Now you can observe a newly created dbt_modules
dir, that contains binary file local_utils
. It means than our local package is ready to be used.
>>> tree dbt_modules
dbt_modules
└── utils -> /opt/dbt/packages/local_utils/
Dbt hub packages
In dbt, you can use packages from the dbt Hub to share your code with others and to reuse code from other users in your own projects. The dbt Hub is a community-driven library of packages that you can use to extend the functionality of dbt.
Probably, the best examples of third-party packages driven by the community would be:
There are a few benefits to using dbt Hub packages:
Reusable code: dbt Hub packages allow you to reuse code that has been shared by other users, teams and companies. This can save you a lot of time and effort, as you don't have to write the same logic from scratch, test and maintain it.
The major advantage of any open-source:
- Community support: When you use packages from the dbt Hub, you can benefit from the support and expertise of the dbt community. If you have questions or run into issues with a package, you can ask for help on the dbt community forums or Slack channel.
- Collaboration: By sharing your own packages on the dbt Hub, you can make your code available to other users. This can help to foster collaboration and improve the overall quality of your code.
Overall, using dbt Hub packages can help you to build more efficient, maintainable, and scalable data pipelines with dbt, and to collaborate with others in the dbt community.
To install a package from the dbt Hub in your dbt project, you will need to add the package to your packages.yml file.
Here is the basic process:
- Go to the dbt Hub and search for the package you want to install.
- Click on the package to view its details.
- Copy the package name and version from the installation instructions.
- Open your
packages.yml
file and add the package name and version to the packages list. It should look something like this:
packages:
- name: dbt-labs/dbt_utils
version: 0.7.6
Save the packages.yml
file and run the dbt deps
command to install the package. This will download the package and make it available to your dbt models.
Here is an example of what the dbt deps
command might look like when you install dbt hub package:
>>> dbt deps
Installing dbt-labs/dbt_utils@0.7.6
Installed from version 0.7.6
Unlike of local package, hub package was downloaded to dbt_modules
dir physically.
>>> tree -L 2 dbt_modules
dbt_modules
└── dbt_utils
├── CHANGELOG.md
├── LICENSE
├── README.md
├── RELEASE.md
├── dbt_project.yml
├── docker-compose.yml
├── etc
├── integration_tests
├── macros
└── run_test.sh
Verify packages are installed
To verify that a package is installed in your dbt project, you can check the packages.yml
file and run the dbt deps
command.
Check the
packages.yml
file: This file lists all of the packages that are installed in your dbt project. Look for the name of the package you want to verify. If it is listed in the packages list, then it is installed.-
Run the
dbt deps
command:- This command will show you a list of all of the packages that are installed in your dbt project. Look for the name of the package you want to verify. If it is listed, then it is installed.
- In the root dbt project dir, you observe a new dir
dbt_modules/
which contains the compiled packages that are ready to be used. NOTE: dirdbt_modules/
has to be added to.gitignore
.
>>> tree -L 1 .
.
├── data
├── dbt_modules
├── dbt_project.yml
├── macros
├── models
├── packages
├── packages.yml
├── profiles.yml
├── snapshots
└── target
If your packages.yml
file contains package that is not installed then you would not be able to run any dbt command:
>>> dbt list
Encountered an error:
Compilation Error
dbt found 1 package(s) specified in packages.yml, but only 0 package(s) installed in dbt_modules. Run dbt deps to install package dependencies.
So this is our guarantee that in runtime we would not have any issues related to the package installation.
Macros usage
In dbt, you can use packages to define custom macros that can be called from your dbt models. Here is an example of how you might use a package to define a custom macro.
Here are a few examples of how you might use macros in dbt to perform common data transformations.
For instance, lets create the following macros in our local package under local_utils/macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name, precision=2) %}
({{ column_name }} / 100)::numeric(16, {{ precision }})
{% endmacro %}
Next we can call our macros as {{ local_utils.cents_to_dollars(your_column_name) }}
. The local_utils
package names comes from the name
in our package dbt_project.yml
file.
Usage a macros from dbt hub packages is pretty much the same. Imagine we want to generate a surrogate key based on a few columns. This is the functional dbt-utils
we previously installed provides: {{ dbt_utils.generate_surrogate_key([field_a, field_b[,...]]) }}
So the macros usage pattern from the third-party package is {{ package_name.macros_name() }}
.
Models usage
As we created our own local package local_utils
as a prerequisite, which has the following structure:
>>> tree packages/local_utils
local_utils
├── README.md
├── analysis
├── data
├── dbt_project.yml
├── macros
├── models
│ └── example
│ ├── my_first_dbt_model.sql
│ ├── my_second_dbt_model.sql
│ └── schema.yml
├── snapshots
└── tests
The most important function in dbt is ref()
; its impossible to build even moderately complex models without it. ref()
is how you reference one model within another inside your package. Here is how this looks in practice:
select
*
from
{{ref("model_a")}}
So if we would like to reference my_first_dbt_model
from my_second_dbt_model
within local_utils
package then we do the following:
select
*
from
{{ ref("my_first_dbt_model") }}
If we want to reference my_first_dbt_model
from our main project then we need to slightly change the way we call it. There is also a two-argument variant of the ref
function. With this variant, you can pass both a package name and model name to ref
to avoid ambiguity:
select
*
from
{{ ref("package_name", "model_name") }}
Our particular case would be as:
select
*
from
{{ ref("local_utils", "my_first_dbt_model") }}
Note: The package_name
should only include the name of the package, not the maintainer. For example, if we use the dbt-labs/dbt-utils
package, type dbt-utils
in that argument, and not dbt-labs/dbt-utils
.
dbt_modules under the hood
The dbt_modules
directory is a directory that is used by dbt to store packages and their models. When you install a package using the dbt deps
command, the package and its models are downloaded and stored in the dbt_modules
directory.
The dbt_modules
directory is located in the root directory of your dbt project. It contains subdirectories for each installed package, and each package directory contains the packages models, macros, and other resources.
The way dbt installs local
and dbt hub
packages is different.
Considering to have the following package.yml
content:
packages:
- package: dbt-labs/dbt_utils
version: 0.7.6
- local: /opt/dbt/packages/local_utils/
You would have the following modules under generated dbt_modules
dir:
>>> tree -L 2 dbt_modules
dbt_modules
├── dbt_utils
│ ├── CHANGELOG.md
│ ├── LICENSE
│ ├── README.md
│ ├── RELEASE.md
│ ├── dbt_project.yml
│ ├── docker-compose.yml
│ ├── etc
│ ├── integration_tests
│ ├── macros
│ └── run_test.sh
└── utils -> /opt/dbt/packages/local_utils/
As I mentioned before, it creates a symlink for local packages, and for the dbt hub package, it simply copies all the needed files in the same name folder.
We use Google Cloud Composer to orchestrate all the transformation jobs. We basically copy our project to GCP bucket with gsutil -m rsync
. Unfortunately, it does not support symbolic links:
Since gsutil rsync is intended to support data operations (like moving a data set to the cloud for computational processing) and it needs to be compatible both in the cloud and across common operating systems, there are no plans for gsutil rsync to support operating system-specific file types like symlinks.
Taken from gutils rsync
's documentation Be Careful When Synchronizing Over Os-Specific File Types (Symlinks, Devices, Etc.)
The possible solution is to compress everything locally to archive, copy it to the bucket, and then unpack it to composer:
tar -czvf dbt-project.tar.gz dbt-project
gsutil -m rsync dbt-project.tar.gz gs://$BUCKET/prefix
Here’s what those switches actually mean:
-c: Create an archive.
-z: Compress the archive with gzip.
-v: Display progress in the terminal while creating the archive, also known as “verbose” mode. The v is always optional in these commands, but it’s helpful.
-f: Allows you to specify the filename of the archive.
These are pitfalls that we met when working with dbt packages.
Disclaimer
- All this experience applies to dbt v0.21.1
- I am aware of since v1.0 they changed the default value to
dbt_packages
instead ofdbt_modules
- I like to think that most of the guide still appears to be applicable to the latest dbt version
Contact
If you found this article helpful, I invite you to connect with me on LinkedIn. I am always looking to expand my network and connect with like-minded individuals in the data industry. Additionally, you can also reach out to me for any questions or feedback on the article. I'd be more than happy to engage in a conversation and help out in any way I can. So don’t hesitate to contact me, and let’s connect and learn together.
Top comments (0)