Whitepaper

Moving your data warehouse in 7 steps

Moving boxes data migration concept

More and more organizations are questioning the reliability of their current cloud provider. Think of geopolitical risks, potential government interventions, or concerns about unwanted access to data. Logical concerns, and precisely why a thoughtful exit strategy is essential. In our whitepaper Why a Data Warehouse exit strategy is essential, we discussed the why. In this whitepaper, we discuss the how.

Why can't you move your data warehouse right now?

The technology you currently use to populate your data warehouse and store data is not suitable to take with you and host with another cloud provider. If you currently use Microsoft Azure with Data Factory, you cannot run that in the Google Cloud Platform (GCP) or on Amazon (AWS). The same applies to other technologies. You are experiencing vendor lock-in.

"The technology you currently use to populate your data warehouse is not suitable to take with you. You are experiencing vendor lock-in."

It's good that you are reading this whitepaper so you know how to break the vendor lock-in with your data warehouse.

Moving your data warehouse is not complex and risky

When moving your data warehouse to another cloud provider, you probably think of quite a few challenges:

Suppose all those risks are clearly mapped out by Datap and you see that all risks can be well managed, what would stop you from taking the step? Below you can read our approach to moving your data warehouse. Curious what that looks like in practice for you? Schedule an appointment with us quickly and easily.

Moving your data warehouse in 7 steps

What steps do you need to take to move your data warehouse?

  1. Inventory
  2. Choosing new technology
  3. Setting up the platform
  4. Setting up data extraction
  5. Migration: Copying existing data to the new database
  6. Validation and data quality
  7. Connecting reports and other systems to the new database

1. Inventory

It is important to start with an inventory first. During the inventory, you create an overview of all components that will be affected by the migration. That overview is crucial because it allows you to roughly determine the lead time and also set up the test plan containing all components that need to be tested.

2. Making choices

Fundamentally, you will have to make two choices:

You want the freedom to host your data warehouse and data platform at the cloud provider of your preference. The database technology and software we advise can run at almost all cloud providers, or on your own server or even on your laptop. You can therefore choose to focus only on replacing the 'technology' now and not immediately replace your cloud provider.

Based on the inventory, it is wise to choose between either PostgreSQL as technology, or a Lakehouse solution. For most companies, PostgreSQL is an excellent choice.

3. Setting up the platform

Once you have decided which (cloud) provider you choose, you can start setting up the database and corresponding software.

If you roll it out at an existing cloud provider, it is a matter of clicking the necessary components in the portal so you have the required software available. A European cloud provider is of course also an excellent option. You will first have to create an account, after which you can turn on the necessary components.

When we set up data platforms, we do this largely automated. We apply 'Infrastructure as Code' (IaC). IaC means that we set up the necessary components fully automated based on scripts. The code to do that is largely generic, making our solution deployable in your current cloud, in another cloud, as well as on your own servers.

We advise using a solution that is cloud-agnostic. For us, this comes down to technology that can run in a containerized environment.

4. Processing data

Datap chooses dlt (data load tool) as a framework to extract data from source systems. It is based on Python. For ETL software of the 3 major cloud providers, specialized knowledge is required. By working with Python, anyone with Python experience can get started with dlt. This means there is a large group of professionals who can support working with it.

dlt ensures that you can connect source systems very quickly, easily extract many tables, and that data extraction is therefore highly automated.

For the further processing of data to your dashboards and reports, business logic will often have to be added. Then it is a matter of programming the Transformations and writing the data where necessary. Certainly when you have chosen PostgreSQL as technology, most SQL functions you use in your current solution can also be used there.

5. Migration

During the migration, the existing data from your current data warehouse is migrated to the new environment. The easiest way is to also use dlt for this so that you can automate the import into the new database as much as possible.

We advise working incrementally as much as possible. So don't finish the entire ETL first, but when a piece of ETL is completed, migrate the necessary data so you can test the piece from front to back. After that, it is also possible to run in parallel (shadow run). The testing of what has already been delivered then runs parallel to the components still in development. This significantly shortens the lead time.

6. Validation and data quality

Ultimately, you want to know whether the data in the new data warehouse corresponds 1-to-1 with data from the old data warehouse.

You will have to determine the validation and check on data quality per table. We create a script for this in which the complete content of all columns of a table is aggregated, so you can compare that with the old and new table. In the event of differences, we ensure that row by row is compared. If there are differences, there are two possibilities. The new situation deviates, then apparently something has to be adjusted in the loading and processing of the data. Or the new situation is actually better in contrast to the current situation. Then this analysis ensures the recording of that conclusion. Below are the points of attention to watch out for during validation:

Ideally, you work with hashes / checksums. This functionality must be supported by your database.

7. Reports

Your new data warehouse now corresponds 1-to-1 with your old data warehouse. You now have the freedom to move your data warehouse to another cloud provider, to your own servers, or possibly even your own laptop. The only thing you still need to do is change the connections in your reports. It is wise to check every report to see if the situation before and after is identical there too and if the report functions the same. Because you have already carried out a thorough check on the underlying tables in your data warehouse, this check will need to be carried out less thoroughly.

Conclusion

"As you have read, transferring your data warehouse to technology that gives you freedom can be done in 7 steps."

We have over 15 years of experience in deploying and developing data platforms. As a result, we have experienced and executed the necessary migrations.

We can therefore take the entire migration off your hands and perform it largely fixed price. The only variable component is the inventory. Usually, an inventory takes 2 – 7 days. The output of this inventory determines the scope for the assignment. This allows us to easily check off all delivered items during the process and keep a grip on the status and lead time together.

Optionally, we can also perform a fallback test. In a fallback test, we set up an environment at another cloud provider and ensure that a number of tables and source systems are transferred. This way you know for sure that you are actually free to change cloud provider at a later moment. Setting up and executing a fallback test takes on average 5 days. This depends on the exact scope we agree on with each other.