More and more organizations are questioning the reliability of their current cloud provider. Think of geopolitical risks, potential government interventions, or concerns about unwanted access to data. Logical concerns, and precisely why a thoughtful exit strategy is essential. In our whitepaper Why a Data Warehouse exit strategy is essential, we discussed the why. In this whitepaper, we discuss the how.
Why can't you move your data warehouse right now?
The technology you currently use to populate your data warehouse and store data is not suitable to take with you and host with another cloud provider. If you currently use Microsoft Azure with Data Factory, you cannot run that in the Google Cloud Platform (GCP) or on Amazon (AWS). The same applies to other technologies. You are experiencing vendor lock-in.
It's good that you are reading this whitepaper so you know how to break the vendor lock-in with your data warehouse.
Moving your data warehouse is not complex and risky
When moving your data warehouse to another cloud provider, you probably think of quite a few challenges:
- Downtime: will I be unable to access my data for a while?
- Lead time: how long will this project take?
- Costs: how much money will the migration and the new monthly cloud spend cost?
- Scope: do I have to rebuild all reports as well?
- Security: how safe is my data in that other cloud?
Suppose all those risks are clearly mapped out by Datap and you see that all risks can be well managed, what would stop you from taking the step? Below you can read our approach to moving your data warehouse. Curious what that looks like in practice for you? Schedule an appointment with us quickly and easily.
Moving your data warehouse in 7 steps
What steps do you need to take to move your data warehouse?
- Inventory
- Choosing new technology
- Setting up the platform
- Setting up data extraction
- Migration: Copying existing data to the new database
- Validation and data quality
- Connecting reports and other systems to the new database
1. Inventory
It is important to start with an inventory first. During the inventory, you create an overview of all components that will be affected by the migration. That overview is crucial because it allows you to roughly determine the lead time and also set up the test plan containing all components that need to be tested.
2. Making choices
Fundamentally, you will have to make two choices:
- Which location: do you choose a cloud provider, or do you want to host yourself?
- Which database technology?
You want the freedom to host your data warehouse and data platform at the cloud provider of your preference. The database technology and software we advise can run at almost all cloud providers, or on your own server or even on your laptop. You can therefore choose to focus only on replacing the 'technology' now and not immediately replace your cloud provider.
Based on the inventory, it is wise to choose between either PostgreSQL as technology, or a Lakehouse solution. For most companies, PostgreSQL is an excellent choice.
3. Setting up the platform
Once you have decided which (cloud) provider you choose, you can start setting up the database and corresponding software.
If you roll it out at an existing cloud provider, it is a matter of clicking the necessary components in the portal so you have the required software available. A European cloud provider is of course also an excellent option. You will first have to create an account, after which you can turn on the necessary components.
When we set up data platforms, we do this largely automated. We apply 'Infrastructure as Code' (IaC). IaC means that we set up the necessary components fully automated based on scripts. The code to do that is largely generic, making our solution deployable in your current cloud, in another cloud, as well as on your own servers.
We advise using a solution that is cloud-agnostic. For us, this comes down to technology that can run in a containerized environment.
4. Processing data
Datap chooses dlt (data load tool) as a framework to extract data from source systems. It is based on Python. For ETL software of the 3 major cloud providers, specialized knowledge is required. By working with Python, anyone with Python experience can get started with dlt. This means there is a large group of professionals who can support working with it.
dlt ensures that you can connect source systems very quickly, easily extract many tables, and that data extraction is therefore highly automated.
For the further processing of data to your dashboards and reports, business logic will often have to be added. Then it is a matter of programming the Transformations and writing the data where necessary. Certainly when you have chosen PostgreSQL as technology, most SQL functions you use in your current solution can also be used there.
5. Migration
During the migration, the existing data from your current data warehouse is migrated to the new environment. The easiest way is to also use dlt for this so that you can automate the import into the new database as much as possible.
We advise working incrementally as much as possible. So don't finish the entire ETL first, but when a piece of ETL is completed, migrate the necessary data so you can test the piece from front to back. After that, it is also possible to run in parallel (shadow run). The testing of what has already been delivered then runs parallel to the components still in development. This significantly shortens the lead time.
6. Validation and data quality
Ultimately, you want to know whether the data in the new data warehouse corresponds 1-to-1 with data from the old data warehouse.
You will have to determine the validation and check on data quality per table. We create a script for this in which the complete content of all columns of a table is aggregated, so you can compare that with the old and new table. In the event of differences, we ensure that row by row is compared. If there are differences, there are two possibilities. The new situation deviates, then apparently something has to be adjusted in the loading and processing of the data. Or the new situation is actually better in contrast to the current situation. Then this analysis ensures the recording of that conclusion. Below are the points of attention to watch out for during validation:
- Total number of rows in the table.
- Number of unique values per column.
- Number of NULL values per column.
- Minimum and maximum date per column for date/time.
- Minimum and maximum length of a column for a textual field.
- Minimum and maximum value and sum of the values for whole or decimal numbers.
Ideally, you work with hashes / checksums. This functionality must be supported by your database.
7. Reports
Your new data warehouse now corresponds 1-to-1 with your old data warehouse. You now have the freedom to move your data warehouse to another cloud provider, to your own servers, or possibly even your own laptop. The only thing you still need to do is change the connections in your reports. It is wise to check every report to see if the situation before and after is identical there too and if the report functions the same. Because you have already carried out a thorough check on the underlying tables in your data warehouse, this check will need to be carried out less thoroughly.
Conclusion
We have over 15 years of experience in deploying and developing data platforms. As a result, we have experienced and executed the necessary migrations.
We can therefore take the entire migration off your hands and perform it largely fixed price. The only variable component is the inventory. Usually, an inventory takes 2 – 7 days. The output of this inventory determines the scope for the assignment. This allows us to easily check off all delivered items during the process and keep a grip on the status and lead time together.
Optionally, we can also perform a fallback test. In a fallback test, we set up an environment at another cloud provider and ensure that a number of tables and source systems are transferred. This way you know for sure that you are actually free to change cloud provider at a later moment. Setting up and executing a fallback test takes on average 5 days. This depends on the exact scope we agree on with each other.