Not a day goes by without a new data-related term coming up. Why? Because harnessing the power of data is the most pressing issue on the agenda nowadays. We keep accumulating data through various tools just because we can. Then, management gurus tell us to use that data to become more data-driven.
But how? How are we supposed to bring together data scattered over numerous data warehouses, data lakes, SaaS tools, and repositories in the first place so that data scientists and analysts can work their magic? As if that were not challenging enough, we must do it with a handful of experts or find a way to democratize the whole process.
If you are done studying terms like data mesh, data fabric, and data virtualization, meet your new homework: Zero-ETL. In this blog post, we find out what it is and whether it can live up to the promise of its name.
Because the extract-transform-load (ETL) process is like calories, sugar, or fat: You want less of it. But can you get away with no ETL at all?
To understand what zero-ETL is trying to achieve, we should start by analyzing the ETL process that it is trying to replace. ETL involves moving data to a staging layer, converting it into the desired format, and then loading it to the desired storage location for it to be consumed by data users. This particular process is the main reason data integration poses such a challenge.
There must be something ETL is doing really well, though, considering how widely it is used in the industry. It is a fast and effective data integration method that works well with legacy systems. Additionally, it introduces security checks and addresses privacy concerns while bringing the data together.
However, ETL has its downsides, too. First of all, it is a labor-intensive and inflexible method by design. Each ETL pipeline is built for a specific type of transformation. Therefore, it has to be reconfigured every time data formats change. It is also high-maintenance as you need to supervise the existing pipelines and ensure that they are working properly. These factors put a strain on IT people and hurt scalability.
Zero-ETL is the outcome of the efforts to create a reliable data integration method without the shortcomings of ETL. It is for those who want to have their cake and eat it, too. Zero-ETL indeed gets rid of the pesky transformation phase, speeding up data integration. However, while doing that, it also sacrifices the standardization and cleansing of the data that comes with the transformation function.
Transformation is not a needless chore you should look to dump at all costs, though. It prepares the data so that it can be consumed by the data user. You can eliminate the transformation as a task, but the need to cleanse and standardize the data remains. It just has to be dealt with differently.
The zero-ETL approach tries to get around this problem by keeping the data in a single ecosystem and automating its movement. Amazon's implementation of this method between its Aurora relational database system and Redshift data warehouse is a good case in point. By letting its users move their data without performing an ETL, Amazon claims to offer near real-time analytics and machine learning (ML) capability. However, zero-ETL gives up the flexibility of being able to work with different data sources, cloud environments, and platforms in the meantime.
Now, let's take a tally and summarize the pros and cons of zero-ETL:
Speed: The zero-ETL approach is faster than the conventional ETL as the former does not involve the time-consuming data transformation phase.
Efficiency: The zero-ETL approach requires less custom programming, hence less manpower and less duplicate storage, which combine to bring down costs.
Ease of use: The speed and the elimination of data pipelines make the technology accessible to a broader range of users.
Lack of flexibility: Despite aiming to replace the inflexible ETL method, zero-ETL comes up short on that front. Its flexibility, speed, and ease of use apply only to a small number of use cases where data moves within the same ecosystem.
Not enterprise-friendly: Due to the vendor lock-in and inability to work with different systems, zero-ETL does not lend itself to use by enterprises that tend to employ different data sources and, among them, legacy systems. It is more suitable for organizations like startups and SMBs where data integration needs are well-defined and predictable.
Limited scope: Unfortunately, it is not possible to abstract away all the possible transformations, which undermines zero-ETL's capability to address different needs. This method is not suitable for building, managing, and maintaining sophisticated data pipelines. Zero-ETL trades versatility for efficiency and speed in a limited number of situations.
The merits of a genuine zero-ETL data integration are obvious. The question is whether the technology is there yet and how much the claim of zero-ETL corresponds to reality. It is the issue of a process living up to the name given again, like the "self-service" data integration we touched upon in our previous blog post.
As is usually the case, there is no one-size-fits-all solution in data integration. Currently, zero-ETL is a quick and easy-to-implement data integration method that only covers select use cases in narrowly-defined ecosystems. It functions like a very specialized form of the data virtualization technique. For use cases where data virtualization is not ideal, it makes more sense to work to improve the good ol' ETL. Reconciling data inconsistencies, standardizing names and type conventions, and building systems that are integrated at the application layer would go a long way toward making the process more efficient.