A startup’s tech stack begins to shape up when it sets up its CRM, payment platform, and various tools to monitor product usage. This setup serves its purpose well with minor adjustments and additions until the startup reaches a headcount of around 20-25 people.
At this point, the ‘flat organization’ principles, which discourage middle management levels and promote collaboration between individuals with overlapping responsibilities, become more difficult to implement. A more hierarchical structure begins to appear, along with departments and new chains of command.
This increase in headcount is usually good news—possibly a consequence of nailing the product-market fit, which brings about an explosion in data. The tech stack at hand fails to handle this new tide of data, so data silos start to form, making it challenging to build a single view of truth. It is around this time that startups get acquainted with the “modern data stack.”
The modern data stack is a buzz phrase describing a set of tools and processes centered around data warehouses and ETLs and widely used as data infrastructure by organizations today.
Despite the hype, the modern data stack is not a good fit for startups due to the high costs and complexity involved.
Zero-ETL is a viable solution for startups striving to unify and query their data, handle high-velocity event data to track product usage, and conduct real-time analytics.
The three components of zero-ETL—query federation, streaming ingestion, and change data capture—give a startup everything they need from a data stack.
The term “modern data stack” refers to the one-size-fits-all solution we have had in the data integration realm for quite some time. It consists of
- a data warehouse to store your data,
- ETL pipelines to extract, transform, and load data from different sources into the data warehouse,
- a data transformation tool,
- a BI tool for drawing insights from the data you brought together and
- various other tools for data orchestration, monitoring, and governance.
However, not much is modern about the modern data stack, especially the data warehouse and the ETL pipelines that make up its backbone. These concepts have been around for thirty years. Justin Borgman, Starburst co-founder and CEO, is right when he says the modern data stack is ”just the data stack of decades ago moved to the cloud.” In fact, there has been so little innovation in the segment in the last few decades that even the introduction of ELT was touted as a breakthrough.
Another sign of the obsolescence of the modern data stack is its lack of solutions for new problems. The modern data stack has specialized in handling enterprise data. It has become so good at this job that it has upgraded the problem many times over, creating fancy terms for every incremental improvement it unlocked.
However, the modern data stack failed spectacularly in providing purpose-built solutions for the data needs of non-enterprise users. The selection of choices available to startups looking to integrate their data is fundamentally the same as an enterprise with tens of thousands of data consumers. They are basically miniaturized or souped-up versions of enterprise solutions that sell startups on features they neither need nor can use. Therefore, money these organizations spend on the modern data stack represents low ROI because it is
- prohibitively expensive to set up unless you are running an enterprise,
- costly to maintain, and
- difficult to use for organizations without sizable data teams.
These challenges are the direct results of the ETL processes involved in the modern data stack.
The “T” in ETL refers to transformation, which involves changing the data format in a way that maps to the target schema used by the destination platform. However, each type of transformation requires a new ETL pipeline, and these pipelines need to be updated as data formats change. This constant need for monitoring and maintaining pipelines renders ETL a labor-intensive process, which organizations without data teams can ill-afford.
So, the ideal data integration solution for a startup is one without ETL processes. What it should include is whole another story.
As a startup founder, what’s the “job” you want to get done when you “hire” a modern data stack? You want to consolidate your scattered data, create a single view of truth for everyone on your team, in real time, if possible, and leverage that data to draw insights that can inform your decisions. To cut it short, you want to query your data wherever it resides.
Do you care if you can do this with a sophisticated stack also used by some Fortune 500 company? Probably not. Does it make any difference to you that the cloud data platform you are using raised X gazillions of dollars last month? Not really. You want to bring your data together and query it in the most efficient way possible. Period.
This is a bit like the Internal Combustion Engine (ICE) car vs. Electric Vehicle (EV) debate we have been witnessing for the last few years. You may like the design of an ICE car, the heritage of the brand, and the so-called character it oozes when you rev up its V8 engine. But, purely on the basis of daily utility, your bespoke Bentley cannot hold a candle to a run-of-the-mill Tesla. With its plethora of moving parts, hoses, and fluids, your Bentley is a breakdown waiting to happen. Even keeping it on the road can be a challenge, with those parts getting replaced one by one and at a significant cost.
Tesla, on the other hand, represents a different paradigm: Lean, easy to build (hence, affordable), cheap to run and maintain, and almost failproof. If you are buying a car for your daily commute or want to get from A to B in the most efficient way possible, Tesla is hands down the better option than a gas-guzzling Bentley with hand-crafted wood trim in its dashboard.
The same applies to a startup’s data integration needs: You want to transition from a state of scattered data to one of unified data in the quickest and most efficient way possible. In the meantime, your “non-modern” data stack should allow you to
- connect and query every data source,
- track product usage data so you can understand how your customers are interacting with your product,
- conduct real-time analytics while offloading reporting duties of the source database so it does not get overwhelmed.
Thankfully, we finally have an alternative that can provide all three. It is called zero-ETL, but we might as well call it the real modern data stack.
Zero-ETL is defined as “a set of integrations that eliminates or minimizes the need to build ETL pipelines” by AWS. This technique offers an innovative and practical way of integrating your data.
The fact that zero-ETL does away with data transformation makes it instantly more suitable for handling a wider range of data sources, as there is no need to engineer the data to fit the schema of the destination. Omitting this intermediary stage translates into cost savings since it reduces the need for data storage and frees up engineering hours that can be used for other value-creating activities.
Zero-ETL consists of three main components:
Having all the data you will ever need in a single location would make your life easier. However, it is neither doable nor necessary. Your data needs are not set in stone, and they will vary over time. What you need is a flexible, easy-to-use, and low-maintenance system that lets you pull in the data you need at any time.
One way of doing that is through query federation. Query federation (or federated query) is a method for fetching data from multiple disparate data sources. It creates temporary virtual tables from the data it pulled in from other systems and allows users to query data stored in remote systems without using any physical ETL process.
The temporary virtual databases created by query federation do not contain the actual remote data but rather the metadata about the remote data. The metadata serves as a digital catalog of where remote data is stored so that it can be pulled in when needed, eliminating the need to copy and move the data from one location to another at predetermined intervals.
- Query federation offers some significant advantages over the conventional data integration methods centered around data warehouses.
- Query federation does away with the prohibitive setup costs of data warehouses as it acts like middleware while bringing data in from remote systems without requiring investment in new hardware.
- Query federation enables users to work with virtual databases without actually moving the data. This helps with data management bills that can quickly get out of hand while using data warehouses.
- The absence of ETL processes in query federation translates into cost savings as maintenance costs tend to be much lower compared to ETL-based data integration methods.
- While data warehouses use batch ingestion and update the data in long intervals, query federation offers real-time or near-real-time data, facilitating decision-making in dynamic environments.
One of the core tenets of data analytics today is the ability to handle and make use of event data. Events are changes in state, such as a transaction or any action a user takes on a website or application. An ordered sequence of these events forms an event stream that can reveal a trove of information for a business.
Being able to ingest event data gives decision-makers visibility into how users use a product or interact with a system. Therefore, it is central to value-creating activities like product development, UX design, and growth frameworks such as product-led growth.
However, ingesting event data poses a serious challenge for organizations. It requires the capability to deal with a continuous stream of high-volume, high-velocity data and simultaneously join it with data from other sources if need be. This stream can easily overwhelm a conventional database, and you can’t keep up with it by manually entering data if you prefer to use a big data table format.
This is where streaming ingestion comes into play. Streaming ingestion is a low-latency process for continuously ingesting high-velocity data from streaming sources such as IoT sensors, various apps, networked devices, or online transactions. While batch processing involves large amounts of data being processed at regular, predetermined intervals, streaming ingestion allows for real-time or near-real-time data processing.
Streaming ingestion is what enables digital marketing teams to listen to social media platforms for mentions, reviews, and feedback and continuously gauge how public opinion views a brand and a product. This technique also allows a product team to track usage data so they can remove the friction points from the SaaS app they are building.
Actioner, one of the first customers of Peaka, was tested by the challenges of handling event data early on. Its stack consisted of HubSpot (for customer info and emails), Segment (for event data), and DynamoDB (for the app IDs and flow IDs that tie events to the customer emails). There were a few options for handling the high-volume event data Segment pumped into Actioner’s databases, but none of them were ideal:
- PostgreSQL did not work after a certain threshold as storage and querying became problematic.
- Keeping the Segment data on the file system could be an option but that would render querying the data difficult.
- Storing that data in one of the big data table formats would require manual data entry, which would not be practical.
Actioner needed to invest in a whole new data stack to deal with this challenge. Instead, they chose to go with Peaka and were able to set up a stack in minutes and handle the Segment data. Peaka made a difference here by buffering that data and loading a big portion of it via bulk-insert.
The Actioner team can now identify individual customers in the huge pile of data being sent by Segment and spot where they are in their respective journeys. This gives them deeper insights into how their customers convert and upgrade, helping them adjust messaging accordingly.
Change data capture (CDC) is the process of spotting and capturing changes made in a database. By targeting and capturing only the incremental changes made to data, the CDC reduces your workload as you no longer have to scan whole databases for changes. Having spotted the changes in a database, the CDC then transfers them to downstream processes and systems in real time as new database events occur. This process helps keep data in sync and is perfect for data environments where high-volume data is changed frequently.
The CDC lends itself to use in a cloud-based infrastructure because continuous real-time updates to data keep it fresh, minimizing the possible downtime needed for data migration.
The CDC also helps with optimizing resource allocation and upholding system performance. The data replication carried out by the CDC creates an analytics database that can be used for resource-intensive querying. This helps offload main database systems, shielding them from excess load.
Another use case that the CDC fits perfectly is real-time data analytics. Nightly updates to data render it stale, and yesterday’s data does not hold much value for the decisions you need to make today if you are operating in a dynamic environment. The CDC enables data consumers to access the most up-to-date data at all times, making data-driven decision-making possible. The log-based version of the CDC, particularly, achieves this without creating any extra computational stress on source databases. Data consumers enjoy real-time data replication in this method while accessing only the changes in transaction logs rather than the source databases themselves.
The modern data stack is based on a three-decades-old method of data integration. It is an invasive method that meddles with how companies are organized. Its gravitational pull forces other processes to change so they can fit with the modern data stack.
With zero-ETL, startups get a lightweight approach to data integration that won’t pull resources away from other functions—something they can plug-and-play at will without a thorough reorganization of the way they operate. Zero-ETL allows startups to be startups, without having to operate like enterprises when they lack the means to.