When To Build vs. Buy Data Pipelines
Deciding whether to build or buy a new software is a challenge every engineer has to deal with. In the world of data engineering, building data pipelines in house was a pretty common choice because it only required a few data ingestion scripts to pipe your data into your data store, whether it was a data warehouse or data lake. But in the world of big data this is changing rapidly.
As data engineers, we now have to handle workloads for a high volume of data from dozens of constantly changing data sources, and with the rise of real-time data use cases latency matters more than ever. There are many approaches we can take in this new world to develop our data pipeline architecture. If we choose to build our own data pipelines, it leads to data integration systems that are hand-crafted by multiple engineers over a long period of time, each adding their own special spin to the code base. In the end, most of these data pipeline systems end up looking very similar to a framework that already exists like Airflow.
This is because, at the end of the day, most pipeline systems require several key components:
- A scheduler
- A meta-database
- Tasks
- Jobs
- A web UI
As engineers, we have a tendency to approach most of our problems as build vs. buy. However, we don’t always weigh the opportunity costs, and sometimes building is not the best option. It’s dependent upon overall company goals and where your company is in its data maturity journey. In this article, we will discuss the build vs. buy decision when it comes to both real-time data streams and batch processing pipelines (ETL/ELT) to help your team make the right choice for their next data infrastructure component.
The challenges of building and maintaining data pipelines
Building data infrastructure is a long process, and maintaining it is time-consuming. Even small requests can become arduous to take on. This is amplified if your company works with different type of data and dozens of data sources with different schema, requiring you to maintain all the connectors as the underlying APIs and sources change.
In addition, we are often bombarded with ad hoc requests for new data sets from other teams while maintaining the current code base. You know the feeling, it’s like death by 1,000 cuts. It keeps your urgent and important quadrant full of redundant and uninspiring work, and it keeps you from getting other more strategically important priorities in your workflow.
The key takeaway here is that constant maintenance and ad hoc requests significantly slow down real business impact and introduce scalability challenges, so buying solutions or using managed services can be a good choice for many teams.
Benefits of buying
There are always trade-offs between build and buy. Let’s start by talking about some benefits of buying solutions.
Quick turn around - Bought solutions often meet the majority of a company's use cases quickly. After the sales cycle, the only time required is implementation. This means your team can immediately implement new tooling once purchased. Often, you’ll have a head start because you’ve already tested out the tool via an open source version, free offering, or trial.
Less maintenance - Maintenance cost is an open secret. All solutions, built or bought, have maintenance costs. The difference is between who pays this cost. When you buy a solution, the solution provider shoulders the burden for all maintenance and any technical debt, distributing these costs over their whole customer base. This offloads the burden of maintenance and frees your team to spend time working on ways to add value rather than running the hamster wheel of maintenance.
You don’t need to keep up with APIs (In the case of connectors) - Keeping up with connector changes is a big (and really annoying) time suck as a data engineer. This is somewhat connected to maintenance. However, rebuilding connectors is such a staple piece of many data engineers' work that it basically requires its own point. Many tools provide connectors out of the box, shifting the maintenance of keeping up with connectors from the company and to the solution provider.
New features don’t need to be built by you - Buying a solution removes the need for your company to try to continue to improve the tool. Instead, all optimization and new feature development is really in the hands of the purchased solution. In this modern era where there are a dozen solutions for just about every problem, competition ensures they are constantly motivated to develop a better product. So, when you buy, you won’t need to find funds and the time to develop, maintain, and improve your custom-built solution. Instead, you can push the solution provider to constantly improve their offering.
The challenges that come with buying
Of course, buying is far from a perfect solution. For every benefit you get to buying a great solution, there will be trade-offs. Here are a few:
Less flexibility - Most bought tools are going to limit how much you can edit or modify in terms of functionality. So, if your company has very specific use cases or requirements that the app doesn’t provide, you will need to use some form of workaround.
Less control - Let’s say the solution you purchase has all of the functionality that you require now. In the future, if you ever have new use cases or want to make small edits that may just be personal preferences, these may not be possible. You can put in tickets to the tool but, since you didn’t build the solution, it might be a while until these tickets are addressed. As referenced above, your team isn’t responsible for building new features. This is great when you don’t have an actual development team or time, but, if you have the team, the time, and the need, building gives your team control.
Vendor lock-in - Whenever you pick a tool, built or bought, there is an inevitable amount of lock-in. With a bought solution this can be even more deterring because you’re paying a monthly invoice to the vendor, and you may have a multi-year agreement keeping you tied to the vendor for a longer period of time.
Lots of different tools leads to multiple learning curves - Every tool has an unavoidable learning curve – no matter how easy or low code. Here’s another way to look at this. There is a broad range of users that know Python & SQL. These are general skills that most programmers understand. Picking a more unique tool that has a smaller user base, even if it’s lower code, involves a learning curve, and hiring people that don’t know the tool you work with makes for slow initial development.
Buy or build considerations
Talking through pros and cons doesn’t make it that easy to check off whether to build or buy, so how should you think about these decisions? Here are a few points to consider when trying to answer the build vs. buy question for technology.
When To Buy
✅ Your teams' main focus is not building software and they don’t have a track record delivering large-scale solutions.
✅ Your team has budget limitations and there are tools that can meet said budget.
✅ You have a tight timeline and need to turn around value quickly.
✅ Your team has limited resources or technical knowledge for the specific solution they would need to build. For example, if you need to build a machine learning model to detect fraud, but no one on your team has done it before, it might be time to look for a solution.
When to build
✅ Your executive team needs a unique function or ability that no solutions currently offer.
✅ You have a bigger scope and vision for the solution and plan to sell it externally.
✅ You don’t have a tight timeline (Yeah right).
✅ Your team is proficient in delivering large-scale projects.
So Which Is Right For You? Build Or Buy?
In a fast-moving world where the amount of data and engineering talent costs are increasing, it's important to balance build vs. buy. Yes, there are pros and cons to both building and buying. However, in the age of the modern data stack, the cloud is supplying us with a host of pre-built tools with sensible pricing models that we can test out with a free trial to help us answer this question.
Truthfully, most companies are too busy with other operational needs to fully commit to internally building tools for data flow automation. In my experience in the data management world, even when tools get built, once the original developer leaves, the tool starts to degrade over time. Not to mention, with the average salaries at all forms of tech companies, not just MANGAs rising, it's hard to keep talent for long. Thus, if your company doesn’t sell software or your team doesn’t have the time, it’s worth signing up for that free trial to test drive a “buy” solution because the total cost of building is probably far more than a bought solution. More and more, I’m talking to data engineering teams and solutions architects across the data sphere who are able to do more with less by buying the right data pipeline solutions. Especially as modern SaaS tools are offering increased flexibility for developers.
Benjamin Rogojan
Seattle Data Guy, Data Science and Data Engineering Consultant