Why Accelerating Data Engineering Across Public Clouds and Private Data Centers is a Game Changer

After the first six months, 2021 was shaping up to be a disappointing year for Netflix. Customers were leaving the streaming service, weary of pandemic-induced binge-watching. However, by the time it released its third quarter earnings report, Netflix had a comeback story to share.

Netflix released the Squid Game on September 17 with little fanfare, but the South Korean series drew in viewers across the globe at an astonishing rate. In its Q3 report published just a month later, Netflix called Squid Game its “biggest TV show ever,” with 142 million households watching. The news gave Netflix a boost at the stock market and the information it needed to prepare its 2022 content plans.

A surprise hit like “Squid Game” can shake up any business, whether it’s a streaming service, an online retailer or any other business sensitive to customer preferences. But without the solutions in place to quickly gather the right data, store it and make sense out of it, that success will go to waste.

Getting useful insights out of a surprise scenario like this requires fast-paced data engineering. You’ve got to have your data engineering team ready to design a system for collecting, storing and analyzing new data at scale.

If you happen to run a global streaming service, you’d probably want to build a data cloud to ingest events from viewers all over the world, using all kinds of devices — viewers who are skipping sections, fast-forwarding through other scenes, or repeatedly watching scenes with their favorite characters. Once you’ve ingested those events in the data cloud, you have to move to the data to storage — typically, a long-term storage pool in the cloud — then use a large-scale data processing framework to make sense of it all.

Data engineers typically know what they need to get done. The problem is that their environment doesn’t always make it easy. If you’re working on premise, it can be hard to get data-intensive solutions off the ground quickly. However, cloud solutions come with lock-in and unpredictable pricing.

The game-changer in this scenario is a hybrid solution that will allow you to accelerate data engineering. Having a solution that works across both public and private clouds lets you keep your business critical systems running in the data center. At the same time, you can spin up research environments for your data analysts in the cloud as needed. If you have solutions that work both on the cloud and on premise, then there’s no real lock-in to speak of. You can have full runway to move from quick iteration in the cloud to robust, well-governed production environments on premise.

Let’s break down the problem. If you’re working on premise, you’ll have to contend with long procurement times and big upfront costs. Be ready to buy servers — for a data-intensive solution, you’re likely going to need hundreds of servers. Every step of the process will take time: getting your project approved, placing your server order with a supplier, waiting for the delivery, and unpacking and racking your servers. By the time you actually deploy your system, you could easily be two years down the road.

On the other hand, working in the cloud comes with its own set of problems. The pay-as-you-go model is great for short-term projects. If you have a team conducting ad hoc research in basic cloud environments, they may start up a cluster of 50 servers and use it for a few hours. However, if you’re running a long-term, 50-node cluster, 24 hours a day, seven days a week, the costs quickly start to add up.

Meanwhile, performance on the cloud simply can’t measure up to on-premise environments. If you’re leveraging deep learning and want to use GPUs in combination with your data-intensive infrastructure, the cloud is not the best solution.

By taking a hybrid approach, you get the best of both worlds. You can quickly spin up a proof of concept in the cloud and simultaneously begin procuring the performant equipment you need for an on-premise deployment. By the time your application is ready to run in production, your on-premise environment should be ready to go.

A hybrid model offers more than just a solution for ad hoc research environments. For a business that needs to run an intense data-processing job intermittently, cloud bursting is the right solution. For instance, if you have a data-processing job that requires 100 computers running in parallel once a week, it wouldn’t be worth investing in the data center infrastructure to make that happen.

A hybrid environment also offers the ability to set up a self-service environment for users. Instead of having to wait for an IT manager or service desk to provision a big data cluster — which can take days — you can use an enterprise self-service provisioning system to get access to a cluster and start working.

By giving data engineering teams options like self-servicing, as well as access to popular tools, you’re not just benefitting the business — you’re also investing in your data engineering team. Empowering the teams you have is key to getting the most out of your data — after all, you can’t accelerate data engineering without data engineers. Thriving in the data economy is hard, but doable — with the right environment, the right tools and the right teams.

About the Author

Rob Gibbon, Product Manager at Canonical, the publisher of Ubuntu, has 20+ years’ industry experience building, scaling, managing and serving the teams, technology and environments behind around 50+ commercial web properties and data hubs across all major industries in varied roles.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: https://twitter.com/InsideBigData1

Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/

Join us on Facebook: https://www.facebook.com/insideBIGDATANOW