These days you can hardly go an hour without reading about genetic artificial intelligence. While we are still in the embryonic stage of what some have compiled the “steam engine” of the fourth industrial revolution, there is no doubt that “GenAI” is shaping up to transform nearly every industry — from finance and healthcare to law and beyond.
Good user-facing apps may attract most of the fanfare, but the companies powering this revolution are currently benefiting the most. Just this month, chipmaker Nvidia in short it was done the world’s most valuable company, a $3.3 trillion juggernaut essentially driven by demand for AI computing power.
But in addition to GPUs (graphics processing units), businesses also need infrastructure to manage the flow of data — to store, process, train, analyze, and ultimately unlock the full potential of AI.
One company that wants to take advantage of this is Onehousea three-year-old Californian startup founded by Vinoth Chadarwho created the open source Apache Hoodie project while serving as a data architect at Uber. Hudi brings its benefits data warehouses to data lakescreating what has come to be known as a “data lakehouse”, enabling support for actions such as indexing and real-time querying of large data sets, be they structured, unstructured or semi-structured data.
For example, an e-commerce company that continuously collects customer data spanning orders, comments and related digital interactions will need a system to ingest all that data and ensure it’s up-to-date, which could help it recommend products based on activity. Hudi enables ingesting data from various sources with minimal latency, with support for delete, update and insert (“upsert”), which is crucial for such real-time data use cases.
Onehouse builds on this with a fully managed data lakehouse that helps companies deploy Hudi. Or, as Chandar puts it, it “starts ingesting and standardizing data into open data formats” that can be used with nearly all major tools in the data science, artificial intelligence, and machine learning ecosystems.
“Onehouse takes away the creation of low-level data infrastructure, helping AI companies focus on their models,” Chandar told TechCrunch.
Today, Onehouse announced that it has raised $35 million in a Series B funding round as it brings two new products to market to improve Hudi’s performance and reduce cloud storage and processing costs.
Down at the (data) lakehouse
Chandar created Hudi as an internal project at Uber in 2016 and since then the company has been doing donate the project at the Apache Foundation in 2019, Hudi has been adopted from such as AmazonDisney and Walmart.
Chandar left Uber in 2019 and, after a brief stint at Confluent, founded Onehouse. The startup emerged from secrecy in 2022 with $8 million in funding, and followed soon after with a $25 million Series A. Both rounds were co-led by Greylock Partners and Addition.
These VC firms have joined forces again for a follow-up Series B, though this time, David Sacks’ Craft Ventures is leading the round.
“The data lakehouse is quickly becoming the standard architecture for organizations looking to pool their data to power new services like real-time analytics, predictive ML and GenAI,” said Craft Ventures partner Michael Robinson.
For context, data warehouses and data lakes are similar in that they serve as a central repository for data aggregation. But they do it in different ways: A data warehouse is ideal for processing and searching historical, structured data, while data lakes have emerged as a more flexible alternative for storing vast amounts of raw data in its original form, with support for many types of data and high performance queries.
This makes data lakes ideal for AI and machine learning workloads, as it is cheaper to store pre-transformed raw data and, at the same time, support more complex queries because the data can be stored in its original form.
However, the trade-off is a whole new set of data management complexities, which risks degrading data quality given the vast variety of data types and formats. This is partly what Hudi aims to solve by bringing some key features of data warehouses to data lakes, such as ACID transactions to support data integrity and reliability, and to improve metadata management for more diverse data sets.
Since it is an open source project, any company can develop Hudi. A quick look at the logos on Onehouse’s website reveals some impressive users: AWS, Google, Tencent, Disney, Walmart, Bytedance, Uber and Huawei, to name a handful. But the fact that such large companies leverage Hudi internally is indicative of the effort and resources required to build it as part of an internal data lakehouse.
“While Hudi provides rich functionality for ingesting, managing and transforming data, companies still need to integrate about half a dozen open source tools to achieve their goals of a production-grade data lakehouse,” Chandar said.
That’s why Onehouse offers a fully managed, cloud-native platform that ingests, transforms and optimizes data in a fraction of the time.
“Users can get an open data lakehouse up and running in less than an hour, with broad interoperability with all major services, warehouses and data lake engines,” Chandar said.
The company was reluctant to name its commercial customers, apart from the couple mentioned The case studiessuch as the Indian unicorn Apna.
“As a new company, we are not publicly sharing Onehouse’s entire commercial customer list at this time,” Chandar said.
With a new $35 million in the bank, Onehouse is now expanding its platform with a free tool called Onehouse LakeView, which provides observability into lakehouse functionality for information on table stats, trends, file sizes, schedule history, and more. This builds on existing observability metrics provided by the core Hudi project, giving additional context to the workload.
“Without LakeView, users have to spend a lot of time interpreting metrics and deeply understanding the entire stack to cause key performance issues or inefficiencies in pipeline configuration,” Chandar said. “LakeView automates this and provides email alerts on good or bad trends, highlighting data management needs to improve query performance.”
In addition, Onehouse is also introducing a new product called Table Optimizer, a managed cloud service that optimizes existing tables to speed up data ingestion and transformation.
“Open and Interoperable”
We shouldn’t ignore the myriad other big players in the space. Databricks and Snowflake are growing embracing the lakehouse paradigm: Earlier this month, Databricks are reportedly sold out $1 billion to acquire a company called Tabular to create a common lakehouse standard.
Onehouse has certainly entered a hot space, but it hopes its focus on an “open and interoperable” system that makes it easier to avoid seller lock-in will help it stand the test of time. It essentially promises the ability to create a single copy of data universally accessible from almost anywhere, including native Databricks, Snowflake, Cloudera and AWS services, without having to create separate data silos in each.
As with Nvidia in the GPU field, we should not ignore the opportunities that await any company in the data management space. Data is the cornerstone of AI development and the lack of sufficient good quality data is a major reason why many AI projects fail. But even when the data exists in bucketloads, companies still need the infrastructure to ingest, transform and standardize it to be useful. This bodes well for Onehouse and its ilk.
“From the data management and processing side, I believe that quality data provided by a solid data infrastructure foundation will play a critical role in getting these AI projects into real production use cases — to avoid introducing garbage/garbage- solving data problems,” Chandar said. “We’re starting to see such demand in data lakehouse users as they struggle to scale their data processing and querying needs to build these newer AI applications on enterprise-scale data.”