50 of the world’s top data startups at a glance

More than a decade after the concept of “big data” was born, data remains one of the most important and fastest-growing drivers of innovation among large corporations and start-ups. From providing the pulse check that underlies business operations to intelligently automating routine tasks through machine learning, data has become the central nervous system for decision-making in organizations of all sizes. Furthermore, the use of data has gone far beyond data scientists, data analysts, and data engineers—everyone is a data producer and consumer.

The result of this increased focus on data is that the data management business has become one of the fastest-growing areas of infrastructure, estimated to be worth more than $70 billion and accounting for more than one-fifth of all enterprise infrastructure spending in 2021. The reason for this market is that it combines the fields of software engineering, analytics and artificial intelligence, while riding the trend of cloud computing. (For more information on the architectural evolutions and drivers behind this huge trend, see Emerging Architectures for Modern Data Infrastructures.)

The evolution of the data industry over the past few years has also spawned some exciting and influential enterprise software companies. More recently, public giants like Snowflake and Confluent have transformed the way thousands of businesses operate and millions of products are produced. Most people, however, are less familiar with the impactful companies, the next generation of category-defining companies.

2021 saw tens of billions of dollars in venture capital for data companies, breaking records, and 2022 is already strong. We compiled the first data from Data50. These are the leading companies in the exciting data category. Collectively, these 50 companies are worth more than $100 billion and have raised about $14.5 billion in total capital, with 20 of them reaching unicorn status by 2021.

There are 7 subcategories of Data50 company coverage types:

AI/ML (Artificial Intelligence/Machine Learning), BI & Notebooks (Business Intelligence & Notebooks), Customer Data Analytics (Customer Data Analytics), Data Governance & Security (Data Governance & Security), Data Observability (Data Observability), ELT & Orchestration, Query and Processing.


1. Query and processing technology is the core engine for accessing, aggregating and computing data. It involves two broad categories: batch processing (like Databricks and Starburst) and real-time processing (like ClickHouse and Imply). The latter has received more and more attention over the past few years due to the increasing demand for real-time applications.

2. AI/ML (Artificial Intelligence and Machine Learning) includes software that applies algorithmic modeling and machine learning to process large-scale data. Judging by the number of companies on the list, the field is maturing and thriving. Some players focus on one specific type of data (e.g. Rasa and Hugging Face for natural language), while others focus on a different area, such as the productization of AI (e.g. Scale, Tecton and Weights and Biases) or acting as a tool for The “compute layer” that runs AI workloads such as Anyscale.

3. ELT and orchestration support the movement of data. It is the transport layer that ensures that data arrives at its destination accurately and on time. This category evolved from traditional ETL vendors. On the other hand, players in the new category are mostly cloud-native (e.g. Fivetran and dbt), developer-friendly (e.g. Astronomer and Prefect), and can handle more complex dependencies between different data environments.

4. As the data stack becomes more complex and involves more stakeholders, data governance and security are becoming key issues. Governance tools are needed—especially in highly regulated industries—to keep data secure and consistent across the data lifecycle (e.g. OneTrust and Collibra). This category is relatively new and typically serves regulated large enterprise companies.

5. Traditionally, customer data analysis has been the responsibility of the marketing team. However, due to its increasing importance, data teams are now more involved in integrating customer data with central data platforms. This category focuses on capturing customer data (such as Rudderstack and ActionIQ) or manipulating that data to serve front-line business use cases (such as Census and Hightouch).

6. BI & notebooks cover the consumption layer of data. Although it is an established category, new players like Preset or Metabase are taking an open source-first approach and attracting technical data engineers as well as business intelligence teams. The rapidly changing nature of data requirements also creates more demand for iterative and interactive notebooks (e.g. Hex) and automated insight generation (e.g. Sisu).

Data observability draws inspiration from the best practices of the software engineering stack. As the data stack becomes more reliant on upstream and downstream tools, and the accuracy of the data has broader implications, observability is the latest category to provide monitoring and diagnostic capabilities across data streams.

While the main driver of market adoption is the increase in data volume and usage, the underlying drivers are different for each category. For example, advances in query and processing have been driven primarily by the separation of compute and storage, migration to the cloud, and cheaper computing power. At the same time, the adoption of operational tools in data governance and data observability is largely driven by the growing complexity of operational use cases and data workflows.

Below is the list of Data50 companies (name, type, location, valuation range and website):






Look at Data50 from a segmentation perspective (financing distribution, quantity distribution, location distribution):

Inquiry and processing companies raised the largest share of capital

The query and processing category accounts for only one-fifth of the companies in the Data50, but the amount of money invested in this category (almost 50% of all funding) is staggering. Although this figure was influenced by Databricks’ recent $1.6 billion funding round, without it this category would still account for 37% of all financings, more than double the next category.


In terms of the number of companies, the distribution is more balanced. AI/ML is the largest category in terms of number of companies, mainly because the field is still evolving and requires a new set of standalone tools to train, measure and produce models. (For more on how the field is evolving, read Emerging Architectures for Modern Data Infrastructures.)


Data50 is concentrated in the San Francisco Bay Area

Of the 50 companies, 47 (94%) are located in the United States and 3 are multinational. Thirty-three of these companies are located in the San Francisco Bay Area and nine are located along the I-95 corridor in Washington, D.C., Philadelphia, New York and Boston. Two are in Seattle, one in Cincinnati and one in Atlanta.

This distribution is heavily influenced by the historical location of large-scale data ecosystems (for example, Oracle and Teradata are both established in the Bay Area). However, we are seeing more and more data companies (such as Firebolt and Matillion) pop up across the globe, as data engineering talent and demand for data tools span nearly every continent.


AI/ML category drives surge in new data companies in 2019

Most Data50 companies were founded after 2014 and peaked around 2019, driven by the explosion of AI/ML tools. In fact, more data companies were formed after 2019, but because we are focusing on companies that have reached a certain size, most of the new companies are not yet on this list.


Investments in every category are growing

Looking at investment in each category, the most notable trend is that AI/ML companies are gaining more investor interest than ever before, mostly in the early stages. The same goes for ELT and orchestration – mostly driven by giant wheels from Fivetran and dbt. Inquiry and processing companies continue to attract large sums of money, although these companies tend to be at a later stage.


The authors of this article are Jennifer Li, Sarah Wang, and Jamie Sullivan. Jennifer Li is a partner at a16z where she focuses on enterprise companies. Sarah Wang is a general partner at a16z, focusing on growth stage investments. Jamie Sullivan is a partner in the a16z Growth investment team, focusing on late-stage companies in the consumer, enterprise and fintech sectors.

At the end of this article, the authors state that we firmly believe that the next 10 years will be the decade of data, including infrastructure, applications, and everything in between. As a result, we will continue to see record growth, funding, and market capitalization, which we will track annually on this list.

Posted by:CoinYuppie,Reprinted with attribution to:https://coinyuppie.com/50-of-the-worlds-top-data-startups-at-a-glance/
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.

Like (0)
Donate Buy me a coffee Buy me a coffee
Previous 2022-04-07 10:49
Next 2022-04-07 10:50

Related articles