Resilience and Vibrancy: The 2020 Data & AI Landscape
In a year like no other in recent memory, the data ecosystem is showing not just remarkable resilience but exciting vibrancy.
When COVID hit the world a few months ago, an extended period of gloom seemed all but inevitable. Yet, as per Satya Nadella, “two years of digital transformation [occurred] in two months”. Cloud and data technologies (data infrastructure, machine learning / artificial intelligence, data driven applications) are at the heart of digital transformation. As a result, many companies in the data ecosystem have not just survived, but in fact thrived, in an otherwise overall challenging political and economic context.
Perhaps most emblematic of this is the blockbuster IPO of Snowflake, a data warehouse provider, which took place a couple of weeks ago and catapulted Snowflake to a $69B market cap company, at the time of writing – the biggest software IPO ever (see our S-1 teardown). And Palantir, an often controversial data analytics platform focused on the financial and government sector, became a public company via direct listing, reaching a market cap of $22B, at the time of writing (see our S-1 teardown).
Meanwhile, other recently IPO’ed data companies are performing very well in public markets. Datadog, for example, went public almost exactly a year ago (an interesting IPO in many ways, see my blog post here). When I hosted CEO Olivier Pomel at my monthly Data Driven NYC event at the end of January 2020, Datadog was a $12B market company. A mere eight months later, at the time of writing, it’s a… $31B market cap company.
While there are many economic factors at play, ultimately financial markets are rewarding an increasingly clear reality, long in the making: to succeed, every modern company will need to be not just a software company, but also a data company. There is of course some overlap between software and data, but data technologies have their own requirements, tools and expertise. And some data technologies involve an altogether different approach and mindset – machine learning, for all the discussion about commoditization, is still a very technical area where success often comes in the form of 90-95% prediction accuracy, rather than 100%. This has deep implications for how to build AI products and companies.
Of course, this fundamental evolution is a secular trend that started in earnest perhaps ten years ago, and will continue to play out over many more years. To keep track of this evolution, this is our seventh annual landscape and “state of the union” of the data and AI ecosystem – this year with great help from Avery Klemmer. For anyone interested in tracking the evolution, here are the prior versions: 2012, 2014, 2016, 2017, 2018 and 2019 (Part I and Part II).
Who’s in, who’s out – noteworthy IPOS, M&A and additions
Let’s dig in.
KEY TRENDS IN DATA INFRASTRUCTURE
There’s plenty going on in data infrastructure in 2020. As companies start reaping the benefits of the data/AI initiatives they started over the last few years, they want to do more: can we process more data, faster and cheaper? Deploy more ML models in production? Do more in real-time? Etc.
This in turn raises the bar on data infrastructure (and the teams building/maintaining it), and offers plenty of room for innovation, particularly in a context where the landscape keeps shifting (multi-cloud, etc.)
A third wave? From Hadoop to cloud services to Kubernetes + Snowflake
Data governance, cataloging, lineage: the increasing importance of data management
The rise of an AI-specific infrastructure stack (“MLOps”, “AIOps”)
While those trends are still very much accelerating, here are a few more that are very much top of mind in 2020:
The modern data stack goes mainstream
The concept of “modern data stack” (a set of tools and technologies that enable analytics, particularly for transactional data) is many years in the making — it started appearing as far back as 2012, with the launch of Redshift, Amazon’s cloud data warehouse.
But over the last couple of years, and perhaps even more so in the last 12 months, the popularity of cloud warehouses has grown explosively, and so has a whole ecosystem of tools and companies around them, going from leading edge to mainstream.
The general idea behind the modern stack is the same as with older technologies: building a data pipeline where you first extract data from a bunch of different sources, store it in a centralized data warehouse, and then analyze and visualize it.
But the big shift has been the enormous scalability and elasticity of cloud data warehouses (Amazon Redshift, Snowflake, Google BigQuery, Microsoft Synapse, in particular). They have become the cornerstone of the modern, cloud-first data stack and pipeline.
While there are all sorts of data pipelines (more on this later), the industry has been normalizing around a stack that looks something like this, at least for transactional data:
ETL vs ELT
Data warehouses used to be expensive and inelastic, so you had to heavily curate the data before loading into the warehouse: first extract it data from sources, then transform it in the desired format, and finally load into the warehouse (Extract, Transform, Load or ETL).
In the modern data pipeline, you can extract large amounts of data from multiple data sources, dump it all in the data warehouse without worrying about scale or format, and *then* transform the data directly inside the data warehouse – in other words, extract, load and transform (“ELT”).
A new generation of tools has emerged to enable this evolution from ETL to ELT. For example, DBT is an increasingly popular command line tool that enables data analysts and engineers to transform data in their warehouse more effectively. The company behind the DBT open source project, Fishtown Analytics, raised a couple of venture capital rounds in rapid succession in 2020. (Their CEO Tristan Handy will be our guest for a fireside chat at Data Driven NYC in November, stay tuned). The space is very vibrant with other companies, as well as some tooling provided by the cloud data warehouses themselves.
This ELT area is still nascent and rapidly evolving. There are some open questions in particular around how to handle sensitive, regulated data (PII, PHI) as part of the load, which has led to a discussion about the need to do light transformation before the load – or ETLT (see XPlenty, What is ETLT?). People are also talking about adding a governance layer, leading to one more acronym, ELTG.
Automation of Data Engineering?
ETL has traditionally been a highly technical area, and largely gave rise to data engineering as a separate discipline. This is still very much the case today with modern tools like Spark that require real technical expertise.
However, in a cloud data warehouse centric paradigm, where the main goal is “just” to extract and load data, without having to transform it as much, there is an opportunity to automate a lot more of the engineering task.
This opportunity has given rise to companies like Segment, Stitch (acquired by Talend), Fivetran, and others. For example, Fivetran offers a large library of prebuilt connectors to extract data from many of the more popular sources and load it into the data warehouse. This is done in an automated, fully managed and zero-maintenance manner. As a further evidence of the modern data stack going mainstream, Fivetran, which started in 2012 and spent several years in building mode, experienced a strong acceleration in the last couple of years, and raised several rounds of financings in a short period of time (most recently at a $1.2B valuation). For more, here’s a great chat I did with them a few weeks ago: In Conversation with George Fraser, CEO, Fivetran (video + full transcript).
Rise of the Data Analyst
An interesting consequence of the above is that data analysts are taking on a much more prominent role in data management and analytics.
Data analysts are non-engineers who are proficient in SQL, a language used for managing data held in databases. They may also know some Python, but they are typically not engineers. Sometimes they are a centralized team, sometimes they are embedded in various departments and business units.
Traditionally, data analysts would only handle the last mile of the data pipeline – analytics, business intelligence, visualization.
Now, because cloud data warehouses are big relational databases (forgive the simplification), data analysts are able to go much deeper into the territory that was traditionally handled by data engineers, and for example handle transformations, leveraging their SQL skills (DBT, Dataform and others being SQL based frameworks).
This is good news as data engineers continue to be rare and expensive. There are many more (10x more?) data analysts and they are much easier to train.
In addition, there’s a whole wave of new companies building modern, analyst-centric tools to extract insights and intelligence from data, in a data warehouse centric paradigm.
For example, there is a new generation of startups building “KPI tools” to sift through the data warehouse and extract insights around specific business metrics, or detecting anomalies, including Sisu, Outlier or Anodot (which started in the observability data world).
There are also emerging tools to embed data and analytics directly into business applications. Census is one such example of a company creating pipes from the data warehouse into applications.
Finally, despite (or perhaps thanks to) the big wave of consolidation in the BI industry which we highlighted in our 2019 landscape, there is a lot of activity around tools that will promote a much broader adoption of BI across the enterprise. To this day, business intelligence in the enterprise is still the province of a handful of analysts trained specifically on a given tool, and has not been broadly democratized.
Data lakes and data warehouses merging?
Another trend towards simplification of the data stack is the unification of data lakes and data warehouses. Some (like Databricks) call this trend the “data lakehouse”, others call it the “Unified Analytics Warehouse”.
Historically, you’ve had data lakes on one side (big repositories for raw data, in a variety of formats, that are low-cost, very scalable but don’t support transactions, data quality, etc.) and then data warehouses on the other side (a lot more structured, with transactional capabilities and more data governance features).
Data lakes have had a lot of use cases for machine learning, whereas data warehouses have supported more transactional analytics and business intelligence.
The net result is that, in many companies, the data stack includes a data lake and sometimes several data warehouses, with many parallel data pipelines.
Companies in the space are now trying to merge both sides, with a “best of both worlds” goal, and a unified experience for all types of data analytics, including both BI and machine learning.
For example, Snowflake pitches itself as a complement or potential replacement, for a data lake (here). Microsoft’s cloud data warehouse, Synapse, has integrated data lake capabilities. Databricks has made a big push to position itself as a full lakehouse (here).
A lot of the above points towards greater simplicity and approachability of the data stack in the enterprise.
However, this trend is counterbalanced by an even faster increase in complexity.
The overall volume of data flowing through the enterprise continues to grow an explosive pace. The number of sources of data keeps increasing as well, with ever more SaaS tools.
There is not one but many data pipelines operating in parallel in the enterprise. The modern data stack mentioned above is largely focused on the world of transactional data and BI style analytics. Many machine learning pipelines are altogether different.
There’s also an increasing need for real time streaming technologies, which the modern stack mentioned above is in the very early stages of addressing (it’s very much a batch processing paradigm for now).
For this reason, the more complex tools, including those for micro-batching (Spark) and streaming (Kafka and, increasingly, Pulsar) continue to have a bright future ahead of them. The demand for data engineers who can deploy those technologies at scale is going to continue to increase.
There are several increasingly important categories of tools that are rapidly emerging to handle this complexity, and add layers of governance and control to it.
Orchestration engines are seeing a lot of activity. Beyond early entrants like Airflow and Luigi, a second generation of engines has emerged, including Prefect and Dagster, as well as Kedro and Metaflow. Those products are open source workflow management systems, using modern languages (Python) and designed for modern infrastructure that create abstractions to enable automated data processing (scheduling jobs, etc.), and visualize data flows through DAGs (directed acyclic graphs).
Pipeline complexity (as well as other considerations, such as bias mitigation in machine learning) also creates a huge need for DataOps solutions, in particular around data lineage (metadata search and discovery), as highlighted last year, to understand the flow of data and monitor failure points. This still an emerging area, with so far mostly homegrown (open source) tools built in-house by the big tech leaders: LinkedIn (Datahub), WeWork (Marquez), Lyft (Admunsen) or Uber (Databook). Some promising startups are emerging.
There is a related need for data quality solutions, and we’ve created a new category in this year’s landscape for new companies emerging in the space (see chart).
Overall, data governance continues to be a key requirement for enterprises, whether across the modern data stack mentioned above (ELTG, as mentioned above) or machine learning pipelines.
TRENDS IN ANALYTICS & ENTERPRISE ML/AI
Boom time for data science and machine learning platforms (DSML)
DSML platforms are the cornerstone of the deployment of machine learning and AI in the enterprise. The top companies in the space have experienced considerable market traction in the last couple of years and are reaching large scale.
While they came at the opportunity from different starting points, the top platforms have been gradually expanding their offering to serve more constituencies and address more use cases in the enterprise, whether through organic product expansion or M&A:
Dataiku (in which we are proud investors) started with a mission to democratize enterprise AI and promote collaboration between data scientists, data analysts, data engineers and leaders of data teams, across the lifecycle of AI (from data prep to deployment in production). With their most recent release, they added non-technical business users to the mix through a series of re-usable AI apps.
Databricks has been pushing further down into infrastructure through their lakehouse effort mentioned above, which interestingly puts it in a more competitive relationship with two of its key historical partners, Snowflake and Microsoft. They also added to their unified analytics capabilities by acquiring Redash, the company behind the popular open source visualization engine of the same name
Datarobot acquired Paxata, which enables it to cover the data prep phase of the data lifecycle, expanding from its core autoML roots.
ML getting deployed and embedded:
A few years into the resurgence of ML/AI as a major enterprise technology, there is a wide spectrum of levels of maturity across enterprises – not surprisingly for a trend that’s mid-cycle.
At one end of the spectrum, the big tech companies (GAFAA, Uber, Lyft, LinkedIn etc) continue to show the way. They have become full-fledged AI companies, with AI permeating all their products. This is certainly the case at Facebook, for example: see my really interesting conversation with Jerome Pesenti, Head of AI at Facebook. Worth nothing that big tech companies contribute a tremendous amount to the AI space, directly through fundamental/applied research and open sourcing, and indirectly as employees leave to start new companies (as a recent example, Tecton.ai, started by the Uber Michelangelo team).
At the other end of the spectrum, there is a large group of non-tech companies that are just starting to dip their toe in earnest into the world of data science, predictive analytics and ML/AI. Some are just launching their initiatives, while others have been stuck in “AI purgatory” for the last couple of years, as early pilots haven’t been given enough attention or resources to produce meaningful results yet.
Somewhere in the middle, a number of large corporations are starting to see the results of their efforts. They typically embarked years ago on a journey that started with Big Data infrastructure, and has evolved along the way to include data science and ML/AI.
Those companies are now in the ML/AI deployment phase, reaching a level of maturity where ML/AI gets deployed in production, and increasingly embedded into a variety of business applications.
For the companies, the multi-year journey has looked something like this:
As ML/AI gets deployed in production, several market segments are seeing a lot of activity:
There’s plenty happening in the MLOps world, as teams grapple with the reality of deploying and maintaining predictive models – while the DSML platforms provide that capability, many specialized startups are emerging at the intersection of ML and devops
The issues of AI governance and AI fairness are more important than ever, and this will continue to be an area ripe for innovation over the next few years
Another area with rising activity is the world of decision science (optimization, simulation), which is very complementary with data science. For example, in a production system for a food delivery company, a machine learning model would predict demand in a certain area, and then an optimization algorithm would allocate delivery staff to that area in a way that optimizes for revenue maximization across the entire system. Decision science takes a probabilistic outcome (“90% likelihood of increased demand here”) and turns it into a 100% executable software-driven action.
While it will take several more years, ML/AI will ultimately get embedded behind the scenes into most applications, whether provided by a vendor, or built within the enterprise. Your CRM, HR software, ERP, etc will all have parts of them running on AI technologies.
Just like “Big Data” before it, ML/AI, at least in its current form, will disappear as a noteworthy and differentiating concept, because it will be everywhere. In other words, it will no longer be spoken of, not because it failed, but because it succeeded. It’s the fundamental irony of successful technologies that they eventually are taken for granted and disappear in the background.
The Year of NLP
It’s been a particularly great last 12 months (or 24 months) for NLP, a branch of artificial intelligence focused on understanding natural language.
Transformers, which have been around for some time, and pre-trained language models continue to gain popularity. These are the model of choice for NLP as they permit much higher rates of parallelization and thus larger training data sets.
Google rolled out BERT, the NLP system underpinning Google Search, to 70 new languages.
Google also released ELECTRA, which performs similarly on benchmarks to language models such as GPT and masked language models such as BERT, while being much more compute efficient.
We are also seeing adoption of NLP products that make training models more accessible. For example, Monkey Learn allows non technical folks to train models on proprietary corpuses of text.
And, of course, the GPT-3 release was greeted with much fanfare. This is a 175B parameter model out of Open AI, more than two orders of magnitude larger than GPT-2.
This year, we took more of an opinionated angle to the landscape – we removed a number of companies (particularly in the applications section) to create a bit of room, and we selectively added some small startups that struck us as doing particularly interesting work.
Underlying list: despite how busy the landscape is, we cannot possibly fit in every interesting company on the chart itself. As a result, we have a whole spreadsheet that not only lists all the companies in the landscape, but also hundreds more – click here.
A few additional comments:
Yes, you can zoom! The image and all logos are very high-res, so you can navigate the landscape in detail by zooming. Works very well on mobile, too!
We’ve detailed some of our methodology in the notes at the end of this post.
MAJOR FINANCING & EXITS & IPOS // Who’s in who’s out?
This has been a blockbuster last 12 months for IPOs:
We mentioned at the beginning of this post some of the large IPOs: Snowflake, Palantir, Datadog
There have been several impressive IPOs in the data space: Sumo Logic, Dynatrace and Cloudflare
Who will be the exciting IPOs of the next 12 months, assuming the IPO window remains open? UIPath? Databricks?
The M&A market was slightly less active relative to the prior year, however, continues to produce large outcomes including Arm (acquired by NVIDIA for $40B), SignalFX (acquired by Splunk for $1.05B (blog)), Habana Labs (acquired by Intel for $2B (blog)), Fitbit (acquired by Google for $2.1B (blog)), Moovit (acquired by Intel for $900M (blog)). There has been a notable trend of consolidation within autonomous driving, with Amazon reportedly buying Zoox, Apple acquiring Drive.ai, and Uber acquiring Mighty AI.
Smaller but interesting acquisitions:
End of an era — after Cloudera acquiring HortonWorks, struggling MapR acquired by HPE in August 2019 (here)
Arcadia Data acquired by Cloudera
Redash acquired by Databricks
Custora acquired by Amperity
Sense360 acquired by Medallia
The VC market has been extremely active for data and AI companies
Noteworthy financings of companies new to the 2020 landscape:
OneTrust (data privacy assessment) raised $400M over the course of a year
Anduril (defense) $200M Series C (July 2020) after $120M Series B in (Sept 2019)
Berkshire Grey (robotics) raised $263M Series B
Rigetti (quantum) raised $71M
Fivetran – $1.2B valuation
Starburst Data (enterprise grade distribution of Presto) came out of stealth with a $22M Series A led by Index in November 2019, followed by a $42M Series B in June 2020.
Fishtown / DBT – one round announced (a16z), another round unannounced
Chronosphere (layer built on top of Uber’s M3 metrics database) came out of stealth with a $11M Series A from Greylock (source).
Some noteworthy financings of companies already on the landscape:
Snowflake $479M Series G before their IPO
Databricks – $400m Series F (October 2019)
UiPath becomes a decacorn with $225M Series E at a $10.2B post money
Dataiku $100m Series D (August 2020), with most recent reported valuation at $1.4B