SVB Capital’s Dave Mullen: ‘Built Data’ Will Enable the Next Era of AI Adoption

ABSTRACT

🌐

While significant hype and investment in Artificial Intelligence (“AI”) has accrued to the infrastructure and modeling layers of the value chain, the real bottleneck hindering AI from reaching its full potential is data - the need for better, cleaner, and more organized data to train models enabling 'ChatGPT'-like applications across industries and functions. Dave Mullen, Partner at SVB Capital, details why ‘Built Data’ – a coinage of Big Data in its next generation form – will be the critical mechanism that unlocks the power of AI for global enterprises. The concept of ‘Built Data’ is centered around data’s empowerment of AI to achieve unprecedented levels of accuracy and adaptability, driving substantial value for the category and many future multi-billion dollar outcomes.

KEY POINTS FROM DAVE MULLEN'S POV:

Why will data be such an important enabler for enterprise AI adoption, as opposed to infrastructure and AI model advancements?

A chief bottleneck for AI adoption lies in an enterprises’ ability to harness the power of its own data. As organizations of all shapes and sizes map their AI strategies, there is an increasing need to address shortcomings in the quality, comprehensiveness, and applicability of data both within the organization and external data it might have access to. “In many cases, the data an organization has access to is disparate, unstructured, unorganized, and unable to be centrally ingested due to compliance requirements”, says Mullen. “The opportunity in ‘Built Data’ then is to give the power back to the organization. In a world where proprietary data is as valuable a competitive advantage as an organization can have, not being able to utilize this to drive forward AI strategies is a pretty big problem.”

In a world where proprietary data is as valuable a competitive advantage as an organization can have, not being able to utilize this to drive forward AI strategies is a pretty big problem.

Dave Mullen~quoteblock

The complexity of the data journey, from raw data to AI application, presents opportunities for a new generation of billion-dollar companies. With many CEOs and CIOs uncertain about where to start, there's a clear demand for solutions bridging the gap between data and AI. “The success of companies like Snowflake demonstrates how much value has been created in solving the data needs amongst enterprises; storing, processing, sharing, and analyzing data. As enterprises increasingly aim to utilize their data to power highly-customized AI use cases, it’s clear the market opportunity here can support several billion dollar outcomes across the value chain.”

What is ‘Built Data’, and which areas of the AI value chain will it have the most impact on in terms of value creation?

‘Built Data’ is a nod to Big Data but in its next generation form - bigger, stronger, more powerful: an enabling technology for more performative AI applications. This next evolution of Big Data will enable organizations to curate hyperlocal, well-structured datasets that enable efficient AI adoption, bridging the gap between raw data and impactful commercial AI applications. Notably, ‘Built Data’ is a central tenet of a much broader ‘barbell strategy’ towards Artificial Intelligence.
Opportunities for value creation within Artificial Intelligence can be assessed through what Dave coins as a ‘barbell strategy’; a framework that distinguishes which areas within the AI technology stack are most interesting from an investment perspective:
1. Big data: On one end of the barbell is data as a mechanism for enabling larger and more performative AI models - this includes functions for cleansing, organizing, categorizing, augmenting and labeling data, as well as managing governance, compliance, and security for AI-specific threat vectors such as data poisoning and malicious prompt injection.
2. Commercial applications: On the other end of the barbell are commercial AI applications that interface with ‘Built Data’ to build transformative AI solutions that will unlock massive enterprise value. Anecdotally, enterprises are prioritizing commercial AI applications ranging from onboarding, risk management, decisioning, and customer support.
3. The infrastructure and model building layer: In between the areas above is the infrastructure and LLM provision layers, where an opportunity for new players within the ‘Built Data’ framework is far less evident. “We recognize the value of the infrastructure layer,” says Mullen, “but view significant alpha in ‘Built Data’ and commercial applications given the quantum of funding towards infrastructure, its inherent capital intensity, as well as what many might view as overvaluation in this part of the technology stack.”

What are some of the most promising business models that are emerging to enable or harness ‘Built Data’ for enterprises?

On the ‘Built Data’ end of the barbell, a substantial opportunity lies in the security and governance of data as it flows through the value chain, notably as threat actors emerge to dilute models with data poisoning and malicious prompt injection.
While there is an enormous opportunity in commercial applications solving sector-specific challenges, such as those in banking or healthcare, the largest outcomes will be those that enable enterprises across sectors to harness their data. By leveraging insights from one sector, these companies can apply them to other industries, enhancing the effectiveness and applicability of AI-driven solutions across various domains. Examples include:
1. Hyperplane, a data intelligence platform that aims to unlock the potential of first-party data, starting with financial institutions. By leveraging centralized intelligence, Hyperplane helps banks model, understand, and interact with their customers more effectively. One of its key features is hyper-personalization, which enables banks to tailor their services and experiences to individual customers' financial needs. This, in turn, can lead to increased customer satisfaction, engagement, and ultimately, more profitable digital relationships for the banks. “With the Hyperplane data intelligence layer, they might have the potential to extrapolate insights from banking products and apply them to insurance or e-commerce products,” adds Mullen.
2. Chalk, an infrastructure provider for real-time data querying across segments. It enables seamless integration of fresh internal and external data, ensuring protection while building models. Chalk’s pre-built libraries can address 80% of data needs across any sector, and then offers further customization that addresses specific nuances, the remaining 20%, such as varying privacy regulations and compliance requirements across industries like financial services and agriculture.
3. Sixfold, a generative AI co-pilot that aids in managing high-volume policy onboarding for insurance. Most noteworthy is the company’s ability to ingest internal risk frameworks and apply them in real-time applications for policies. Thus, there is an opportunity for horizontal expansion by applying the solution to the onboarding process in other sectors.
4. Ketch, a platform specializing in online privacy and data compliance. It addresses the complex regulatory landscape by assisting organizations in determining the appropriate usage of data within different operational units and geographical regions. Whether it's navigating data governance requirements in the US or ensuring compliance with regulations in UK branches, Ketch provides comprehensive solutions to mitigate legal risks and safeguard sensitive information.
5. Unstructured, a solution that manages LLM integration with unstructured data, which often accounts for as much as 80% of the data within an enterprise. Unstructured extracts and transforms complex data stored in difficult-to-use formats (eg: HTML, PDF, CSV, PNG, PPTX) for use with every major vector database and LLM framework.

Commercial applications will utilize ‘Built Data’ to support transformative AI applications across a host of sectors including financial services, supply chain, climate tech and environmental assets, and insurance:
- In financial services, the sheer volume of data and potential to build on this data dwarfs many other categories. In many cases enterprises in financial services are building on proprietary models such as OpenAI and Anthropic to achieve AI objectives, establishing partnerships with these organizations. Others are exploring open-source alternatives, driven by privacy and security considerations. While most companies aren't gearing up to construct their own LLMs, a handful are emphasizing their own in-house development of specialized models for specific tasks. The prevailing sentiment is to diversify model usage to avoid dependency on any single ecosystem. As they prioritize the great challenges of the category, core priority areas for leaders in financial services include code cleanup, customer support, onboarding, personalization, and fraud detection.
- Data platforms will be foundational in informing the future of climate tech and environmental asset management. The emergence of geospatial data offers potential applications in climate monitoring, ranging from real-time weather pattern analysis to assessing companies' greenhouse gas footprints. Treefera—a data platform for environmental asset management—demonstrates how ‘Built Data’ can be leveraged as an evolving mechanism to apply AI within numerous segments. Treefera integrates machine learning algorithms with an array of data sources to provide auditable and transparent data for supply chain custody and compliance, biofuel and bioenergy compliance, and carbon credit origination and underwriting.
- Supply chain logistics represents another significant opportunity to enhance operational efficiency through the utilization of AI. AI-driven insights can provide real-time visibility into inventory levels, demand fluctuations, and logistical challenges, enabling proactive decision-making and mitigating potential bottlenecks. Moreover, AI applications can facilitate the implementation of predictive analytics and machine learning algorithms, enabling companies to forecast demand, optimize routing, and minimize transportation costs.

What are some of the potential roadblocks for these emerging companies?

Global data privacy regulations will present an ongoing challenge for both enterprises and providers of new data solutions. “Regulations around privacy and data safety will only increase, especially for anything that directly touches the consumer,” says Mullen. “At the same time, many enterprises still have a long way to go in terms of getting up to speed on these compliance and data governance requirements before they can implement their AI strategies.”
- Open questions around the onus of liability for data security create additional headwinds in heavily regulated sectors. With the emergence of open banking, for example, a challenge that arises is the implied onus on banks to take on data liability for any form of data it has touched - regardless of whether it still rests within the bank. At the same time, banks are forced to share data with startups and other constituents who may not share this same onus, leaving unanswered questions as to the scope of bank’s liability in the event of a data breach. These issues are unlikely to be solved in the near-term, and will present an ongoing challenge in this category.
Solutions must accommodate both on-premises and cloud data environments. There is an ongoing push-pull dynamic between on-prem and cloud storage. Solutions will have to tailor to both in order to capture significant shares of their respective markets whike weighing the nuances of organizational priorities, sectors, and geographies along the way.

IN THE INVESTOR’S OWN WORDS

Dave Mullen

As a Series A/B investor, I am looking for opportunities that have entered the mainstream with readily adopting commercial buyers. From conversations with buyers of technology, the importance of data to feed models disintermediating everything from fraud to chatbot technology is recognized as a significant problem by organizations of all sizes. This is why I perceive ‘Built Data’ as such a compelling opportunity – buyers understand there is a problem and are ready to buy, but need the tools to get started to fill this gap.

There are many indicators of the accelerating momentum of this trend. In addition to massive funding rounds, one of the central tenets of Reddit's IPO success is its monetization of data for training models, which is likely to become the industry standard. Reddit's corpus, representing 19 years of human experience organized by topic with moderation and relevance, is crucial for building both chat capability and ensuring the freshness of information. The ability of other organizations to replicate this approach will be a crucial inflection point in the rise of ' Built Data'.

MORE Q&A

Q: Does the opportunity for ‘Built ‘Data’ extend to the corpus of publicly available data, or is the scope of impact limited to the proprietary datasets held by various organizations?

A: Wherever high quality data is available, it should be utilized. The challenge is less in where the data is coming from and more in the process of preparing it for ingestion by models - be that ensuring compliance, or organizing, cleansing, and categorizing data. While not all internal data is necessarily structured and clean, the process becomes more complex when dealing with scraped data, often from external sources. There is an added layer of cleansing and organization involved in this scenario. I see an opportunity for next-gen data companies to address this challenge, particularly concerning synthesizing a specific organization's dataset or augmenting it.

STARTUPS MENTIONED IN THIS BRIEF

Chalk, Hyperplane, Ketch, Sixfold, Treefera, Unstructured

Produced by the Emerging Venture Capitalists Association (EVCA), the 2023 EVC List honors the top 25 rising stars in venture capital.

Terra Nova's Thesis Brief Series showcases each investor's insights and category expertise.