Mastering Data Strategy for AI: Unlocking Value and Control

Data strategy for AI is pretty much about access and control over different sources of data.

This paper assignment discusses value capture and appropriability in AI data strategy, focusing on OpenAI and ChatGPT. We explore how data is used strategically to capture value, focusing on the concept of appropriability, the ability of companies to secure profits from their innovations.

OpenAI’s launch of the generative AI service ChatGPT highlights the significance of accessing and utilising diverse data sources, including public, user-generated, and partner data, in its large language models (LLMs). OpenAI capture value from data in services like ChatGPT, and what key challenges do they face in appropriating public, user-generated, and partner data?

What is OpenAI? Current operations

OpenAI was founded in 2015, as an artificial intelligence research non-profit organisation. They aimed to prevent potential misuse and problems from AI being used ”in the wild” . The goal was to ”benefit all of humanity” . Presumably, it was then carrying values of ”free access to research results” and access to data, as mentioned in our lectures by Long (2025). Value creation was seen deriving from open access and sharing of data as described by Mayer-Schönberger and Ramge (2022 p.9). However, in November 2022, OpenAIs LLM, ChatGPT, quickly made a generative AI breakthrough and it grew its user base faster than ever seen. Their users could suddenly produce content, text and later advanced texts, images, videos, coding, etc., using prompts in their chat-like interface. ChatGPT moved in a fast, complex and constantly changing landscape, where rules and regulations of data and access were not set (Svenskarna och AI, 2024).

From non-profit research towards controlled access

Data access is vital for innovation. Volume, Variety, and Velocity of data are at the core of data management (Gandomi, A. & Haider, M., 2014). As Bjurgren and Long (2022) describes, the raw data is non-rival, because it can be consumed by many people at the same time, without reduced value. Innovative entrepreneurs need a constant data flow according to Mayer-Schönberger and Ramge (2022 p.5). OpenAI accessed lots of data by scraping the internet of all types of different sources, such as text, code and image data, coming from social media, blogs, digitised books, online reviews and wikipedia pages. The data providers were often unnoticed, untold.

Stanford (2023) writes that OpenAI’s model GPT-4 is a transparency backlash, ”Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar. ” (Stanford, 2023). Mapping the data completely is not possible since OpenAI likes to keep it a secret what exactly their models are trained on. Teece (1986 p.3) describes trade secrets as an alternative to patents, possibly ”if a firm can put its product before the public and still keep the underlying technology secret” . Soon, many publications suspected that ChatGPT was fed with their texts, without consent (even those behind paywalls). New York Times and others sued Open AI, for copyright infringement.

Teece discussion from 1986, are once again relevant ”with which property rights environment within which a firm operates can thus be classified according to the nature of the technology and the efficacy of the legal system to assign and protect intellectual property. ” (Teece.J 1986 p.3). The European commission writes in February 2023 that it is ” important to note that content created by ChatGPT is derived from content that has been previously generated by others. Therefore it is not clear what are the implications in terms of copyright for reusing this content: when is the output ”inspired” from existing works and when is it actually infringing them? (Shreblog, 2023, Intellectual property helpdesk, 2023) OpenAI, on their behalf, argues that their AI models were trained on publicly available internet content, in fair use (Module 2, lectures Long, 2025). When OpenAI started profiting from the free scraping of content, some as New York Times and GitHub, blocked OpenAI from accessing their content without permission. (Mashable, 2024, CNBC, 2025).

Stakeholders involved and the AI Ecosystem

The stakeholders involved in OpenAI service ChatGPT are diverse mix of developers, AI researchers, social scientists, investors, auditors, policymakers, end users and consumers, and more. Hubs within the ecosystem function as nodes with high connectivity and opportunity. asset developers, hubs indicate their assets are high impact. For economists, hubs communicate emergent market structure and potential consolidation of power. For investors, hubs signal opportunities to further support or acquire. For policymakers and auditors, hubs identify targets to scrutinize to ensure their security and safety. ” (Standford 2023). This ecosystem of stakeholders and nodes, contributes to a range of downstream products across several industry sectors using the ChatGPT api (Standford 2023).

Partnership for data and model access

OpenAI’s evolving partnership strategy with Microsoft illustrates how appropriability strategies shift across different phases of innovation (Teece, 1986, Lecture: Long, 2025). During upstream innovation, the focus was on building capabilities and driving breakthroughs. In the downstream phase, the challenge becomes how to capture value from those innovations while maintaining strategic control. In January 2023, Microsoft deepened its commitment to OpenAI through a multiyear, multibillion-dollar investment aimed at accelerating AI development and scaling commercial applications. The partnership promised developers and organisations across industries access to ”the best AI infrastructure, models, and toolchain with Azure to build and run their applications” (Microsoft, 2023). More recently, in May 2025, OpenAI and Microsoft negotiated a new deal in preparation for OpenAI’s planned IPO, which will transform the company from a nonprofit into a fully commercial, for-profit enterprise (Reuters, 2025). Microsoft is reportedly seeking guarantees to secure access to OpenAI’s future cutting-edge models. This evolution reflects what Varian (2014) describes as the impact of changing cost structures. As AI training becomes capital-intensive, control over data and model access tightens (Teece, 1986, p.5). The company’s challenge is how to balance the need to share information to gain value and the risk of losing control of that value. Bjurgren and Long (2022) highlight that significant for AI, is value chains of multiple actors, that not only produce the data, but also rely on data access. To make this work, collaboration between the different actors and partners is a must. This benefits big players of collaborative networks, where the value of the aggregated data is greater, than the single data points in themselves.

This growing emphasis on controlling access to critical inputs and outputs has also shaped how OpenAI manages external data partnerships moving from general web scraping to carefully negotiated licensing deals with selected content providers. The data economy is built by complex networks (Zech, 2017).

Strategic partnership and ecosystem positioning enables appropriability

OpenAI’s ecosystem strategy relies heavily on monetising access to its models through APIs and layered applications. Their appropriability strategy reflects a broader shift from the original ideals of the open Web. In contrast to the decentralised organisation structure, used for example of Berners-Lee Web (Mayer-Schönberger and Ramge, 2022 p.11), OpenAI has positioned itself as a central node in the emerging AI ecosystem. It powers a growing number of layered applications, from Microsoft 365 Copilot to third-party startups like company Lovable and Cursor, which pay to access OpenAI’s models. This strategy helps OpenAI expand its influence and revenue streams across the AI market by controlling access for their data and models.(Pipeline Capital Group, Stanford, 2023). Today we see new type of automation where companies get together vertical to form larger data networks, or layers on which you can build additional production and services with adapted and tailored interfaces for certain target groups (Wiebe, 2017). The AI economy where Open AI position themselves, makes a layered cake of services forming infrastructure control, to platform integration, licensing deals and partnerships.

Partnerships and data deals

Beyond API revenues, OpenAI is increasingly closing exclusive data partnerships, such as recent licensing deals with Axel Springer and News Corp, to strengthen its competitive position and solve copyright battles (Fast Company, 2024 Reuters, 2023). The web was originally designed as an open, neutral infrastructure to combine and share information globally, and it was built as a scalable and resilient architecture, democratising access to knowledge (Mayer-Schönberger & Ramge, 2022, p.11). However, services such as ChatGPT by OpenAI are built as a layer on top of the free web, which may lead to a more curated and commercialised access to information. Reports suggest that since Axel Springer’s content may even be surfaced more prominently in ChatGPT search results to drive traffic and subscriptions to its brands. This illustrates how OpenAI’s appropriability strategy relies not only on technical controls but also on selective data partnerships that change how information flows, moving away from the Web’s original idea of decentralisation and open access. (Reuters, 2023, Fastcompany, 2024). Teece, 1986 p.9) claims that the amount and diversity of assets and competences that N need to be accessed are huge, especially for complex technologies. Even users of ChatGPT, enters a form of partnership, using ChatGPT. As Varian, suggests are most people are willing to share quite personal information if they get something in return. And we are today used to trading our data for valuable transactions.

Evolving Appropriability Data Strategy: How OpenAI is Consolidating Control

As OpenAI’s ecosystem has matured, its appropriability strategy has evolved, moving from open innovation toward tighter commercial control. As OpenAI tightens control over its commercial models, tensions over data ownership and IP protection have become more visible. The company built much of its AI capability by training on publicly available data, often asking for forgiveness rather than permission. However, they defend their models. In response to ”suspected unauthorised distillation” of its models by DeepSeek, OpenAI stated: ”We take aggressive, proactive countermeasures to protect our technology and will continue working closely with the US government to protect the most capable models being built here” (Guardian, 2025). This reflects Teece’s (1986, p.6) observation that innovators must turn to business strategy to stay ahead of imitators. As industries move from pre-paradigmatic (in weak appropriability regimes, user needs are closely coupled to the market and impact designs). Rivalry focuses on setting the dominant design (Teece.J 1986 p.7). In today’s rapidly evolving AI market, OpenAI faces intense competition from models such as Anthropic’s Claude, Google’s Gemini, and Perplexity, which are look-alike versions of ChatGPT. I therefore argue, ChatGPT has now entered the paradigmatic phase where ”volumes increase and opportunities for economies of scale” which opens up for more specialised services and distribution (Teece, 1986 p.7). OpenAI continues to seek strategic partnerships to strengthen its control over key complementary assets. In 2024, a partnership with Apple enabled deep integration of ChatGPT functionality across iOS, iPadOS, and macOS platforms (OpenAI, 2024, CNBC, 2025). These moves reflect a deliberate strategy of consolidating control over both data inputs and the infrastructure required to deliver AI services, deepening OpenAI’s appropriability advantage beyond financial partnerships. OpenAI is securing appropriability by controlling their models, their infrastructure and platforms. According to (Teece.J 1986 p.11) securing control of complementary capacities will be a better success factor when innovation is not tightly protected and ”once ”out” is easy to imitate” .

Drawing from Mayer-Schönberger, V. & Ramge (2022) discussion, it is the access to data, not the ownership of data where value. OpenAI create value by offering subscriptions of OpenAIs models in ChatGPT, and access to their APIs for different price segments. Becoming a platform, from where of services are enabled, they position themselves as part of the underlying infrastructure of companies innovation. They offer a type of data access OpenAI withholds, is impossible for most startups and companies to achieve. They become a ”kill zone” problem where startups can’t compete if they lack access to comparable data or compute. Control is made by contracts and infrastructure control (Mayer-Schönberger, V. & Ramge, 2022). Once, working for the Swedish Internetfoundation with digital literacy, the ”Internet for all” was the main accessible goal of the organisation.

This makes me reflect on how this matters for the infrastructure, how a new potential AI layer created on top of the internet might affect us (Internetstiftelsen). These moves reflect a deliberate strategy of consolidating control over both data inputs and the infrastructure required to deliver AI services, deepening OpenAI’s appropriability advantage beyond financial partnerships. As Teece (1986, p.7) observes, as the terms of competition shift and prices become less important, access to complementary assets becomes absolutely critical. Firms controlling cospecialised assets such as distribution channels and infrastructure are advantageously positioned relative to competitors (Teece, 1986, p.8). OpenAI’s partnerships with Apple and CoreWeave reflect this dynamic: by securing these relationships, OpenAI strengthens its long-term competitive position even as LLM technologies become increasingly commodified. ”

The information paradox — media companies’ data strategy dilemma

However, these partnerships highlight a deeper information paradox. Burstein (2022) describes that ”the buyer of information must be able to place a value on the information. But once the seller discloses the information, the buyer can take it without paying. ” . While media companies gain short-term revenue, they risk undermining the long-term value of their content. The information paradox lies at the heart of OpenAI’s recent media licensing partnerships.

While content creators seek to protect and monetise their work, they must also feed the very AI models that may ultimately undermine their value. This tension reflects a deeper structural issue in the economics of data. Data is a fundamentally nonrival and can be used simultaneously by multiple actors with no loss of value (Jones & Tonetti, 2020). Yet, as Teece (1986) highlights, the information paradox means that firms hesitate to share data openly, fearing that its value will be appropriated by others once accessed. In OpenAI’s partnerships with media companies, this paradox is evident: while AI models depend on broad access to high-quality data, content producers fear losing both control and future revenue streams. As media companies strike deals with OpenAI, such as those involving News Corp, Axel Springer, and Reddit, they gain some control over how their content is used (Fast Company, 2023, Mashable, 2023). Yet concerns remain about copyright, fair compensation, and the long-term impact on competition. As Mashable observes, ”when allowing AI to train on their material, it may undermine their value by enabling AI-generated content to replicate their style and authority” (Mashable, 2023).

Moreover, some critics argue that publishers who license their content are ”trading their own hard-earned credibility for a little cash from the companies that are simultaneously undervaluing them and building products quite clearly intended to replace them” (Mashable, 2023). Today, many researchers understand that the raw material of their publications must be accessible, reproducible, and verifiable (Mayer-Schönberger and Ramge, 2022, p.9). Yet in the current AI ecosystem, contributors to model training have no such guarantees.

The resulting black box AI systems not only obscure the learning process but also raise ethical questions about transparency and accountability in data-driven innovation. As Bjurgren and Long (2022) conclude, we lack a clear framework for AI data. This tension exemplifies the classic information paradox. AI models depend on high-quality, human-created content for functionality and fairness, yet using that content risks eroding its future value. Mayer-Schönberger and Ramge (2022 p.18-19) argues that ”open government data is primarily an opportunity to ensure transparency and create value for society. But if done well, it can also turn into a huge donation of data to the economy and stimulate innovation. ” Burstein (2022) argue that ”intellectual property should be the preferred solution to the disclosure paradox only when it is the best among alternatives. ”

How does this affect the data strategy?

OpenAI captures value by offering not only access to vast data resources, but also by providing its AI models as a service through ChatGPT’s subscription-based business model. In doing so, it positions itself at the centre of an evolving AI ecosystem, becoming an underlying infrastructure layer for other companies’ innovation.

However, by protecting its data models through secretive training processes and securing tight control over key technologies and partnerships, OpenAI is simultaneously moving away from the principles of an open web, building a more controlled and closed AI ecosystem. This shift carries growing risks. Companies that build on ChatGPT’s outputs may increasingly rely on AI-generated content whose origins, biases, and limitations remain opaque – making their own value creation strategies more fragile. For end users, this opacity raises critical concerns around security, fairness, and wellbeing. With little transparency into how answers are generated, it becomes harder to ensure that AI services are trustworthy and equitable.

Finally, the broader vision of an open and accessible internet for all is at stake: if key knowledge infrastructures become closed and proprietary, the web’s original democratic potential risks being undermined. As AI ecosystem strategies continue to evolve, policymakers and stakeholders must confront these risks. A key challenge going forward will be to ensure that data governance and AI regulation frameworks promote not only innovation and economic value but also openness, transparency, fairness, and public trust.

References

Lectures
Long. V (2025) Lectures during the course AI and Datastrategy. Halmstad University

Course literature
Bjurgren. Long.V (2022) Artificiell intelligens data – att dela eller inte dela?.
Mayer-Schönberger, V. & Ramge, T., 2022. Schumpeter’s Nightmare. In: Access Rules: Freeing
Data from Big Tech for a Better Future. University of California Press. Available at:
https://www.jstor.org/stable/j.ctv2kx88cp.5
Gandomi, A. & Haider, M., 2014. Beyond the hype: Big data concepts, methods, and analytics.
International Journal of Information Management, 35(2), pp.137–144. Available at:
https://www.sciencedirect.com/science/article/pii/S0268401214001066
Teece, D.J., 1986. Profiting from technological innovation: Implications for integration,
collaboration, licensing and public policy. Research Policy
Wiebe.A Journal of Intellectual Property Law & Practice, 2017, Vol. 12, No. 1
Downloaded from https://academic.oup.com/jiplp/article/12/1/62/2593608

Internet reports
Svenskarna och AI, 2024
https://svenskarnaochinternet.se/utvalt/svenskarna-och-ai/
Internetstiftelsen
https://internetstiftelsen.se/om-oss/mer-om-oss/organisation/urkund-och-stadgar/

Internet articles
Coursera – What is OpenAI, 2025
https://www.coursera.org/articles/what-is-openai
Standford 2023
https://hai.stanford.edu/news/ecosystem-graphs-social-footprint-foundation-models
Shreblog, 2023
https://srheblog.com/2023/09/18/fair-use-or-copyright-infringement-what-academic-researchers-
need-to-know-about-chatgpt-prompts/
Intellectual property helpdesk, 2023
https://intellectual-property-helpdesk.ec.europa.eu/news-events/news/intellectual-property-chatg
pt-2023-02-20en
Mashable, 2024
https://mashable.com/article/all-the-media-companies-that-have-licensing-deals-with-openai-so-f
ar
CBN
https://www.cnbc.com/2025/03/10/openai-to-pay-coreweave-11point9-billion-over-five-years-for-
Microsoft, 2023
https://blogs.microsoft.com/blog/2023/01/23/microsoftandopenaiextendpartnership/
Reuters, 2025
https://www.reuters.com/business/openai-negotiates-with-microsoft-unlock-new-funding-future-i
po-ft-reports-2025-05-11/
Pipeline Capital Group
https://pipeline.capital/openai-and-investment-strategyhow-the-creator-of-chatgpt-is-building-a-g
enerative-ai-ecosystem/
Reuters, 2023
https://www.reuters.com/business/media-telecom/global-news-publisher-axel-springer-partners-
with-openai-landmark-deal-2023-12-13/
Fastcompany, 2024
https://www.fastcompany.com/91130785/companies-reddit-news-corp-deals-openai-train-chatgp
t-partnerships
Guardian, 2025
https://www.theguardian.com/technology/2025/jan/29/openai-chatgpt-deepseek-china-us-ai-mod
els
OpenAI, 2024
https://openai.com/index/openai-and-apple-announce-partnership/
CNBC, 2025
https://www.cnbc.com/2025/03/10/openai-to-pay-coreweave-11point9-billion-over-five-years-for-
ai-tech.html

During the preparation of this essay, I have partly utilised an AI tool, specifically OpenAI’s ChatGPT. The AI tool has served as a dialogue partner with whom I discussed the overall structure, received text feedback and improved and refined my English writing. The objective of employing AI was to test and reflect upon how this technology can serve as support in academic research and writing. The text reflects my own thinking.

This text was produced as part of the course AI Data strategy, completed June 2025. Data strategy is a new skill added in my portfolio.

What is OpenAI? Current operations

From non-profit research towards controlled access

Stakeholders involved and the AI Ecosystem

Partnership for data and model access

Strategic partnership and ecosystem positioning enables appropriability

Partnerships and data deals

Evolving Appropriability Data Strategy: How OpenAI is Consolidating Control

The information paradox — media companies’ data strategy dilemma

How does this affect the data strategy?

References

Relaterade inlägg