The Bleeding Edge: Planning for the New Data Ecosystem

Graphic of computer data network
Author: Ed Moyle, CISSP
Date Published: 1 March 2024

When it comes to the data in our organizations, looks can be deceiving. On the surface, it may seem as if things have not changed that much. After all, the same technologies and architectures—those that facilitate and govern data collection, data aggregation, data consumption, data spoliation, and so forth—are still running along and are, more or less, in the same state as they have been for years now.

This is a true statement from certain points of view. Data lakes, data warehouses, and business intelligence activities operate nowadays in much the same vein as they have historically. We in risk-aware and digital-trust enabling professions (e.g., information security, assurance, privacy, governance) know this is true because initiatives such as these tend to be highly visible in the organization. As such, they are usually an area of focus for us and a topic directly discussed and negotiated with our peers in the realms of technology and business.

That level of understanding, though, belies something that is going on in parallel, where the waters are much murkier. There are other viewpoints where the rules of what we think we know about data (and, in fact, technology more generally) are being rewritten. Specifically, we are smack in the middle of several converging trends that represent a seismic shift in our understanding of the technology ecosystem.

It is important for us to pay attention to transformations like this. They do not happen often, but when they do, they can represent a major shift in risk planning. This is true for a couple reasons. First, certain adoption dynamics can sometimes obscure new sources of risk. Specifically, when it is easy for individuals within the enterprise to make use of a new technology, they may start making use of those technologies without direct oversight from technology organizations (i.e., shadow IT). This can put us behind the eight ball because the risk exposure might not be fully understood by the user—and, since we as risk professionals do not have a direct line of sight into it, the risk may not be fully understood by us, either.

For example, consider how many organizations struggled historically (and, honestly, continue to today) with shadow adoption of software as a service (SaaS). We all know that situations arise where a SaaS technology scratches a particular itch that someone in the organization has—for example, if it performs a given task that would have been more difficult to do another way. That person may decide to start using the SaaS application without realizing that this can, in the wrong circumstances, bring about risk.

In addition, new and emerging technologies impact the risk equation in organizations because they can carry with them new risk that might take some time and planning to address thoroughly. At the same time, they can offset risk that has existed historically. Recall the previous example and consider how moving to a SaaS solution for existing business applications can reduce some risk: Before, we might have needed to ensure that applications were continually updated, while now, hygiene-related tasks (such as patching) are the service provider’s problem instead of ours. On the other hand, new risk that was not there before can come to light, such as supply chain risk, availability, potentially less transparency (e.g., of the underlying ecosystem), vendor lock in, etc.

AI and ML Are Transformative

Right now, the data ecosystem is at a crossroads. Another reason why? Artificial intelligence (AI) and machine learning (ML).

Unless you have been living under a rock, you know that the past two years have been all about AI (in particular, generative AI and large language models [LLMs]) and ML. These technologies are quite literally transforming organizations. And while often we refer to the two in the same breath (as I have just done), the truth is that they are not the same, they have different risk dynamics, and adoption of each varies significantly.

Understanding that these new trends may be making their way into your organization is a good and prudent first step on the path to building out architectural and business strategies for how to address them.

Let’s start with what has received the largest share of attention in the press and trade media: generative AI, particularly LLMs. This technology has gone from being relatively niche to achieving near-universal awareness in a very short period of time. In fact, usage is so widespread that even if you do not think you are using LLM in your organization, I would encourage you to look again. According to data from a McKinsey survey,1 usage is already near-ubiquitous. The vertical industry where generative AI is adopted the least of those surveyed (retail and consumer goods) still had a majority of respondents (70 percent) cite at least some usage. In industries where adoption was more widespread (such as technology, media, and telecom), adoption was much higher—as high as 88 percent.

The point being, AI usage is on fire and growing rapidly. Use cases can include integration of LLMs into development tools, customer support tools, office and productivity applications, direct integration into business applications, and numerous others. The reason why it is so important to keep LLMs front-of-mind is that it is an area that is rife with shadow IT. It is easy to integrate LLMs into existing products (major search engines and business applications are directly incorporating LLMs into their default functionality), it is easy for an individual to use themselves should they so desire, and it is hard to separate traffic from normative web activity. All of this means that there are likely to be numerous interaction points where your users are already interacting with generative AI.

This matters from a data perspective for a couple reasons. First, we know from the responses taken by Apple2 and Samsung3 (and the country of Italy )4 that there is concern from many organizations about users having a willingness to submit proprietary company data to a generative AI service if they believe it will help them do their jobs better. Second, there has already been at least one data breach where some details of prompts submitted by users were exposed publicly.5 These two things taken together ought to encourage us to remain alert for users submitting data to AI tools in ways that we do not expect.

A different challenge also on the horizon (and also potentially impacting our data ecosystems) stems from the lack of visibility that can exist within segments of the organization working on data analytics—in other words, ML efforts. Unlike LLMs though, ML efforts go part and parcel with data. Quite literally, data is the substrate upon which ML operates.

Historically, from a risk control perspective, one of the areas that has been challenging to practitioners is application security. It is an area in which many security teams do not invest heavily and one to which the skill base of many practitioners does not extend. ML is similar in many regards: It leverages skills that many in the security community do not have, it is often siloed within the organization, it is often not directly visible either to technology teams or the enterprise, etc. Entire community ecosystems exist (e.g., MLSecOps.6 ) for the specific purpose of operationalizing ML and integrating risk and digital trust disciplines (e.g., information security, assurance, etc.) into ML efforts.

My point in raising these ideas is not to scare anybody, generate fear, uncertainty, and doubt (FUD), or even to point out new risk areas. Instead, it is to generate awareness. Understanding that these new trends may be making their way into your organization is a good and prudent first step on the path to building out architectural and business strategies for how to address them. Likewise, it can help you locate areas within the organization where folks might be doing things in a new way that could require additional scrutiny.

Endnotes

1 McKinsey, The State of AI in 2023: Generative AI’s Breakout Year, 1 August 2023, http://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year
2 Vincent, J.; “Apple Restricts Employees From Using ChatGPT Over Fear of Data Leaks,” The Verge, 19 May 2023, http://www.theverge.com/2023/5/19/23729619/apple-bans-chatgpt-openai-fears-data-leak
3 Sforza, L.; “Samsung Bans Employee Use of ChatGPT After Reported Data Leak: Report,” The Hill, 2 May 2023, http://thehill.com/business/3983581-samsung-bans-employee-use-of-chatgpt-after-reported-data-leak-report/
4 Milmo, D.; “Italy’s Privacy Watchdog Bans ChatGPT Over Data Breach Concerns,” The Guardian, 1 April 2023, http://www.theguardian.com/technology/2023/mar/31/italy-privacy-watchdog-bans-chatgpt-over-data-breach-concerns
5 Powell, O.; “OpenAI Confirms ChatGPT Data Breach,” CSHUB, 3 May 2023, http://www.cshub.com/data/news/openai-confirms-chatgpt-data-breach
6 MLSecOps, http://mlsecops.com/

Ed Moyle, CISSP

Is currently chief information security officer for Drake Software. In his years in information security, Moyle has held numerous positions including director of thought leadership and research for ISACA®, application security principal for Adaptive Biotechnologies, senior security strategist with Savvis, senior manager with CTG, and vice president and information security officer for Merrill Lynch Investment Managers. Moyle is co-author of Cryptographic Libraries for Developers and Practical Cybersecurity Architecture and a frequent contributor to the information security industry as an author, public speaker and analyst.