Harbor | IA Before AI: Bring Your Dark Data Into the Light

It may sound ominous, but dark data is not an emissary of evil. It’s a scary name for a common problem: the accumulation of ROT (redundant, obsolete, and trivial information) that clogs up an organization’s servers. Corporations and law firms that have defaulted to a “keep everything” culture in the face of complex laws and client requirements are especially vulnerable to this.

But while dark data might be mundane, it isn’t harmless. Not only does it create privacy or cybersecurity issues should an AI tool inadvertently expose sensitive material or misuse intellectual property, but it can also sabotage AI implementation from the start by feeding these tools biased or unreliable information.

Here’s what to know about dark data—and how to ensure it won’t become a problem for your organization’s AI initiatives.

AI Heightens Dark Data Risks

Enterprises eager to leverage the efficiencies of AI can open themselves up to significant risks if they don’t effectively manage their dark data.

Why? Well, when AI platforms are set loose to index large internal data sets that include outdated, redundant, or improperly classified information, they can end up producing incorrect answers or revealing confidential materials. Some organizations or clients, for example, may mandate that none of their information may be fed into an AI model, meaning all their files would need to be off limits to any integrated AI tools.

Additionally, if users take these outputs at face value, even as an experiment or ideation exercise, biased or even blatantly wrong information could make its way into financial forecasts, legal filings, or other work products—with serious professional consequences.

Shedding Light on Dark Data

Many organizations have already put policies in place to prohibit employees from inputting client information of any kind—especially anything sensitive—into public AI tools like ChatGPT. While this is an important step, it isn’t enough.

Those looking to adopt AI tools and improve their cybersecurity should start with a readiness assessment to determine if they have policies in place to prevent the accumulation of dark data. For instance:

Are there retention schedules that account for the organization’s or clients’ data management rules and legal best practices?
Are auto-generated files like meeting transcripts or recordings set to auto-delete?
Can you identify and delete duplicates? This alone could save massive amounts of file storage space, minimize the attack surface for hackers, reduce IT costs, and prevent AI tools from drawing on outdated information.
Do you have guidelines in place around when and why to delete old files? Are they publicized around your firm or company? Most importantly, do employees follow them?

At the same time, organizations should treat AI like any other technology product by running it through IT and compliance checks before implementation. They should consider:

The role AI would play internally, the goals it is meant to achieve, and the data it needs to accomplish these goals.
Whether client/outside counsel guidelines or regulations regarding data management and protection, including retention requirements and ethical walls, may factor into AI’s effectiveness.
How AI technologies work with internal systems and the vendor’s security procedures.
Which third-parties might have access to information inputs or outputs, and what they would do with it.

Creating an organization-wide AI governance framework that addresses questions like these and establishes processes to follow is critical. Done right, it can make dark data a thing of the past—and empower organizations to get the most out of their AI.

Back

Briefs

Share this page's URL

IA Before AI: Bring Your Dark Data Into the Light

Snapshot

AI Heightens Dark Data Risks

Shedding Light on Dark Data

Share this page's URL

Briefs

IA Before AI: Bring Your Dark Data Into the Light

Snapshot

AI Heightens Dark Data Risks

Shedding Light on Dark Data

AI & the Art of Information Governance