The enterprise software industry has convinced itself that "AI-ready data" is achievable through enough ETL pipelines, data lakes, and schema standardization. Vendors promise that with the right tools, your messy organizational data can be transformed into pristine, semantically rich datasets that AI systems will consume effortlessly. This is fundamentally a myth—and a dangerous one.
The irony of "AI-ready data" is that making data truly usable by AI systems requires making it properly human-readable first, just formalized. But here's the deeper truth: that formalization process cannot be automated. It requires domain experts, deep contextual understanding, and an appreciation for the subtle semantics that only humans embedded in a problem space can provide.
The Seductive Promise of Automated Data Preparation
The myth goes something like this: hire data engineers, implement robust ETL processes, adopt a modern data lakehouse architecture, maybe sprinkle in some auto-ML for feature engineering, and voilà—your data becomes "AI-ready." The AI can then extract insights, make predictions, and drive business value.
This narrative ignores a fundamental reality: data doesn't carry its own meaning. Meaning emerges from context, domain knowledge, and human interpretation. A column labeled status_code in your system might mean something entirely different from status_code in mine, even if they both contain integers from 200-500. The semantics are implicit, embedded in application logic, tribal knowledge, and the historical evolution of your systems.
Why Domain Experts Are Irreplaceable
Consider a real-world scenario from CDN infrastructure monitoring. You have event streams from thousands of edge servers, each emitting metrics: latency, error rates, cache hit ratios, bandwidth utilization. An automated system can normalize these into a consistent schema easily enough. But understanding what these metrics mean in context requires a domain expert.
When latency spikes on European nodes at 3 AM, is that:
A DDoS attack requiring immediate mitigation?
Expected behavior during content publication windows?
A symptom of upstream provider issues that will self-resolve?
Early warning signs of hardware degradation?
The raw metrics don't tell you. The domain expert knows that at 3 AM Central European Time, major content providers push updates. They know which upstream providers have reliability issues. They understand the difference between transient spikes and systemic problems. This contextual knowledge cannot be extracted from the data itself—it must be encoded into the data by humans who understand the domain.
The Fallacy of Schema as Semantics
The linked data movement, particularly RDF and semantic web technologies, attempted to solve this by creating formal ontologies. The idea was compelling: if we define our concepts precisely using URIs and formal relationships, machines can reason about the data without human interpretation.
In practice, ontology creation becomes an exercise in formalized confusion. Different domain experts create subtly incompatible ontologies for the same domain. The W3C has ontologies for time, space, organizations, and provenance—each internally consistent but challenging to compose. And critically, the hardest semantic questions remain unresolved:
When are two entities the same thing? (The identity problem)
When does a concept's meaning change? (The temporal semantics problem)
How do we handle vagueness and ambiguity? (The boundary problem)
Whose interpretation is authoritative? (The perspective problem)
These aren't technical problems to be solved with better tooling. They're fundamental epistemic challenges that require human judgment calls.
Human Understanding Creates the Context Graph
What actually makes data valuable to AI systems isn't standardization or automation—it's the rich contextual knowledge graph that domain experts build around the data. This knowledge graph exists partially in formal systems (schemas, ontologies, documentation) but primarily in human understanding.
When you work in advertising identity graphs, as I did at LiveIntent, the data itself is straightforward: email addresses, device identifiers, timestamps, interaction events. But the meaning of that data lives in understanding:
Privacy regulations and their nuances across jurisdictions
The difference between explicit and inferred associations
Temporal decay of identity signals
The reliability hierarchy of different identifier types
Business rules about consent and data usage
A domain expert doesn't just know that email_hash links to device_id—they understand when that link is reliable, what it implies about user behavior, and what assumptions it's safe to make downstream. This contextual understanding is what makes the data graph useful. Without it, you have nodes and edges that are syntactically correct but semantically hollow.
Why Linked Data Principles Actually Matter
Here's the paradox: linked data and semantic web principles are crucial, but not for the reasons their proponents typically claim. They don't eliminate the need for human expertise—they provide a formalism for experts to express their knowledge.
The value of linked data isn't that it makes data "machine-understandable" in some autonomous sense. The value is that it gives domain experts a rigorous language to:
Make implicit knowledge explicit - Forcing you to write down "customer in CRM system" versus "customer in billing system" versus "customer in support ticketing" reveals semantic distinctions that matter
Expose contradictions - When you try to formalize your ontology, conflicting assumptions become visible
Enable composition - Well-designed knowledge graphs let you combine insights from different domains in principled ways
Preserve provenance - Tracking who asserted what, when, and under what assumptions becomes first-class
But notice: every single one of these requires human intelligence and domain expertise to do well.
The Self-Sovereign Knowledge Model
The future of "AI-ready" data isn't in centralized data lakes where automated processes cleanse and prepare information. It's in distributed knowledge graphs where domain experts maintain authority over their semantic domains, similar to Self-Sovereign Identity principles.
In SSI systems, individuals control their own identity claims. Others can verify and trust these claims, but the authority remains decentralized. Apply this model to knowledge:
Domain experts own their semantic spaces
They formalize their knowledge using appropriate formalisms (ontologies, type systems, schemas)
They expose verifiable claims about their data's meaning
Downstream AI systems can compose these semantic spaces, but they inherit the experts' interpretations
This isn't "AI-ready data" in the conventional sense. It's expert-curated semantic knowledge that AI systems can leverage because humans did the hard work of understanding and formalization.
Temporal Causality and Living Knowledge
One aspect that gets completely ignored in traditional "AI-ready data" discussions is that knowledge isn't static. The meaning of your data changes as your domain evolves, your business changes, and your understanding deepens.
In CDN infrastructure, the semantics of "outage" evolved significantly as we built better prediction systems. Initially, an outage was a binary state: service unavailable or not. Then we understood different outage types: partial degradation, geographic isolation, upstream provider issues. Later we developed predictive signatures—patterns in metrics that reliably preceded outages by minutes.
Each evolution required domain experts to reconceptualize the problem space and update our knowledge representations. No automated system could have discovered these semantic shifts. They emerged from humans analyzing incidents, discussing patterns, and refining mental models.
This is why temporal semantics matter in knowledge graphs. It's not enough to say "this entity has this property." You need: "as of this date, according to this expert, in this context, we believed this property held, with this confidence level." The knowledge graph needs to be temporally grounded because our understanding evolves.
What Actually Works: The Practitioner's Path
Given all this, what does a realistic approach to data preparation for AI look like?
Start with domain experts mapping their mental models. Don't begin with the data. Begin with the experts who understand what the data means. Have them draw out their conceptual models, identify key entities and relationships, and articulate the implicit assumptions they use when reasoning about the domain.
Formalize incrementally, not comprehensively. Don't try to build a complete ontology upfront. Pick a specific use case, formalize just enough semantics to support it, validate with real problems, then expand. Each iteration teaches you about the domain and refines your understanding.
Embrace multiple perspectives, not universal truth. Different experts will have different mental models, and that's valuable. The finance team's understanding of "customer" differs from the product team's understanding, and both are valid. Your knowledge representation should accommodate multiple viewpoints, not force artificial consensus.
Make provenance and context first-class. Every assertion in your knowledge graph should track: who made it, when, based on what evidence, with what confidence, under what assumptions. This metadata is as important as the assertion itself.
Build feedback loops with actual use. The only reliable way to know if your semantic formalization works is to use it for real AI/ML tasks and see where it breaks. When your model makes wrong predictions or fails to capture important patterns, investigate the semantic gaps.
Invest in tooling for experts, not automation for data engineers. The bottleneck isn't moving data around or transforming schemas. It's capturing expert knowledge and maintaining semantic coherence. Build tools that help domain experts formalize their understanding, visualize knowledge graphs, and validate semantic consistency.
The Knowledge Graph as Shared Understanding
Ultimately, what makes data valuable to AI systems is the same thing that makes it valuable to humans: shared understanding of what things mean and how they relate. A knowledge graph isn't a database with fancier relationships. It's a formalized representation of collective human understanding about a domain.
The Slavic god Veles, associated with wisdom and hidden knowledge, offers an apt metaphor here. Veles dwells in the depths, guarding knowledge that isn't immediately accessible but must be sought through understanding. Your organization's most valuable knowledge similarly exists in depths that automated tools cannot reach—in the minds of domain experts, in the nuanced understanding of context and causality, in the temporal evolution of meaning.
AI systems are powerful tools for pattern recognition and optimization, but they operate on the semantics we provide them. Garbage in, garbage out—except the "garbage" isn't bad data quality in the traditional sense. It's semantic poverty: data without adequate context, relationships without clear meaning, entities without proper grounding.
Conclusion: Expertise Over Automation
The myth of "AI-ready data" suggests that with enough processing, automation, and tooling, we can prepare data for AI consumption without deep human involvement. This is backwards. The preparation that matters—semantic enrichment, contextual grounding, temporal understanding, domain expertise—is fundamentally a human endeavor.
Linked data principles and knowledge graphs are valuable precisely because they give us rigorous formalisms for expressing human understanding. But they don't eliminate the need for that understanding. If anything, they make the requirement more explicit.
The path forward isn't better automated data preparation. It's better tools for domain experts to formalize their knowledge, better frameworks for composing semantic spaces, and better practices for maintaining living knowledge graphs that evolve as our understanding deepens.
Stop chasing the myth of AI-ready data. Start building systems where domain experts can express what they know, and AI systems can leverage that hard-won human understanding. That's the only approach that actually works.