Governing Claude Code in the Enterprise: What Law Firms and Professional Services Organizations Need to Know Before They Deploy

In the typically slow-paced world of enterprise software adoption, a shift has rapidly taken place that most leadership teams have not fully registered. AI-assisted development tools have moved from experiment to infrastructure, and Claude Code, Anthropic’s agentic command-line coding assistant, is now being requested and deployed across professional services organizations including law firms, often before anyone in a position of accountability fully understands what they have approved.

This matters because Claude Code is not a chat interface. It is an agentic system with file system access, shell command execution rights, and direct CI/CD pipeline integration. When a developer runs Claude Code, they are not asking a model a question. They are delegating a sequence of autonomous actions to software that can read files, write files, execute commands, and push changes on a machine that, in a law firm, may sit adjacent to privileged client material.

We write this from two complementary vantage points. One of us (Kavya) approaches this as an AI security governance and risk practitioner, drawing on enterprise agentic AI assessments and contributions to the AARM Specification for securing AI-driven actions at runtime. The other (Ivan) approaches it from Datasaur, which builds private, secure LLMs and agentic AI workflows for regulated enterprises including law firms, with a direct line of sight into how organizations actually consume AI infrastructure, where token costs accumulate invisibly, and what the difference looks like between AI deployed securely behind the firewall and AI deployed through public APIs without adequate controls.

What follows are four governance gaps that organizations deploying Claude Code are discovering too late and a practical path that does not require choosing between adoption and accountability.

The Nomenclature Problem: You Did Not Approve What You Think You Approved

Many organizations approving Claude Code deployments believe they are approving something in the same category as Claude Cowork or ChatGPT Enterprise. They are not. Those products are productivity interfaces: a person types, a model responds, and the blast radius of a bad output is a bad paragraph. Claude Code is a developer tool that executes commands on a machine with access to files, credentials, and pipelines. The blast radius of a bad output is a bad action.

This is not a pedantic distinction. It determines the entire threat model. A chat interface’s primary risks are data leakage through prompts and over-reliance on flawed answers. An agentic coding tool’s risks include arbitrary command execution, credential exposure, supply chain contamination, and exfiltration channels that never pass through a human’s eyes. When a procurement committee evaluates “Claude” as a single product category, it produces governance frameworks designed for the wrong threat model and deployment decisions made without the security review an agentic system actually requires.

The fix begins with language. Approval workflows should distinguish conversational AI (chatbots) from agentic AI as separate procurement categories with separate review tracks. If the tool can take actions rather than merely generate text, it belongs in the second track, alongside the access reviews, sandboxing requirements, and runtime monitoring that any other piece of software with shell access would face.

The Cost Problem Nobody Anticipated

Token consumption in agentic coding workflows scales non-linearly, in ways that differ fundamentally from chat-based AI use. A chat session consumes tokens roughly in proportion to the conversation. An agentic coding session consumes tokens in proportion to the work, and every action an agent takes incorporates planning, execution, and verification. Before the agent writes a line of code, it reads files and reasons through a multi-step plan. Then it executes: running commands, ingesting their output, retrying after failed tests. Then it verifies its own work, re-reading what it changed and checking the result. Each of those phases consumes tokens, and each revision of the plan mid-flight repeats the cycle. A single complex task can silently consume what a chat user would generate in a month. Organizations that approved Claude Code on per-seat licensing assumptions are finding actual spend orders of magnitude higher once agentic workflows run at scale. The failure mode is not hypothetical: one enterprise reportedly burned through $500 million in Claude usage in a single month after failing to set usage limits on employee licenses.

There is a second dynamic that compounds the first: chatbots are opt-in technologies, while agents are opt-out. A chatbot only costs money when an employee remembers to open the shiny new AI tool, which is why chat-based deployments routinely plateau around 15% of the workforce actually using them. Agents invert that. Once wired into pipelines, schedulers, and CI/CD, they run 24/7 whether or not anyone is watching, racking up costs incessantly and pushing effective adoption from 15% to 100%. The budget assumptions built for a tool people might use do not survive contact with a tool that is always running.

The organizational failure modes that allow this are predictable. Finance sees an AI line item that looks like SaaS seats. Engineering sees a developer tool whose consumption is invisible until the invoice. Nobody owns the meter. Costs compound silently because no single function has both the visibility and the authority to intervene.

Datasaur‘s experience deploying private AI infrastructure for regulated industries reveals a consistent pattern: organizations that retain control of their AI deployment architecture, running models behind the firewall rather than routing sensitive data through public APIs, have dramatically better visibility into consumption, cost, and data exposure simultaneously. This is not a coincidence. The same architectural choice that keeps privileged data inside the perimeter also puts the metering, logging, and budget controls inside the perimeter. Sovereignty over the deployment is sovereignty over the bill.

Before deployment, not after the first invoice shock, organizations should establish per-team and per-project token budgets with hard alerts, attribute consumption to specific workflows so anomalies are visible within days rather than billing cycles, and decide deliberately which workloads justify frontier-model API pricing and which belong on self-hosted or smaller models.

The Code Review Impossibility Problem

Here is an uncomfortable truth that security leaders at firms deploying Claude Code must internalize: the volume of AI-generated code in an active deployment makes comprehensive human review structurally impossible. A security team that insists on reviewing every AI-generated line will either create a bottleneck that kills adoption or, far more dangerous, find that “review” becomes nominal rather than genuine, a rubber stamp that provides the appearance of control without its substance.

The right response is not more review. It is smarter triage, anchored in two non-negotiable controls.

The first is AI code attribution tagging in version control. Every commit containing AI-generated or AI-assisted code should be tagged as such, at the commit level, automatically. Without attribution, an organization cannot weight its review effort toward the code that carries elevated risk, cannot respond forensically when a vulnerability is discovered (“which of our 400 repositories contain code generated during the window when this attack pattern was active?”), and cannot satisfy emerging disclosure requirements.

The second is SBOM generation for every AI-assisted build. A software bill of materials is the only mechanism by which an organization can answer, quickly and authoritatively, what is actually inside the software it ships, including dependencies an AI agent introduced that no human deliberately chose.

The documented record already shows what happens when system-level controls are absent. CVE-2025-55284, assigned a CVSS 7.1 High Severity rating, demonstrated that Claude Code could be hijacked via indirect prompt injection to exfiltrate secrets over DNS: a malicious instruction hidden in a source file caused the agent to read sensitive files like .env and silently encode their contents as subdomains in outbound DNS queries, using auto-approved utilities such as ping and dig. No human confirmation. No alert. Anthropic patched it by tightening the command allowlist, but the lesson is not about one CVE. It is that the attack surface of an agentic tool includes every piece of text the agent reads, and that perimeter cannot be policed by human code review.

For law firms specifically, the exposure surface extends further than most security teams realize. A developer whose machine holds an authenticated iManage session and has Claude Code installed has, without any additional configuration, created a potential pathway between the firm’s entire document estate and an agentic tool with shell execution rights. The MCP integration capability that makes Claude Code useful in connected development environments is the same capability that makes credential and session exposure a live risk, not a theoretical one.

The March 2026 “Claudy Day” disclosure affecting Claude.ai reinforced the point at the platform level: a trio of flaws, invisible prompt injection via URL parameters, an exfiltration channel through the Files API, and an open redirect, chained into a complete attack pipeline against users’ conversation history and memory. For a law firm, that history may contain matter strategy, client identities, and privileged analysis.

In plain terms: Claude’s execution environment trusts Anthropic’s own file storage infrastructure by design. An attacker who embeds an instruction in a URL parameter can exploit that trusted channel, instructing Claude to collect sensitive information from the user’s conversation history, write it to a file, and silently upload it to an account the attacker controls. The user sees nothing. No alert fires. The data is gone.

The Pressure Problem: The CISO and the General Counsel Are Holding It Together

The business case for Claude Code is real, and the pressure to move fast is legitimate. Engineering leaders are not wrong that agentic coding tools deliver step-change productivity. This is precisely what makes the governance challenge acute: security and legal leaders who respond with blanket caution, without offering a governed path forward, will lose the argument and the tool will be deployed anyway, without controls, often as shadow IT on personal accounts where no controls exist at all.

A CISO’s Playbook: Four Controls That Actually Contain Claude Code

1. Classify it correctly. If Claude Code was approved as a productivity application, the security review it deserves has not happened. Restart that review now.

2. Contain what it can touch. Sandboxed environments only. .claudeignore policy covering credentials and client matter directories. No –dangerously-skip-permissions in CI/CD. For law firms: a developer with an authenticated iManage session and Claude Code installed has already created a potential pathway to the firm’s document estate.

3. Build the forensic trail. AI attribution tagging in every commit. SBOM on every AI-assisted build. Without these, you cannot answer the question that follows a bad deployment: was that code AI-generated?

4. Own the meter. Token budgets with hard alerts before deployment, not after the invoice. Private deployment keeps metering, logging, and budget controls inside the perimeter alongside the privileged data.

A General Counsel’s Checklist Before Any AI Tool Touches Client Matter

1. Make the deployment architecture decision as a legal matter. The choice between a public API and private deployment is simultaneously a privilege decision. Routing client matter content through a third-party API without adequate contractual controls invites a waiver argument from opposing counsel. Legal needs to be in that room.

2. Audit your vendor contracts. Zero-data-retention clauses and explicit no-training provisions are required for every AI tool processing matter content. If they are not in the contract, the privilege protection is not there either.

3. Close the competence gap. ABA Rule 1.1 requires technological competence. Rule 1.6 requires reasonable efforts to prevent unauthorized disclosure. Both apply to every AI tool in the firm’s current stack. Attorney training on output verification and privilege protection is a professional responsibility obligation, not optional.

4. Track the regulatory deadlines. California AB 853 covered provider obligations: August 2, 2026. Platform-level provenance obligations: January 1, 2027. EU AI Act high-risk deadline: provisionally extended to December 2027 under the May 2026 Omnibus agreement, though August 2026 remains live until formally adopted. Separately, the EU’s transparency and watermarking obligations for AI-generated content move to December 2, 2026 under the same Omnibus deal, a nearer-term date that directly affects any AI-assisted work product leaving the firm. Generic enterprise governance does not cover any of this. Generic enterprise governance does not cover any of this.

The practical alternative to the binary of yes or no is a governed deployment architecture, one that satisfies security, privilege protection, cost visibility, and regulatory requirements simultaneously. In our work this looks like: agentic tools running against models deployed privately, behind the firewall or in the firm’s own cloud tenancy, so privileged material never transits a third-party API by default; sandboxed execution environments that constrain what the agent can read and run; centralized logging of agent actions for forensic and supervisory purposes; attribution and SBOM controls in the development pipeline; and metered consumption with budget enforcement. Datasaur‘s private LLM deployments for regulated industries provide one concrete reference point for what this architecture looks like in practice, but the principle matters more than the vendor: control of the deployment is the control plane for everything else.

A Governance Framework: Before, During, and Between

Drawing the threads together, professional services organizations deploying Claude Code should structure governance in three phases.

Before the first developer installs it, classify the tool correctly as agentic software with execution rights, and review it accordingly. Decide the deployment architecture deliberately, public API, private cloud, or behind the firewall, with the General Counsel in the room, because that choice is also a privilege decision. Establish token budgets, sandboxing defaults, and the attribution and SBOM requirements as conditions of deployment, not aspirations after it.

Once it is running, monitor continuously: agent action logs, consumption anomalies, dependency drift in AI-assisted builds, and the vulnerability disclosures that now arrive monthly for agentic tooling. Treat each disclosed CVE in an AI coding tool as a trigger to re-examine your own configuration, because the next CVE-2025-55284 will be exploited faster than the last.

Between technology, security, and legal, structure the conversation so that adoption pressure produces accountability rather than exposure. In practice this means the CISO arrives at the adoption discussion with a governed path rather than a refusal. The sandboxing requirements, the attribution controls, the token budgets; these are presented as the conditions of a yes, not as objections to a no. Engineering accepts them because they are narrowly scoped to the highest-risk behaviors rather than a blanket restriction on the tool. The General Counsel translates privilege and professional responsibility obligations into technical requirements that procurement and engineering can actually act on: the deployment architecture decision, public API versus private cloud versus behind the firewall, is made explicitly in that room, with legal present, because it is simultaneously a security decision, a cost decision, and a privilege decision. When that conversation happens before deployment rather than after an incident, the firm does not have to choose between innovation and accountability. It gets both.

Why Governance Cannot Be One and Done

There is now mathematical backing for this posture. In June 2026, NIST published a formal proof, building on Gödel’s incompleteness theorems, that no finite set of guardrails on an AI system can be universally robust against adversarial prompts. The implication is not despair; it is a model change. Security for AI systems cannot be “one and done” at procurement. It must be continuous-monitor-and-update: red teams searching for failures before adversaries do, guardrails hardened as fast as weaknesses are found, and operational resilience that limits impact when, not if, an exploit lands. That is exactly the governance model this article describes, applied to the most consequential category of AI now entering professional services firms.

The confluence of agentic capability, invisible cost dynamics, an unreviewable volume of generated code, and a uniquely demanding regulatory surface means law firms and professional services organizations cannot govern Claude Code with frameworks built for chatbots. The firms that get this right will not be the ones that said no. They will be the ones that understood what they were saying yes to, and built the architecture, the controls, and the internal conversation to say it accountably.