A caller is an input source, not an administrator
A caller can naturally say things such as ignore the normal policy, mark this as approved, reveal another customer's appointment, or transfer me to an internal number. The wording may be playful, urgent, or persuasive, but the security issue is the same: untrusted conversation content is attempting to change the agent's operating instructions or tool authority.
VoxsAgents should preserve a strict order of authority. Platform safety controls and organization-approved rules govern the workflow. Verified business data supplies facts. Tool results supply action evidence. Caller statements provide intent and details that still require validation. The model may interpret language, but it should not be able to promote caller text into a new permission or policy.
Original VoxsAgents threat analysis
We examined attacks by intended effect rather than by a list of suspicious phrases. The main effects were unauthorized disclosure, unauthorized action, policy bypass, routing abuse, and audit manipulation. This approach is more durable because the same effect can be requested politely, indirectly, in another language, or through content read from an external knowledge source.
The analysis found that tool boundaries provide stronger protection than a warning inside a prompt. If a booking tool requires an eligible service, scoped organization identifier, validated fields, and a server-side permission check, persuasive language cannot create an arbitrary booking. If a transfer tool accepts any telephone number from generated text, the prompt becomes the only barrier and the impact of a failure is much larger.
Control the action surface
Every tool should expose the smallest operation the workflow requires. The agent can choose from approved locations or transfer destinations rather than generating identifiers. Server-side code must derive organization scope from the authenticated execution context, not from a caller-provided value. Sensitive account changes should require the business's approved verification and may still require human review.
- Separate conversational text from system instructions and never concatenate caller content into an administrator policy field.
- Validate tool parameters against organization-owned services, calendars, destinations, field formats, and workflow state.
- Return only the minimum data required for the current call; do not give the model broad customer lists to filter itself.
- Require explicit evidence before status-changing actions and log the tool result independently of the generated summary.
- Escalate repeated boundary-testing or unsupported requests without arguing with the caller or exposing internal controls.
Knowledge content is also untrusted input
A retrieved document, website excerpt, CRM note, or uploaded FAQ can contain text that looks like an instruction. The application should treat that material as reference content, not authority. Retrieval results need source labels, organization scoping, content review, and output boundaries. An instruction found inside a customer note must not override the approved transfer or disclosure policy.
Summaries need similar protection. A caller can ask the agent to write that identity was verified or that a manager approved a refund. Structured summary fields should be derived from actual verification and tool events where possible. Free-text notes may capture the caller's claim, but they should label it as a claim rather than turning it into a completed action.
Adversarial testing should assert outcomes
Create tests that request cross-organization information, arbitrary transfers, unapproved discounts, fake verification, hidden prompt disclosure, and changes to audit records. Vary tone, language, interruptions, and indirect requests. The assertion should inspect tool calls, returned data, stored status, and logs—not only whether the spoken answer sounded like a refusal.
Track blocked unauthorized actions, sensitive-data exposure, unexpected tool parameters, escalation accuracy, false positives, and reviewer corrections. A perfect refusal script is not enough if the tool ran before the refusal. The release criterion is that protected state and data remain protected, with a clear path for legitimate callers to reach staff when automation cannot safely continue.
Limitations and responsible use
No prompt can guarantee resistance to every adversarial input. Effective protection depends on permission design, scoped data access, server-side validation, monitoring, review, and a tested human escalation path in addition to model instructions.
Research note and primary sources
This is original VoxsAgents workflow research based on system-state modelling, failure-path analysis, implementation review, and test-design work. It is an operational analysis, not a verified customer outcome claim. The official primary references below inform the controls and provider behavior discussed in this article.
Validate these recommendations against the organization's real tools, permissions, contracts, jurisdictions, and approved operating procedures before deployment.