Voice AI Can Be Hijacked by Audio Commands Hidden in Ordinary Sound

Editorial illustration for: Voice AI Can Be Hijacked by Audio Commands Hidden in Ordinary Sound

What happens when a voice AI receives a command you never gave - and never heard? Researchers have shown it executes. IEEE Spectrum's recent report covers a vulnerability class affecting a wide range of deployed voice AI systems: audio commands hidden inside ordinary sounds that trigger real actions.

How Hidden Commands Work

The attack exploits a gap between human hearing and how machine learning models process audio. Human ears have a frequency range of roughly 20 Hz to 20,000 Hz, plus a phenomenon called auditory masking, where louder sounds prevent us from perceiving quieter ones nearby in pitch or time. Voice AI models process raw audio waveforms without these same limitations.

Two attack approaches have been demonstrated in security research. Ultrasonic injection embeds commands at frequencies above 20,000 Hz - inaudible to humans, but picked up by microphones and processed by AI models. Psychoacoustic hiding encodes commands within normal audio at levels that human auditory masking filters out, but that pattern-matching models detect and act on. An attacker can, in theory, play modified audio in a shared space and send a hidden command to a nearby voice-enabled device.

Working implementations have been documented against major voice assistant platforms, and the attack surface has grown as businesses deploy voice AI into customer service automation, meeting transcription, voice authentication, and enterprise software control.

Who's Actually at Risk

The risk depends on what the voice AI system can do with a recognized command. A consumer smart speaker has limited blast radius. A voice-authenticated banking app, an AI phone agent that takes actions on behalf of a business, or a transcription pipeline feeding into a production database - those have real exposure.

For individual users, exploitation requires either controlling an audio stream the target is listening to, or being physically close enough to have injected sound reach their microphone. That's a meaningful constraint for opportunistic attacks.

For businesses deploying voice AI, the calculation is different. A customer service AI that processes phone calls could receive adversarial audio from any caller. A meeting transcription tool could have its output manipulated by someone playing modified audio into a conference room. Voice-controlled access systems can potentially be triggered by commands hidden in ambient sound.

The Gap Between Deployment and Security Testing

Voice AI deployment has moved faster than security auditing of these systems. A company that has run thorough penetration testing on its web application may have deployed an AI phone agent or voice transcription pipeline with no equivalent adversarial testing.

This fits a wider pattern in AI security: models trained to be highly sensitive to patterns can be manipulated by inputs engineered to trigger specific responses. Image classifiers get fooled by imperceptible pixel changes. Large language models - the software behind tools like ChatGPT and Claude - can be manipulated by carefully designed text inputs. Audio models follow the same logic; they just haven't received as much security attention yet.

Mitigations exist: frequency filtering to strip ultrasonic content before it reaches the model, anomaly detection on recognized commands, and physical microphone security in sensitive environments. None of these are complicated to implement. The harder problem is that most businesses deploying voice AI haven't asked whether they need them.