Build Your Own AI Gaming Mate Part 3: Audio Pipeline

Introduction

This post covers the voice and control layer around the gaming mate. The main distinction here is not whether audio exists at all, but whether you self-host it. You can use built-in TTS or an online provider if that is good enough. I use a local Qwen3-TTS setup when I want tighter control over the voice itself, and I use macOS dictation plus a foot pedal to make the interaction less awkward during play.

For an overview of the full system, see Part 0: Overview.

What This Part Does

At the end of this post, you should have:

A low-friction way to speak naturally during gameplay
A clear picture of your TTS options, from built-in voices to a self-hosted local server
Hands-free shortcut control using a 3-key deck pedal
A clear understanding of when local Qwen3-TTS is worth the extra setup

TTS Options

There are a few reasonable ways to do voice output:

Discord’s built-in TTS if you want the fastest path with zero configuration
An online TTS provider if you want better voice quality with less local setup
mlx-audio running Qwen3-TTS locally if you want a self-hosted voice on Apple Silicon

The reason to run mlx-audio locally is not that the other options are unusable. It is that local hosting gives you lower latency and more control over the voice without any recurring cost or privacy tradeoff.

Why This Setup

The combination of macOS dictation, Discord, and a foot pedal looks unconventional, but each piece earns its place.

macOS dictation is built in. It requires no additional installation and already handles a wide vocabulary reasonably well. The downside is that it needs a keyboard shortcut to activate, which is why the foot pedal matters.

Discord is the interaction transport. It gives me a typed message thread where I can see the exact words the agent received, which makes it easy to spot bad transcriptions. It also means the input path is the same whether I am typing or speaking: a Discord message goes to the agent either way. Keeping the transport uniform makes the rest of the agent logic simpler.

The deck pedal removes the last awkward moment in the loop. Without it, activating dictation means taking a hand off the controller or keyboard. With it, I press a pedal with my foot, speak, and press again to send — without touching anything else.

That combination is why this setup feels usable during real gameplay rather than just during a demo. None of the three pieces alone is enough. Together, they make the voice input path feel as fast and low-friction as typing.

Architecture for This Step

Audio input
  -> press pedal key 1 (double-click left Command to activate macOS dictation)
  -> speak
  -> press pedal key 2 (Return to send Discord message)
  -> Discord message
  -> local agent

Audio output
  -> local agent text
  -> Discord built-in TTS, online TTS, or Qwen3-TTS via mlx-audio
  -> press pedal key 3 (mouse click via cliclick to play Discord TTS audio)
  -> spoken reply through 3.5mm cable into Windows line-in

Prerequisites

macOS dictation:

macOS Ventura or later
Enable dictation in System Settings → Keyboard → Dictation
Set the shortcut to double-click the left Command key
Enable “Auto-punctuation” if you want cleaner transcriptions
Microphone access must be granted to the system
macOS handles only Discord, browser, and OpenClaw; the game itself runs on Windows

Discord:

A Discord server and channel dedicated to the gaming mate session
The agent connected to that channel (covered in previous posts)
Discord desktop app, not the browser version — focus behavior is more predictable
Discord’s built-in TTS enabled (the /tts prefix or a bot that sends TTS messages)

TTS (online provider path):

An API key from your chosen provider (ElevenLabs, OpenAI TTS, or similar)
Network access during play

TTS (local Qwen3-TTS path):

An Apple Silicon Mac with enough memory to run the model
mlx-audio installed via pip — the upstream package already exposes an OpenAI-compatible TTS API

pip install mlx-audio

Qwen3-TTS weights downloaded through mlx-audio
(Optional) A fork of mlx-audio with voice upload and voice-prompt caching for better performance — works like the vllm Qwen3-TTS serving example where the reference audio is uploaded once and reused across all requests instead of being re-encoded on every call

pip install git+https://github.com/hemslo/mlx-audio.git@save-voice

Foot pedal:

A 3-key USB deck pedal (any programmable HID pedal will work)
The pedal driver or macOS keyboard shortcut mapping tool
cliclick installed on macOS (used via an Automator application for the click action)

brew install cliclick

Audio routing:

A 3.5mm audio cable connecting the macOS headphone output to the Windows PC line-in port
Windows built-in volume mixer to balance game audio and the incoming macOS TTS audio

Input Path

The voice input side is a three-step loop: activate dictation, speak, send.

Why Discord Is Still in the Loop

Even with voice input enabled, Discord stays as the transport. The agent receives a typed message regardless of whether you dictated or typed it. That means you get a visible transcript of everything you said, which is useful for two reasons.

First, you can see exactly what the agent received. If it gives an odd response, you can check whether the transcript is correct before assuming the agent is wrong. Second, you can scroll back through the session and see the full conversation, which is easier to review than an audio log.

Push-to-Activate Behavior

macOS dictation activates on a double-click of the left Command key. macOS runs only Discord, the browser, and OpenClaw, so there is very little risk of a shortcut collision.

The activation shortcut puts the dictation overlay in whatever app currently has focus. If Discord is in the background, the shortcut will activate dictation in the wrong app. Solve this by keeping Discord always in focus on macOS, or by including a click on the Discord window as the first step of the pedal macro.

Message Length

Dictation handles both short and long inputs well. For quick remarks, one sentence is enough. For longer questions or complex situations, dictate the full message — there is no need to switch to typing. The foot pedal workflow supports whatever length you need.

Recovering From Bad Transcription

When dictation produces something garbled, the simplest fix is to say “ignore that” and repeat the message. The agent handles it cleanly because it treats the conversation as a dialogue, not a command sequence. You can also type a correction directly into Discord if the error is significant.

Game-specific terms cause the most trouble. Hero names, ability names, and place names that are not common English words will often come out wrong. Build up a short list of corrections for the terms you use most and use Discord’s message edit to fix them before sending.

Staying Usable During Intense Gameplay

The pedal workflow interrupts only for the moment you are speaking. One foot press starts dictation, one press stops it. After that, the message is sent automatically and you are back in the game before the agent starts replying.

During intense moments, it is fine to say nothing. The agent does not interrupt unless you speak first. Save the conversation for breaks between fights, loading screens, and menu navigation.

Output Path

The voice output side is where the tradeoffs between convenience and control are most visible.

When Built-in TTS Is Enough

Discord has a built-in TTS feature. Any message sent with the /tts command, or sent by a bot that uses TTS output, will be read aloud by Discord using the system voice. This requires zero additional setup: it works out of the box once Discord TTS is enabled in user settings under Notifications.

The voice quality is functional but generic. It sounds like a system utility rather than a specific character. For validating the end-to-end loop and getting a feel for the interaction rhythm, Discord TTS is more than sufficient.

The main limitation is control. You cannot change the voice style, pace, or persona through Discord TTS. For longer play sessions, the generic voice becomes noticeable. That is when moving to an online provider or local Qwen3-TTS makes sense.

When an Online Provider Is a Better Tradeoff

An online TTS provider gives you better voice quality with a small amount of latency and a recurring cost. ElevenLabs and OpenAI TTS are both easy to integrate. The voice library is wide, and you can find one that suits the companion personality you want.

The tradeoff is privacy and latency. Every reply text is sent to the provider’s API before it is spoken. Network latency adds roughly 300–800 ms depending on the provider and region. For casual conversation this is not noticeable, but for short reactive replies it can feel slow.

If the extra latency is acceptable and you do not want the local setup cost, an online provider is a reasonable default.

Why Local Qwen3-TTS Is Worth the Extra Setup

mlx-audio runs Qwen3-TTS on Apple Silicon and already exposes an OpenAI-compatible TTS API out of the box. That means you can point OpenClaw — or any other client that speaks the OpenAI TTS API — straight at the local server without any adapter code.

Start the local TTS server:

python -m mlx_audio.server --host 0.0.0.0 --port 8000

Binding to 0.0.0.0 lets other machines on the same network reach the server, which is useful if OpenClaw runs on a different host than the Mac running mlx-audio. The server exposes an OpenAI-compatible TTS endpoint at http://<host>:8000/v1. Configure OpenClaw (or your agent) to send reply text to that endpoint.

If you want voice cloning with better performance, the optional fork adds voice upload and caches the loaded voice prompt across requests. This works the same way as the vllm Qwen3-TTS serving example: upload a reference audio file once, and every subsequent TTS request reuses the cached voice prompt instead of re-encoding it from scratch. The result is faster first-token latency for short replies.

To test the fork locally, start the server the same way, then upload a reference audio file under a name of your choice and send a speech request using that name as the voice.

Upload a reference audio file as a named voice (replace reference.wav with your actual file):

curl http://localhost:8000/v1/audio/voices \
  -F "audio_sample=@reference.wav" \
  -F "consent=positive" \
  -F "name=ai"

The server stores the file and caches the voice prompt the first time it is used. Subsequent requests with "voice": "ai" skip re-encoding and return faster.

Generate speech using the saved voice:

curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit",
    "input": "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you.",
    "voice": "ai"
  }' \
  --output speech-en.mp3

Response latency for short replies on Apple Silicon M-series hardware is lower than most online providers. Long replies take more time to generate, which is one reason to keep agent replies short.

Configuring OpenClaw to Use the Local TTS Server

OpenClaw’s TTS settings live under messages.tts in openclaw.json. To point it at the local mlx-audio server, use the openai provider and set baseUrl to http://localhost:8000/v1:

{
  "messages": {
    "tts": {
      "auto": "tagged",
      "provider": "openai",
      "providers": {
        "openai": {
          "apiKey": "ai",
          "baseUrl": "http://localhost:8000/v1",
          "model": "mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit",
          "voice": "ai"
        }
      },
      "timeoutMs": 120000
    }
  }
}

Set auto to "tagged" to speak only replies that carry a TTS tag — useful for limiting which responses are read aloud during gameplay. Use the /voice command in Discord to toggle TTS on or off on the fly without editing the config file. Set auto to "always" if you want every reply spoken automatically.

Response Length Limits

Set a maximum reply length in the agent prompt. Two to four sentences is the right range for voice output during gameplay. Longer replies take more time to generate and more time to speak. By the time a ten-sentence reply finishes, the moment in the game has already passed.

If the agent tends to give longer answers, add an explicit instruction: Keep replies to three sentences or fewer. Shorter is better.

Audio Routing While the Game Is Already Producing Sound

The macOS machine handles Discord, browser, and OpenClaw. The Windows machine runs the game. A 3.5mm audio cable connects the macOS headphone output to the Windows PC line-in port. This merges macOS audio — including Discord TTS and local TTS playback — with the game audio directly in hardware.

Windows sees the macOS audio as a line-in source. Use the Windows built-in volume mixer to balance the two: raise or lower the line-in level relative to the game audio until the companion voice is comfortably audible over the game sound. No virtual audio devices or complex routing software are required.

Foot Pedal Workflow

The deck pedal turns a three-step keyboard sequence into three single presses with the left foot.

Pedal Mapping

Left pedal: Activate macOS dictation (double-click the left Command key)
Middle pedal: Send the Discord message (Return)
Right pedal: Play the TTS audio reply (mouse left click via cliclick c:. wrapped in an Automator application)

Workflow

Press the left pedal to start dictation
Speak your message
Press the middle pedal to send it as a Discord message
Wait for the agent reply to appear
Hover the mouse over the Discord TTS play button, then press the right pedal to hear the reply

The right pedal runs cliclick c:., which clicks at the current cursor position without moving it. Position the cursor over the play button once and use the pedal for every subsequent click.