M5Stick S3 · Volume 5

M5Stack M5StickS3 Volume 5 — Audio Subsystem Deep Dive (the standout feature)

ES8311 codec + MEMS mic + AW8737 amp + 8 Ω 1 W speaker — voice recording, FFT, wake-word, walkie-talkie, internet radio

Contents

SectionTopic
1About this volume — why audio matters
2The audio chain (block diagram + components)
3ES8311 codec — control + sample-rate matrix
4Voice recording workflow
5Real-time audio FFT visualization
6Wake-word detection with esp-skainet
7ESP-NOW walkie-talkie protocol
8Internet radio receiver
9Audio FX / DAW prototyping
10The “covert audio recorder” use case (with legal caveat)
11Audio power profile
12Resources

1. About this volume — why audio matters

Vol 5 is the audio subsystem deep dive — the unique-value-proposition volume for the M5StickS3. The audio chain is the single feature that differentiates the M5StickS3 from every other stick in M5Stack’s lineup and from most pentest handhelds generally:

  • M5StickC Plus 2 (classic ESP32): passive buzzer only — can produce tones, not voice
  • Cardputer ADV (ESP32-S3): has the same ES8311 + speaker chain — but the larger form factor sacrifices the wearable/covert use case
  • AWOK Dual Touch V3: no audio subsystem
  • Game Over: no audio subsystem
  • Flipper Zero: passive buzzer (PWM tones) — no microphone, no speaker for voice

The M5StickS3 is the only wearable-form-factor device in tjscientist’s lineup with voice-quality recording and playback. That’s the operational niche this volume covers.

What the audio chain enables (with sections in this volume):

  1. Voice recording (§ 4) — MEMS mic → ES8311 codec → 8 MB flash or PSRAM. Audible-quality voice memos.
  2. Real-time audio FFT (§ 5) — visualization of frequency content.
  3. Wake-word detection (§ 6) — voice-activated features via esp-skainet’s Multinet5 / WakeNet.
  4. ESP-NOW walkie-talkie (§ 7) — push-to-talk between two M5StickS3s.
  5. Internet radio receiver (§ 8) — Shoutcast / Icecast / direct MP3 URLs.
  6. Audio FX / DAW prototyping (§ 9) — limited but possible.
  7. Covert audio recording (§ 10) — operationally hazardous; legal caveats apply.

Power profile under sustained audio (§ 11) is the load-bearing operational constraint — the 250 mAh battery + 1 W speaker is a tight envelope.


2. The audio chain (block diagram + components)

           ┌──────────────┐
           │  MEMS mic    │ 65 dB SNR omnidirectional
           │  (bottom-firing typical)
           └──────┬───────┘
                  │ analog

           ┌──────────────┐         ┌────────────────────┐
           │   ES8311     │ ←───── │   ESP32-S3 LX7      │
           │   24-bit I²S │  I²C   │   I²S peripheral    │
           │   codec      │ (0x18) │   FreeRTOS audio    │
           │              │ ─────→ │   task              │
           └──────┬───────┘  I²S   └────────────────────┘
                  │ analog out

           ┌──────────────┐
           │   AW8737     │ Class-D amp, 1 W into 8 Ω
           │   PD pin     │ (firmware can mute via PD)
           └──────┬───────┘
                  │ amplified

           ┌──────────────┐
           │   8 Ω 1 W    │ Small mylar cone driver
           │   speaker    │ Audible in quiet rooms
           └──────────────┘

Components:

ComponentPartFunctionNotes
MEMS microphoneGeneric 65 dB SNRVoice capture, ambient audio recordingSpecific part not vendor-documented; likely Knowles / InvenSense class
ES8311 codecEverest Semiconductor ES8311I²S 24-bit codec + I²C config (0x18). Sample rates 8 kHz - 96 kHzSame chip as Cardputer ADV; M5Unified handles both transparently
Audio amplifierAwinic AW8737Class-D, 1 W into 8 Ω, lower noise floor than budget class-DHas PD (power-down) pin for firmware-controlled mute
Speaker8 Ω 1 W mylar coneAudible in quiet rooms (~60 dB SPL @ 30 cm), not loud enough for noisy environmentsSmall driver; bass response limited
(Future) 3.5 mm jack via Hat2TBDHeadphone outputNot on M5StickS3 base unit

Optional: a future Hat2 audio-jack accessory would route ES8311 output to a 3.5 mm TRRS jack. Not currently available. If discreet audio is needed, plug a small Bluetooth speaker into the M5StickS3’s BLE (uses different audio path: M5StickS3 streams via BLE audio profile to external speaker — less covert but lower visible footprint).


3. ES8311 codec — control + sample-rate matrix

The ES8311 is configured via I²C (address 0x18) and streams audio via I²S.

Sample rate / bit depth matrix:

Sample rate (kHz)Bit depthBytes / sec monoTypical use caseM5Unified default?
816-bit16 KB/sTelephony-quality voice; ESP-NOW walkie-talkie μ-lawSometimes
1616-bit32 KB/sVoice memos (good intelligibility, compact)Yes — default for M5.Mic
22.0516-bit44 KB/s”Voice quality” plus some music headroomOptional
44.116-bit88 KB/sCD-quality stereo (1 channel each)Optional
4816-bit96 KB/sDVD-quality stereoOptional
4824-bit144 KB/sStudio-qualityPower-heavy
9624-bit288 KB/sHigh-resolution studioMax ceiling; rarely used

Choosing the rate:

  • Voice memo recording → 16 kHz / 16-bit mono. Intelligible voice, ~32 KB/sec storage. 8 MB flash → ~4 minutes; 8 MB PSRAM → another ~4 minutes; 8 MB total → ~8 minutes recording.
  • ESP-NOW walkie-talkie → 8 kHz / 16-bit μ-law mono. Compresses well over Wi-Fi raw frames.
  • Audio FFT → 16 kHz / 16-bit mono. Adequate for 16-band visualization.
  • Internet radio receiver → match the source stream rate (typically 44.1 or 48 kHz). Audio decoded by libhelix-mp3 or similar.

M5Unified API:

#include <M5Unified.h>

void setup() {
    auto cfg = M5.config();
    M5.begin(cfg);

    // Mic: 16 kHz default
    M5.Mic.begin();

    // Speaker: M5Unified handles ES8311 + AW8737 init
    M5.Speaker.begin();
}

void loop() {
    M5.update();

    // Record buffer
    int16_t buf[256];
    M5.Mic.record(buf, 256);    // Fills buffer with 16-bit mono samples

    // Or play tone
    M5.Speaker.tone(440, 100);  // 440 Hz for 100 ms

    // Play raw audio buffer
    M5.Speaker.playRaw(buf, 256);
}

Lower-level API via ESP-IDF esp_codec_dev:

For applications that need finer control (e.g., simultaneously stream mic input + speaker output for a walkie-talkie pattern), use esp_codec_dev directly:

#include "esp_codec_dev.h"

audio_codec_data_if_t *data_if;
audio_codec_ctrl_if_t *ctrl_if;
esp_codec_dev_handle_t codec_dev;

// Init codec
data_if = audio_codec_new_i2s_data(...);
ctrl_if = audio_codec_new_i2c_ctrl(...);
codec_dev = esp_codec_dev_new(...);

esp_codec_dev_set_in_gain(codec_dev, 30.0);  // 30 dB mic gain
esp_codec_dev_set_out_vol(codec_dev, 70);    // 70% volume

// Open for input + output
esp_codec_dev_open(codec_dev, &fs_in);
esp_codec_dev_read(codec_dev, mic_buf, sizeof(mic_buf));
esp_codec_dev_write(codec_dev, speaker_buf, sizeof(speaker_buf));

This is more verbose but enables full-duplex audio for walkie-talkie patterns.


4. Voice recording workflow

End-to-end recipe for recording voice to flash:

Step-by-step

  1. Initialize codec at desired sample rate (16 kHz mono is standard).
  2. Allocate audio buffer in PSRAM (8 MB available — plenty).
  3. Start I²S input task — ES8311 streams samples into circular buffer.
  4. Main loop drains buffer — writes to flash, SD-via-Hat2, or RAM.
  5. Stop button → flush buffer, close file, return to menu.

Code skeleton

#include <M5Unified.h>
#include <SPIFFS.h>

const int SAMPLE_RATE = 16000;
const int BUFFER_SIZE = 256;
const size_t MAX_RECORDING_BYTES = 8 * 1024 * 1024;  // 8 MB max in PSRAM

int16_t *audio_buffer = nullptr;
size_t buffer_offset = 0;
bool recording = false;
File output_file;

void setup() {
    auto cfg = M5.config();
    M5.begin(cfg);
    SPIFFS.begin();

    // Allocate large recording buffer in PSRAM
    audio_buffer = (int16_t *)ps_malloc(MAX_RECORDING_BYTES);
    if (!audio_buffer) {
        M5.Display.println("PSRAM alloc failed!");
        while (1) delay(100);
    }

    M5.Mic.begin();
    M5.Display.println("Press A to record");
}

void loop() {
    M5.update();

    if (M5.BtnA.wasPressed() && !recording) {
        // Start recording
        recording = true;
        buffer_offset = 0;
        output_file = SPIFFS.open("/voice_memo.wav", "w");
        write_wav_header(output_file, SAMPLE_RATE);  // (helper function — RIFF header)
        M5.Display.println("Recording...");
    }

    if (recording) {
        int16_t buf[BUFFER_SIZE];
        M5.Mic.record(buf, BUFFER_SIZE);

        size_t bytes_to_copy = BUFFER_SIZE * sizeof(int16_t);
        if (buffer_offset + bytes_to_copy < MAX_RECORDING_BYTES) {
            memcpy(audio_buffer + (buffer_offset / 2), buf, bytes_to_copy);
            buffer_offset += bytes_to_copy;
        }
    }

    if (M5.BtnA.wasReleased() && recording) {
        // Stop recording
        recording = false;
        output_file.write((uint8_t *)audio_buffer, buffer_offset);
        finalize_wav_header(output_file, buffer_offset);
        output_file.close();
        M5.Display.printf("Saved %d bytes", (int)buffer_offset);
    }

    delay(10);
}

Storage paths:

  • SPIFFS / LittleFS (on-chip flash): ~5 MB usable after partitions. Limit ~3 minutes at 16 kHz mono 16-bit.
  • PSRAM (in-RAM buffer): ~7 MB usable. Limit ~3.5 minutes. Lost on reboot — for short captures only.
  • Hat2 microSD (if accessory present): essentially unlimited (limited by card capacity). Best for sustained recording.

File format: WAV (RIFF) is the standard. Helper function write_wav_header() writes a 44-byte RIFF header at the start of the file; finalize_wav_header() patches the byte-count field at end-of-recording.


5. Real-time audio FFT visualization

The LX7 dual-core handles 16-band FFT at 16 kHz comfortably (<10% CPU). Pattern:

#include <arduinoFFT.h>

const int SAMPLES = 256;
const int SAMPLING_FREQUENCY = 16000;

double vReal[SAMPLES];
double vImag[SAMPLES];
arduinoFFT FFT = arduinoFFT(vReal, vImag, SAMPLES, SAMPLING_FREQUENCY);

void loop() {
    M5.update();

    // Read SAMPLES samples
    int16_t buf[SAMPLES];
    M5.Mic.record(buf, SAMPLES);

    // Copy to FFT buffers
    for (int i = 0; i < SAMPLES; i++) {
        vReal[i] = (double)buf[i];
        vImag[i] = 0.0;
    }

    // Run FFT
    FFT.Windowing(FFT_WIN_TYP_HAMMING, FFT_FORWARD);
    FFT.Compute(FFT_FORWARD);
    FFT.ComplexToMagnitude();

    // Render 16-band bar graph
    M5.Display.fillScreen(BLACK);
    int bin_width = SAMPLES / 32;  // Use first half of FFT bins (Nyquist)
    int bar_width = 135 / 16;
    for (int i = 0; i < 16; i++) {
        double magnitude = 0;
        for (int j = 0; j < bin_width; j++) {
            magnitude += vReal[i * bin_width + j];
        }
        magnitude /= bin_width;
        int bar_height = (int)(magnitude * 0.05);
        if (bar_height > 240) bar_height = 240;
        M5.Display.fillRect(i * bar_width, 240 - bar_height,
                            bar_width - 1, bar_height, RED);
    }

    delay(50);  // ~20 fps update
}

The m5Cardputer_audiospectrum community port works on M5StickS3 with portrait-orientation re-flow. 16-band bar graph fills the screen cleanly.

Use cases:

  • Audio spectrum visualization (party trick)
  • Frequency analysis for tone detection
  • Educational tool for understanding audio frequency content

6. Wake-word detection with esp-skainet

Espressif’s esp-skainet library provides on-device wake-word detection and speech-command recognition:

  • Multinet5 model: ~800 KB. Recognizes ~40 English commands + custom wake words.
  • WakeNet8 / 9 models: smaller, wake-word only.

The M5StickS3’s 8 MB PSRAM is plenty for these models — they load into PSRAM at boot, leaving flash free for application code.

Workflow:

   MEMS mic ─→ ES8311 ─→ I²S ─→ esp-skainet wake-word detector

                          Detects "Hey Jarvis" or similar

                          Trigger application action
                          (e.g., "scan Wi-Fi")

LX7 dual-core hits the wake-word detector at <5% CPU. The wake-word detection can run continuously without materially impacting other tasks or battery life.

Integration:

#include "esp_skainet.h"

esp_skainet_handle_t skainet;

void setup() {
    M5.begin(M5.config());
    M5.Mic.begin();

    esp_skainet_config_t cfg = ESP_SKAINET_DEFAULT_CONFIG();
    cfg.wake_word = "hi jarvis";    // Use built-in wake words list
    skainet = esp_skainet_create(&cfg);
}

void loop() {
    M5.update();

    int16_t audio_chunk[ESP_SKAINET_AUDIO_CHUNK_SIZE];
    M5.Mic.record(audio_chunk, ESP_SKAINET_AUDIO_CHUNK_SIZE);

    int command_id = esp_skainet_detect(skainet, audio_chunk);
    if (command_id == ESP_SKAINET_WAKE_WORD_DETECTED) {
        M5.Display.println("Hey Jarvis detected!");
        // Trigger app action
    } else if (command_id >= 0) {
        M5.Display.printf("Command: %d\n", command_id);
        // Execute the recognized command
    }
    delay(10);
}

Available wake-words (Multinet5 default vocabulary): “Hi Lexin”, “Hi ESP”, “Alexa”, “Hi Jarvis”, “Hey Siri”, etc. Custom wake-word training is possible via Espressif’s ESP-Skainet training pipeline (offline process, requires audio dataset).

Common command vocabulary: ~40 English words covering basic device control — “turn on”, “turn off”, “stop”, “play”, “volume up”, “volume down”, “next”, “previous”, etc. Useful for hands-free menu navigation.

Use cases for M5StickS3:

  • “Hey Jarvis, scan Wi-Fi” → triggers Marauder / Bruce / Evil-M5 scan
  • “Hey Jarvis, record” → starts voice memo recorder
  • “Hey Jarvis, time” → display speaks current time via TTS (M5StickS3 has on-board TTS via M5.Speaker.tone(...) sequences or external TTS library)
  • Voice navigation of NEMO/Bruce menus when buttons are inconvenient (e.g., during a presentation)

7. ESP-NOW walkie-talkie protocol

ESP-NOW is Espressif’s broadcast-style Wi-Fi raw-frames protocol — no AP, no router, no protocol stack overhead beyond raw 802.11 frames. Two M5StickS3s on the same channel + same broadcast MAC = a real walkie-talkie.

Protocol overview

Frame format:

ESP-NOW frame:
  Channel: User-set (1-13)
  Source MAC: Sender's MAC (or broadcast)
  Dest MAC: Broadcast FF:FF:FF:FF:FF:FF
  Payload: Up to 250 bytes per frame
    - Sequence number (1 byte)
    - Audio data (μ-law encoded, ~249 bytes per frame)

Audio encoding:

  • Sample rate: 8 kHz mono (telephony quality)
  • Bit depth: 16-bit → compressed to 8-bit μ-law (~2:1 compression)
  • 8 kbps → ~1 KB/sec audio data → ~250-byte ESP-NOW frame every 250 ms

Push-to-talk workflow:

  1. Button A pressed: M5StickS3 enters TX mode
    • MEMS mic → ES8311 captures audio at 8 kHz
    • μ-law encodes each sample
    • Buffers 250-byte chunks
    • Each chunk transmitted as one ESP-NOW frame (~250 ms latency end-to-end)
  2. Button A released: M5StickS3 enters RX mode
    • Listens for ESP-NOW frames
    • μ-law decodes incoming audio
    • Plays to speaker via ES8311 + AW8737

Pair / multi-cast: ESP-NOW frames are broadcast — every M5StickS3 on the same channel receives them. Up to ~10 devices in audible range can hear the transmission.

Range

EnvironmentRange (typical)
Indoor, walls between devices~30-50 m
Indoor, line-of-sight~100 m
Outdoor, line-of-sight (open field)~150-200 m
Outdoor, with obstacles~50-100 m

(Standard Wi-Fi range at 2.4 GHz, +20 dBm TX power.)

Audio quality

8 kHz / 16-bit μ-law mono sounds like a 1990s walkie-talkie — voice intelligible, no hi-fi, slight compression artifacts on plosives. Not for music; perfectly fine for tactical comms.

Implementation

esp-now-talkie is the community firmware. Cross-compatible with Cardputer ADV (same codec) — M5StickS3 ↔ Cardputer ADV walkie-talkie pairs work.

#include <esp_now.h>
#include <WiFi.h>

void onDataRecv(const uint8_t *mac, const uint8_t *data, int len) {
    // Decode μ-law, play via M5.Speaker
    int16_t pcm[len];
    for (int i = 0; i < len; i++) {
        pcm[i] = mulaw_decode(data[i]);
    }
    M5.Speaker.playRaw(pcm, len);
}

void setup() {
    M5.begin(M5.config());
    WiFi.mode(WIFI_STA);
    esp_now_init();
    esp_now_register_recv_cb(onDataRecv);
}

void loop() {
    if (M5.BtnA.isHeld()) {
        // Push-to-talk: record + broadcast
        int16_t pcm[250];
        M5.Mic.record(pcm, 250);

        uint8_t mulaw_data[250];
        for (int i = 0; i < 250; i++) {
            mulaw_data[i] = mulaw_encode(pcm[i]);
        }

        uint8_t broadcast_mac[6] = {0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF};
        esp_now_send(broadcast_mac, mulaw_data, 250);
    }
    M5.update();
    delay(30);   // ~33 chunks/sec = ~8 kbps audio
}

Operational use

  • Tactical comms: small group on same channel, simple push-to-talk
  • Conference / event coordination: M5StickS3s on lanyards
  • Multi-device demo synchronization: trigger demos across multiple M5StickS3s

Privacy: traffic is unencrypted by default. For sensitive comms, add software-layer AES (pre-shared key).


8. Internet radio receiver

Workflow:

  1. Wi-Fi station connect to home network.
  2. Open HTTP/HTTPS stream from Shoutcast / Icecast / direct MP3 URL.
  3. Stream → libhelix-mp3 decoder → ES8311 → speaker.
  4. Display station name + track metadata (Shoutcast metadata embedded in stream).

Bandwidth: 128 kbps MP3 stream is typical. M5StickS3’s 2.4 GHz Wi-Fi handles this easily.

Audio quality: at the 8 Ω 1 W speaker, acceptable for casual listening in quiet rooms. Better via Bluetooth speaker (M5StickS3 streams via BLE Audio profile to external speaker).

Community firmware: RHesus-RAdio (DrRhesus) is the canonical implementation. Originally for Cardputer ADV; M5StickS3 port likely requires minor UI re-flow.

Architecture:

[Wi-Fi station]

       ↓ HTTPS GET on stream URL
[ESP32-S3 HTTP client]

       ↓ MP3 frame buffer (in PSRAM — 64 KB typical)
[libhelix-mp3 decoder]

       ↓ Decoded PCM samples (44.1 kHz 16-bit stereo)
[Audio mixer (downmix to mono if needed)]

       ↓ I²S
[ES8311 codec]

       ↓ Analog
[AW8737 amp]


[Speaker (or 3.5 mm jack if Hat2 accessory present)]

Stream sources:

  • Public Shoutcast / Icecast directories: shoutcast.com, internet-radio.com, radio-browser.info
  • Specific public streams: NPR, BBC, college radio, themed playlists
  • Self-hosted: any HTTP-served MP3/AAC file or Icecast server

The M5StickS3 becomes a tiny “kitchen radio” / “shower radio” / “office music” device — battery-powered (250 mAh → ~1-2 hours playback) or USB-tethered for unlimited duration.


9. Audio FX / DAW prototyping

The LX7 dual-core handles modest real-time audio processing. Possible (but limited):

EffectFeasibilityNotes
Low-pass / high-pass filter✓ trivialSimple IIR filter on incoming samples
Echo / delay✓ easyCircular buffer + mix
Reverb (basic)✓ moderateMulti-tap delay with feedback
Pitch shift⚠ heavyCPU-bound; expect 50-70% CPU at 16 kHz
Vocoder⚠ very heavyMulti-band filter bank; pushes CPU limits
Real-time mixing 4+ tracks⚠ borderlinePSRAM helps with buffers
Hardware-class synth✗ noLX7 lacks DSP-specific instructions; dedicated DSP chips outperform
Studio-class DAW✗ noCardputer ADV with QWERTY is better for any text-driven workflow

For real audio prototyping: use a Cardputer ADV (same codec + better UI), or a dedicated platform (Teensy 4.x + Audio Library, Bela board, modular synth modules).

The M5StickS3’s audio chain is good for proof-of-concept and niche audio recording — not for serious music production.


The M5StickS3’s wearable form factor + magnetic back + 20 g weight + voice-quality recording + 1 W audio playback = the physical and capability profile of a covert audio recorder.

The operator must understand the legal landscape before deploying this capability.

US federal law

Federal Wiretap Act (Title III, 18 USC §§ 2510-2522): prohibits interception of “oral communications” — defined as private communications spoken with reasonable expectation of privacy.

Federal interpretation: “one-party consent” — if one party to the conversation (the operator) consents to recording, recording is legal under federal law. Most federal investigations proceed under this.

Federal exceptions: recording where no party consents is always illegal under federal law, regardless of state law.

US state law (more restrictive than federal)

Two-party / all-party consent states: 11 US states (as of 2026-05-13) require all parties to consent before recording. Operating without all-party consent in these states is a criminal offense, typically a felony:

  • California — CA Penal Code §§ 631, 632
  • Florida — FL Stat. § 934.03
  • Illinois — 720 ILCS 5/14-2
  • Maryland — MD Cts. Jud. Proc. § 10-402
  • Massachusetts — Mass. Gen. Laws ch. 272 § 99
  • Montana — Mont. Code Ann. § 45-8-213
  • Nevada — NRS 200.620
  • New Hampshire — NH Rev. Stat. § 570-A:2
  • Pennsylvania — 18 Pa. Cons. Stat. § 5704
  • Vermont — Vermont Supreme Court rulings
  • Washington — RCW 9.73.030

The other 39 US states use one-party consent. Operating in these is legal as long as the operator is a party to the conversation.

EU + UK

GDPR (Regulation 2016/679): voice is personal data. Recording voice without lawful basis (consent / legitimate interest / legal obligation) is a regulatory violation. Penalties up to 4% of global revenue for organizations; criminal exposure under national laws for individuals.

UK Investigatory Powers Act 2016: regulates electronic interception. Strict; criminal penalties for unauthorized interception.

National variations: each EU member state has slightly different rules. Germany (StGB § 201) is particularly strict on speech without consent.

Other jurisdictions

  • Australia: state-by-state. Generally restrictive (similar to two-party consent).
  • Canada: Criminal Code § 184 — one-party consent.
  • Japan: Wiretap Act prohibits private-conversation recording without consent.
  • Russia / China / restrictive regions: assume strict prohibitions and severe consequences.

Operational rule for M5StickS3 covert-audio use

The M5StickS3 can technically be deployed as a covert audio recorder. The operator must not, except under explicit authorization.

Practical operational discipline:

  1. Know your jurisdiction — US state of operation drives the rule. EU jurisdiction drives the rule.
  2. Document authorization — written, signed, scope-specified, before engagement.
  3. Time-box — shorter is safer.
  4. Don’t deploy in spaces where third parties might be present without all-party consent: schools, hospitals, courthouses, public accommodations have additional rules.
  5. Sanitize recordings post-engagement — chain-of-custody discipline (Vol 11 § 11).
  6. For tjscientist’s own bench: recording yourself / your own equipment / your own private spaces = legal everywhere. This is the safe operating envelope.

For personal use (voice memos, audio note-taking, sound recording of your own activities): no legal exposure.

For engagement work: get authorization, document, time-box, sanitize.

For public-space deployment (magnetic-back-stick on the side of a server rack with audio recording): only with explicit authorization from the venue + all parties present. Otherwise: don’t.


11. Audio power profile

Per-mode current draw + battery-life math:

Audio modeCurrent (mA)250 mAh battery life
Idle (no audio)~80 mA~3 hours
Recording at 16 kHz 16-bit mono~95-105 mA~2.4 hours
Recording at 96 kHz 24-bit stereo~140 mA~1.8 hours
Playback at low speaker volume~150-200 mA~1.3-1.7 hours
Playback at full 1 W speaker output~280-320 mA peak~45-50 minutes
ESP-NOW walkie-talkie active~200-250 mA average~1.0-1.2 hours
Wake-word detection continuously~85 mA~2.9 hours (almost free)
Internet radio reception~150-180 mA~1.4-1.6 hours
FFT audio analysis~120 mA~2 hours

Key insights:

  1. Wake-word detection has minimal power overhead — can run continuously without materially shortening battery life.
  2. Sustained playback at full volume drains the battery in <1 hour — much faster than Cardputer ADV (1750 mAh battery + same audio chain → ~5 hours playback).
  3. Recording-only is power-efficient — ~2-3 hours.
  4. Walkie-talkie sustained is borderline — under 1.5 hours at 250 mAh.

Thermal consideration: continuous high-volume playback for >10-15 minutes makes the M5StickS3 case palpably warm. ESP32-S3 die temp climbs but doesn’t throttle until ~125°C. Speaker driver cone displacement starts to drift after sustained high SPL.

Operational guidance:

  • For sustained audio work (long internet radio listening, extended walkie-talkie session), plug in USB-C rather than relying on battery.
  • For voice-memo workflows (record-as-needed), battery is fine.
  • For wake-word-activated workflows, battery is fine for many hours.

12. Resources

Library / driver references

Hardware datasheets

Use-case firmwares

  • RHesus-RAdio (internet radio): GitHub search
  • m5Cardputer_audiospectrum (FFT visualizer): GitHub search
  • esp-now-talkie (walkie-talkie): GitHub search

Legal references

Forward references


This is Volume 5 of a twelve-volume series. Next: Vol 6 covers the firmware ecosystem — Evil-M5Project family, Bruce-for-stick, Marauder ports, MicroHydra, UiFlow 2, ESPHome, retro emulators, audio-niche firmwares.