M5Stick S3 · Volume 5
M5Stack M5StickS3 Volume 5 — Audio Subsystem Deep Dive (the standout feature)
ES8311 codec + MEMS mic + AW8737 amp + 8 Ω 1 W speaker — voice recording, FFT, wake-word, walkie-talkie, internet radio
Contents
1. About this volume — why audio matters
Vol 5 is the audio subsystem deep dive — the unique-value-proposition volume for the M5StickS3. The audio chain is the single feature that differentiates the M5StickS3 from every other stick in M5Stack’s lineup and from most pentest handhelds generally:
- M5StickC Plus 2 (classic ESP32): passive buzzer only — can produce tones, not voice
- Cardputer ADV (ESP32-S3): has the same ES8311 + speaker chain — but the larger form factor sacrifices the wearable/covert use case
- AWOK Dual Touch V3: no audio subsystem
- Game Over: no audio subsystem
- Flipper Zero: passive buzzer (PWM tones) — no microphone, no speaker for voice
The M5StickS3 is the only wearable-form-factor device in tjscientist’s lineup with voice-quality recording and playback. That’s the operational niche this volume covers.
What the audio chain enables (with sections in this volume):
- Voice recording (§ 4) — MEMS mic → ES8311 codec → 8 MB flash or PSRAM. Audible-quality voice memos.
- Real-time audio FFT (§ 5) — visualization of frequency content.
- Wake-word detection (§ 6) — voice-activated features via esp-skainet’s Multinet5 / WakeNet.
- ESP-NOW walkie-talkie (§ 7) — push-to-talk between two M5StickS3s.
- Internet radio receiver (§ 8) — Shoutcast / Icecast / direct MP3 URLs.
- Audio FX / DAW prototyping (§ 9) — limited but possible.
- Covert audio recording (§ 10) — operationally hazardous; legal caveats apply.
Power profile under sustained audio (§ 11) is the load-bearing operational constraint — the 250 mAh battery + 1 W speaker is a tight envelope.
2. The audio chain (block diagram + components)
┌──────────────┐
│ MEMS mic │ 65 dB SNR omnidirectional
│ (bottom-firing typical)
└──────┬───────┘
│ analog
↓
┌──────────────┐ ┌────────────────────┐
│ ES8311 │ ←───── │ ESP32-S3 LX7 │
│ 24-bit I²S │ I²C │ I²S peripheral │
│ codec │ (0x18) │ FreeRTOS audio │
│ │ ─────→ │ task │
└──────┬───────┘ I²S └────────────────────┘
│ analog out
↓
┌──────────────┐
│ AW8737 │ Class-D amp, 1 W into 8 Ω
│ PD pin │ (firmware can mute via PD)
└──────┬───────┘
│ amplified
↓
┌──────────────┐
│ 8 Ω 1 W │ Small mylar cone driver
│ speaker │ Audible in quiet rooms
└──────────────┘
Components:
| Component | Part | Function | Notes |
|---|---|---|---|
| MEMS microphone | Generic 65 dB SNR | Voice capture, ambient audio recording | Specific part not vendor-documented; likely Knowles / InvenSense class |
| ES8311 codec | Everest Semiconductor ES8311 | I²S 24-bit codec + I²C config (0x18). Sample rates 8 kHz - 96 kHz | Same chip as Cardputer ADV; M5Unified handles both transparently |
| Audio amplifier | Awinic AW8737 | Class-D, 1 W into 8 Ω, lower noise floor than budget class-D | Has PD (power-down) pin for firmware-controlled mute |
| Speaker | 8 Ω 1 W mylar cone | Audible in quiet rooms (~60 dB SPL @ 30 cm), not loud enough for noisy environments | Small driver; bass response limited |
| (Future) 3.5 mm jack via Hat2 | TBD | Headphone output | Not on M5StickS3 base unit |
Optional: a future Hat2 audio-jack accessory would route ES8311 output to a 3.5 mm TRRS jack. Not currently available. If discreet audio is needed, plug a small Bluetooth speaker into the M5StickS3’s BLE (uses different audio path: M5StickS3 streams via BLE audio profile to external speaker — less covert but lower visible footprint).
3. ES8311 codec — control + sample-rate matrix
The ES8311 is configured via I²C (address 0x18) and streams audio via I²S.
Sample rate / bit depth matrix:
| Sample rate (kHz) | Bit depth | Bytes / sec mono | Typical use case | M5Unified default? |
|---|---|---|---|---|
| 8 | 16-bit | 16 KB/s | Telephony-quality voice; ESP-NOW walkie-talkie μ-law | Sometimes |
| 16 | 16-bit | 32 KB/s | Voice memos (good intelligibility, compact) | Yes — default for M5.Mic |
| 22.05 | 16-bit | 44 KB/s | ”Voice quality” plus some music headroom | Optional |
| 44.1 | 16-bit | 88 KB/s | CD-quality stereo (1 channel each) | Optional |
| 48 | 16-bit | 96 KB/s | DVD-quality stereo | Optional |
| 48 | 24-bit | 144 KB/s | Studio-quality | Power-heavy |
| 96 | 24-bit | 288 KB/s | High-resolution studio | Max ceiling; rarely used |
Choosing the rate:
- Voice memo recording → 16 kHz / 16-bit mono. Intelligible voice, ~32 KB/sec storage. 8 MB flash → ~4 minutes; 8 MB PSRAM → another ~4 minutes; 8 MB total → ~8 minutes recording.
- ESP-NOW walkie-talkie → 8 kHz / 16-bit μ-law mono. Compresses well over Wi-Fi raw frames.
- Audio FFT → 16 kHz / 16-bit mono. Adequate for 16-band visualization.
- Internet radio receiver → match the source stream rate (typically 44.1 or 48 kHz). Audio decoded by libhelix-mp3 or similar.
M5Unified API:
#include <M5Unified.h>
void setup() {
auto cfg = M5.config();
M5.begin(cfg);
// Mic: 16 kHz default
M5.Mic.begin();
// Speaker: M5Unified handles ES8311 + AW8737 init
M5.Speaker.begin();
}
void loop() {
M5.update();
// Record buffer
int16_t buf[256];
M5.Mic.record(buf, 256); // Fills buffer with 16-bit mono samples
// Or play tone
M5.Speaker.tone(440, 100); // 440 Hz for 100 ms
// Play raw audio buffer
M5.Speaker.playRaw(buf, 256);
}
Lower-level API via ESP-IDF esp_codec_dev:
For applications that need finer control (e.g., simultaneously stream mic input + speaker output for a walkie-talkie pattern), use esp_codec_dev directly:
#include "esp_codec_dev.h"
audio_codec_data_if_t *data_if;
audio_codec_ctrl_if_t *ctrl_if;
esp_codec_dev_handle_t codec_dev;
// Init codec
data_if = audio_codec_new_i2s_data(...);
ctrl_if = audio_codec_new_i2c_ctrl(...);
codec_dev = esp_codec_dev_new(...);
esp_codec_dev_set_in_gain(codec_dev, 30.0); // 30 dB mic gain
esp_codec_dev_set_out_vol(codec_dev, 70); // 70% volume
// Open for input + output
esp_codec_dev_open(codec_dev, &fs_in);
esp_codec_dev_read(codec_dev, mic_buf, sizeof(mic_buf));
esp_codec_dev_write(codec_dev, speaker_buf, sizeof(speaker_buf));
This is more verbose but enables full-duplex audio for walkie-talkie patterns.
4. Voice recording workflow
End-to-end recipe for recording voice to flash:
Step-by-step
- Initialize codec at desired sample rate (16 kHz mono is standard).
- Allocate audio buffer in PSRAM (8 MB available — plenty).
- Start I²S input task — ES8311 streams samples into circular buffer.
- Main loop drains buffer — writes to flash, SD-via-Hat2, or RAM.
- Stop button → flush buffer, close file, return to menu.
Code skeleton
#include <M5Unified.h>
#include <SPIFFS.h>
const int SAMPLE_RATE = 16000;
const int BUFFER_SIZE = 256;
const size_t MAX_RECORDING_BYTES = 8 * 1024 * 1024; // 8 MB max in PSRAM
int16_t *audio_buffer = nullptr;
size_t buffer_offset = 0;
bool recording = false;
File output_file;
void setup() {
auto cfg = M5.config();
M5.begin(cfg);
SPIFFS.begin();
// Allocate large recording buffer in PSRAM
audio_buffer = (int16_t *)ps_malloc(MAX_RECORDING_BYTES);
if (!audio_buffer) {
M5.Display.println("PSRAM alloc failed!");
while (1) delay(100);
}
M5.Mic.begin();
M5.Display.println("Press A to record");
}
void loop() {
M5.update();
if (M5.BtnA.wasPressed() && !recording) {
// Start recording
recording = true;
buffer_offset = 0;
output_file = SPIFFS.open("/voice_memo.wav", "w");
write_wav_header(output_file, SAMPLE_RATE); // (helper function — RIFF header)
M5.Display.println("Recording...");
}
if (recording) {
int16_t buf[BUFFER_SIZE];
M5.Mic.record(buf, BUFFER_SIZE);
size_t bytes_to_copy = BUFFER_SIZE * sizeof(int16_t);
if (buffer_offset + bytes_to_copy < MAX_RECORDING_BYTES) {
memcpy(audio_buffer + (buffer_offset / 2), buf, bytes_to_copy);
buffer_offset += bytes_to_copy;
}
}
if (M5.BtnA.wasReleased() && recording) {
// Stop recording
recording = false;
output_file.write((uint8_t *)audio_buffer, buffer_offset);
finalize_wav_header(output_file, buffer_offset);
output_file.close();
M5.Display.printf("Saved %d bytes", (int)buffer_offset);
}
delay(10);
}
Storage paths:
- SPIFFS / LittleFS (on-chip flash): ~5 MB usable after partitions. Limit ~3 minutes at 16 kHz mono 16-bit.
- PSRAM (in-RAM buffer): ~7 MB usable. Limit ~3.5 minutes. Lost on reboot — for short captures only.
- Hat2 microSD (if accessory present): essentially unlimited (limited by card capacity). Best for sustained recording.
File format: WAV (RIFF) is the standard. Helper function write_wav_header() writes a 44-byte RIFF header at the start of the file; finalize_wav_header() patches the byte-count field at end-of-recording.
5. Real-time audio FFT visualization
The LX7 dual-core handles 16-band FFT at 16 kHz comfortably (<10% CPU). Pattern:
#include <arduinoFFT.h>
const int SAMPLES = 256;
const int SAMPLING_FREQUENCY = 16000;
double vReal[SAMPLES];
double vImag[SAMPLES];
arduinoFFT FFT = arduinoFFT(vReal, vImag, SAMPLES, SAMPLING_FREQUENCY);
void loop() {
M5.update();
// Read SAMPLES samples
int16_t buf[SAMPLES];
M5.Mic.record(buf, SAMPLES);
// Copy to FFT buffers
for (int i = 0; i < SAMPLES; i++) {
vReal[i] = (double)buf[i];
vImag[i] = 0.0;
}
// Run FFT
FFT.Windowing(FFT_WIN_TYP_HAMMING, FFT_FORWARD);
FFT.Compute(FFT_FORWARD);
FFT.ComplexToMagnitude();
// Render 16-band bar graph
M5.Display.fillScreen(BLACK);
int bin_width = SAMPLES / 32; // Use first half of FFT bins (Nyquist)
int bar_width = 135 / 16;
for (int i = 0; i < 16; i++) {
double magnitude = 0;
for (int j = 0; j < bin_width; j++) {
magnitude += vReal[i * bin_width + j];
}
magnitude /= bin_width;
int bar_height = (int)(magnitude * 0.05);
if (bar_height > 240) bar_height = 240;
M5.Display.fillRect(i * bar_width, 240 - bar_height,
bar_width - 1, bar_height, RED);
}
delay(50); // ~20 fps update
}
The m5Cardputer_audiospectrum community port works on M5StickS3 with portrait-orientation re-flow. 16-band bar graph fills the screen cleanly.
Use cases:
- Audio spectrum visualization (party trick)
- Frequency analysis for tone detection
- Educational tool for understanding audio frequency content
6. Wake-word detection with esp-skainet
Espressif’s esp-skainet library provides on-device wake-word detection and speech-command recognition:
- Multinet5 model: ~800 KB. Recognizes ~40 English commands + custom wake words.
- WakeNet8 / 9 models: smaller, wake-word only.
The M5StickS3’s 8 MB PSRAM is plenty for these models — they load into PSRAM at boot, leaving flash free for application code.
Workflow:
MEMS mic ─→ ES8311 ─→ I²S ─→ esp-skainet wake-word detector
│
Detects "Hey Jarvis" or similar
↓
Trigger application action
(e.g., "scan Wi-Fi")
LX7 dual-core hits the wake-word detector at <5% CPU. The wake-word detection can run continuously without materially impacting other tasks or battery life.
Integration:
#include "esp_skainet.h"
esp_skainet_handle_t skainet;
void setup() {
M5.begin(M5.config());
M5.Mic.begin();
esp_skainet_config_t cfg = ESP_SKAINET_DEFAULT_CONFIG();
cfg.wake_word = "hi jarvis"; // Use built-in wake words list
skainet = esp_skainet_create(&cfg);
}
void loop() {
M5.update();
int16_t audio_chunk[ESP_SKAINET_AUDIO_CHUNK_SIZE];
M5.Mic.record(audio_chunk, ESP_SKAINET_AUDIO_CHUNK_SIZE);
int command_id = esp_skainet_detect(skainet, audio_chunk);
if (command_id == ESP_SKAINET_WAKE_WORD_DETECTED) {
M5.Display.println("Hey Jarvis detected!");
// Trigger app action
} else if (command_id >= 0) {
M5.Display.printf("Command: %d\n", command_id);
// Execute the recognized command
}
delay(10);
}
Available wake-words (Multinet5 default vocabulary): “Hi Lexin”, “Hi ESP”, “Alexa”, “Hi Jarvis”, “Hey Siri”, etc. Custom wake-word training is possible via Espressif’s ESP-Skainet training pipeline (offline process, requires audio dataset).
Common command vocabulary: ~40 English words covering basic device control — “turn on”, “turn off”, “stop”, “play”, “volume up”, “volume down”, “next”, “previous”, etc. Useful for hands-free menu navigation.
Use cases for M5StickS3:
- “Hey Jarvis, scan Wi-Fi” → triggers Marauder / Bruce / Evil-M5 scan
- “Hey Jarvis, record” → starts voice memo recorder
- “Hey Jarvis, time” → display speaks current time via TTS (M5StickS3 has on-board TTS via
M5.Speaker.tone(...)sequences or external TTS library) - Voice navigation of NEMO/Bruce menus when buttons are inconvenient (e.g., during a presentation)
7. ESP-NOW walkie-talkie protocol
ESP-NOW is Espressif’s broadcast-style Wi-Fi raw-frames protocol — no AP, no router, no protocol stack overhead beyond raw 802.11 frames. Two M5StickS3s on the same channel + same broadcast MAC = a real walkie-talkie.
Protocol overview
Frame format:
ESP-NOW frame:
Channel: User-set (1-13)
Source MAC: Sender's MAC (or broadcast)
Dest MAC: Broadcast FF:FF:FF:FF:FF:FF
Payload: Up to 250 bytes per frame
- Sequence number (1 byte)
- Audio data (μ-law encoded, ~249 bytes per frame)
Audio encoding:
- Sample rate: 8 kHz mono (telephony quality)
- Bit depth: 16-bit → compressed to 8-bit μ-law (~2:1 compression)
- 8 kbps → ~1 KB/sec audio data → ~250-byte ESP-NOW frame every 250 ms
Push-to-talk workflow:
- Button A pressed: M5StickS3 enters TX mode
- MEMS mic → ES8311 captures audio at 8 kHz
- μ-law encodes each sample
- Buffers 250-byte chunks
- Each chunk transmitted as one ESP-NOW frame (~250 ms latency end-to-end)
- Button A released: M5StickS3 enters RX mode
- Listens for ESP-NOW frames
- μ-law decodes incoming audio
- Plays to speaker via ES8311 + AW8737
Pair / multi-cast: ESP-NOW frames are broadcast — every M5StickS3 on the same channel receives them. Up to ~10 devices in audible range can hear the transmission.
Range
| Environment | Range (typical) |
|---|---|
| Indoor, walls between devices | ~30-50 m |
| Indoor, line-of-sight | ~100 m |
| Outdoor, line-of-sight (open field) | ~150-200 m |
| Outdoor, with obstacles | ~50-100 m |
(Standard Wi-Fi range at 2.4 GHz, +20 dBm TX power.)
Audio quality
8 kHz / 16-bit μ-law mono sounds like a 1990s walkie-talkie — voice intelligible, no hi-fi, slight compression artifacts on plosives. Not for music; perfectly fine for tactical comms.
Implementation
esp-now-talkie is the community firmware. Cross-compatible with Cardputer ADV (same codec) — M5StickS3 ↔ Cardputer ADV walkie-talkie pairs work.
#include <esp_now.h>
#include <WiFi.h>
void onDataRecv(const uint8_t *mac, const uint8_t *data, int len) {
// Decode μ-law, play via M5.Speaker
int16_t pcm[len];
for (int i = 0; i < len; i++) {
pcm[i] = mulaw_decode(data[i]);
}
M5.Speaker.playRaw(pcm, len);
}
void setup() {
M5.begin(M5.config());
WiFi.mode(WIFI_STA);
esp_now_init();
esp_now_register_recv_cb(onDataRecv);
}
void loop() {
if (M5.BtnA.isHeld()) {
// Push-to-talk: record + broadcast
int16_t pcm[250];
M5.Mic.record(pcm, 250);
uint8_t mulaw_data[250];
for (int i = 0; i < 250; i++) {
mulaw_data[i] = mulaw_encode(pcm[i]);
}
uint8_t broadcast_mac[6] = {0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF};
esp_now_send(broadcast_mac, mulaw_data, 250);
}
M5.update();
delay(30); // ~33 chunks/sec = ~8 kbps audio
}
Operational use
- Tactical comms: small group on same channel, simple push-to-talk
- Conference / event coordination: M5StickS3s on lanyards
- Multi-device demo synchronization: trigger demos across multiple M5StickS3s
Privacy: traffic is unencrypted by default. For sensitive comms, add software-layer AES (pre-shared key).
8. Internet radio receiver
Workflow:
- Wi-Fi station connect to home network.
- Open HTTP/HTTPS stream from Shoutcast / Icecast / direct MP3 URL.
- Stream → libhelix-mp3 decoder → ES8311 → speaker.
- Display station name + track metadata (Shoutcast metadata embedded in stream).
Bandwidth: 128 kbps MP3 stream is typical. M5StickS3’s 2.4 GHz Wi-Fi handles this easily.
Audio quality: at the 8 Ω 1 W speaker, acceptable for casual listening in quiet rooms. Better via Bluetooth speaker (M5StickS3 streams via BLE Audio profile to external speaker).
Community firmware: RHesus-RAdio (DrRhesus) is the canonical implementation. Originally for Cardputer ADV; M5StickS3 port likely requires minor UI re-flow.
Architecture:
[Wi-Fi station]
│
↓ HTTPS GET on stream URL
[ESP32-S3 HTTP client]
│
↓ MP3 frame buffer (in PSRAM — 64 KB typical)
[libhelix-mp3 decoder]
│
↓ Decoded PCM samples (44.1 kHz 16-bit stereo)
[Audio mixer (downmix to mono if needed)]
│
↓ I²S
[ES8311 codec]
│
↓ Analog
[AW8737 amp]
│
↓
[Speaker (or 3.5 mm jack if Hat2 accessory present)]
Stream sources:
- Public Shoutcast / Icecast directories: shoutcast.com, internet-radio.com, radio-browser.info
- Specific public streams: NPR, BBC, college radio, themed playlists
- Self-hosted: any HTTP-served MP3/AAC file or Icecast server
The M5StickS3 becomes a tiny “kitchen radio” / “shower radio” / “office music” device — battery-powered (250 mAh → ~1-2 hours playback) or USB-tethered for unlimited duration.
9. Audio FX / DAW prototyping
The LX7 dual-core handles modest real-time audio processing. Possible (but limited):
| Effect | Feasibility | Notes |
|---|---|---|
| Low-pass / high-pass filter | ✓ trivial | Simple IIR filter on incoming samples |
| Echo / delay | ✓ easy | Circular buffer + mix |
| Reverb (basic) | ✓ moderate | Multi-tap delay with feedback |
| Pitch shift | ⚠ heavy | CPU-bound; expect 50-70% CPU at 16 kHz |
| Vocoder | ⚠ very heavy | Multi-band filter bank; pushes CPU limits |
| Real-time mixing 4+ tracks | ⚠ borderline | PSRAM helps with buffers |
| Hardware-class synth | ✗ no | LX7 lacks DSP-specific instructions; dedicated DSP chips outperform |
| Studio-class DAW | ✗ no | Cardputer ADV with QWERTY is better for any text-driven workflow |
For real audio prototyping: use a Cardputer ADV (same codec + better UI), or a dedicated platform (Teensy 4.x + Audio Library, Bela board, modular synth modules).
The M5StickS3’s audio chain is good for proof-of-concept and niche audio recording — not for serious music production.
10. The “covert audio recorder” use case (with legal caveat)
The M5StickS3’s wearable form factor + magnetic back + 20 g weight + voice-quality recording + 1 W audio playback = the physical and capability profile of a covert audio recorder.
The operator must understand the legal landscape before deploying this capability.
US federal law
Federal Wiretap Act (Title III, 18 USC §§ 2510-2522): prohibits interception of “oral communications” — defined as private communications spoken with reasonable expectation of privacy.
Federal interpretation: “one-party consent” — if one party to the conversation (the operator) consents to recording, recording is legal under federal law. Most federal investigations proceed under this.
Federal exceptions: recording where no party consents is always illegal under federal law, regardless of state law.
US state law (more restrictive than federal)
Two-party / all-party consent states: 11 US states (as of 2026-05-13) require all parties to consent before recording. Operating without all-party consent in these states is a criminal offense, typically a felony:
- California — CA Penal Code §§ 631, 632
- Florida — FL Stat. § 934.03
- Illinois — 720 ILCS 5/14-2
- Maryland — MD Cts. Jud. Proc. § 10-402
- Massachusetts — Mass. Gen. Laws ch. 272 § 99
- Montana — Mont. Code Ann. § 45-8-213
- Nevada — NRS 200.620
- New Hampshire — NH Rev. Stat. § 570-A:2
- Pennsylvania — 18 Pa. Cons. Stat. § 5704
- Vermont — Vermont Supreme Court rulings
- Washington — RCW 9.73.030
The other 39 US states use one-party consent. Operating in these is legal as long as the operator is a party to the conversation.
EU + UK
GDPR (Regulation 2016/679): voice is personal data. Recording voice without lawful basis (consent / legitimate interest / legal obligation) is a regulatory violation. Penalties up to 4% of global revenue for organizations; criminal exposure under national laws for individuals.
UK Investigatory Powers Act 2016: regulates electronic interception. Strict; criminal penalties for unauthorized interception.
National variations: each EU member state has slightly different rules. Germany (StGB § 201) is particularly strict on speech without consent.
Other jurisdictions
- Australia: state-by-state. Generally restrictive (similar to two-party consent).
- Canada: Criminal Code § 184 — one-party consent.
- Japan: Wiretap Act prohibits private-conversation recording without consent.
- Russia / China / restrictive regions: assume strict prohibitions and severe consequences.
Operational rule for M5StickS3 covert-audio use
The M5StickS3 can technically be deployed as a covert audio recorder. The operator must not, except under explicit authorization.
Practical operational discipline:
- Know your jurisdiction — US state of operation drives the rule. EU jurisdiction drives the rule.
- Document authorization — written, signed, scope-specified, before engagement.
- Time-box — shorter is safer.
- Don’t deploy in spaces where third parties might be present without all-party consent: schools, hospitals, courthouses, public accommodations have additional rules.
- Sanitize recordings post-engagement — chain-of-custody discipline (Vol 11 § 11).
- For tjscientist’s own bench: recording yourself / your own equipment / your own private spaces = legal everywhere. This is the safe operating envelope.
For personal use (voice memos, audio note-taking, sound recording of your own activities): no legal exposure.
For engagement work: get authorization, document, time-box, sanitize.
For public-space deployment (magnetic-back-stick on the side of a server rack with audio recording): only with explicit authorization from the venue + all parties present. Otherwise: don’t.
11. Audio power profile
Per-mode current draw + battery-life math:
| Audio mode | Current (mA) | 250 mAh battery life |
|---|---|---|
| Idle (no audio) | ~80 mA | ~3 hours |
| Recording at 16 kHz 16-bit mono | ~95-105 mA | ~2.4 hours |
| Recording at 96 kHz 24-bit stereo | ~140 mA | ~1.8 hours |
| Playback at low speaker volume | ~150-200 mA | ~1.3-1.7 hours |
| Playback at full 1 W speaker output | ~280-320 mA peak | ~45-50 minutes |
| ESP-NOW walkie-talkie active | ~200-250 mA average | ~1.0-1.2 hours |
| Wake-word detection continuously | ~85 mA | ~2.9 hours (almost free) |
| Internet radio reception | ~150-180 mA | ~1.4-1.6 hours |
| FFT audio analysis | ~120 mA | ~2 hours |
Key insights:
- Wake-word detection has minimal power overhead — can run continuously without materially shortening battery life.
- Sustained playback at full volume drains the battery in <1 hour — much faster than Cardputer ADV (1750 mAh battery + same audio chain → ~5 hours playback).
- Recording-only is power-efficient — ~2-3 hours.
- Walkie-talkie sustained is borderline — under 1.5 hours at 250 mAh.
Thermal consideration: continuous high-volume playback for >10-15 minutes makes the M5StickS3 case palpably warm. ESP32-S3 die temp climbs but doesn’t throttle until ~125°C. Speaker driver cone displacement starts to drift after sustained high SPL.
Operational guidance:
- For sustained audio work (long internet radio listening, extended walkie-talkie session), plug in USB-C rather than relying on battery.
- For voice-memo workflows (record-as-needed), battery is fine.
- For wake-word-activated workflows, battery is fine for many hours.
12. Resources
Library / driver references
- M5Unified
M5.Mic/M5.SpeakerAPIs: https://github.com/m5stack/M5Unified esp_codec_devframework: https://github.com/espressif/esp-adfesp-skainetwake word + audio AI: https://github.com/espressif/esp-skainet- Arduino-ESP32 I²S API: https://docs.espressif.com/projects/arduino-esp32/en/latest/api/i2s.html
- arduinoFFT library: https://github.com/kosme/arduinoFFT
- libhelix-mp3: https://github.com/pschatzmann/arduino-libhelix
Hardware datasheets
- ES8311 codec: https://www.everest-semi.com/
- AW8737 class-D amp: Awinic
Use-case firmwares
RHesus-RAdio(internet radio): GitHub searchm5Cardputer_audiospectrum(FFT visualizer): GitHub searchesp-now-talkie(walkie-talkie): GitHub search
Legal references
- US Federal Wiretap Act: https://www.law.cornell.edu/uscode/text/18/2510
- US two-party consent state laws: cited individually in § 10
- EU GDPR Article 4 (definition of personal data): https://gdpr.eu/article-4-definitions/
- UK Investigatory Powers Act 2016: https://www.legislation.gov.uk/ukpga/2016/25
Forward references
- Recipe workflows: Vol 9 § 4 (audio-specific recipes)
- Operational posture / battery realism: Vol 11 § 4
- Custom firmware patterns for audio: Vol 10 § 7
This is Volume 5 of a twelve-volume series. Next: Vol 6 covers the firmware ecosystem — Evil-M5Project family, Bruce-for-stick, Marauder ports, MicroHydra, UiFlow 2, ESPHome, retro emulators, audio-niche firmwares.