Testing Whisper‑Tiny

Sometimes, you don’t need fancy gear to have big fun. In this post, I’ll share my experience running the locally installed OpenAI Whisper‑Tiny speech recognition model with the ultimate test setup: a $5‑special “chipiest” microphone. Will it understand me? Will it confuse “coffee with milk” with “coughing in silk”? Let’s find out.

Why Whisper‑Tiny?

  • Lightweight: Tiny model → runs on CPU, even on modest laptops.
  • Fast: Trade‑off between accuracy and speed, but perfect for experimenting.
  • Self‑contained: Works offline—your audio data never leaves your machine.

The Test Dataset

I prepared a set of 50 short phrases covering everyday commands, food, numbers, locations, tongue twisters, and “ummmm” sounds. The idea is to stress‑test the model with a mix of easy wins and likely failure points.

Here are 50 short phrases, clean and ready for testing:

Everyday Commands & Queries

  1. Open the door
  2. Turn off the lights
  3. Call my mom
  4. Send a text message
  5. What’s the weather?
  6. Play the next song
  7. Stop the music
  8. Start the timer
  9. Set an alarm for seven
  10. Cancel reminder

Food & Drink

  1. Coffee with milk
  2. Pizza delivery tonight
  3. Two bottles of water
  4. Hot chocolate please
  5. Fresh apple juice

Numbers & Dates

  1. One two three four five
  2. Eleven twelve thirteen
  3. Twenty twenty-four
  4. March third, nineteen ninety-nine
  5. Ten o’clock sharp

Locations & Travel

  1. Take me home
  2. Nearest gas station
  3. Central train station
  4. Go to the airport
  5. Map of New York

Random Words for Clarity

  1. Red blue green yellow
  2. Cat dog bird fish
  3. Alpha beta gamma delta
  4. Yes no maybe later
  5. Up down left right

Conversational Snippets

  1. How are you today?
  2. I’m feeling great
  3. That was funny
  4. I don’t know
  5. See you tomorrow

Tech Stuff

  1. Open the settings
  2. Restart computer now
  3. Wi‑Fi disconnected
  4. Bluetooth headphones paired
  5. Battery is low

Tricky / Fun Utterances

  1. She sells seashells
  2. Peter picked a pepper
  3. Unique New York
  4. Toy boat toy boat toy boat
  5. The quick brown fox jumps

Edge Cases

  1. Zero point zero one percent
  2. Nine nine nine nine
  3. Zzzzz sound
  4. Hmmm, let me think
  5. Okay okay okay

Setup

1. Hardware

  • Microphone: the cheapest USB mic I could find online (cost less than a sandwich).
  • Computer: GPU NVIDIA GeForce GTX 1660 SUPER.

2. Software

Recording the Audio

I recited each of the 50 test phrases into the bargain‑bin mic, making sure to include:

  • Clear speaking (to test best case)
  • Mumbling and background noise (to test worst case)

Transcription Process

  1. Record and transcribe audio.
  2. Collect the outputs in a simple table with columns:
    Phrase ID | Original Phrase | Whisper Output | Error/Match

Clear speaking (to test best case)

Phrase ID Original Phrase Whisper Output Error/Match
1Open the doorOpen the doormatch
2Turn off the lights Turn off the lightsmatch
3Call my mom Go, my mom.error
4Send a text message Send a text messagematch
5What’s the weather? Wchodz ze wezorerror
6Play the next song Play the next songmatch
7Stop the music Stop the musicmatch
8Start the timer Start the timermatch
9Set an alarm for seven Set an alarm for 7match
10Cancel reminder Council Remindererror
11Coffee with milk coffee, vis milkerror
12Pizza delivery tonight Pizza delivery tonight.match
13Two bottles of water Two bottles of watermatch
14Hot chocolate please hot chocolate pleasematch
15Fresh apple juice Fresh Apple Juicematch
16One two three four five 1, 2, 3, 4, 5match
17Eleven twelve thirteen 11 12 13match
18Twenty twenty-four 20 24match
19March third, nineteen ninety-nine March, sod, 1999error
20Ten o’clock sharp Then, Ocklock Sharperror
21Take me home Take me homematch
22Nearest gas station nearest gas stationmatch
23Central train station Central train stationmatch
24Go to the airportGo to the airportmatch
25Map of New York map of New Yorkmatch
26Red blue green yellow Red, blue, green, yellowmatch
27Cat dog bird fish cat, dog, bird, fishmatch
28Alpha beta gamma delta Alpha, beta, gamma, deltamatch
29Yes no maybe later Yes, no, maybe latermatch
30Up down left right Up, down, left, rightmatch
31How are you today? How are you today?match
32I’m feeling great I’m feeling greatmatch
33That was funny That was funnymatch
34I don’t know I don’t knowmatch
35See you tomorrow See you tomorrowmatch
36Open the settings Open the settingsmatch
37Restart computer now Restart Computer Nowmatch
38Wi-Fi disconnected Wi-Fi disconnectedmatch
39Bluetooth headphones paired Bluetooth headphones pairedmatch
40Battery is low Battery is lowmatch
41She sells seashells She sells seashellsmatch
42Peter picked a pepperPeter, pickid, and peppererror
43Unique New Yorkunique New Yorkmatch
44Toy boat toy boat toy boatto a Buddha, to a Buddherror
45The quick brown fox jumps The quick brown fox jumpsmatch
46Zero point zero one percent 0.01%match
47Nine nine nine nine 9 9 9 9match
48Zzzzz soundThis sounderror
49Hmmm, let me think Hmm. Let me sinkerror
50Okay okay okay Okay okay okaymatch

Mumbling and background noise (to test worst case)

Phrase ID Original Phrase Whisper Output Error/Match
1Open the door Open the doormatch
2Turn off the lights It removes the ladserror
3Call my mom Goal my momerror
4Send a text message Send it to the messageerror
5What’s the weather?what is the VZR?error
6Play the next song Blaze the next songerror
7Stop the music Stubbs in musicerror
8Start the timer Starts with time onerror
9Set an alarm for seven Set an alarm for sevenmatch
10Cancel reminder Consular Mandererror
11Coffee with milk Coffee is milkerror
12Pizza delivery tonight Pizza delivery toneterror
13Two bottles of water Two bottles of watermatch
14Hot chocolate please Hot chocolate pleasematch
15Fresh apple juice Fresh apple juicematch
16One two three four five 1, 2, 3, 4, 5match
17Eleven twelve thirteen 11 12 13match
18Twenty twenty-four 20, 24match
19March third, nineteen ninety-nine Mart, soat, 1999error
20Ten o’clock sharp Then a clog sharperror
21Take me home Take me homematch
22Nearest gas station nearest gas stationmatch
23Central train station Central Drain Stationerror
24Go to the airport go to the airportmatch
25Map of New York Map of the New Yorkerror
26Red blue green yellow Red, blue, green, yellowmatch
27Cat dog bird fish Cat dog bird fishmatch
28Alpha beta gamma delta alpha, beta, gamma, deltamatch
29Yes no maybe later Yes, no, maybe latermatch
30Up down left right up down left rightmatch
31How are you today? How are you today?match
32I’m feeling great I’m feeling greatmatch
33That was funny That was funnymatch
34I don’t know I don’t knowmatch
35See you tomorrow Siv till morgonerror
36Open the settings Open the settingsmatch
37Restart computer now We’re start computer nowerror
38Wi-Fi disconnectedWhy do I disconnect?error
39Bluetooth headphones paired Bluetooth et de prendre spruesteerror
40Battery is low battery lawerror
41She sells seashells She sells, she sellserror
42Peter picked a pepper Bit of a big, a bit bettererror
43Unique New York You need milkerror
44Toy boat toy boat toy boat Toi bort, toi bort, toi borterror
45The quick brown fox jumps and the quick brown folks jumpserror
46Zero point zero one percent 0.01%match
47Nine nine nine nine No, no, no, noerror
48Zzzzz sound Sounderror
49Hmmm, let me think Hmm, let me sinkerror
50Hmmm, let me think Okay, okay, okaymatch

Summary results

Test typeMatchErrorMatch %Error %
Clear speaking401080%20%
Mumbling and background noise232746%54%

Raw Performance Stats

  • Clear speaking:
    • Matches: 40/50
    • Accuracy: 80%
  • Mumbling + noise:
    • Matches: 23/50
    • Accuracy: 46%

That’s a huge gap: roughly a 34 percentage-point drop when conditions get difficult.


What This Means

  • Tiny model strengths:
    • Relatively solid under ideal conditions—80% isn’t bad for a very small model.
    • Fast and resource-efficient, works on lower-powered devices.
  • Tiny model weaknesses:
    • Struggles significantly with noisy, “imperfect” speech.
    • This is expected: whisper-tiny has fewer parameters, so its “ear” for dealing with accents, mumbling, and background sounds is limited.

How to Interpret

Think of whisper-tiny as the bicycle of ASR models: lightweight, efficient, easy to deploy—but not the champion for carrying heavy loads (like messy audio).
Whereas larger models like whisper-base or whisper-small/medium are like scooters or cars: heavier, need more resources, but handle more complicated journeys.


Next Steps You Might Explore

  • Comparison with bigger models: Run the same 50-question test with whisper-base or small. You’ll instantly see whether accuracy in noisy cases jumps (it usually does).
  • Preprocessing tricks:
    • Noise reduction (e.g. with pyannote.audio or even simple filters).
    • Volume normalization.
  • Data augmentation for robustness: If deploying on a custom task, fine-tuning with examples that include your tricky noise situations can dramatically help.
  • Error analysis: What types of words were consistently misrecognized—short function words, numbers, names? This often reveals model limits.

A Tiny Bit of Humor

In short: whisper-tiny can handle a calm podcast session, but if you invite it to a crowded party, it sort of nods and smiles while guessing what you said.

Conclusion

The evaluation of the Whisper-tiny model demonstrates that it performs reasonably well under clear speaking conditions, achieving 80% accuracy across the test set. However, its performance drops sharply in the presence of mumbling and background noise, where accuracy falls to 46%. This contrast highlights both the efficiency and limitations of the model: while Whisper-tiny is lightweight and suitable for resource-constrained environments, it is not robust enough to handle real-world scenarios where speech may be unclear or disturbed by noise. For applications requiring higher reliability in noisy conditions, using a larger Whisper model or applying audio preprocessing techniques would be advisable.

To obtain more reliable and generalizable results, the evaluation should be expanded with a larger set of questions and a wider range of speakers featuring different accents, languages, and environmental conditions. This broader testing will provide a clearer picture of the model’s strengths and weaknesses across realistic usage scenarios.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *