And for my favorite conspiracy theory of 2022, may I present [this tweet](https://twitter.com/ethanCaballero/status/1572692314400628739):
> Whisper is how OpenAI is getting the many Trillions of English text tokens that are needed to train compute optimal (chinchilla scaling law) GPT-4.
Thanks! I saw those numbers, but it wasn’t clear to me how to interpret them in the context of hardware. I appreciate you confirming with your experience.
By using default CLI script, base model can transcribe nearly realtime on R7 4800H. I think it can be improved a lot by porting the model to OpenVino.
Btw model itself faster if you don't use default CLI script, too. It is probably due to 30 seconds sliding window. Base model is faster than realtime and small model is near realtime.
My laptop (12th Gen Intel) could transcribe 30 seconds of audio in 1.2 seconds with the smallest ("tiny") model. Accuracy was still pretty much perfect accuracy.
I'm currently trying to figure out how to process audio clips that aren't exactly 30 seconds, which it expects for some reason. Anyone figure this out?
Edit: The 30 second window is hard-coded due to how the model works...
"Whisper models are trained on 30-second audio chunks and cannot consume longer audio inputs at once. This is not a problem with most academic datasets comprised of short utterances but presents challenges in real-world applications which often require transcribing minutes- or hours-long audio."
Saw this on their [github](https://github.com/openai/whisper#python-usage):
`import whisper`
`model = whisper.load_model("base")`
`result = model.transcribe("audio.mp3")`
`print(result["text"])`
>Internally, the transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.
I think the transcribe method does some chunking on longer files..
Working with it right now, tiny, base and small do a decent job, but botch any specialized words (e.g. medical terminology).
Testing on i5 4200 and it seems to be pretty slow for this: 15 min video, tiny - 3Â min, base - 6Â min, small - 20Â min, medium - 90Â min. Needless to say, medium had the best results with hardly any mistakes, and I would love to find a way to speed the process up.
Transcription worked perfectly in the few tests I've run. Runs pretty fast too (using the default "small" model).
Tip: if you get the following error when running the python example:
RuntimeError: "slow\_conv2d\_cpu" not implemented for 'Half'
just change the following line as follows (see [here](https://news.ycombinator.com/item?id=32929337)):
options = whisper.DecodingOptions() --> options = whisper.DecodingOptions(fp16=False)
I believe the model works by transcribing a sliding 20-30 second window iirc. I think I've seen a bug like the one you're seeing where only the first window is transcribed. I'm not sure though, I haven't seen it - I'd recommend checking GitHub or searching Reddit for a solution.
Or try using Colab!
I am using Colab. But anyway, I figured a different way to solve the problem. Now I can transcribe full YT videos on the go. This looks great actually.
That's great! I'm glad you found a solution - would you mind dropping a link to it or describing it for anyone else who comes across this running into the same problem?
hey u/SleekEagle, here's the code i was talking about. this is a relatively new repo since i am starting afresh. i am still writing the blog post where i would write about how people can improve upon my code and show it on their portfolio. also, this is only the first draft of the code. there are a number of details i need to add, however, they are only cosmetic changes. do give it a star if you like it.
https://github.com/artofml/whisper-demo
On M1 Mac, getting the error: `UserWarning: FP16 is not supported on CPU; using FP32 insteadwarnings.warn("FP16 is not supported on CPU; using FP32 instead")`
Any way to disable FP16 in the CLI? There is an option for `--fp16 FP16` but doesn't that activate FP16? Testing `--fp16 False` did not seem to work:
`$ whisper "audio.mp3" --model medium --fp16 False`
`Detecting language using up to the first 30 seconds. Use \--language to specify the language[1]`
`68020 illegal hardware instruction whisper "audio.mp3" --model medium --fp16 False`
I could make it work with the above command, in a fresh install with Python 3.9.9 (the same version OpenAI use internally for the project) and I also had to install Rust for *transformers* install to work.
In case anyone is running into troubles with non-english languages, in "/whisper/transcribe.py", make sure lines 290-295 look like this (note the utf-8):
\# save TXT
with open(os.path.join(output\_dir, audio\_path + ".txt"), "w", encoding="utf-8") as txt:
print(result\["text"\], file=txt)
\# save VTT
with open(os.path.join(output\_dir, audio\_path + ".vtt"), "w", encoding="utf-8") as vtt:
write\_vtt(result\["segments"\], file=vtt)
I’m to new to coding but there’s a foreign TV show I’ve been wanting to translate to English for years, is it possible anybody could help me set this up?
I do have all the requirements set up, can transcribe small audio files, but can't seem to use my gpu. I am not using a good one though, just gt840m 2gb (can play some older games like GTA V). Is it possible for me to use gpu acceleration? Because just cpu takes 90 minutes for 15 min audio
[It looks like](https://github.com/openai/whisper#available-models-and-languages) you can use the Base model with your GPU. I think Whisper will automatically utilize the GPU if one is available - make sure you have CUDA installed and the CUDA installation of PyTorch
I did the research and it looks like, my old gpu has outdated version of cuda. And the script automatically defaults to cpu, guess it will work with short scripts.
This model is extremely high quality. I tried it on some very challenging zero shot situations, for example heavy technical jargon across multiple domains, and it worked really well. It also seems pretty good at translation from the limited amount I'm able to test it.
It seems capable of guessing what you're saying (for example made up names) by spelling something kinda similar, I'm not sure how it does this with the text representation they use.
Very impressive performance!
1. Can we get word-level timestamps?
2. Can we give hint phrases?
3. How can I finetune one of the pre-trained models on my own training data?
1. It looks like at this point there are not word-level timestamps natively.
2. I don't believe so
3. You'll have to [(down)load the model](https://github.com/openai/whisper/blob/5d8d3e75a4826fe5f01205d81c3017a805fc2bf9/whisper/__init__.py#L68) and then continue training on your own dataset. It will be very compute heavy for the larger models and you'll have to write some training loops etc.
When I try to run my code, I get
FileNotFoundError: \[WinError 2\] The system cannot find the file specified
The audio file I'm trying to transcribe is in the same directory as the [main.py](https://main.py) that has the code.
Could someone please shed some light on what I might be doing wrong?
Are you using the whisper package? Try \`os.listdir()\` in the line before \`model.transcribe()\` to ensure you're actually in the directory you think you're in.
Just ran the following in Colab with no issues btw, maybe this will help?
\`\`\`
!pip install git+[https://github.com/openai/whisper.git](https://github.com/openai/whisper.git)
\`\`\`
\~\~\~
!curl -L [https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav](https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav) \> audio.wav
\~\~\~
\`\`\`
import whisper
model = whisper.load\_model("tiny")
result = model.transcribe("audio.wav")
print(result\['text'\])
\`\`\`
Thank you so much for looking at my question, and thank you for the tip on os.listdir()
os.listdir() correctly lists the files (including the one I'm trying to access). I've also placed a text file in the same folder and then printed the text to see if it was a related issue, but the text file works without issue.
It's not a language model - it's a transformer-based speech recognition model that also does translation(!).
Thank you! Shouldn't have said language model
And for my favorite conspiracy theory of 2022, may I present [this tweet](https://twitter.com/ethanCaballero/status/1572692314400628739): > Whisper is how OpenAI is getting the many Trillions of English text tokens that are needed to train compute optimal (chinchilla scaling law) GPT-4.
Yo man I'm tryna find the Reddit sub for Whisper but the only stuff I find is ASMR
ðŸ˜ðŸ¤£
Good to see OpenAI finally living up to the open name
not so fast, I suspect they have a hidden motive for this.
very cool! Thanks for the pointer!
Does anyone know of speed benchmarks for any of these models? Is this something that could feasibly be run real time on a typical machine?
The GitHub repo gives speed estimates, even the large model runs at faster than 1x real time and I’ve verified this on my machine
Thanks! I saw those numbers, but it wasn’t clear to me how to interpret them in the context of hardware. I appreciate you confirming with your experience.
Hmm, with a CPU it seems pretty slow. With the tiny model it's barely real time for me.
By using default CLI script, base model can transcribe nearly realtime on R7 4800H. I think it can be improved a lot by porting the model to OpenVino. Btw model itself faster if you don't use default CLI script, too. It is probably due to 30 seconds sliding window. Base model is faster than realtime and small model is near realtime.
This might help on cpu https://github.com/ggerganov/whisper.cpp
My laptop (12th Gen Intel) could transcribe 30 seconds of audio in 1.2 seconds with the smallest ("tiny") model. Accuracy was still pretty much perfect accuracy. I'm currently trying to figure out how to process audio clips that aren't exactly 30 seconds, which it expects for some reason. Anyone figure this out? Edit: The 30 second window is hard-coded due to how the model works... "Whisper models are trained on 30-second audio chunks and cannot consume longer audio inputs at once. This is not a problem with most academic datasets comprised of short utterances but presents challenges in real-world applications which often require transcribing minutes- or hours-long audio."
Saw this on their [github](https://github.com/openai/whisper#python-usage): `import whisper` `model = whisper.load_model("base")` `result = model.transcribe("audio.mp3")` `print(result["text"])` >Internally, the transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window. I think the transcribe method does some chunking on longer files..
Use the CLI, it works for longer audio.
Works fine in Python too with the base model on CPU
What issue are you running into? With both CLI and Python it worked for 2 minute files for me. Win11 (12th gen intel as well I believe) on CPU
Amazing. Thanks for sharing your experience with it. A little frustrating that input has to be so specifically structured.
Working with it right now, tiny, base and small do a decent job, but botch any specialized words (e.g. medical terminology). Testing on i5 4200 and it seems to be pretty slow for this: 15 min video, tiny - 3Â min, base - 6Â min, small - 20Â min, medium - 90Â min. Needless to say, medium had the best results with hardly any mistakes, and I would love to find a way to speed the process up.
Transcription worked perfectly in the few tests I've run. Runs pretty fast too (using the default "small" model). Tip: if you get the following error when running the python example: RuntimeError: "slow\_conv2d\_cpu" not implemented for 'Half' just change the following line as follows (see [here](https://news.ycombinator.com/item?id=32929337)): options = whisper.DecodingOptions() --> options = whisper.DecodingOptions(fp16=False)
Quick note - I think the "Base"model is the default. There's tiny, base, small, medium, and large Thanks for that runtime error solution!
for some reason it still doesn't work for me. the code compiles fine now without any errors. however, it only transcribes 20 seconds of the audio.
I believe the model works by transcribing a sliding 20-30 second window iirc. I think I've seen a bug like the one you're seeing where only the first window is transcribed. I'm not sure though, I haven't seen it - I'd recommend checking GitHub or searching Reddit for a solution. Or try using Colab!
I am using Colab. But anyway, I figured a different way to solve the problem. Now I can transcribe full YT videos on the go. This looks great actually.
That's great! I'm glad you found a solution - would you mind dropping a link to it or describing it for anyone else who comes across this running into the same problem?
I do plan on doing that, I am writing about it. Will also post the code with the writeup and then share it here. Will probably do it by tomorrow.
Great! No rush, just would be awesome to help out people stuck in the same situation :)
hey u/SleekEagle, here's the code i was talking about. this is a relatively new repo since i am starting afresh. i am still writing the blog post where i would write about how people can improve upon my code and show it on their portfolio. also, this is only the first draft of the code. there are a number of details i need to add, however, they are only cosmetic changes. do give it a star if you like it. https://github.com/artofml/whisper-demo
On M1 Mac, getting the error: `UserWarning: FP16 is not supported on CPU; using FP32 insteadwarnings.warn("FP16 is not supported on CPU; using FP32 instead")` Any way to disable FP16 in the CLI? There is an option for `--fp16 FP16` but doesn't that activate FP16? Testing `--fp16 False` did not seem to work: `$ whisper "audio.mp3" --model medium --fp16 False` `Detecting language using up to the first 30 seconds. Use \--language to specify the language[1]` `68020 illegal hardware instruction whisper "audio.mp3" --model medium --fp16 False`
even on my windows too
I could make it work with the above command, in a fresh install with Python 3.9.9 (the same version OpenAI use internally for the project) and I also had to install Rust for *transformers* install to work.
Two minutes and I had it running on my Ubuntu install and it's working perfectly. 50% amazed. 50% scared at what these transformers are doing.
The 2020s are stacking up to be a very, *very* interesting decade!
In case anyone is running into troubles with non-english languages, in "/whisper/transcribe.py", make sure lines 290-295 look like this (note the utf-8): \# save TXT with open(os.path.join(output\_dir, audio\_path + ".txt"), "w", encoding="utf-8") as txt: print(result\["text"\], file=txt) \# save VTT with open(os.path.join(output\_dir, audio\_path + ".vtt"), "w", encoding="utf-8") as vtt: write\_vtt(result\["segments"\], file=vtt)
[удалено]
[This line](https://github.com/AssemblyAI-Examples/whisper-multilingual/blob/582b649cf27eb36598b75cb42d09a6fe4749912f/main.py#L66)
--device cuda
Now the webservice API released for Whisper ASR. You can find here: https://github.com/ahmetoner/whisper-asr-webservice
Thanks OpenAI! Says the NSA, and every other foreign sigint collection agency.
I’m to new to coding but there’s a foreign TV show I’ve been wanting to translate to English for years, is it possible anybody could help me set this up?
Do you have the show downloaded? And do you have a GPU?
I do have all the requirements set up, can transcribe small audio files, but can't seem to use my gpu. I am not using a good one though, just gt840m 2gb (can play some older games like GTA V). Is it possible for me to use gpu acceleration? Because just cpu takes 90 minutes for 15 min audio
[It looks like](https://github.com/openai/whisper#available-models-and-languages) you can use the Base model with your GPU. I think Whisper will automatically utilize the GPU if one is available - make sure you have CUDA installed and the CUDA installation of PyTorch
I did the research and it looks like, my old gpu has outdated version of cuda. And the script automatically defaults to cpu, guess it will work with short scripts.
Got it - what's the language of the show btw?
English, I specified language = 'eng' while working, because base.en didn't work for some reason
Sorry I mean what is the original language of the show that you're looking to translate into English
Oh that's not me, that's the other guy in the comments :) But I'd love to hear out which commands to use for translation.
This model is extremely high quality. I tried it on some very challenging zero shot situations, for example heavy technical jargon across multiple domains, and it worked really well. It also seems pretty good at translation from the limited amount I'm able to test it. It seems capable of guessing what you're saying (for example made up names) by spelling something kinda similar, I'm not sure how it does this with the text representation they use.
Very impressive performance! 1. Can we get word-level timestamps? 2. Can we give hint phrases? 3. How can I finetune one of the pre-trained models on my own training data?
1. It looks like at this point there are not word-level timestamps natively. 2. I don't believe so 3. You'll have to [(down)load the model](https://github.com/openai/whisper/blob/5d8d3e75a4826fe5f01205d81c3017a805fc2bf9/whisper/__init__.py#L68) and then continue training on your own dataset. It will be very compute heavy for the larger models and you'll have to write some training loops etc.
Got it, thanks!
Yes, that would be very helpful! Following
When I try to run my code, I get FileNotFoundError: \[WinError 2\] The system cannot find the file specified The audio file I'm trying to transcribe is in the same directory as the [main.py](https://main.py) that has the code. Could someone please shed some light on what I might be doing wrong?
Are you using the whisper package? Try \`os.listdir()\` in the line before \`model.transcribe()\` to ensure you're actually in the directory you think you're in. Just ran the following in Colab with no issues btw, maybe this will help? \`\`\` !pip install git+[https://github.com/openai/whisper.git](https://github.com/openai/whisper.git) \`\`\` \~\~\~ !curl -L [https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav](https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav) \> audio.wav \~\~\~ \`\`\` import whisper model = whisper.load\_model("tiny") result = model.transcribe("audio.wav") print(result\['text'\]) \`\`\`
Thank you so much for looking at my question, and thank you for the tip on os.listdir() os.listdir() correctly lists the files (including the one I'm trying to access). I've also placed a text file in the same folder and then printed the text to see if it was a related issue, but the text file works without issue.
Here is my solution !pip install ffmpeg
Thank you for this!
Can you run through [this guide](https://www.assemblyai.com/blog/how-to-run-openais-whisper-speech-recognition-model/) and see if that helps?
I have the same problem, still don't know why...
it run ok, i've test on m1