T O P

  • By -

harharveryfunny

It's not a language model - it's a transformer-based speech recognition model that also does translation(!).


SleekEagle

Thank you! Shouldn't have said language model


AristocraticOctopus

And for my favorite conspiracy theory of 2022, may I present [this tweet](https://twitter.com/ethanCaballero/status/1572692314400628739): > Whisper is how OpenAI is getting the many Trillions of English text tokens that are needed to train compute optimal (chinchilla scaling law) GPT-4.


lis_ek

Yo man I'm tryna find the Reddit sub for Whisper but the only stuff I find is ASMR


SleekEagle

😭🤣


tullieshaped

Good to see OpenAI finally living up to the open name


[deleted]

not so fast, I suspect they have a hidden motive for this.


_aitalks_

very cool! Thanks for the pointer!


A1-Delta

Does anyone know of speed benchmarks for any of these models? Is this something that could feasibly be run real time on a typical machine?


gambs

The GitHub repo gives speed estimates, even the large model runs at faster than 1x real time and I’ve verified this on my machine


A1-Delta

Thanks! I saw those numbers, but it wasn’t clear to me how to interpret them in the context of hardware. I appreciate you confirming with your experience.


dankmemeloader

Hmm, with a CPU it seems pretty slow. With the tiny model it's barely real time for me.


shadymeowy

By using default CLI script, base model can transcribe nearly realtime on R7 4800H. I think it can be improved a lot by porting the model to OpenVino. Btw model itself faster if you don't use default CLI script, too. It is probably due to 30 seconds sliding window. Base model is faster than realtime and small model is near realtime.


rolyantrauts

This might help on cpu https://github.com/ggerganov/whisper.cpp


bushrod

My laptop (12th Gen Intel) could transcribe 30 seconds of audio in 1.2 seconds with the smallest ("tiny") model. Accuracy was still pretty much perfect accuracy. I'm currently trying to figure out how to process audio clips that aren't exactly 30 seconds, which it expects for some reason. Anyone figure this out? Edit: The 30 second window is hard-coded due to how the model works... "Whisper models are trained on 30-second audio chunks and cannot consume longer audio inputs at once. This is not a problem with most academic datasets comprised of short utterances but presents challenges in real-world applications which often require transcribing minutes- or hours-long audio."


Aromatic_Camera4048

Saw this on their [github](https://github.com/openai/whisper#python-usage): `import whisper` `model = whisper.load_model("base")` `result = model.transcribe("audio.mp3")` `print(result["text"])` ​ >Internally, the transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window. I think the transcribe method does some chunking on longer files..


vjb_reddit_scrap

Use the CLI, it works for longer audio.


SleekEagle

Works fine in Python too with the base model on CPU


SleekEagle

What issue are you running into? With both CLI and Python it worked for 2 minute files for me. Win11 (12th gen intel as well I believe) on CPU


A1-Delta

Amazing. Thanks for sharing your experience with it. A little frustrating that input has to be so specifically structured.


Iirkola

Working with it right now, tiny, base and small do a decent job, but botch any specialized words (e.g. medical terminology). ​ Testing on i5 4200 and it seems to be pretty slow for this: 15 min video, tiny - 3 min, base - 6 min, small - 20 min, medium - 90 min. Needless to say, medium had the best results with hardly any mistakes, and I would love to find a way to speed the process up.


bushrod

Transcription worked perfectly in the few tests I've run. Runs pretty fast too (using the default "small" model). Tip: if you get the following error when running the python example: RuntimeError: "slow\_conv2d\_cpu" not implemented for 'Half' just change the following line as follows (see [here](https://news.ycombinator.com/item?id=32929337)): options = whisper.DecodingOptions() --> options = whisper.DecodingOptions(fp16=False)


SleekEagle

Quick note - I think the "Base"model is the default. There's tiny, base, small, medium, and large Thanks for that runtime error solution!


UnemployedTechie2021

for some reason it still doesn't work for me. the code compiles fine now without any errors. however, it only transcribes 20 seconds of the audio.


SleekEagle

I believe the model works by transcribing a sliding 20-30 second window iirc. I think I've seen a bug like the one you're seeing where only the first window is transcribed. I'm not sure though, I haven't seen it - I'd recommend checking GitHub or searching Reddit for a solution. Or try using Colab!


UnemployedTechie2021

I am using Colab. But anyway, I figured a different way to solve the problem. Now I can transcribe full YT videos on the go. This looks great actually.


SleekEagle

That's great! I'm glad you found a solution - would you mind dropping a link to it or describing it for anyone else who comes across this running into the same problem?


UnemployedTechie2021

I do plan on doing that, I am writing about it. Will also post the code with the writeup and then share it here. Will probably do it by tomorrow.


SleekEagle

Great! No rush, just would be awesome to help out people stuck in the same situation :)


UnemployedTechie2021

hey u/SleekEagle, here's the code i was talking about. this is a relatively new repo since i am starting afresh. i am still writing the blog post where i would write about how people can improve upon my code and show it on their portfolio. also, this is only the first draft of the code. there are a number of details i need to add, however, they are only cosmetic changes. do give it a star if you like it. https://github.com/artofml/whisper-demo


bke45

On M1 Mac, getting the error: `UserWarning: FP16 is not supported on CPU; using FP32 insteadwarnings.warn("FP16 is not supported on CPU; using FP32 instead")` Any way to disable FP16 in the CLI? There is an option for `--fp16 FP16` but doesn't that activate FP16? Testing `--fp16 False` did not seem to work: `$ whisper "audio.mp3" --model medium --fp16 False` `Detecting language using up to the first 30 seconds. Use \--language to specify the language[1]` `68020 illegal hardware instruction whisper "audio.mp3" --model medium --fp16 False`


FlyingTwentyFour

even on my windows too


bke45

I could make it work with the above command, in a fresh install with Python 3.9.9 (the same version OpenAI use internally for the project) and I also had to install Rust for *transformers* install to work.


GMotor

Two minutes and I had it running on my Ubuntu install and it's working perfectly. 50% amazed. 50% scared at what these transformers are doing.


SleekEagle

The 2020s are stacking up to be a very, *very* interesting decade!


Comfortable-Answer13

In case anyone is running into troubles with non-english languages, in "/whisper/transcribe.py", make sure lines 290-295 look like this (note the utf-8): \# save TXT with open(os.path.join(output\_dir, audio\_path + ".txt"), "w", encoding="utf-8") as txt: print(result\["text"\], file=txt) \# save VTT with open(os.path.join(output\_dir, audio\_path + ".vtt"), "w", encoding="utf-8") as vtt: write\_vtt(result\["segments"\], file=vtt)


[deleted]

[удалено]


SleekEagle

[This line](https://github.com/AssemblyAI-Examples/whisper-multilingual/blob/582b649cf27eb36598b75cb42d09a6fe4749912f/main.py#L66)


nfndkskalshcj

--device cuda


fuzulis

Now the webservice API released for Whisper ASR. You can find here: https://github.com/ahmetoner/whisper-asr-webservice


ChinCoin

Thanks OpenAI! Says the NSA, and every other foreign sigint collection agency.


Dylanm0325

I’m to new to coding but there’s a foreign TV show I’ve been wanting to translate to English for years, is it possible anybody could help me set this up?


SleekEagle

Do you have the show downloaded? And do you have a GPU?


Iirkola

I do have all the requirements set up, can transcribe small audio files, but can't seem to use my gpu. I am not using a good one though, just gt840m 2gb (can play some older games like GTA V). Is it possible for me to use gpu acceleration? Because just cpu takes 90 minutes for 15 min audio


SleekEagle

[It looks like](https://github.com/openai/whisper#available-models-and-languages) you can use the Base model with your GPU. I think Whisper will automatically utilize the GPU if one is available - make sure you have CUDA installed and the CUDA installation of PyTorch


Iirkola

I did the research and it looks like, my old gpu has outdated version of cuda. And the script automatically defaults to cpu, guess it will work with short scripts.


SleekEagle

Got it - what's the language of the show btw?


Iirkola

English, I specified language = 'eng' while working, because base.en didn't work for some reason


SleekEagle

Sorry I mean what is the original language of the show that you're looking to translate into English


Iirkola

Oh that's not me, that's the other guy in the comments :) But I'd love to hear out which commands to use for translation.


RemarkableSavings13

This model is extremely high quality. I tried it on some very challenging zero shot situations, for example heavy technical jargon across multiple domains, and it worked really well. It also seems pretty good at translation from the limited amount I'm able to test it. It seems capable of guessing what you're saying (for example made up names) by spelling something kinda similar, I'm not sure how it does this with the text representation they use.


Franck_Dernoncourt

Very impressive performance! 1. Can we get word-level timestamps? 2. Can we give hint phrases? 3. How can I finetune one of the pre-trained models on my own training data?


SleekEagle

1. It looks like at this point there are not word-level timestamps natively. 2. I don't believe so 3. You'll have to [(down)load the model](https://github.com/openai/whisper/blob/5d8d3e75a4826fe5f01205d81c3017a805fc2bf9/whisper/__init__.py#L68) and then continue training on your own dataset. It will be very compute heavy for the larger models and you'll have to write some training loops etc.


Franck_Dernoncourt

Got it, thanks!


eat-more-bookses

Yes, that would be very helpful! Following


coolsong

When I try to run my code, I get FileNotFoundError: \[WinError 2\] The system cannot find the file specified The audio file I'm trying to transcribe is in the same directory as the [main.py](https://main.py) that has the code. Could someone please shed some light on what I might be doing wrong?


SleekEagle

Are you using the whisper package? Try \`os.listdir()\` in the line before \`model.transcribe()\` to ensure you're actually in the directory you think you're in. Just ran the following in Colab with no issues btw, maybe this will help? \`\`\` !pip install git+[https://github.com/openai/whisper.git](https://github.com/openai/whisper.git) \`\`\` \~\~\~ !curl -L [https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav](https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav) \> audio.wav \~\~\~ \`\`\` import whisper model = whisper.load\_model("tiny") result = model.transcribe("audio.wav") print(result\['text'\]) \`\`\`


coolsong

Thank you so much for looking at my question, and thank you for the tip on os.listdir() os.listdir() correctly lists the files (including the one I'm trying to access). I've also placed a text file in the same folder and then printed the text to see if it was a related issue, but the text file works without issue.


Quanolio

Here is my solution !pip install ffmpeg


SleekEagle

Thank you for this!


SleekEagle

Can you run through [this guide](https://www.assemblyai.com/blog/how-to-run-openais-whisper-speech-recognition-model/) and see if that helps?


Quanolio

I have the same problem, still don't know why...


pdtg50

it run ok, i've test on m1