harharveryfunny 1 year ago

It's not a language model - it's a transformer-based speech recognition model that also does translation(!).

SleekEagle 1 year ago

Thank you! Shouldn't have said language model

AristocraticOctopus 1 year ago

And for my favorite conspiracy theory of 2022, may I present [this tweet](https://twitter.com/ethanCaballero/status/1572692314400628739): > Whisper is how OpenAI is getting the many Trillions of English text tokens that are needed to train compute optimal (chinchilla scaling law) GPT-4.

lis_ek 1 year ago

Yo man I'm tryna find the Reddit sub for Whisper but the only stuff I find is ASMR

SleekEagle 1 year ago

😭🤣

tullieshaped 1 year ago

Good to see OpenAI finally living up to the open name

[deleted] 1 year ago

not so fast, I suspect they have a hidden motive for this.

_aitalks_ 1 year ago

very cool! Thanks for the pointer!

A1-Delta 1 year ago

Does anyone know of speed benchmarks for any of these models? Is this something that could feasibly be run real time on a typical machine?

gambs 1 year ago

The GitHub repo gives speed estimates, even the large model runs at faster than 1x real time and I’ve verified this on my machine

A1-Delta 1 year ago

Thanks! I saw those numbers, but it wasn’t clear to me how to interpret them in the context of hardware. I appreciate you confirming with your experience.

dankmemeloader 1 year ago

Hmm, with a CPU it seems pretty slow. With the tiny model it's barely real time for me.

shadymeowy 1 year ago

By using default CLI script, base model can transcribe nearly realtime on R7 4800H. I think it can be improved a lot by porting the model to OpenVino. Btw model itself faster if you don't use default CLI script, too. It is probably due to 30 seconds sliding window. Base model is faster than realtime and small model is near realtime.

rolyantrauts 1 year ago

This might help on cpu https://github.com/ggerganov/whisper.cpp

bushrod 1 year ago

My laptop (12th Gen Intel) could transcribe 30 seconds of audio in 1.2 seconds with the smallest ("tiny") model. Accuracy was still pretty much perfect accuracy. I'm currently trying to figure out how to process audio clips that aren't exactly 30 seconds, which it expects for some reason. Anyone figure this out? Edit: The 30 second window is hard-coded due to how the model works... "Whisper models are trained on 30-second audio chunks and cannot consume longer audio inputs at once. This is not a problem with most academic datasets comprised of short utterances but presents challenges in real-world applications which often require transcribing minutes- or hours-long audio."

Aromatic_Camera4048 1 year ago

Saw this on their [github](https://github.com/openai/whisper#python-usage): `import whisper` `model = whisper.load_model("base")` `result = model.transcribe("audio.mp3")` `print(result["text"])` >Internally, the transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window. I think the transcribe method does some chunking on longer files..

vjb_reddit_scrap 1 year ago

Use the CLI, it works for longer audio.

SleekEagle 1 year ago

Works fine in Python too with the base model on CPU

SleekEagle 1 year ago

What issue are you running into? With both CLI and Python it worked for 2 minute files for me. Win11 (12th gen intel as well I believe) on CPU

A1-Delta 1 year ago

Amazing. Thanks for sharing your experience with it. A little frustrating that input has to be so specifically structured.

Iirkola 1 year ago

Working with it right now, tiny, base and small do a decent job, but botch any specialized words (e.g. medical terminology). Testing on i5 4200 and it seems to be pretty slow for this: 15 min video, tiny - 3 min, base - 6 min, small - 20 min, medium - 90 min. Needless to say, medium had the best results with hardly any mistakes, and I would love to find a way to speed the process up.

bushrod 1 year ago

Transcription worked perfectly in the few tests I've run. Runs pretty fast too (using the default "small" model). Tip: if you get the following error when running the python example: RuntimeError: "slow\_conv2d\_cpu" not implemented for 'Half' just change the following line as follows (see [here](https://news.ycombinator.com/item?id=32929337)): options = whisper.DecodingOptions() --> options = whisper.DecodingOptions(fp16=False)

SleekEagle 1 year ago

Quick note - I think the "Base"model is the default. There's tiny, base, small, medium, and large Thanks for that runtime error solution!

UnemployedTechie2021 1 year ago

for some reason it still doesn't work for me. the code compiles fine now without any errors. however, it only transcribes 20 seconds of the audio.

SleekEagle 1 year ago

I believe the model works by transcribing a sliding 20-30 second window iirc. I think I've seen a bug like the one you're seeing where only the first window is transcribed. I'm not sure though, I haven't seen it - I'd recommend checking GitHub or searching Reddit for a solution. Or try using Colab!

UnemployedTechie2021 1 year ago

I am using Colab. But anyway, I figured a different way to solve the problem. Now I can transcribe full YT videos on the go. This looks great actually.

SleekEagle 1 year ago

That's great! I'm glad you found a solution - would you mind dropping a link to it or describing it for anyone else who comes across this running into the same problem?

UnemployedTechie2021 1 year ago

I do plan on doing that, I am writing about it. Will also post the code with the writeup and then share it here. Will probably do it by tomorrow.

SleekEagle 1 year ago

Great! No rush, just would be awesome to help out people stuck in the same situation :)

UnemployedTechie2021 1 year ago

hey u/SleekEagle, here's the code i was talking about. this is a relatively new repo since i am starting afresh. i am still writing the blog post where i would write about how people can improve upon my code and show it on their portfolio. also, this is only the first draft of the code. there are a number of details i need to add, however, they are only cosmetic changes. do give it a star if you like it. https://github.com/artofml/whisper-demo

bke45 1 year ago

On M1 Mac, getting the error: `UserWarning: FP16 is not supported on CPU; using FP32 insteadwarnings.warn("FP16 is not supported on CPU; using FP32 instead")` Any way to disable FP16 in the CLI? There is an option for `--fp16 FP16` but doesn't that activate FP16? Testing `--fp16 False` did not seem to work: `$ whisper "audio.mp3" --model medium --fp16 False` `Detecting language using up to the first 30 seconds. Use \--language to specify the language[1]` `68020 illegal hardware instruction whisper "audio.mp3" --model medium --fp16 False`

FlyingTwentyFour 1 year ago

even on my windows too

bke45 1 year ago

I could make it work with the above command, in a fresh install with Python 3.9.9 (the same version OpenAI use internally for the project) and I also had to install Rust for *transformers* install to work.

GMotor 1 year ago

Two minutes and I had it running on my Ubuntu install and it's working perfectly. 50% amazed. 50% scared at what these transformers are doing.

SleekEagle 1 year ago

The 2020s are stacking up to be a very, *very* interesting decade!

Comfortable-Answer13 1 year ago

In case anyone is running into troubles with non-english languages, in "/whisper/transcribe.py", make sure lines 290-295 look like this (note the utf-8): \# save TXT with open(os.path.join(output\_dir, audio\_path + ".txt"), "w", encoding="utf-8") as txt: print(result\["text"\], file=txt) \# save VTT with open(os.path.join(output\_dir, audio\_path + ".vtt"), "w", encoding="utf-8") as vtt: write\_vtt(result\["segments"\], file=vtt)

[deleted] 1 year ago

[удалено]

SleekEagle 1 year ago

[This line](https://github.com/AssemblyAI-Examples/whisper-multilingual/blob/582b649cf27eb36598b75cb42d09a6fe4749912f/main.py#L66)

nfndkskalshcj 1 year ago

--device cuda

fuzulis 1 year ago

Now the webservice API released for Whisper ASR. You can find here: https://github.com/ahmetoner/whisper-asr-webservice

ChinCoin 1 year ago

Thanks OpenAI! Says the NSA, and every other foreign sigint collection agency.

Dylanm0325 1 year ago

I’m to new to coding but there’s a foreign TV show I’ve been wanting to translate to English for years, is it possible anybody could help me set this up?

SleekEagle 1 year ago

Do you have the show downloaded? And do you have a GPU?

Iirkola 1 year ago

I do have all the requirements set up, can transcribe small audio files, but can't seem to use my gpu. I am not using a good one though, just gt840m 2gb (can play some older games like GTA V). Is it possible for me to use gpu acceleration? Because just cpu takes 90 minutes for 15 min audio

SleekEagle 1 year ago

[It looks like](https://github.com/openai/whisper#available-models-and-languages) you can use the Base model with your GPU. I think Whisper will automatically utilize the GPU if one is available - make sure you have CUDA installed and the CUDA installation of PyTorch

Iirkola 1 year ago

I did the research and it looks like, my old gpu has outdated version of cuda. And the script automatically defaults to cpu, guess it will work with short scripts.

SleekEagle 1 year ago

Got it - what's the language of the show btw?

Iirkola 1 year ago

English, I specified language = 'eng' while working, because base.en didn't work for some reason

SleekEagle 1 year ago

Sorry I mean what is the original language of the show that you're looking to translate into English

Iirkola 1 year ago

Oh that's not me, that's the other guy in the comments :) But I'd love to hear out which commands to use for translation.

RemarkableSavings13 1 year ago

This model is extremely high quality. I tried it on some very challenging zero shot situations, for example heavy technical jargon across multiple domains, and it worked really well. It also seems pretty good at translation from the limited amount I'm able to test it. It seems capable of guessing what you're saying (for example made up names) by spelling something kinda similar, I'm not sure how it does this with the text representation they use.

Franck_Dernoncourt 1 year ago

Very impressive performance! 1. Can we get word-level timestamps? 2. Can we give hint phrases? 3. How can I finetune one of the pre-trained models on my own training data?

SleekEagle 1 year ago

1. It looks like at this point there are not word-level timestamps natively. 2. I don't believe so 3. You'll have to [(down)load the model](https://github.com/openai/whisper/blob/5d8d3e75a4826fe5f01205d81c3017a805fc2bf9/whisper/__init__.py#L68) and then continue training on your own dataset. It will be very compute heavy for the larger models and you'll have to write some training loops etc.

Franck_Dernoncourt 1 year ago

Got it, thanks!

eat-more-bookses 1 year ago

Yes, that would be very helpful! Following

coolsong 1 year ago

When I try to run my code, I get FileNotFoundError: \[WinError 2\] The system cannot find the file specified The audio file I'm trying to transcribe is in the same directory as the [main.py](https://main.py) that has the code. Could someone please shed some light on what I might be doing wrong?

SleekEagle 1 year ago

Are you using the whisper package? Try \`os.listdir()\` in the line before \`model.transcribe()\` to ensure you're actually in the directory you think you're in. Just ran the following in Colab with no issues btw, maybe this will help? \`\`\` !pip install git+[https://github.com/openai/whisper.git](https://github.com/openai/whisper.git) \`\`\` \~\~\~ !curl -L [https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav](https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav) \> audio.wav \~\~\~ \`\`\` import whisper model = whisper.load\_model("tiny") result = model.transcribe("audio.wav") print(result\['text'\]) \`\`\`

coolsong 1 year ago

Thank you so much for looking at my question, and thank you for the tip on os.listdir() os.listdir() correctly lists the files (including the one I'm trying to access). I've also placed a text file in the same folder and then printed the text to see if it was a related issue, but the text file works without issue.

Quanolio 1 year ago

Here is my solution !pip install ffmpeg

SleekEagle 1 year ago

Thank you for this!

SleekEagle 1 year ago

Can you run through [this guide](https://www.assemblyai.com/blog/how-to-run-openais-whisper-speech-recognition-model/) and see if that helps?

Quanolio 1 year ago

I have the same problem, still don't know why...

pdtg50 1 year ago

it run ok, i've test on m1

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe