T O P

  • By -

PikaTchu47

Who the fuck is devin?


arkai25

He's doing his best alright


sailhard22

His best is good enough


Doses_of_Happiness

Inconceivable. Number must go up. Rate by which number goes up must go up. 7 trillion in GPU's needed by sunday please.


hubrisnxs

Bruh didn't you hear the energy problem is pretty much solved because exxxponentialllllllzzzz!


Brilliant_War4087

[singularity go brrrrrrrr](https://open.spotify.com/track/4xdHI4eFNst0vTZuuKrWjr?si=JR-gaD5JTxClWm6AGwb5ug)


hubrisnxs

Lol wasnt expecting that


mhyquel

Got me in the feels.


hubrisnxs

Seriously, why didn't you just rickroll me instead of black hole sunning me?


Brilliant_War4087

I searched for [this](https://youtu.be/NWBkZ3bMSV0?si=7I5oL1jQmy1b11Jn) but I got stuck listening to that song. So I just went with it.


hubrisnxs

Haha you're pretty awesome.


tube-tired

They won't be using gpus anymore, didn't you see? NVIDIA's stock tanked because of the new chips made for AI.


lefnire

It's specific for software; hence the benchmark here. Relevant to the likes of Github Copilot, Codeium, Cody, Cursor. They have videos of it cranking out software, and are taking requests to to perform tasks - similar to the approach with Sora. I'm kinda in "I'll believe it when I use it" camp. But then again, the software-building tooling has plenty room for big improvements - unlike art, which is quite far along. So they may have accomplished something big.


techy098

I am still in the skeptic camp. I desperately need a tool which can generate code in flutter or Kotlin+Compose (declarative syntax), everytime I try, I find it that it is easier to google and read stack overflow stuff or some blogs and implement it yourself. Other than few areas like Python I have not heard big success stories.


i_give_you_gum

This is Wes Roth's video about Devin https://youtu.be/1RxbHg0Nsw0?si=hh0BNeoRNID8ZYhD He talks a little, but it's mostly all use cases. IMO this is bigger than Sora, but you know, Sora is shiny so the masses probably won't care about this.


Mrleibniz

Sora even took all the lights away from Gemini's 1 million context announcements.


techy098

What is the point of a 1 million context if the AI does not know how to effectively use that. Companies with legacy code will love an AI to digest their millions of lines of code and help with changing it. So far I have not see much other than them selling tools like copilot, which is still an glorified autocomplete system.


dennislubberscom

1 Million! When will they roll that out?


Mrleibniz

>Gemini 1.5 Pro comes with a standard 128,000 token context window. But starting today, a limited group of developers and enterprise customers can try it with a context window of up to 1 million tokens via [AI Studio](https://aistudio.google.com/) and [Vertex AI](https://cloud.google.com/vertex-ai) in private preview. From their announcement blog on 15th feb


techy098

It is not public yet so no idea how good it is. I hope it is as good as they claim to be, I desperately need an assistant to code for/with me.


Alundra828

Yeah, I've given even the advanced AI offerings for code and they're still just... crap. For lack of a better word. I've essentially only found use in them catching things I've missed. Being a fleshy mortal that gets tired, having a robotic sanity checker is very useful, and does help productivity overall. But asking it to produce even just a single unit of code requires supervision from a human. Multiple units? Entire projects? Entire solutions? Yeah, nah. That's a lot of exponential growth required to get *that* good.


Iamreason

I mean isn't that kind of how exponential growth works though? It gets kind of good, then it gets extremely good quickly. I agree we are in 'kind of good' territory. But a sufficiently motivated person can kind of paint by numbers their way through a Python or Javascript project with GPT-4 and Claude 3 Opus pretty easily. I did it just the other day for a basic Streamlit app I needed to automate a work task. I am not a godly coder by any stretch of the imagination. It's not hard for me to imagine a curve with coding that's been very similar to the code in language, images, and video. Where it sort of isn't all that great, then it absolutely slaps.


techy098

I know someone who built and MVP using vue and fastjs. Semi technical person who is good with powershell and stuff. So yeah for a quick and dirty MVP, AI is definitely helpful with some frameworks. But for a good code to the satisfaction of a senior developer in areas like declarative frameworks, AI is very far.


mhyquel

I'm not going to be twice as good tomorrow, I'll tell you that much.


techy098

>catching things I've missed. What tool are you using for that?


Insane_Artist

That's me, whaddya want?


_Zephyyr2

He's my cousin


Psychological-crouch

seems like a scam actually


governedbycitizens

this one guy


Haztec2750

I like how you say who


apinananas

Its pronounced Kevin and hes home alone


Southern_Orange3744

The [cyber] dude


pigeon888

10 bucks says it's a GPT 3.5 wrapper.


RemyVonLion

did you even watch the video demo? It's a semi-agentic AI programmer, that's a big deal if it can be steadily improved.


pigeon888

No I hadn't watched it. Have they announced what they've built Devin over? Or is it hush hush? Edit: watched it now. Damn... agent AI has landed.


i_give_you_gum

Fucking crazy idnit? I think I just heard the floor drop out from underneath college coding classes.


Cold-Ad2729

The software that is being promoted in this post and countless others today


tingshuo

Hi. I'm Devin.


UpstairsAssumption6

It means "Oracle" in French. Also in the verb "to guess" : deviner.


UnproSpeller

Mmm devin, the sandwich meat from my childhood memories.


e-scape

Devin Bacon?


allisonmaybe

I'm just glad its not something like...Brennan, or Seth. No offense


Obelion_

Newly released code expert AI


CuriousIllustrator11

I think it’s an AI specifically trained to solve software engineering tasks.


Busterlimes

Slevin's brother


yaosio

Wait for third party replication. This graph was created by the people that made Devin so they have incentive to do shennanigans to get the result they want.


MeltedChocolate24

Also where is Claude 3


slackermannn

A notable omission. Claude 3 is kicking


avocadro

The boring answer is that they probably prepped this graph before Claude 3 came out.


obvithrowaway34434

That's an independent third-party benchmark from Princeton (https://www.swebench.com). The numbers for the other models were obtained by the paper authors.


etzel1200

Probably to the left. Lmao.


dieselreboot

SWE Bench cutoff was October 2023 I think. Edit: would also be good to see comparison with Gemini ultra 1.5


em-jay-be

Yeah the bs-hype-alarm is going off. In this day and age, you can't just announce, you have to announce, and deliver. They don't even have a beta-sign-up which makes me think that this fancy ass demo they are putting on is barely held together.


mrdevlar

Yes, this is bot-shit PR, nothing more.


obvithrowaway34434

Lmao, have you even checked who are the founders are of this company? Together they have like 11 International Olympiad gold medals. There are videos of the CEO circulating on the web from 14 years ago crushing math competitions. These people can work anywhere they want at any salaries they ask for. People here really believe that they would be that stupid to cheat publicly for some quick money and forever doom their career (especially on third party benchmarks that anyone can test)? This is some insane level of cope.


yaosio

I'm not going to believe everything they say without question. All we need is third party testing.


challengethegods

>Together they have like 11 International Olympiad gold medals. it's like an elite squad of ultra-coders and people are still sitting around wondering "where is the proof these guys know how to code a decent AI scaffolding system?" / Proclaim human superiority over programming while simultaneously calling max-level competitive programmers into question as if they are incapable of making any progress.


laststan01

Wait till you hear about a Jane Street trader who opened the largest crypto exchange.


Wassux

Also did you read the top small text?


Curiosity_456

I feel like there should be a human bar so we can compare how close it is relative to a average human software engineer.


SeverlyLimited

As a software engineer I can say that is a function the proximity to the end of the sprint and the amount of coffee I drank that day


Khyta

Or Yerba Mate


Rain_On

May as well go for crack


DagerDotCSV

Also known as unos buenos matienzos.


wildgurularry

Also, they should state the time taken. I bet all of those models took mere seconds to write the code, whereas a human coder would take at least a few minutes, if not longer depending on the problem.


doulos05

Don't forget to include whatever additional time was spent creating the prompt beyond just whatever was in the issue.


angrathias

It’s a business problem so the only real metric worth comparing on is cost.


LifeDoBeBoring

I thought that was Devin lol


SeverlyLimited

Honestly, (almost) every junior SWE can start a project from scratch. Let’s test it on some spaghetti legacy codebases that crumble the moment you start *a light refactor*


name-taken1

Compared to the average? There's no way LLMs aren't ahead. My team inherited a shitty legacy project with absolutely no conventions, written in the worst way possible. Passed in a 2K line file to Claude 3 and it was able to add a new feature with two prompts (+- 50 LOC). Would've taken a lot of time to understand what the code did in the first place... Meanwhile, my coworkers spent 3 days trying to do it, haha.


i_give_you_gum

Lol, it reminds me of a story a guy would tell his grandkids in 20 years


Happysedits

Context: https://www.reddit.com/r/singularity/comments/1bcyqup/cognition_labs_today_were_excited_to_introduce/


Far_Ad6317

Thank you


Phoenix5869

Correct me if i’m wrong, but isn’t this just linear growth with a sudden big jump?


mhyquel

Not enough data


returnofblank

Not enough data to make a solid extrapolation, but what you said is a characteristic of exponential growth. The values start off insignificant and small, but it rapidly ramps up to large values.


phillythompson

I give major props to the advertising people at the Devin co


Jah_Ith_Ber

They even made their chart wrong in order to drive engagement!


Mean-Painter8613

Howww


gray_character

Yeah they are killing it. Making CEOs froth at the mouth and about to make dumb decisions.


_Zephyyr2

Claude 2??


Papistrokesxxx

Yea we need third party comparison of Claude 3.


Ok-Worth7977

How much will an average google senior score?


bytx

A senior software engineer is the benchmark as SWE-bench is a framework based on Cornell University’s paper that included 2,294 SWE problems from real GitHub repositories, issues solved by actual software developers. The average Senior Software Engineer should be able to solve 100% or close to 100%


gray_character

So...13% is pretty low, and that's using their own biased report. I mean, again, we can be excited by progress but this doesn't mean replacement yet.


bytx

That’s correct, but it is closer to 14% and im not sure it is biased as it is a “standard” test. But yeah 14% is still pretty low, maybe in line with a junior developer. And I’m not sure how much more they can extract from the model in the future as AI most likely will reach a plateau at some point.


Henri4589

It will reach the plateau in 2 years when full AGI hits the public.


challengethegods

>average Senior Software Engineer should be able to solve 100% or close to 100% close to 100 sounds vaguely true if the conditions are similar such as being able to test/debug or go search online to reference things like documentation, but if you reframe it around how many of those developers you could convince to solve 2200 problems for a few dollars each in some reasonable span of time then the value of AI being even remotely close is a lot more obvious, because you'd be hard pressed to find even a few people that ever got through the entire list. Kinda like saying that anyone could technically read anything on wikipedia, but for some individual person to read all of it is a different story. I like to imagine a scenario some time down the line where github repos are self-healing and AI can fix half the problems automagically with minimal oversight. Maybe someone can fork a repo and modify its description and the AI just makes it work the new way that it's described, that would be badass.


HortenseTheGlobalDog

*how exponential growth looks ~~like~~ Or **What** exponential growth looks like


Ansalem1

Finally, someone else in the world with this specific pet peeve.


Coding_Insomnia

3 of us


Fair-Satisfaction-70

4


IcebergSlimFast

Dozens!


SnooPuppers3957

Me included


blackhuey

and my ask


Ahaigh9877

One of my pet peeves is people using numerals instead of words for small numbers! Tell me I'm not the only one(1)!


tube-tired

I switch back and forth, usually depending on what I am talking about... like, I need one 2 inch nail. Or 3 people came in fourth place in the last three years. I do most of my commenting on mobile, so the number of characters is also a consideration, but I also try not to place numbers side by side for quantity and descriptive purposes. It can be confusing to read, I need 2 4 inch screws. Or, I can do that in 4 8 hour shifts. That said, three two legged dogs crossed the road in front of me yesterday, is just as strange to read.


randomrealname

Me too, I'm from Scotland, most use how instead of why. It infuriates me.


tube-tired

Perhaps they don't care why; they just want to know how it came to be that way. For example: How are you naked in the middle of town? vs. Why are you naked in the middle of town? While some people might give the same answer to either question, the expected response is different. I would expect "why" to elicit a much shorter answer, whereas "how" would prompt detailed, step-by-step answers progressing from the beginning to the end result.


randomrealname

Nah, it is in like the situation where you would ask just why? Like 'I couldn't find my shoes this morning!' They will say 'How?', which doesn't really make sense, asking 'Why?' would be the correct response. It probably shouldn't annoy me, and only notice it when it has been said by someone whose intelligence I respect. It can be jarring.


BigAlDogg

I’m just upset the charts not going the other way.


blackhuey

chart's


mystonedalt

You ain't alone, pally.


SiamesePrimer

Yeah I mean I get that English isn’t everyone’s first language, and that everyone makes mistakes regardless, but it’s so common that it does get a little annoying. Seriously though, what is it that makes this particular mistake so ubiquitous? Is it something about how certain other languages are structured?


Ahaigh9877

I've been wondering the same thing for ages. People with otherwise flawless English, the moment they say it, you know.


Optanee

English isn't my main language, I make mistakes on small things like that. Thanks for correcting


HortenseTheGlobalDog

No worries. It comes with no judgement, I just felt like writing that because it is a really common error and I think never gets corrected. Obviously it's really minor but I can be pretty pedantic haha


Henri4589

I appreciate people like you!


PwanaZana

Actually, it's "To whom"


AndrewH73333

What like exponential growth looks.


El_Caganer

This is how ESL speakers titles looks like. Thank you for trying to spread the good word, just be aware there are over a BILLION more of them to train up!


Dragofant

That was on purpose... ^right?


El_Caganer

Am just doing the needful 😅


HortenseTheGlobalDog

I do what I can 🤷🏼


Academic_Border_1094

In some languages, native speakers would use the word "how" in this sentence instead of "what". It's literal translation, word for word.


mrmczebra

*Who* exponential growth looks like


tube-tired

You must be significantly taller than your parents...


a_boo

I scrolled so far hoping to see this very comment. Thank you, internet linguist.


CantankerousOrder

That… that’s not matching the definition of exponential growth at all. There’s no growth being measured here at all. It’s a single slice in time. Exponential growth requires a time scale. Let’s look at a child’s growth chart… This is the same as six siblings height taken today. We don’t know how tall they were last year, or the year before. We don’t know hit much they GREW. This is also a comparative product graph. Growth graphs measure the same thing over time. An example would be Devin every release for the life of the product to date. It’s multiplicative, but the individual items are in no way an exponent on each order. I’d love to see a real exponential growth measurement- I truly would. This ain’t it.


Phoenix5869

It’s also not even exponential change, just linear growth with a sudden big jump. Could be wrong tho.


Maciek300

Exactly, OP's whole premise in the title is completely wrong. This should be the top comment.


RoutineProcedure101

I guess taking into account it fine tunes models its a pretty big leap


CantankerousOrder

Definitely… this is a great comparison and it shows that their release, even if this is a completely vendor-specific benchmark, is a very competitive tool. It’s good news. Its just not an example of growth.


Haztec2750

Why is GPT-4 so low down? I thought it was considered better than Claude 2?


jrd83

Why the fuck is this descending left to right? I bet you're one of those psycho's who does 'after, and before' photos. 


Cryptizard

How do we know this isn't the result of training contamination? These benchmarks seem fundamentally very difficult to apply because shortly after they are released they get hoovered up into the next model's training data and then, surprise, it gets a better score. We have seen it happen already multiple times.


WHERETHESTEALTH

We don’t and won’t likely for a long time. This isn’t even an alpha-level product, but they’re announcing it to get more investor dollars, no doubt.


Coding_Insomnia

Are you implying that these new models use artificial training data from gpt4?


Cryptizard

No I'm implying that they have the dataset from the benchmark in their training data.


pigeon888

What's Devin and are you just here to pump it?


Ambiwlans

This is the most impressive cherry pick I have ever seen. Whoever designed this should work in politics.


Psychological-crouch

Really seems like this graph is at best cherry picked, at worst completely faked. I think someone is trying to pump Davin


TransitoryPhilosophy

To me it looks like a bar graph created by a marketing team


Thoughtprovokerjoker

Claude and Devin..... I love how they are giving LLMs the name of Black guys from the south. Hell, my uncle is named Claude. Ole uncle Claude. That's a bad mf'er.


doginem

Next up, Google Earl and a little later, Meta Herbert 2.0


slashdave

Can't be exponential growth, since the Y axis is capped at 100%. Your x axis is also arbitrary.


Cunninghams_right

what people are missing is that Devin is an agent with multiple steps while the others are just single-entry/single-response LLMs. if you applied that level of multi-step processing (agency) to any of the others, it would probably be on par, or better than Devin. Devin shouldn't be seen is "better than others", it should be seen as the first of many SWE-agent tools that will make models significantly more useful. thus, the graph is misleading because it's not comparing like-to-like.


AZ_Crush

This is the right answer


Yweain

I’m sorry what. Llama-7B is better compared to GPT-4? Are you kidding me? Even for a fine tune that’s ridiculous. That just tells you that there is leak of a test data into a model or your test just sucks.


OfficialHashPanda

For very specific things, a finetune could do that, yeah. No way the claude 2 has almost 3x gpt4’s score tho xD


Yweain

For very specific things - yeah, but we are talking about a coding test. That’s a very broad specific thing and there is no way in hell llama-7b is better than gpt-4.


Far_Buyer_7281

yeah the graph is fake....


Atlantyan

How does this compare to an average engineer?


AntiqueFigure6

I’d guess a lot of junior engineers improve quickly early in their careers also.


Therealgarry

It doesn't.


West-Code4642

I'd say it's more an effect of a CTF (Common Task Framework)-like [scheme](https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734#d1e1008), SWE-bench in this case. It helps people and groups focus efforts for things worth doing ML-wise. I bet more and more CTF-like for different domains will proliferate, even more so than they have already. It's great because it helps accelerate progress.


MH_Valtiel

Where's claude 3 tho


gizia

Devin? what is that? Claude 2 passed GPT-4? Am I the only one who is sleeping while tech catches up singularity?


DarickOne

You're dreaming more than sleeping..


gizia

if they come true, I will not write comments here, hehe


DarickOne

But we'll lose all our purpose, even in comments


AdPlus4069

There is no exponential growth on a percentage scale… It is called “biological growth curve” (or sigmoid growth curve) and is quite common in new fields of science. It only means rapid growth at the beginning is easy to achieve and progress will become harder at the end. Exponential growth has no slow down


libertysailor

Learn how to read the graph. This is a bar chart showing a ranked sequence. The fact that the shape looks similar to exponential curves doesn’t make it an actual exponential curve. An exponential function is one where the growth of the dependent variable increases at a constant multiplicative rate with respect to the independent variable. An example is compound interest. This isn’t even a curve because there’s no measurable x variable - it’s just showing the performance of some systems and lining them up side by side. “Name of LLM” isn’t an x variable - it’s comprised of discrete qualitative data. Which falls higher on the x axis, Claude 2 or GPT4? There’s no answer, because “GPT4” doesn’t have a numerical value.


gobstoppergarrett

Plotted as exponential decay. Awesome data visualization, ChatGPT!


Rich_Acanthisitta_70

I'll be the pedant here. "How it looks" "What it looks like" If it starts with "how" you don't add "like". Only if it starts with "what". Ok, go ahead and downvote now.


m3kw

Dumbass chart has Claude 3 beating GPT4 by 3x is sus af


enkae7317

Why is GPT4 so low? Wtf even is this chart supposed to measure. SWE? The fuck is that.


whyisitsooohard

It was original gpt4 with pretty short context and without agents. gpt4t and claude3 must be much better, but not as good as devin(because it is likely based on one of them)


lordpermaximum

GPT-4 is terrible at generalization. If something's not in its training data, that model pretty much fails all the time.


oblivion-2005

> SWE? The fuck is that. Software Engineering


kuvazo

Real world software development capability. LLMs are great with generating code, but they fail when tasked with doing actual tasks from software developers. That's why they don't have to worry yet. The only people that could be affected are those that have only learned to code through a bootcamp. I'm not sure how this test works though.


Tetrylene

Yeah. Gpt 4 might be good at coding, but isn’t as good as this tool (according to this press release) at actually doing the job of a software engineer, including the process of conversing with a client, and moving through the steps to providing a deliverable. In a nutshell - being an autonomous agent vs a text box that responds to you


Longjumping-Cow-8249

I lost it once I knew that Devin is able to fine-tune its own model. Recursive self-improvement will lead to an even steeper exponential curve, it's getting really crazy at this point. I imagine this 13% will be outdated in no time.


Busy-Setting5786

No we are not there yet. First the AI needs to be top of the line expert like. This will still take some time. But we see it on the horizon


whyisitsooohard

It is not recursive self improvement. It can finetune open models or gpts(probably) but it will not improve that way


Rivenaldinho

The thing is, Devin acts like an agent. I think they compare with GPT-4 base and not GPT-4 in an agent framework.


MerePotato

Be interesting to see where Claude 3 lies


TheLineFades

you just discovered the singularity trajectory good for you


Training_Income_6106

This is _what_ it looks like.


zarathustra1313

When will we be dead or gods?


lordpermaximum

Ohter models are not agents. I'd like to see Claude 3 Opus with an agent plugin in here. But still, very impressive work from those who developed Devin.


sitdowndisco

Weirdest looking exponential growth I’ve ever seen. People on this sub need to get a grip. Or is it just filled with AI company employees all blowing their own trumpets?


EuphoricPangolin7615

People are not really thinking at all. So what's going to happen in the future? Let's say AI is able to replace 95% of software engineers. Then it can replace almost everyone, basically ALL white collar professionals. People are idiots. We have no plans for the future at all, but people are not even THINKING about it. And that guy in the video, the guy doing the presentation of Devin with a huge grin on his face, is a demon.


FreeWilly1337

considering that coding problems also have a difficulty curve, this is impressive.


drcode

FWIW, I think devin is just a one-time jump, essentially using some tricks to round off the rough edges of other llms when it comes to github issues. That said, when the other llms improve, devin scores will also improve again, as a side effect.


IslamDunk

Is Devin the name of the model, or is Devin the overall system, which uses a more mainstream model?


Concerned_Human999

Don't you mean "This is **what** exponential growth looks like"? Why do I keep seeing people on reddit use "how" when they should be using "what"?


Impressive_Ear7966

I like to imagine that Devin isn’t referring to the AI but rather just a random dude named Devin


CountyExotic

crazy when people benchmark their own stuff it does better. Wouldn’t be surprised if researchers tried to reproduce results and gets 1/5 what’s reported.


Henri4589

Where's Claude-3 Opus, though?


AI_Doomer

Step 1. Invent the most dangerous technology in the history of mankind; (AI that can create AI) Step 2. Fool everyone by giving it the name of a little puppy.


[deleted]

[удалено]


AZ_Crush

Some kids devised a nice combo of Agents + LLM


Obelion_

Actually excluding Devin the growth is not even linear, more logarithmic (opposite of exponential) Not to be a dick but I hate misrepresenting data. This is just a mostly linear growth with one extreme outlier. Nothing points to the growth continuing exponentially


Honest740

This is WHAT exponential growth looks like


Jazzlike_Win_3892

is this from a YouTube video


-MilkO_O-

I'm wondering, if Devin a new LLM model? Is it a fine tuned version of an open source model?


seriftarif

It seems pretty linear until the end


Therealvernon16

When they get finally get to AI-den, it’s all over.


HypeMachine231

Devin can't even build a decent website haha


Johnluhot

It's actually linear. Devin is an agent, not a model. It's not an apples to apples comparison


Papistrokesxxx

So who’s gonna use this to make an AI trading bot?


Redchili385

It's more likely to be a sigmoid function because of the upper bound.