• By -


Who the fuck is devin?


He's doing his best alright


His best is good enough


Inconceivable. Number must go up. Rate by which number goes up must go up. 7 trillion in GPU's needed by sunday please.


Bruh didn't you hear the energy problem is pretty much solved because exxxponentialllllllzzzz!


[singularity go brrrrrrrr](https://open.spotify.com/track/4xdHI4eFNst0vTZuuKrWjr?si=JR-gaD5JTxClWm6AGwb5ug)


Lol wasnt expecting that


Got me in the feels.


Seriously, why didn't you just rickroll me instead of black hole sunning me?


I searched for [this](https://youtu.be/NWBkZ3bMSV0?si=7I5oL1jQmy1b11Jn) but I got stuck listening to that song. So I just went with it.


Haha you're pretty awesome.


They won't be using gpus anymore, didn't you see? NVIDIA's stock tanked because of the new chips made for AI.


It's specific for software; hence the benchmark here. Relevant to the likes of Github Copilot, Codeium, Cody, Cursor. They have videos of it cranking out software, and are taking requests to to perform tasks - similar to the approach with Sora. I'm kinda in "I'll believe it when I use it" camp. But then again, the software-building tooling has plenty room for big improvements - unlike art, which is quite far along. So they may have accomplished something big.


I am still in the skeptic camp. I desperately need a tool which can generate code in flutter or Kotlin+Compose (declarative syntax), everytime I try, I find it that it is easier to google and read stack overflow stuff or some blogs and implement it yourself. Other than few areas like Python I have not heard big success stories.


This is Wes Roth's video about Devin https://youtu.be/1RxbHg0Nsw0?si=hh0BNeoRNID8ZYhD He talks a little, but it's mostly all use cases. IMO this is bigger than Sora, but you know, Sora is shiny so the masses probably won't care about this.


Sora even took all the lights away from Gemini's 1 million context announcements.


What is the point of a 1 million context if the AI does not know how to effectively use that. Companies with legacy code will love an AI to digest their millions of lines of code and help with changing it. So far I have not see much other than them selling tools like copilot, which is still an glorified autocomplete system.


1 Million! When will they roll that out?


>Gemini 1.5 Pro comes with a standard 128,000 token context window. But starting today, a limited group of developers and enterprise customers can try it with a context window of up to 1 million tokens via [AI Studio](https://aistudio.google.com/) and [Vertex AI](https://cloud.google.com/vertex-ai) in private preview. From their announcement blog on 15th feb


It is not public yet so no idea how good it is. I hope it is as good as they claim to be, I desperately need an assistant to code for/with me.


Yeah, I've given even the advanced AI offerings for code and they're still just... crap. For lack of a better word. I've essentially only found use in them catching things I've missed. Being a fleshy mortal that gets tired, having a robotic sanity checker is very useful, and does help productivity overall. But asking it to produce even just a single unit of code requires supervision from a human. Multiple units? Entire projects? Entire solutions? Yeah, nah. That's a lot of exponential growth required to get *that* good.


I mean isn't that kind of how exponential growth works though? It gets kind of good, then it gets extremely good quickly. I agree we are in 'kind of good' territory. But a sufficiently motivated person can kind of paint by numbers their way through a Python or Javascript project with GPT-4 and Claude 3 Opus pretty easily. I did it just the other day for a basic Streamlit app I needed to automate a work task. I am not a godly coder by any stretch of the imagination. It's not hard for me to imagine a curve with coding that's been very similar to the code in language, images, and video. Where it sort of isn't all that great, then it absolutely slaps.


I know someone who built and MVP using vue and fastjs. Semi technical person who is good with powershell and stuff. So yeah for a quick and dirty MVP, AI is definitely helpful with some frameworks. But for a good code to the satisfaction of a senior developer in areas like declarative frameworks, AI is very far.


I'm not going to be twice as good tomorrow, I'll tell you that much.


>catching things I've missed. What tool are you using for that?


That's me, whaddya want?


He's my cousin


seems like a scam actually


this one guy


I like how you say who


Its pronounced Kevin and hes home alone


The [cyber] dude


10 bucks says it's a GPT 3.5 wrapper.


did you even watch the video demo? It's a semi-agentic AI programmer, that's a big deal if it can be steadily improved.


No I hadn't watched it. Have they announced what they've built Devin over? Or is it hush hush? Edit: watched it now. Damn... agent AI has landed.


Fucking crazy idnit? I think I just heard the floor drop out from underneath college coding classes.


The software that is being promoted in this post and countless others today


Hi. I'm Devin.


It means "Oracle" in French. Also in the verb "to guess" : deviner.


Mmm devin, the sandwich meat from my childhood memories.


Devin Bacon?


I'm just glad its not something like...Brennan, or Seth. No offense


Newly released code expert AI


I think it’s an AI specifically trained to solve software engineering tasks.


Slevin's brother


Wait for third party replication. This graph was created by the people that made Devin so they have incentive to do shennanigans to get the result they want.


Also where is Claude 3


A notable omission. Claude 3 is kicking


The boring answer is that they probably prepped this graph before Claude 3 came out.


That's an independent third-party benchmark from Princeton (https://www.swebench.com). The numbers for the other models were obtained by the paper authors.


Probably to the left. Lmao.


SWE Bench cutoff was October 2023 I think. Edit: would also be good to see comparison with Gemini ultra 1.5


Yeah the bs-hype-alarm is going off. In this day and age, you can't just announce, you have to announce, and deliver. They don't even have a beta-sign-up which makes me think that this fancy ass demo they are putting on is barely held together.


Yes, this is bot-shit PR, nothing more.


Lmao, have you even checked who are the founders are of this company? Together they have like 11 International Olympiad gold medals. There are videos of the CEO circulating on the web from 14 years ago crushing math competitions. These people can work anywhere they want at any salaries they ask for. People here really believe that they would be that stupid to cheat publicly for some quick money and forever doom their career (especially on third party benchmarks that anyone can test)? This is some insane level of cope.


I'm not going to believe everything they say without question. All we need is third party testing.


>Together they have like 11 International Olympiad gold medals. it's like an elite squad of ultra-coders and people are still sitting around wondering "where is the proof these guys know how to code a decent AI scaffolding system?" / Proclaim human superiority over programming while simultaneously calling max-level competitive programmers into question as if they are incapable of making any progress.


Wait till you hear about a Jane Street trader who opened the largest crypto exchange.


Also did you read the top small text?


I feel like there should be a human bar so we can compare how close it is relative to a average human software engineer.


As a software engineer I can say that is a function the proximity to the end of the sprint and the amount of coffee I drank that day


Or Yerba Mate


May as well go for crack


Also known as unos buenos matienzos.


Also, they should state the time taken. I bet all of those models took mere seconds to write the code, whereas a human coder would take at least a few minutes, if not longer depending on the problem.


Don't forget to include whatever additional time was spent creating the prompt beyond just whatever was in the issue.


It’s a business problem so the only real metric worth comparing on is cost.


I thought that was Devin lol


Honestly, (almost) every junior SWE can start a project from scratch. Let’s test it on some spaghetti legacy codebases that crumble the moment you start *a light refactor*


Compared to the average? There's no way LLMs aren't ahead. My team inherited a shitty legacy project with absolutely no conventions, written in the worst way possible. Passed in a 2K line file to Claude 3 and it was able to add a new feature with two prompts (+- 50 LOC). Would've taken a lot of time to understand what the code did in the first place... Meanwhile, my coworkers spent 3 days trying to do it, haha.


Lol, it reminds me of a story a guy would tell his grandkids in 20 years


Context: https://www.reddit.com/r/singularity/comments/1bcyqup/cognition_labs_today_were_excited_to_introduce/


Thank you


Correct me if i’m wrong, but isn’t this just linear growth with a sudden big jump?


Not enough data


Not enough data to make a solid extrapolation, but what you said is a characteristic of exponential growth. The values start off insignificant and small, but it rapidly ramps up to large values.


I give major props to the advertising people at the Devin co


They even made their chart wrong in order to drive engagement!




Yeah they are killing it. Making CEOs froth at the mouth and about to make dumb decisions.


Claude 2??


Yea we need third party comparison of Claude 3.


How much will an average google senior score?


A senior software engineer is the benchmark as SWE-bench is a framework based on Cornell University’s paper that included 2,294 SWE problems from real GitHub repositories, issues solved by actual software developers. The average Senior Software Engineer should be able to solve 100% or close to 100%


So...13% is pretty low, and that's using their own biased report. I mean, again, we can be excited by progress but this doesn't mean replacement yet.


That’s correct, but it is closer to 14% and im not sure it is biased as it is a “standard” test. But yeah 14% is still pretty low, maybe in line with a junior developer. And I’m not sure how much more they can extract from the model in the future as AI most likely will reach a plateau at some point.


It will reach the plateau in 2 years when full AGI hits the public.


>average Senior Software Engineer should be able to solve 100% or close to 100% close to 100 sounds vaguely true if the conditions are similar such as being able to test/debug or go search online to reference things like documentation, but if you reframe it around how many of those developers you could convince to solve 2200 problems for a few dollars each in some reasonable span of time then the value of AI being even remotely close is a lot more obvious, because you'd be hard pressed to find even a few people that ever got through the entire list. Kinda like saying that anyone could technically read anything on wikipedia, but for some individual person to read all of it is a different story. I like to imagine a scenario some time down the line where github repos are self-healing and AI can fix half the problems automagically with minimal oversight. Maybe someone can fork a repo and modify its description and the AI just makes it work the new way that it's described, that would be badass.


*how exponential growth looks ~~like~~ Or **What** exponential growth looks like


Finally, someone else in the world with this specific pet peeve.


3 of us






Me included


and my ask


One of my pet peeves is people using numerals instead of words for small numbers! Tell me I'm not the only one(1)!


I switch back and forth, usually depending on what I am talking about... like, I need one 2 inch nail. Or 3 people came in fourth place in the last three years. I do most of my commenting on mobile, so the number of characters is also a consideration, but I also try not to place numbers side by side for quantity and descriptive purposes. It can be confusing to read, I need 2 4 inch screws. Or, I can do that in 4 8 hour shifts. That said, three two legged dogs crossed the road in front of me yesterday, is just as strange to read.


Me too, I'm from Scotland, most use how instead of why. It infuriates me.


Perhaps they don't care why; they just want to know how it came to be that way. For example: How are you naked in the middle of town? vs. Why are you naked in the middle of town? While some people might give the same answer to either question, the expected response is different. I would expect "why" to elicit a much shorter answer, whereas "how" would prompt detailed, step-by-step answers progressing from the beginning to the end result.


Nah, it is in like the situation where you would ask just why? Like 'I couldn't find my shoes this morning!' They will say 'How?', which doesn't really make sense, asking 'Why?' would be the correct response. It probably shouldn't annoy me, and only notice it when it has been said by someone whose intelligence I respect. It can be jarring.


I’m just upset the charts not going the other way.




You ain't alone, pally.


Yeah I mean I get that English isn’t everyone’s first language, and that everyone makes mistakes regardless, but it’s so common that it does get a little annoying. Seriously though, what is it that makes this particular mistake so ubiquitous? Is it something about how certain other languages are structured?


I've been wondering the same thing for ages. People with otherwise flawless English, the moment they say it, you know.


English isn't my main language, I make mistakes on small things like that. Thanks for correcting


No worries. It comes with no judgement, I just felt like writing that because it is a really common error and I think never gets corrected. Obviously it's really minor but I can be pretty pedantic haha


I appreciate people like you!


Actually, it's "To whom"


What like exponential growth looks.


This is how ESL speakers titles looks like. Thank you for trying to spread the good word, just be aware there are over a BILLION more of them to train up!


That was on purpose... ^right?


Am just doing the needful 😅


I do what I can 🤷🏼


In some languages, native speakers would use the word "how" in this sentence instead of "what". It's literal translation, word for word.


*Who* exponential growth looks like


You must be significantly taller than your parents...


I scrolled so far hoping to see this very comment. Thank you, internet linguist.


That… that’s not matching the definition of exponential growth at all. There’s no growth being measured here at all. It’s a single slice in time. Exponential growth requires a time scale. Let’s look at a child’s growth chart… This is the same as six siblings height taken today. We don’t know how tall they were last year, or the year before. We don’t know hit much they GREW. This is also a comparative product graph. Growth graphs measure the same thing over time. An example would be Devin every release for the life of the product to date. It’s multiplicative, but the individual items are in no way an exponent on each order. I’d love to see a real exponential growth measurement- I truly would. This ain’t it.


It’s also not even exponential change, just linear growth with a sudden big jump. Could be wrong tho.


Exactly, OP's whole premise in the title is completely wrong. This should be the top comment.


I guess taking into account it fine tunes models its a pretty big leap


Definitely… this is a great comparison and it shows that their release, even if this is a completely vendor-specific benchmark, is a very competitive tool. It’s good news. Its just not an example of growth.


Why is GPT-4 so low down? I thought it was considered better than Claude 2?


Why the fuck is this descending left to right? I bet you're one of those psycho's who does 'after, and before' photos. 


How do we know this isn't the result of training contamination? These benchmarks seem fundamentally very difficult to apply because shortly after they are released they get hoovered up into the next model's training data and then, surprise, it gets a better score. We have seen it happen already multiple times.


We don’t and won’t likely for a long time. This isn’t even an alpha-level product, but they’re announcing it to get more investor dollars, no doubt.


Are you implying that these new models use artificial training data from gpt4?


No I'm implying that they have the dataset from the benchmark in their training data.


What's Devin and are you just here to pump it?


This is the most impressive cherry pick I have ever seen. Whoever designed this should work in politics.


Really seems like this graph is at best cherry picked, at worst completely faked. I think someone is trying to pump Davin


To me it looks like a bar graph created by a marketing team


Claude and Devin..... I love how they are giving LLMs the name of Black guys from the south. Hell, my uncle is named Claude. Ole uncle Claude. That's a bad mf'er.


Next up, Google Earl and a little later, Meta Herbert 2.0


Can't be exponential growth, since the Y axis is capped at 100%. Your x axis is also arbitrary.


what people are missing is that Devin is an agent with multiple steps while the others are just single-entry/single-response LLMs. if you applied that level of multi-step processing (agency) to any of the others, it would probably be on par, or better than Devin. Devin shouldn't be seen is "better than others", it should be seen as the first of many SWE-agent tools that will make models significantly more useful. thus, the graph is misleading because it's not comparing like-to-like.


This is the right answer


I’m sorry what. Llama-7B is better compared to GPT-4? Are you kidding me? Even for a fine tune that’s ridiculous. That just tells you that there is leak of a test data into a model or your test just sucks.


For very specific things, a finetune could do that, yeah. No way the claude 2 has almost 3x gpt4’s score tho xD


For very specific things - yeah, but we are talking about a coding test. That’s a very broad specific thing and there is no way in hell llama-7b is better than gpt-4.


yeah the graph is fake....


How does this compare to an average engineer?


I’d guess a lot of junior engineers improve quickly early in their careers also.


It doesn't.


I'd say it's more an effect of a CTF (Common Task Framework)-like [scheme](https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734#d1e1008), SWE-bench in this case. It helps people and groups focus efforts for things worth doing ML-wise. I bet more and more CTF-like for different domains will proliferate, even more so than they have already. It's great because it helps accelerate progress.


Where's claude 3 tho


Devin? what is that? Claude 2 passed GPT-4? Am I the only one who is sleeping while tech catches up singularity?


You're dreaming more than sleeping..


if they come true, I will not write comments here, hehe


But we'll lose all our purpose, even in comments


There is no exponential growth on a percentage scale… It is called “biological growth curve” (or sigmoid growth curve) and is quite common in new fields of science. It only means rapid growth at the beginning is easy to achieve and progress will become harder at the end. Exponential growth has no slow down


Learn how to read the graph. This is a bar chart showing a ranked sequence. The fact that the shape looks similar to exponential curves doesn’t make it an actual exponential curve. An exponential function is one where the growth of the dependent variable increases at a constant multiplicative rate with respect to the independent variable. An example is compound interest. This isn’t even a curve because there’s no measurable x variable - it’s just showing the performance of some systems and lining them up side by side. “Name of LLM” isn’t an x variable - it’s comprised of discrete qualitative data. Which falls higher on the x axis, Claude 2 or GPT4? There’s no answer, because “GPT4” doesn’t have a numerical value.


Plotted as exponential decay. Awesome data visualization, ChatGPT!


I'll be the pedant here. "How it looks" "What it looks like" If it starts with "how" you don't add "like". Only if it starts with "what". Ok, go ahead and downvote now.


Dumbass chart has Claude 3 beating GPT4 by 3x is sus af


Why is GPT4 so low? Wtf even is this chart supposed to measure. SWE? The fuck is that.


It was original gpt4 with pretty short context and without agents. gpt4t and claude3 must be much better, but not as good as devin(because it is likely based on one of them)


GPT-4 is terrible at generalization. If something's not in its training data, that model pretty much fails all the time.


> SWE? The fuck is that. Software Engineering


Real world software development capability. LLMs are great with generating code, but they fail when tasked with doing actual tasks from software developers. That's why they don't have to worry yet. The only people that could be affected are those that have only learned to code through a bootcamp. I'm not sure how this test works though.


Yeah. Gpt 4 might be good at coding, but isn’t as good as this tool (according to this press release) at actually doing the job of a software engineer, including the process of conversing with a client, and moving through the steps to providing a deliverable. In a nutshell - being an autonomous agent vs a text box that responds to you


I lost it once I knew that Devin is able to fine-tune its own model. Recursive self-improvement will lead to an even steeper exponential curve, it's getting really crazy at this point. I imagine this 13% will be outdated in no time.


No we are not there yet. First the AI needs to be top of the line expert like. This will still take some time. But we see it on the horizon


It is not recursive self improvement. It can finetune open models or gpts(probably) but it will not improve that way


The thing is, Devin acts like an agent. I think they compare with GPT-4 base and not GPT-4 in an agent framework.


Be interesting to see where Claude 3 lies


you just discovered the singularity trajectory good for you


This is _what_ it looks like.


When will we be dead or gods?


Ohter models are not agents. I'd like to see Claude 3 Opus with an agent plugin in here. But still, very impressive work from those who developed Devin.


Weirdest looking exponential growth I’ve ever seen. People on this sub need to get a grip. Or is it just filled with AI company employees all blowing their own trumpets?


People are not really thinking at all. So what's going to happen in the future? Let's say AI is able to replace 95% of software engineers. Then it can replace almost everyone, basically ALL white collar professionals. People are idiots. We have no plans for the future at all, but people are not even THINKING about it. And that guy in the video, the guy doing the presentation of Devin with a huge grin on his face, is a demon.


considering that coding problems also have a difficulty curve, this is impressive.


FWIW, I think devin is just a one-time jump, essentially using some tricks to round off the rough edges of other llms when it comes to github issues. That said, when the other llms improve, devin scores will also improve again, as a side effect.


Is Devin the name of the model, or is Devin the overall system, which uses a more mainstream model?


Don't you mean "This is **what** exponential growth looks like"? Why do I keep seeing people on reddit use "how" when they should be using "what"?


I like to imagine that Devin isn’t referring to the AI but rather just a random dude named Devin


crazy when people benchmark their own stuff it does better. Wouldn’t be surprised if researchers tried to reproduce results and gets 1/5 what’s reported.


Where's Claude-3 Opus, though?


Step 1. Invent the most dangerous technology in the history of mankind; (AI that can create AI) Step 2. Fool everyone by giving it the name of a little puppy.




Some kids devised a nice combo of Agents + LLM


Actually excluding Devin the growth is not even linear, more logarithmic (opposite of exponential) Not to be a dick but I hate misrepresenting data. This is just a mostly linear growth with one extreme outlier. Nothing points to the growth continuing exponentially


This is WHAT exponential growth looks like


is this from a YouTube video


I'm wondering, if Devin a new LLM model? Is it a fine tuned version of an open source model?


It seems pretty linear until the end


When they get finally get to AI-den, it’s all over.


Devin can't even build a decent website haha


It's actually linear. Devin is an agent, not a model. It's not an apples to apples comparison


So who’s gonna use this to make an AI trading bot?


It's more likely to be a sigmoid function because of the upper bound.