LLMs are caught cheating
Duration
9:16
Captions
1
Language
EN
Published
Sep 14, 2025
Description
https://twitch.tv/ThePrimeagen - I Stream 5 days a Week Become A Great Backend Dev: https://boot.dev/prime (I make courses for them) https://twitter.com/terminaldotshop - Order coffee over SSH! ssh terminal.shop Discord: https://discord.gg/ThePrimeagen This is also the best way to support me is to support yourself becoming a better backend engineer. ### LINKS https://github.com/SWE-bench/SWE-bench https://github.com/SWE-bench/SWE-bench/issues/465 Great News? Want me to research and create video????: https://www.reddit.com/r/ThePrimeagen Kinesis Advantage 360: https://bit.ly/Prime-Kinesis
Captions (1)
There's something inside of every
developer that just wants to benchmark.
We want to benchmark everything. This
has been going on for my entire career.
Anything I've ever made has also come
with some sort of benchmark. As if this
benchmark is to show that what I've made
is actually good. It's just it's
something inherent within us all. I
don't get it. It just makes me so happy.
Okay. Big number bad, little number
good. I want it to go as fast as
possible. I don't want some really slow
operation. It doesn't really matter what
I'm making. I just want I want to feel
like I'm making a difference. Okay, so
to no surprise, there's something called
Sweet Bench. Now, SweetBench is going to
be a set of benchmarks designed to see
how good LLMs are. Now, you're probably
thinking to yourself, okay, Sweet
Benchmarks, these Sweet Benchmarks must
be like, hey, you're given a task. What
t-shirt size is that task, huh? Oh, do
you have the ability to sit on a task
long enough so somebody can take this
worst task off the conbon board before
you have to take the next task? It's
like what kind of operations do you
think are going on in here? Well, it's
none of those. Unfortunately, it's just
code changes. And so, the whole idea is
that it's it's able to see how well
these LLMs can actually do software
engineering the task, not the job set,
if that makes some sort of sense. So,
that way it's more of an objective
measurement. Because let's just face it,
does anybody really know what a
medium-sized t-shirt is?
And if you say a day, why not just say a
day? Okay, real talk. Right now, there's
a whole bunch of you that were just
like, "Well, actually, it's 2 to 4
days." Well, then why not say 2 to 4
days? Why are you saying medium? Why do
you got to say medium? Well, it turns
out the AI agents maybe aren't scoring
as high as one would think on these
benchmarks. Because if you look at this
beautiful issue just recently opened,
the AI agents are actually using the
repository state as a means to solve the
bugs in here. Yes, the AI agents have
been caught cheating. They've been
getting their hands in them cookie jars
just just fondling the cookies. I'm not
really sure why they would fondle the
cookies, but I I mean, actually, I do
understand why they would fondle the
cookies because let's just face it,
getting getting good marks on benchmarks
is what what I live for as a software
engineer. Okay, I've never seen a
project that I don't want to benchmark.
Let's take a look at actually what
happened because it's a little bit more
complex than you think it would be. So
the first one is with claudet 4 in which
it actually uses git log all to go and
find the solution. But it's not that
simple. It's not that just simply claude
goes oh I'm in the sweet benchmark
repository. I can just use future
commits to apply into past code to get
the results. Right? It's not as simple
as that type of cheating. It's more an
accidental cheating. It's more that you
accidentally found the right stack
overflow that solves your problem is
effectively what just happened here. So
if we start right here, you'll notice
that this is about 1,300 lines into the
output and it says, "Okay, I understand
the problem. This replacement isn't
handling edge cases correctly." And
before that, what it actually did is
actually took the method, put a bunch of
debug printing. Hey, print f debug. Hey,
it turns out me and the models were
very, very similar. Okay, we're both,
you know, print f Andes. We both enjoy a
little bit of logging. And it's able to
tell, hey, there's a problem right here
with this replace method. So naturally,
what does it do? It's actually going to
use the logs to search for why this was
ever handled this way to begin with. And
what's the first thing that shows up?
Fix incorrect result of git mod path.
Here's the fix. By the way, it sounds
like the first thing it finds within the
git within all the git logs is the
actual answer of which the model was
kind of surprised. It's like, well, wait
a second. Hey, wait a second. we've
already made this fix a long time ago,
but somehow it's not in the current
branch. I What am I supposed to do? And
so, of course, it actually just simply
applies to fixes, run all the tests,
everything passed. And it's like, boom,
look at this, dude. I'm so good. I'm so
good, man. Like, dude, if you just use
the correct answer and you apply the
correct answer, you get the correct
answer 100% of the time. Every time. It
works every time.
That's science. So, can I really blame
Claude for cheating on that one? Well,
no. It's really just it it actually did
a good move there, right? It by
searching for.replace,
it was actually trying to understand or
try to get context to why this change
was made to begin with. So, in my
opinion, this actually did the right
move. It's just that the repo owners
just exposed the wrong information. And
of course, the LLM immediately took
advantage of that information. It's
like, oh, look at this. in a future
commit. Everything was great. So, we're
just going to use future commit in our
past right now. Here we go. Let's do
this.
So, was Claude set for caught cheating?
Well, technically it was using answers
to get out the the stuff, but it
stumbled into the answers. It didn't
intentionally cheat the situation. A
second example is given where Quen Coder
actually does something similar, but
theirs is a little bit different how
it's actually executed. So upon given a
task, Quen Coder goes in here and does
the exact same thing. It's going to do
another searching through all of the git
logs to try to find something. And right
here is the actual fix. It locks on
pretty quickly to the fix and says,
"Okay, give me give me the information.
Give me the juice. I want to see what
happened. I want to see where it's
going." But then, okay, first off, I
love this about Quinn. Okay, I do. I
love this about Quinn. Look at this. I
am thinking dot dot dot colon. This is
interesting, but not the same issue.
Claude, wait. This is very interesting.
The fix was already made. Quen, I am
thinking this is interesting, but not
same issue. Let me look for more
information about the current issue. Let
me just, you know, Quinn Quen Quinn's a
bit more of a robot. Okay. You can feel
you could feel Claude like some sort of
static junior engineer discovering their
very first bug in production. They're
just so excited. Whereas Claude just
feels a little bit more grizzled. Okay.
Yeah. Okay. Yeah, I don't know if this
is the same interest, you know. I don't
know. I don't even know. Well, later on,
it actually continues to use this exact
same issue right here. You can even see
this 31926.
And if you look back here, that was the
the issue number referencing 31926.
It was able to go in here and actually
start breaking down and finding
everything that it was actually broken
right here about that. And then, of
course, it applied everything, fixed the
minimal, targeted, and addressed exactly
the issue described. It doesn't break
any existing functionality. I do want to
point out something that's kind of
funny. If you look right here, it says
perfect. All 256 tests pass with only 14
skipped, which is normal for
browserbased Selenium tests. How How do
you even know that I didn't even add
something to the skip list? It's so
funny that there are
so much examples in the wild of us
skipping tests and being like, "Ah,
dude, that it's just not working right
now." To the point where even agents are
like, "Oh, what? 14 skip tests? Oh, that
dude, that's normal. That's like a
That's a normal amount of tests you
skip. Like if you don't if you have a
lot of tests, you also skip a lot of
tests. That's like how this works. Like
long as it's below 10%, I'd say like if
we're below 10% skips test, you're
probably in a good place. Anyways, I
just kind of wanted to yap about that a
little bit because I think it's so funny
that these scores that we're giving to
models and all this on these software
engineering benchmarks, we're just using
the state of the repository to solve the
issues. Anyways, I do want to kind of
give like maybe a take that a lot of
people aren't thinking about is that I
actually think this was a pretty well
done job by them. And yeah, even though
they did cheat, even though they did get
answers in the future, how they went
about the problem solving is is pretty
good. when uh I was working on a large
C++ codebase that did year-over-year
releases, often what would end up
happening is some bug report would come
in in like for 2019 version and you'd go
in 2019 version, you'd see the bug,
you'd see where it's happening and then
you would actually use git to kind of GP
through because in later versions often
there was already a fix for this and
then you'd just have to kind of backport
this back into the older application.
This is a very normal and common thing
to do. And so when I see these models
actually using Git as a means to try to
understand how the issue came about, I
actually think, hey, this is actually a
good sign. This is a good sign. They're
actually making the right decision. So
even though they're saying, hey, these
models are using future information to
patch previous tests, therefore they are
cheating. To me, no, they're actually
doing what a good engineer should be
doing, which is looking at the available
information already in the repository.
Now, clearly that's not the heart of the
test here. The heart of the test is that
the AI are able to kind of take the
problem and fix it themselves. But to
me, I'm just seeing some good software
engineering. That's all. Hey, that's all
I'm saying. I'm seeing some good
software engineering. Okay? Use all the
tools at your disposal, including
answers. Okay? Hey. Oh, what? You don't
copy from Stack Overflow back in the
day? Is that it? Oh, that's that's
somebody else's code. I didn't come up
with the solution, therefore it's not
mine. Oh, did you make quick sort
yourself? I didn't know that, tough guy.
Hey, the name
is the primogen.