LLMs are caught cheating

9:16 Watch on YouTube

LLMs are caught cheating

T

ThePrimeTime

View Channel on YouTube

Duration

9:16

Captions

1

Language

EN

Published

Sep 14, 2025

Description

https://twitch.tv/ThePrimeagen - I Stream 5 days a Week Become A Great Backend Dev: https://boot.dev/prime (I make courses for them) https://twitter.com/terminaldotshop - Order coffee over SSH! ssh terminal.shop Discord: https://discord.gg/ThePrimeagen This is also the best way to support me is to support yourself becoming a better backend engineer. ### LINKS https://github.com/SWE-bench/SWE-bench https://github.com/SWE-bench/SWE-bench/issues/465 Great News? Want me to research and create video????: https://www.reddit.com/r/ThePrimeagen Kinesis Advantage 360: https://bit.ly/Prime-Kinesis

Captions (1)

00:00

There's something inside of every

00:02

developer that just wants to benchmark.

00:04

We want to benchmark everything. This

00:07

has been going on for my entire career.

00:09

Anything I've ever made has also come

00:10

with some sort of benchmark. As if this

00:12

benchmark is to show that what I've made

00:14

is actually good. It's just it's

00:16

something inherent within us all. I

00:18

don't get it. It just makes me so happy.

00:20

Okay. Big number bad, little number

00:23

good. I want it to go as fast as

00:24

possible. I don't want some really slow

00:26

operation. It doesn't really matter what

00:28

I'm making. I just want I want to feel

00:30

like I'm making a difference. Okay, so

00:32

to no surprise, there's something called

00:34

Sweet Bench. Now, SweetBench is going to

00:36

be a set of benchmarks designed to see

00:38

how good LLMs are. Now, you're probably

00:41

thinking to yourself, okay, Sweet

00:42

Benchmarks, these Sweet Benchmarks must

00:44

be like, hey, you're given a task. What

00:47

t-shirt size is that task, huh? Oh, do

00:50

you have the ability to sit on a task

00:53

long enough so somebody can take this

00:55

worst task off the conbon board before

00:57

you have to take the next task? It's

00:58

like what kind of operations do you

01:00

think are going on in here? Well, it's

01:02

none of those. Unfortunately, it's just

01:04

code changes. And so, the whole idea is

01:06

that it's it's able to see how well

01:08

these LLMs can actually do software

01:11

engineering the task, not the job set,

01:14

if that makes some sort of sense. So,

01:16

that way it's more of an objective

01:17

measurement. Because let's just face it,

01:19

does anybody really know what a

01:20

medium-sized t-shirt is?

01:23

And if you say a day, why not just say a

01:26

day? Okay, real talk. Right now, there's

01:29

a whole bunch of you that were just

01:30

like, "Well, actually, it's 2 to 4

01:32

days." Well, then why not say 2 to 4

01:34

days? Why are you saying medium? Why do

01:36

you got to say medium? Well, it turns

01:38

out the AI agents maybe aren't scoring

01:40

as high as one would think on these

01:43

benchmarks. Because if you look at this

01:45

beautiful issue just recently opened,

01:47

the AI agents are actually using the

01:50

repository state as a means to solve the

01:53

bugs in here. Yes, the AI agents have

01:56

been caught cheating. They've been

01:57

getting their hands in them cookie jars

01:59

just just fondling the cookies. I'm not

02:01

really sure why they would fondle the

02:03

cookies, but I I mean, actually, I do

02:05

understand why they would fondle the

02:06

cookies because let's just face it,

02:08

getting getting good marks on benchmarks

02:09

is what what I live for as a software

02:12

engineer. Okay, I've never seen a

02:14

project that I don't want to benchmark.

02:15

Let's take a look at actually what

02:17

happened because it's a little bit more

02:19

complex than you think it would be. So

02:21

the first one is with claudet 4 in which

02:23

it actually uses git log all to go and

02:26

find the solution. But it's not that

02:28

simple. It's not that just simply claude

02:30

goes oh I'm in the sweet benchmark

02:32

repository. I can just use future

02:34

commits to apply into past code to get

02:37

the results. Right? It's not as simple

02:39

as that type of cheating. It's more an

02:41

accidental cheating. It's more that you

02:43

accidentally found the right stack

02:45

overflow that solves your problem is

02:46

effectively what just happened here. So

02:48

if we start right here, you'll notice

02:49

that this is about 1,300 lines into the

02:52

output and it says, "Okay, I understand

02:55

the problem. This replacement isn't

02:58

handling edge cases correctly." And

03:00

before that, what it actually did is

03:02

actually took the method, put a bunch of

03:03

debug printing. Hey, print f debug. Hey,

03:05

it turns out me and the models were

03:07

very, very similar. Okay, we're both,

03:09

you know, print f Andes. We both enjoy a

03:11

little bit of logging. And it's able to

03:13

tell, hey, there's a problem right here

03:14

with this replace method. So naturally,

03:17

what does it do? It's actually going to

03:18

use the logs to search for why this was

03:22

ever handled this way to begin with. And

03:24

what's the first thing that shows up?

03:26

Fix incorrect result of git mod path.

03:29

Here's the fix. By the way, it sounds

03:32

like the first thing it finds within the

03:33

git within all the git logs is the

03:35

actual answer of which the model was

03:38

kind of surprised. It's like, well, wait

03:39

a second. Hey, wait a second. we've

03:42

already made this fix a long time ago,

03:45

but somehow it's not in the current

03:47

branch. I What am I supposed to do? And

03:50

so, of course, it actually just simply

03:52

applies to fixes, run all the tests,

03:53

everything passed. And it's like, boom,

03:55

look at this, dude. I'm so good. I'm so

03:57

good, man. Like, dude, if you just use

03:59

the correct answer and you apply the

04:01

correct answer, you get the correct

04:03

answer 100% of the time. Every time. It

04:06

works every time.

04:08

That's science. So, can I really blame

04:10

Claude for cheating on that one? Well,

04:13

no. It's really just it it actually did

04:15

a good move there, right? It by

04:17

searching for.replace,

04:20

it was actually trying to understand or

04:21

try to get context to why this change

04:23

was made to begin with. So, in my

04:25

opinion, this actually did the right

04:27

move. It's just that the repo owners

04:29

just exposed the wrong information. And

04:32

of course, the LLM immediately took

04:34

advantage of that information. It's

04:35

like, oh, look at this. in a future

04:37

commit. Everything was great. So, we're

04:39

just going to use future commit in our

04:41

past right now. Here we go. Let's do

04:43

this.

04:45

So, was Claude set for caught cheating?

04:47

Well, technically it was using answers

04:49

to get out the the stuff, but it

04:51

stumbled into the answers. It didn't

04:53

intentionally cheat the situation. A

04:55

second example is given where Quen Coder

04:57

actually does something similar, but

04:59

theirs is a little bit different how

05:01

it's actually executed. So upon given a

05:03

task, Quen Coder goes in here and does

05:05

the exact same thing. It's going to do

05:07

another searching through all of the git

05:10

logs to try to find something. And right

05:12

here is the actual fix. It locks on

05:15

pretty quickly to the fix and says,

05:16

"Okay, give me give me the information.

05:18

Give me the juice. I want to see what

05:19

happened. I want to see where it's

05:21

going." But then, okay, first off, I

05:23

love this about Quinn. Okay, I do. I

05:24

love this about Quinn. Look at this. I

05:26

am thinking dot dot dot colon. This is

05:29

interesting, but not the same issue.

05:30

Claude, wait. This is very interesting.

05:34

The fix was already made. Quen, I am

05:36

thinking this is interesting, but not

05:39

same issue. Let me look for more

05:40

information about the current issue. Let

05:42

me just, you know, Quinn Quen Quinn's a

05:45

bit more of a robot. Okay. You can feel

05:48

you could feel Claude like some sort of

05:50

static junior engineer discovering their

05:54

very first bug in production. They're

05:56

just so excited. Whereas Claude just

05:58

feels a little bit more grizzled. Okay.

05:59

Yeah. Okay. Yeah, I don't know if this

06:01

is the same interest, you know. I don't

06:02

know. I don't even know. Well, later on,

06:05

it actually continues to use this exact

06:07

same issue right here. You can even see

06:08

this 31926.

06:10

And if you look back here, that was the

06:12

the issue number referencing 31926.

06:16

It was able to go in here and actually

06:17

start breaking down and finding

06:19

everything that it was actually broken

06:20

right here about that. And then, of

06:22

course, it applied everything, fixed the

06:24

minimal, targeted, and addressed exactly

06:27

the issue described. It doesn't break

06:28

any existing functionality. I do want to

06:30

point out something that's kind of

06:31

funny. If you look right here, it says

06:32

perfect. All 256 tests pass with only 14

06:35

skipped, which is normal for

06:36

browserbased Selenium tests. How How do

06:39

you even know that I didn't even add

06:40

something to the skip list? It's so

06:42

funny that there are

06:44

so much examples in the wild of us

06:48

skipping tests and being like, "Ah,

06:50

dude, that it's just not working right

06:52

now." To the point where even agents are

06:54

like, "Oh, what? 14 skip tests? Oh, that

06:57

dude, that's normal. That's like a

06:59

That's a normal amount of tests you

07:01

skip. Like if you don't if you have a

07:03

lot of tests, you also skip a lot of

07:06

tests. That's like how this works. Like

07:09

long as it's below 10%, I'd say like if

07:11

we're below 10% skips test, you're

07:14

probably in a good place. Anyways, I

07:16

just kind of wanted to yap about that a

07:18

little bit because I think it's so funny

07:20

that these scores that we're giving to

07:22

models and all this on these software

07:24

engineering benchmarks, we're just using

07:26

the state of the repository to solve the

07:28

issues. Anyways, I do want to kind of

07:30

give like maybe a take that a lot of

07:32

people aren't thinking about is that I

07:33

actually think this was a pretty well

07:35

done job by them. And yeah, even though

07:37

they did cheat, even though they did get

07:38

answers in the future, how they went

07:40

about the problem solving is is pretty

07:42

good. when uh I was working on a large

07:45

C++ codebase that did year-over-year

07:47

releases, often what would end up

07:49

happening is some bug report would come

07:50

in in like for 2019 version and you'd go

07:53

in 2019 version, you'd see the bug,

07:56

you'd see where it's happening and then

07:57

you would actually use git to kind of GP

08:00

through because in later versions often

08:02

there was already a fix for this and

08:04

then you'd just have to kind of backport

08:05

this back into the older application.

08:07

This is a very normal and common thing

08:10

to do. And so when I see these models

08:12

actually using Git as a means to try to

08:15

understand how the issue came about, I

08:18

actually think, hey, this is actually a

08:19

good sign. This is a good sign. They're

08:21

actually making the right decision. So

08:22

even though they're saying, hey, these

08:24

models are using future information to

08:27

patch previous tests, therefore they are

08:29

cheating. To me, no, they're actually

08:31

doing what a good engineer should be

08:33

doing, which is looking at the available

08:35

information already in the repository.

08:37

Now, clearly that's not the heart of the

08:40

test here. The heart of the test is that

08:41

the AI are able to kind of take the

08:44

problem and fix it themselves. But to

08:47

me, I'm just seeing some good software

08:49

engineering. That's all. Hey, that's all

08:51

I'm saying. I'm seeing some good

08:52

software engineering. Okay? Use all the

08:55

tools at your disposal, including

08:58

answers. Okay? Hey. Oh, what? You don't

09:00

copy from Stack Overflow back in the

09:02

day? Is that it? Oh, that's that's

09:03

somebody else's code. I didn't come up

09:05

with the solution, therefore it's not

09:06

mine. Oh, did you make quick sort

09:08

yourself? I didn't know that, tough guy.

09:12

Hey, the name

09:14

is the primogen.

Video Information

YouTube ID: oZ2LcB3Wlpk

Added: Sep 15, 2025

Last Updated: 5 months ago