Intel just CRUSHED Nvidia & AMD GPU pricing

Duration

11:08

Captions

1

Language

EN

Published

Oct 7, 2025

Description

This is the stack that gets me over 4000 tokens per second locally. Plaud Web: https://bit.ly/46WTrWD (20% OFF with code โ€œAlex20โ€ on Note & NotePin) Plaud AMZ: https://amzn.to/4mzJt4f ๐Ÿ›’ Gear Links ๐Ÿ›’ ๐Ÿ’ปโ˜• Thunderbolt 5 external SSD: https://amzn.to/3XqetZO ๐Ÿ’ปโ˜• Favorite 15" display with magnet: https://amzn.to/3zD1DhQ ๐ŸŽงโšก Great 40Gbps T4 enclosure: https://amzn.to/3JNwBGW ๐Ÿ› ๏ธ๐Ÿš€ My nvme ssd: https://amzn.to/3YLEySo ๐Ÿ“ฆ๐ŸŽฎ My gear: https://www.amazon.com/shop/alexziskind ๐ŸŽฅ Related Videos ๐ŸŽฅ ๐Ÿ† Skip M3 Ultra & RTX 5090 for LLMs | NEW 96GB KING - https://youtu.be/bAao58hXo9w ๐Ÿ’ป Smallest RTX Pro 6000 rig | OVERKILL - https://youtu.be/JbnBt_Aytd0 ๐Ÿ”ง Cheap mini runs a 70B LLM ๐Ÿคฏ - https://youtu.be/xyKEQjUzfAk ๐ŸŒ™ RAM torture test on Mac - https://youtu.be/l3zIwPgan7M ๐Ÿš€ FREE Local LLMs on Apple Silicon | FAST! - https://youtu.be/bp2eev21Qfo ๐Ÿชž REALITY vs Appleโ€™s Memory Claims | vs RTX4090m - https://youtu.be/fdvzQAWXU7A ๐Ÿ“ฆ Set up Conda - https://youtu.be/2Acht_5_HTo ๐Ÿค– INSANE Machine Learning on Neural Engine - https://youtu.be/Y2FOUg_jo7k * ๐Ÿ› ๏ธ Developer productivity Playlist - https://www.youtube.com/playlist?list=PLPwbI_iIX3aQCRdFGM7j4TY_7STfv2aXX ๐Ÿ”— AI for Coding Playlist: ๐Ÿ“š - https://www.youtube.com/playlist?list=PLPwbI_iIX3aSlUmRtYPfbQHt4n0YaX0qw โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€” โค๏ธ SUBSCRIBE TO MY YOUTUBE CHANNEL ๐Ÿ“บ Click here to subscribe: https://www.youtube.com/@AZisk?sub_confirmation=1 โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€” Join this channel to get access to perks: https://www.youtube.com/channel/UCajiMK_CY9icRhLepS8_3ug/join โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€” ๐Ÿ“ฑ ALEX ON X: https://twitter.com/digitalix #intel #llm #b50

Captions (1)

00:00

Would you believe it if I told you that

00:01

I'm holding a GPU? Yeah. Is this not the

00:04

cutest little GPU that you've ever seen

00:07

in your life? I love the blue accents.

00:09

So, Intel just released these. These are

00:11

brand new Gen 5 PCIe. It's not just the

00:14

ARC B50 that replaces the A50. This is

00:17

the Pro B50, which means, yeah, it can

00:20

play your games, but it's meant for

00:22

professionals and workstations and home

00:25

servers. So today I'm going to throw it

00:27

into this rig right here, which is

00:29

ridiculous because this thing can fit

00:31

into a tiny little build. Maybe I'll do

00:34

that one day, but for now, this is what

00:36

I have. And it just looks ridiculous

00:37

sitting in this giant box. And that's

00:39

it. I plugged it into the PCI slot.

00:41

Notice that I did not plug anything else

00:43

in. There's power cables here from my

00:46

power supply that I use in my bench for

00:49

other GPUs. This one doesn't need it. It

00:52

gets all of its power from the PCI slot.

00:54

No extra power cables needed because it

00:57

only takes up 70 watts of power to run.

00:59

Now, I actually contacted Intel and I

01:01

said, "Can you send me one of these,

01:03

please?" I asked really nicely and they

01:04

did and I'm hoping that they'll do the

01:06

same for me for the B60. That's the big

01:09

brother. This is the baby. The baby is

01:11

here. It's very capable. I already ran a

01:13

couple LMs on it. I'm going to show you

01:15

right now. And that's what it's really

01:16

for. It's for machine learning and

01:18

graphics work. But why? We already have

01:21

the B580, right? That's the consumer GPU

01:23

that I tested a few weeks ago. That one

01:25

had 12 gigs of memory. This one has 16

01:30

16 gigs of VRAM inside a tiny package

01:33

like that is kind of unheard of at this

01:36

stage in the game. And the B60 is

01:38

supposed to have 24 and with a dual chip

01:41

48. So what's so special about it having

01:44

16 gigs? Well, the price. try to find

01:47

another GPU with 16 gigs of VRAM at this

01:51

price. It was announced at Computex at

01:53

299, but the price has raised to 3.49,

01:56

which is still the cheapest card out

01:58

there for this amount of VRAM. Here's

02:00

some extra specs you might want to know

02:02

about. The three most important things

02:03

are the price, the memory, and the

02:05

memory bandwidth, which is 224. It's not

02:08

the highest memory bandwidth we've seen,

02:09

but it's pretty decent. Now, Intel is

02:11

comparing this one to the RTX A1000.

02:14

They're placing it directly against this

02:16

card. Here's their official graphic on

02:18

it. Nvidia A1000 has less performance,

02:22

less value because it's more expensive.

02:24

$426. That's the graphics. And then the

02:26

inference machine learning stuff is on

02:28

the right. We also have a much higher

02:30

score in performance and value. Even

02:33

better value this time. The A1000, by

02:35

the way, is a very popular professional

02:37

level GPU. It only has 8 gigs of VRAM,

02:40

this card. So, how can you even compare

02:42

a 16 gig card to an 8 gig card? Well,

02:46

even at 8 gigs, this card is $100 more.

02:49

If you take a card that has 16 gigs, the

02:52

A2000, which I don't have here, but let

02:55

me know in the comments if you do want

02:57

to see that comparison. The A2000 is

02:59

over $700. It's almost $800. So,

03:02

hopefully that sets the stage for you in

03:04

the positioning of this GPU and where

03:07

Intel is hoping to enter this market.

03:09

They're not ahead. They're quite a bit

03:11

behind. AMD already has GPUs out.

03:13

Nvidia, obviously, Intel has had GPUs

03:16

out, but now they're catching up with

03:17

these smaller professional kind of GPUs

03:20

that have a lot to offer at a low price.

03:23

So, let's take a look at how it does

03:24

with some of these models that we have

03:26

here. Is this a platinum credit card?

03:28

Nope. It's the Plaude Node Pro, my

03:30

pocket-sized meeting assistant. I've

03:32

started with the Plaude Note and Plaude

03:34

PIN over a year ago. They've been my

03:36

silent assistants, taking down every

03:38

meeting note for me while I stay fully

03:40

engaged. And the new Pro takes it even

03:42

further. All right, which one of these

03:43

people would you rather work with? The

03:45

one buried in their notebook or the one

03:47

listening, present, and leading? Plot

03:49

Node Pro uses four microphones with

03:51

noise cancellation to record

03:53

crystalclear audio from up to 5 meters

03:55

away, even when voices overlap. And

03:57

during a meeting, I can pin moments, jot

04:00

quick thoughts, and take photos, all

04:02

synced into one timeline. Afterward, the

04:04

app delivers a structured summary,

04:06

highlights key insights, and organizes

04:08

to-dos for each stakeholder. Then I use

04:11

Ask Plaude, draft a recap email,

04:13

brainstorm next steps. This is all

04:15

powered by Plaude intelligence. It's

04:17

built on top of chatbt and claude. It

04:20

remembers context, connects dots across

04:22

past meetings, and actually helps me

04:24

think through next steps and the meeting

04:25

notes. And the hardware, AMOLED display,

04:27

auto detects calls versus in-person

04:30

meetings, Apple fine my 50 hours of

04:33

non-stop recording, 60-day standby, and

04:36

64 gigs of storage. My phone stays free,

04:39

my battery stays safe, and it's

04:41

endto-end encrypted with global

04:43

compliance. Your data stays yours. This

04:46

isn't just another recorder. It's like a

04:48

second brain, an investment in focus and

04:50

clarity. The subscription simply covers

04:52

the AI that does the heavy lifting.

04:54

Plaude Node Pro designed for leaders and

04:56

for everyone who wants to become one.

04:58

And right now you can get 20% off the

05:00

original Plaude Note and Note Pin if

05:02

you're just getting started. Check the

05:03

links in the description. Right now I

05:05

have here the Quen 3 4 billion parameter

05:08

model. This is the new one. It's a Q4

05:10

quant and it's a pretty small one. It

05:12

takes about 3 GB. I like this new

05:15

feature of LM Studio that estimates the

05:17

memory usage once you have everything

05:19

enabled. Let's see if context length

05:21

Yes. Yes. Once you increase that context

05:24

length, it'll tell you how much GPU is

05:27

going to be needed. And right now it

05:29

says 10 gigabytes is going to be needed

05:31

for this tiny model. If you increase the

05:33

contact length to about 50,000 tokens, I

05:37

like that feature. Even doing that, we

05:39

still fit within the budget of our

05:42

memory here. Let's take a look at task

05:45

manager before I load that up. And yeah,

05:47

there it is. We got 16 gigs. Let me hit

05:50

load here. Now, just for those folks

05:52

that are curious about the back end

05:54

here, we are using Vulcan for this GPU.

05:57

We're not using Intel specialized

05:59

libraries because LM Studio doesn't have

06:02

that support yet. Dedicated memory is

06:04

12.8 out of 16. That means we can run

06:06

something. I'm going to give it a nice

06:08

decent prompt that I like to use. And by

06:10

decent, I don't mean like there's an

06:12

indecent prompt. I'm sure there's

06:14

indecent prompts, but I don't have any

06:15

of those. Uh, let's go with this one

06:18

right here. It's called design a

06:20

scalable web architecture. All right,

06:23

here we go. And that's going pretty

06:25

nicely. Look at that. We're using 97% of

06:28

the GPU. So, Vulcan is working really,

06:31

really nicely with this. Vulcan is

06:33

great. And look how fast it's going.

06:35

Now, I did notice that the fan here is

06:38

really blowing out a lot of hot air.

06:40

It's not super loud, but it is working

06:43

pretty hard. So, if that camera dies

06:45

while I'm talking, you know, it

06:46

overheated. 51.75

06:49

tokens per second for Quen 34B. Let's do

06:52

another popular model which is this one.

06:54

GPTO OSS 20B by OpenAI. And LM Studio is

06:58

telling me that only 23 out of 24 layers

07:01

can be offloaded to the GPU. I don't

07:03

believe it. I think we can do all. Let's

07:05

do all. I'm going to do all 24. Now, I

07:07

did set my guard rails to relaxed.

07:10

Guardrails is what uh keeps the memory

07:12

in check. If there's too much of a

07:14

demand from a model, it's going to say

07:16

we can't load it. Uh you can turn that

07:18

off completely, but I've had my system

07:20

crash and freeze and burn. Uh not burn,

07:23

but freeze if I did that before. So, I'm

07:25

going to set it to relaxed. You can see

07:27

up here the GPU estimate is 11.89. And

07:30

I'm at context length 4,000. If I push

07:33

this, I'm afraid things might start

07:36

breaking. So, let's start at 4,000

07:39

4096 and see what happens there. Is it

07:41

going to load? There it goes. And it

07:43

loads. It loads fine. Let's run our

07:47

prompt here and see what happens here.

07:48

It is a thinking model, so it's going to

07:50

do that first. There it goes. 39 tokens

07:53

per second. That's pretty good. Now, I

07:54

ran my automation script against this

07:56

system. And I have a bunch of prompts

07:57

here, starting with these really long

07:59

prompts. 17,000 tokens, 44,000 tokens.

08:02

Those two did not run, by the way. And

08:04

then we have a long architecture

08:06

enterprise prompt of 1,200 tokens. We

08:08

have a 2,300 token prompt and a bunch of

08:11

smaller prompts in the 100 range. These

08:13

are all programming related prompts

08:16

except for this one. This one just says

08:18

hi. And I know you love that.

08:21

Well, the high prompt never does well.

08:23

It's over here. It's the short simple

08:24

greeting and it always has this largest

08:26

amount of variance. I wonder why. But in

08:29

general, you can see that we're pretty

08:33

consistent staying between about 35

08:35

tokens per second for this one, which is

08:37

the long programming project, up to

08:39

about 42 tokens per second for the

08:41

medium programming prompt. By the way,

08:43

this is just a first look at this card.

08:45

I'm probably going to do more videos

08:47

about it cuz it's interesting. And also,

08:49

we'll see what happens when the B60

08:51

comes out. But if you're using this

08:53

professionally, you're probably going to

08:55

be doing something on Linux with it

08:57

using VLM.

08:59

So yeah, we're going to get slightly

09:01

different numbers uh later on. So stay

09:04

tuned for those. Now, let's do some

09:05

comparisons. Here's M1 MacBook Air. It

09:09

does stuff, but it's down here. It's the

09:11

purple one. M4 MacBook Air. I don't know

09:13

if you're interested in my MacBook

09:14

comparisons, but I like to compare a

09:16

wide array of machines. In fact, I

09:18

recently did a video on all the

09:19

different base MacBook Air models. Go

09:22

check that out. So, the M1 and the M4

09:23

are significantly slower. But if we take

09:27

a look at the M4 Max, yeah, it blows it

09:29

away. That's my daily driver right over

09:31

here. What about this new breed of AMD's

09:36

APU? The Ryzen AI 395 Max Plus. Ah, I

09:41

forgot the name again. Look at that. GMK

09:43

Tech has one in it. The Framework

09:45

Desktop has one in it. And yeah, that

09:48

does really well. And that's an APU, not

09:51

even a dedicated GPU. So, does pretty

09:53

well. But here we are. The red one is

09:55

still the Intel one. Should be blue in

09:57

this chart. Come on, chart. Why didn't

10:00

you figure out it should have been blue?

10:02

Let's take for example the short simple

10:03

math prompt. Intel does 39.9 tokens per

10:07

second. Framework desktop 46 and 46.6

10:11

from the GMK Techch Evo X2. There's a

10:13

few here that the framework refused to

10:15

run. Not sure why, but a few of the

10:18

longer prompts. This long programming

10:20

project, the GMK Tech did 50 tokens per

10:24

second compared to 35.9 from the Intel

10:28

machine. And here's a medium programming

10:29

prompt. 60 tokens per second from GMK

10:32

Tech, 56.4 from Framework, and 42 from

10:35

the B50. Like I said, this is just a

10:37

first look. Let me know in the comments

10:39

down below if you want to see more

10:40

videos about the B50 and specifically a

10:44

comparison maybe with the A1000, a

10:46

direct comparison. I haven't had a

10:47

chance to do this one yet. But for now,

10:49

take a look at my B580 video which is

10:52

right over here and my framework desktop

10:54

video which is right over here. Thanks

10:55

for watching and I'll see you next time.

Video Information

YouTube ID: KBbJy-jhsAA
Added: Oct 8, 2025
Last Updated: 5 months ago