youtube.nixfred.com nixfred.com

I Plugged a DGX Spark and Mac Together... and Didn’t Expect This

Two machines on a desk, each brilliant at exactly half of what running a large language model needs, and each terrible at the other half. The NVIDIA DGX Spark (here a GB10 in MSI's Edge Expert clothing) chews through a long prompt at hundreds of tokens per second, then crawls when it actually has to write the answer. The Mac mini does the reverse: slow to read the prompt, fast to stream the reply. Alex Ziskind spends this video trying to bolt the good half of each onto the other, a trick the industry calls disaggregated prefill and decode, then measuring whether the Frankenstein is actually worth building.

Published May 1, 2026 20:11 video 20 min read Added Jun 14, 2026 Open on YouTube →

At a glance

Two machines on a desk, each brilliant at exactly half of what running a large language model needs, and each terrible at the other half. The NVIDIA DGX Spark (here a GB10 in MSI's Edge Expert clothing) chews through a long prompt at hundreds of tokens per second, then crawls when it actually has to write the answer. The Mac mini does the reverse: slow to read the prompt, fast to stream the reply. Alex Ziskind spends this video trying to bolt the good half of each onto the other, a trick the industry calls disaggregated prefill and decode, then measuring whether the Frankenstein is actually worth building.

The honest answer is "it works, and you probably should not." Across Llama 3.1 8B, Qwen 2.5 32B, and Gemma 2 27B he gets a clean result every time: the disaggregated rig recovers Spark class time to first token and Mac class decode speed, a combination neither machine reaches alone. But network transfer nearly killed the whole idea on the first try (96% of total time was the wire), the gains only matter at larger model sizes, and a single RTX Pro 6000 would, by his own admission, demolish the two machine cluster on both ends. This is the entire experiment rebuilt in order, with every number he put on screen.

The two phases nobody splits at home

Running an LLM is two jobs wearing one trench coat. The first is prefill, also written PP for prompt processing, where the model reads and digests your entire prompt at once. That work is compute heavy, a wall of matrix math, and a GPU eats it for breakfast. The second is decode, where the model writes the answer one token at a time, each new token waiting on the one before it. That work is memory bandwidth heavy, a question of how fast you can stream weights past the chip, and Apple Silicon's unified memory is unusually good at it.

On a single desktop you run both phases on the same box and never think about the split. But you can pull them apart and run each on the hardware it loves. That is disaggregated prefill and decode, and Ziskind is quick to say it is not a toy concept he invented. DeepSeek and ByteDance and plenty of others already run it in production. Splitting the two phases lets each be optimized and scaled independently, and it is one of the quiet reasons inference costs keep falling. The catch is the handoff: after the GPU finishes prefill it has to ship the KV cache, the digested state of the prompt, across to the machine doing decode. Hold that thought, because the handoff is where the experiment almost dies.

The setup: a GB10 and a Mac, talking

Round one is the MSI Edge Expert, which is a GB10 like the DGX Spark from a different vendor, NVIDIA's Blackwell GPU with 128 GB of unified memory, wired to a Mac mini M4 Pro with 64 GB of unified memory. The Mac Studio comes later. The plan is dead simple to say and miserable to do: send the prompt to the Mac, have it route prefill to the GB10, let Blackwell crunch it, ship the KV cache back, and let the Mac decode.

Mac (decode) M4 Pro / M3 Ultra 4-bit MLX memory bandwidth GB10 (prefill) Blackwell GPU BF16, 128 GB raw compute prompt → ← KV cache 50 Gb link tokens out → each machine does only the half it is built for
Figure 1. The disaggregated pipeline. The prompt enters at the Mac, prefill is routed to the GB10 where Blackwell crunches it in BF16, the KV cache is shipped back over the network, and the Mac decodes in 4-bit and streams the answer. The entire experiment lives or dies on the speed of that middle link.

The build: days of compiling, then mDNS hell

Ziskind is candid that he could not write this himself. In his words, he is a web developer, not a systems programmer, and he knows nothing about Rust networking code or libp2p multicast protocols. He had watched the EXO project tease consumer hardware disaggregation on Twitter for months, complete with a slick blog post and animations, but they never shipped it. Then the community opened a couple of pull requests adding Blackwell support for disaggregated prefill and decode. Experimental, untested on real hardware, but the code existed. So he pointed Claude Code at the pull request, gave it full SSH access to both machines, and, in his words, said "Make it so, number one."

The agent went to the races. It SSHed into both boxes, installed uv, cloned the EXO repo, built the Rust networking bindings, compiled vLLM from source on ARM Linux (the Blackwell CUDA kernels are not small), built MLX from source on the Mac with all its Metal shader compilations, installed Node.js, and went. "Easy, right? Well, it only took a few days." After about 30 minutes of compilation both EXO instances came up and the dashboards loaded. Then the hard part.

EXO uses mDNS to discover peers on the network, and the two machines simply could not see each other. Hours with Claude trying everything: a direct Ethernet cable, USB adapters, modifying the Rust networking layer, a good quality Thunderbolt cable. Nothing. The breakthrough came from running tcpdump and finding the real culprit: libp2p's mDNS is broken on macOS. The fix turned out to be small. Instead of waiting to be discovered, have the GB10 dial the Mac directly. He set an environment variable on the MSI Edge Expert and the connection came up instantly. Even then, more bugs followed, which is probably why the PR is still open. Prefill routing broke, the Mac mini's runner process could not reach the GB10, and each fix uncovered the next. After a reboot and some creative workarounds, everything connected.

First run: the network eats everything

With the cluster up, he loaded matching models with mismatched precision on purpose: Qwen 27B in BF16 on the GB10 and the same model in 4-bit MLX on the Mac mini. The reasoning is the whole philosophy of the build. The GB10 has 128 GB so it can hold full precision, and full precision is faster for prefill compute. The Mac runs the smaller 4-bit version, and smaller means faster decode. Each machine runs the code and quantization optimized for its role.

Then he sent a long prompt and watched it route prefill to the GB10. Blackwell chewed through the tokens, sent the KV cache back, and it worked. The raw prefill numbers were exactly what you would hope for:

And then he looked at end to end time and the wheels came off. At 25,000 tokens the GB10 computed the KV cache in under a second. Transferring that cache over his 2.5 gigabit USB Ethernet adapter took 25 seconds. In other words, 96% of total time was the network, with the GPU idling while it waited. The computational win was completely real and completely irrelevant, drowned by the wire.

A three way comparison (GB10 alone vs Mac mini alone vs disaggregated) made it worse, because he was still on Qwen 3.5, a thinking model. Thinking models generate hidden reasoning tokens before the visible answer, and those reasoning tokens run at decode speed on every platform, so they dominated time to first token and made all three configs look nearly identical. To see anything real he needed two changes: a faster network and a non thinking model.

The drawer of network cards

The network fix is the most "guy with too much hardware" stretch of the video, and it is great. He pulls out a Thunderbolt 5 enclosure with a PCIe slot, bought months ago for exactly this. First card in: an Intel E810 NIC, dual QSFP at 100 gigabit, his most powerful card. macOS said "driver not installed." Dead end. He digs through a drawer of network cards (with the obligatory "don't stick your finger in the fan, rule of thumb, rule of all fingers") and finds a Mellanox ConnectX-4, QSFP, 50 gigabit. Not 100, but plugged in and macOS recognized it immediately, because Apple has shipped built in drivers for Mellanox cards since 2019. On the other end sits a MikroTik CRS812 switch from his DGX Spark cluster video. Getting the switch port to negotiate took work, but once it linked up, KV cache transfer improved by about 30%. He suspects a working 100 gig card would push it further.

Apples to apples on Llama 8B (and 50 gigabit changes everything)

Non thinking model in hand, he switches to Llama 3.1 8B, dense, old, reliable, and runs on everything. He measures with llama-bench over the full HTTP stack, a real API call rather than a synthetic micro benchmark, across three configs and several prompt lengths. The GB10 alone in BF16 peaks at PP 2048 and PP 4096 with throughput up to almost 1,800 tokens per second. The Mac mini alone in 4-bit is much lower on prefill but much faster on token generation. And disaggregated, GB10 prefill plus Mac decode, takes the good corner of each chart: Spark class prefill, Mac class decode.

The verdict at 8B is honest and a little deflating. On the Mac mini rig, disaggregated time to first token landed at 2.4 seconds against the GB10 alone at 2.3, so he matched the Spark on prefill rather than beating it, and the 50 gigabit link added almost zero overhead. But disaggregated decode came in at 34 tokens per second, actually slower than the Mac mini alone at 52, the price of injecting the remote KV cache. Worse, he admits a single RTX Pro 6000 would probably demolish the whole two machine setup on both ends, with six times the GB10's memory bandwidth and three and a half times its compute.

The saving variable: decode speed tracks memory bandwidth, and there was a faster Mac on the desk. The Mac mini M4 Pro has 273 GB/s. The M3 Ultra Mac Studio has 819 GB/s, three times more.

273 GB/s 819 GB/s Mac mini M4 Pro Mac Studio M3 Ultra 0 300 600 900 memory bandwidth (GB/s) the spec that sets decode speed
Figure 2. Decode speed is governed by memory bandwidth, and the Mac Studio M3 Ultra has three times the Mac mini's. That single spec is why swapping the decode machine is the move that finally makes the cluster worth building.

Swap in the Mac Studio

Same EXO code, same switch, same ConnectX-4 and ConnectX-7 cards, only the silicon changed: a DGX Spark next to an M3 Ultra Mac Studio with 512 GB of unified memory. Setup was the usual pain, the Mac Studio's runner subprocess could not reach the Spark, so he recreated the ConnectX-4 as a proper network service and rebooted, the same ritual as before, but this time it only took a couple of hours.

Re running Llama 3.1 8B for an apples to apples comparison, the bandwidth hypothesis held beautifully:

That is the moment the pitch starts to deliver: Spark class prefill recovered, Mac Studio class decode, in one pipeline.

The 70B wall: quantization that would not build

Naturally he reaches for Llama 3.1 70B. The Mac Studio runs the 4-bit MLX quant at about 40 GB on disk, plenty of room. The Spark is the constraint: full precision 70B is 140 GB and the Spark has 128, so it will not fit. He needs a quantized variant the Spark will run, and runs into a kernel wall:

The root cause: the vLLM wheel for Spark ships only BF16 and FP16 kernels, with no quantization built for SM121 (the Spark's architecture). Running 70B would mean rebuilding vLLM from source with those kernels, which he tables. So 70B is off the table, and he drops to the 27B and 32B class, which fit comfortably on the Spark at BF16.

Qwen 32B and Gemma 27B: the pattern repeats, and a twist

Qwen 2.5 32B, BF16 on the Spark and 4-bit on the Mac Studio, prefill at 4K:

Gemma 2 27B, same recipe:

A nice honesty moment: the disaggregated decode for Gemma reads 24, identical to the Spark, which looks like the Mac got skipped. It did not. Every model lost about 20% of decode to KV cache injection overhead, and 30 minus 20% is 24, which happens to match the Spark exactly by coincidence.

14 106 84 Llama 8B 23 29 23 Qwen 32B 24 30 24 Gemma 27B 0 30 60 90 120 decode (tokens/sec) Spark alone Mac Studio alone disaggregated
Figure 3. Decode speed across the three models. At 8B the Mac Studio crushes the Spark eight to one (106 vs 14) and disaggregated keeps most of that at 84. By 27B and 32B the Mac's bandwidth lead collapses to roughly 1.25x to 1.3x, and the disaggregated bar (after the ~20% injection tax) lands right on the Spark. The decode advantage of going hybrid shrinks exactly as models grow.

Why the Mac's lead evaporates at scale

The big finding is that the Mac's eightfold decode lead at 8B does not survive to 27B and 32B. Two different mechanisms do the work, and Ziskind names both. For Gemma, sliding window attention caps how much KV cache decode has to read per token, so memory bandwidth stops being the bottleneck and the Mac's bandwidth advantage stops mattering. For Qwen, vLLM kernel fusion and torch compilation on the Spark side dramatically cut the bandwidth demand per decode step. Different routes, same outcome: the Spark's decode gets relatively better and the Mac's bandwidth edge fades.

That means the value of disaggregation flips from what intuition suggests. On prefill, the Spark's advantage over the Mac Studio grows with size: barely ahead at 8B (1,585 vs 1,420), but two to two and a half times faster at 27B and 32B. On decode, the Mac's advantage shrinks with size: 8x at 8B, down to 1.25x to 1.33x at the larger models. Put together, disaggregation becomes more valuable at larger model sizes, not smaller ones. The prefill gap you are closing gets bigger, and the decode penalty you are paying gets smaller. As he puts it, the decode side of disaggregation gets cheaper at larger sizes, not because the Mac gets worse, but because the Spark gets relatively better.

ModelPrefill @ 4K (tok/s)Decode (tok/s)Time to first token
Llama 3.1 8BSpark 1,585 · Mac 1,420 · disag 1,584Spark 14 · Mac 106 · disag 842.6 s
Qwen 2.5 32BSpark 875 · Mac 356 · disag 792Spark 23 · Mac 29 · disag 235.2 s
Gemma 2 27BSpark 779 · Mac 379 · disag 722Spark 24 · Mac 30 · disag 245.7 s
Figure 4. The full results ledger. In every row the disaggregated prefill tracks the Spark and the time to first token is recovered to Spark class. Decode keeps most of the Mac's lead at 8B (green) but, after the ~20% KV injection tax, collapses onto the Spark by 27B and 32B (amber). The hybrid wins biggest exactly where the prefill gap is widest and the decode gap is narrowest.

So does it work, and should you?

Yes, it works. Two machines doing what they are good at, talking over a 50 gigabit link, spitting out tokens faster than either could alone, with Spark class time to first token recovered every time (2.6 s at 8B, 5.7 s for Gemma 27B, 5.2 s for Qwen 32B). As a proof of concept for heterogeneous inference it is genuinely cool, and if you already own both machines, he says, go squeeze a little more juice out of that orange and apple.

But the closing verdict is unsentimental. The DGX Spark and the Mac Studio are not cheap, and if you are spending that kind of money on new desktop gear, he would rather buy a single, less portable but much more powerful RTX Pro 6000 and build a rig around it, a card with six times the GB10's bandwidth and three and a half times its compute that would likely beat the whole cluster on both ends. The disaggregation idea is real and runs in production at hyperscalers for exactly the cost reasons he describes. On a desk, with two consumer boxes, it is a beautiful trick that a single better card mostly obviates. "Just because you can doesn't mean you should," delivered as a result rather than a warning.

Key takeaways

Chapters

Timestamps are clickable. Click one and the player jumps there and keeps playing while you read.

Notable quotes

The DGX Spark is incredible at processing your prompt, but considerably slower at generating tokens. The Mac mini is the opposite, slow to process your prompt, but fast at streaming the response. Alex Ziskind, 0:00

Just because you can doesn't mean you should. Alex Ziskind, 0:25

I'm not a systems programmer, I'm a web developer. I don't know anything about Rust networking code or libp2p multicast protocols. Alex Ziskind, 4:10

I pointed Claude code at the pull request and gave it full SSH access to both of my machines and said, Make it so, number one. Alex Ziskind, 2:50

96% of the total time was just the network. The GPU was still idling waiting. Alex Ziskind, 6:30

Apple has already shipped with the built-in driver for Mellanox cards since 2019. Alex Ziskind, 7:45

The disaggregated setup gets GB10 class time to first token with Mac class decode. And you can see that neither machine actually achieves this alone. Alex Ziskind, 8:20

A single RTX Pro 6000, the workstation Blackwell card, would probably demolish the entire two-machine setup on both prefill and decode. Alex Ziskind, 8:40

Disaggregation becomes more valuable at larger sizes, not the smaller sizes. Alex Ziskind, 11:50

Two machines doing what they're good at, talking over a 50 gig link and spitting out tokens faster than either one could alone. That's the whole pitch and it finally delivers. Alex Ziskind, 12:05

Resources mentioned

The one idea to walk away with

Disaggregated inference is not a hack a YouTuber invented; it is how the biggest inference providers cut costs, by letting compute heavy prefill and bandwidth heavy decode each run on the hardware built for it. Rebuilt on a desk with a DGX Spark and a Mac Studio, it genuinely delivers GB10 prefill plus Mac decode in one pipeline, a combination neither box reaches alone. The lesson is that the win lives in the network, grows with model size, and still loses on price to a single better GPU. The trick is real. The reason to do it at home mostly is not.

Full transcript
The DGX Spark is incredible at processing your prompt, but considerably slower at generating tokens. The Mac mini is the opposite, slow to process your prompt, but fast at streaming the response. What if you could combine the best of both worlds? And that's what I tried. But just because you can doesn't mean you should. All right, here's what I got so far. Here's my setup. On one side I have the MSI Edge Expert. Basically, it's a GB10 just like the DGX Spark, just from a different vendor. It's got Nvidia's Blackwell GPU with 128 gigs of unified memory. On the other side, a Mac mini with M4 Pro, 64 gigs of unified memory. And of course, we'll try the Spark in a Mac Studio later on. When you run a large language model, there are two phases. Prefill, that's processing the entire prompt, or PP as I like to call it sometimes. >> [laughter] >> I don't know why I'm pointing. It's it's weird. It's like ah, nobody's there. It's not rude, right? It's just a camera. So, the prompt processing is compute heavy. That means a lot of that number crunching is going on on the GPU, and GPUs are great at that. The next stage is decode, which is token generation one by one. And this is memory bandwidth heavy, and Apple Silicon is great at that. Now, typically we do this on one machine when we're running it on our desk, but when you split it up, it's called disaggregated prefill and decode. And it's not just some academic concept. Companies like Deep Seek and ByteDance and a whole bunch of other ones already do this in production. Splitting these two phases lets you optimize each one independently, and it's one of those reasons inference costs have been going down. I've been watching the Exo project tease disaggregated prefill and decode for consumer hardware on Twitter for months now. They even put together a nice little blog post here describing the thing with nice animations, but they never actually released it. So, obviously I have all these machines sitting around on my desk. I tried making it work myself and it drove me nuts for months. I'm not a systems programmer, I'm a web developer. I don't know anything about Rust networking code or libp2p multicast protocols. That's not me. I just happen to have the hardware here and the capacity to test what's already out there. But I'm not smart enough to create this stuff from scratch. Then the community opened a couple of pull requests adding Blackwell support for pre-filled decode disaggregation. Now, it's experimental, untested on real hardware, maybe, I don't know, but the code was there. So, of course, I pointed Claude code at it the pull request and gave it full SSH access to both of my machines and said, "Make it so, number one." But of course, all these tools are really expensive. So, quick sponsor break and then we'll be right back. So, these days I'm always flipping between models. GPT for research, Claude for coding, Nano Banana for image generation, VEO Kling and Runway for video. Six tabs, six bills, and counting. Enter Chat LLM Teams. One dashboard houses every top LLM and route LLM picks the right one. GPT Mini for ultra-fast answers, Claude Sonnet for coding, Gemini Pro for massive context. They recently added Gemini 3 and GPT 5.1 the moment they dropped. Create professional presentations with graphs, charts, and deep research detailed content. Need human-sounding copy? Humanize rewrites text to defeat AI detectors. Need visuals? Pick Frontier or open-source models. Nano Banana, Midjourney, Flux for images, Magnific for upscaling, plus VEO WAN and Sora for video. All built in. You also get Abacus AI Deep Agent to pretty much do anything. Build full-stack apps, websites, reports with just text prompts, and deploy them on the spot. They have Abacus AI Desktop, which is the brand new coding editor and assistant that lets you vibe code and build production-ready apps. And the kicker? It's just $10 a month, less than one premium model. Head over to chat.abacus.ai or click the link below to level up with Chat LLM Teams. All right. And off it went to the races. SSH into both machines, starting setting everything up, installed UV, cloned the EXO repo, built the Rust networking bindings, compiled VLLM from source on ARM Linux. That took a while. CUDA kernels for Blackwell aren't small. On the Mac side, it built MLX from source with all the metal shader compilations, installed Node.js, and off we went. Easy, right? Well, it only took a few days. Initially, after about 30 minutes of compilation, both EXO instances came up. The dashboards loaded fine. First signs of life were there. Then came the hard part. See, EXO uses mDNS to discover peers on the network, and these two machines could just not see each other. I spent hours with Claude trying everything. Direct Ethernet cable, USB adapters, modifying the Rust networking layer, a Thunderbolt cable, and a good quality one, and nothing worked. Well, eventually it did, but I had to get some more hardware involved. Got to push it to the limit. Then I ran TCP dump and found the real problem. Apparently, libp2p's mDNS is broken on macOS. And the fix was simple. You just have the GB10 machine, the Spark, dial the Mac mini instead of waiting to be discovered. So, I added an environmental variable, set it on the MSI Edge Expert, and the connection came up instantly. Finally, with the cluster connected, I loaded the models. Started with Qwen 3.5 27B, nice decently sized model for these machines because on the GB10, I ran it in BF16. That's the Blackwell optimized attention backend. And we had to have the exact same model on the Mac mini, but in 4-bit using MLX cuz that's what's optimized on Apple silicon. Different quantizations on purpose. The GB10 has 128 gigs of memory, so it can run full precision, and full precision is faster for prefill compute. The Mac mini has the 4-bit version, and it's smaller, so the smaller model means faster decode. Each machine runs its optimized code and for what it's optimized for, the role that it's optimized for. Then, of course, it wasn't that simple. There were more bugs. Maybe that's why this PR is still open. But it was a good start. The prefill routing was broken now. The Mac Mini's runner process couldn't reach the GB10. And on and on like this we went. Each fix uncovered the next problem. But eventually, after a reboot and some creative workarounds, everything connected. So, I sent a long prompt to the Mac Mini and watched it route the prefilled to the GB10. The Blackwell GPU chewed through the tokens and sent the KV cache back. It worked. The GB10 prefilled at 546 to 937 tokens per second depending on the prompt length. The Mac Mini locally, 66 tokens per second. Up to 14 times faster on the GPU. But then I looked at the actual end-to-end time and something was off. When I broke down where the time was going, the answer was obvious. Network transfer. You probably could have guessed that, right? At 25,000 tokens, the GB10 computed the KV cache in under a second. But transferring it over my 2.5 gigabit USB ethernet adapter took 25 seconds. In other words, 96% of the total time was just the network. The GPU was still idling waiting. I also ran a three-way comparison, GB10 alone versus the Mac Mini alone versus the disaggregated. And yeah. Here I was using Qwen 3.5, which was a thinking model. So, it generates hidden reasoning tokens before the visible response. Those reasoning tokens run at decode speed, the same on all platforms. And they dominated the time to the first token. All three configs looked almost the same. To really show the difference, I needed two things. A faster network and a non-thinking model. So, uh a couple months ago, just for this very reason in fact, but I haven't shown this yet on the channel, I bought a Thunderbolt enclosure. It's an external enclosure, Thunderbolt 5 with a PCIE and it looks like this. I got a nick in there. Now, first I started off with this network card cuz I thought, "Hey, I'm going to put in my most powerful network card in there." This is an Intel 810 NIC, and it's a dual QSFP port connection at 100 gigabits. So, yeah, this is a nice fast network card. But, when I put this in and plugged into the Mac, macOS said, "Nope. Sorry, driver not installed. Dead end." Then, I dug into my drawer of network cards, and I found ah I found a bunch of these. And Don't stick your finger in the fan. Rule of thumb. Rule of all fingers. >> Stupid jokes. All right. I bought these cards when I was experimenting with my framework cluster. You might have seen that video. But, this one right here happens to be a Mellanox ConnectX-4. Also QSFP port, so it's convenient. And it happens to be 50 gigs. So, not 100, but still. I plugged it in, and macOS recognized it immediately. See, Apple has already shipped with the built-in driver for Mellanox cards since 2019. I thought about trying this one, but this is only 25. So, if that one worked, hey, that's what we're going with. Now, on the other side with a QSFP connection, I have the MikroTik CSR 812 switch, which I showed off in my DGX Spark cluster video. Getting the switch port to negotiate took some work, but once it linked up, KB cache transfer improved by about 30%. I imagine it could probably go even faster if we got a 100 gig card to work. Now, the non-thinking model. I switched to Llama 3.1 8B. Yeah, I know, I know. It's an oldie, but a goodie, and it's dense, and it works on everything. It's a good experiment model, okay? And I used Llama Bench, this tool right here, to measure prompt processing and token generation properly. So, it's going over the whole HTTP stack and responding to an API call. So, we got three configurations. GB10 alone in BF16. By the way, along the bottom, I have different prompt processing lengths. So, you can see that PP 2048 and PP 4096 have the best throughput up to almost 1,800 tokens per second there. Then, I ran Mac Mini alone in 4-bit. Quite a bit lower there, but token generation down here is much faster, as you can see. And finally, disaggregated GB10 prefill plus Mac decode. Hmm. Okay. Okay. We're getting somewhere. Where are we getting? I I don't know, but we're getting somewhere. Especially when you don't look at these individually, cuz you can say, "Oh, what's the point of this?" If you look at PP 4096, for example, you just say, "Oh, just run it on the Spark, right?" Well, then you take a look at the token generation and see that it's actually faster in disaggregated. We're getting somewhere. We're getting somewhere. Now, time to first token. If we take a look at 4096 tokens, we're almost at 2.4 seconds. So, the 50 gigabit link adds almost zero overhead for comparing GB10 alone versus disaggregated. So, the disaggregated setup gets GB10 class time to first token with Mac class decode. And you can see that neither machine actually achieves this alone. So, is it worth it? And here's my honest answer. My disaggregated time to first token of 2.4 seconds matches the GB10 alone at 2.3. I didn't beat the GB10 on prefill. I matched it. And disaggregated decode of 34 tokens per second is actually a bit slower than the Mac Mini by itself at 52. And that's from the overhead of injecting the remote KV cache. And if I'm being really honest, a single RTX Pro 6000, the workstation Blackwell card, would probably demolish the entire two-machine setup on both prefill and decode. This thing has six times the memory bandwidth of the GB10 and three and a half times the compute. So, was it all just a cool experiment? Well, maybe, but there's one variable I haven't changed yet. See, the Mac mini M4 Pro has 273 GB per second of memory bandwidth. That's what determines the decode speed. But there's a machine on my desk, well, close to be, that has 819 GB per second, three times more. Oops. The M3 Ultra Mac Studio. If I swap the Mac mini for the Mac Studio, decode could jump from 34 tokens per second to over 100. And combined with the DJX Spark's 1,700 tokens per second of prefill, that's a setup that might actually be worth building. Excuse me. >> Would you pass the tokens? >> Only if you've like got the cash. So on my desk now, a DJX Spark and an Apple Mac Studio M3 Ultra. This one has 512 gigs of unified memory, same XO code, same switch, same ConnectX 4 and ConnectX 7 cards. Only the silicon changed. I got it running, but of course, as usual, the setup was a pain. The Mac Studio's runner sub process couldn't reach the Spark, so I recreated the ConnectX 4 as a proper network service and rebooted. Same ritual I've already done before. This time it only took a couple hours. So I know we have more capacity here, but I thought we'd start with the same Llama 3.1 8B, so we can compare apples to apples. Yes. Yes. That phrase actually works now. I've been waiting for so long for it to work. Remember the Mac mini decoded at 52 tokens per second. The Mac Studio, 106. So it's just about double, not the full 3X that the bandwidth spec would have us believe, but pretty close. The extra bandwidth isn't fully saturated at batch size one, I guess. Prefill at 4K or 4,096 tokens, Spark alone got 1,585 tokens per second. Mac Studio alone got 1,420. Not bad for Mac Studio actually. Disaggregated 1584. The dis ag column matches the spark almost exactly. And the 50 gigabit link that we have going on here adds about 18 milliseconds of overhead. And disaggregated decode 84 tokens per second here. That's down from 106 on the Mac Studio because of the KV cash injection overhead. But still six times faster than the spark alone at 14. So remember we got 34 tokens per second here in the first part where we did it with the Mac Mini. So this is two and a half times better. So the bandwidth hypothesis actually held at 8B. But the Mac Studio has a lot more RAM than the Mac Mini. We shouldn't let that sit idle. Let's kick it up a notch. Llama 3.1 70B. Everybody's favorite 70B model. It's an oldie but a goodie. It's a dense model. The Mac Studio runs this MLX 4-bit quant, which is about 40 gigabytes on disk. So lots of room to spare. Now the spark is the constraint here at this point. The full precision 70B model is 140 gigs and the spark only has 128. So it's not going to fit. So I needed to find a quantized variant that the spark would run. FP8 is what I tried and that one failed. Reason? Basically it had to do with the compilation of the cutlass matmul kernel. And in the VLLM version that we compiled we didn't use that. So FP8 is out. I tried the AWQ or activation aware quantization N4 version. That failed also. When it was running it was auto converted to AWQ Marlin and the Marlin repack kernel is missing. I also tried this one W4A16. Same story. This is the GPTQ Marlin. Same missing kernel. So the VLLM wheel for spark only ships BF16 and FP16 kernels. None of the quantization got built for SM121 which is the spark. So I'd have to completely rebuild VLLM from source with the kernels and that would be a pain. But not to worry, even though 70B is off the table without a VLLM rebuild, we still have a bunch of models we can use larger than 8B and enough for the Spark and the Mac Studio. We have 32 billion and 27 billion class models. And at BF16, they fit on the Spark just comfortably. Qwen 2.5 32B BF16 on the Spark and 4-bit on the Mac Studio. At this point, I'm running prefill at 4K. Spark alone gets 875 tokens per second on this one. Mac Studio alone gets 356 and DSAG 792. So, we're kind of seeing the same pattern as Llama 8B here where disaggregated tracks the Spark and that's a good sign. That's better than two times the Mac Studio's prefill speed. Then, on the decode side of the story, Mac Studio, of course, kicks butt here 29, Spark 23. So, only 1.3 times gap at this size. And that's down from eight times the gap we saw with the 8 billion parameter model. Now, hold on to that. The next model tells the same story. I went with Gemma 2 27B, again BF16 on the Spark and the 4-bit MLX version on the Mac Studio. Very similar-looking chart, right? But slightly different numbers. The Spark gets 779 on prefill, Mac Studio 379, and DSAG 722. Same shape as before. I call this the upside-down middle finger. I probably shouldn't call it that, but that's what it looks like. On the decode side, we've got Mac Studio at 30, Spark at 24. And that's only one and a quarter times gap here. So, two different architectures, Gemma and Qwen, we're seeing the same behavior. And I have the same question. What happened to the Mac's eight times decode lead from 8B? And by the way, I see that there's 24 here and 24 here. Before anyone asks, yes, I checked. Disaggregated here looks like it may have skipped the Mac altogether here, but it didn't. Every model I tested lost about 20% of decode speed to KV cache injection overhead. So, 30 minus 20% is 24. It just happens to match the Spark number exactly by coincidence. So, in this particular case, if we ran it on the Spark, probably would have gotten the same numbers. See, at 8 billion decode is almost purely bandwidth bound. The Mac's three times bandwidth just wins. At 27 and 32 billion though, two things shift around. For Gemma, sliding window attention caps how much KV cache decode has to read per token. So, bandwidth stops being the bottleneck. For Qwen, VLLM kernel fusion and torch compilation on the Spark side dramatically cut the bandwidth demand per decode step. So, we got slightly different mechanisms, different outcome. The Spark's decode gets relatively better and the Mac's bandwidth advantage stops mattering as much. All right. Now, what do these charts look like to you? Huh? We've got three models, three sizes, three architectures. What do they all have in common? Well, in every case, the disaggregated time to first token tracks the Spark. Llama 8B, 2.6 seconds. Gemma 27B, 5.7. And Qwen 32B, 5.2 seconds. So, the Spark class prefill is always recovered. And the prefill advantage over the Mac Studio grows with model size. It's pretty clear here. You can see that at 8B, the Spark is barely ahead, 1420 versus 1585. At 27 and 32B, the Spark is two to two and a half times faster. So, disaggregation becomes more valuable at larger sizes, not the smaller sizes. Now, decode is where it gets interesting. At 8 billion, the Mac Studio is eight times faster than the Spark on decode. That's huge. That's the bandwidth story playing out clearly. At 27 and 32 billion, the gap kind of shrinks to one and a quarter to one and a third. So, the decode side of disaggregation gets cheaper at larger sizes. Not because the Mac gets worse, but because the Spark gets relatively better. So, does it work? Yeah, it works. Two machines doing what they're good at, talking over a 50 gig link and spitting out tokens faster than either one could alone. That's the whole pitch and it finally delivers. So, look, as a proof of concept for heterogeneous inference, this is really cool. And if you already own both of these machines, great. Go squeeze a little more juice out of that orange and apple. But, realistically, the DGX Spark and the Mac Studio are not cheap. If you're spending that kind of money on new desktop gear, I'd honestly just rather get less portable, but much more powerful RTX Pro 6000 and build out a rig around that. In fact, I did a whole video comparing that to the Mac Studio right over here. Thanks for watching and I'll see you next time.