Once we speak about the price of AI infrastructure, the main target is normally on Nvidia and GPUs — however reminiscence is an more and more necessary a part of the image. As hyperscalers put together to construct out billions of {dollars} price of latest knowledge facilities, the value for DRAM chips has jumped roughly 7x in the last year.
On the identical time, there’s a rising self-discipline in orchestrating all that reminiscence to ensure the correct knowledge will get to the correct agent on the proper time. The businesses that grasp it is going to be in a position to make the identical queries with fewer tokens, which might be the distinction between folding and staying in enterprise.
Semiconductor analyst Dan O’Laughlin has an attention-grabbing take a look at the significance of reminiscence chips on his Substack, the place he talks with Val Bercovici, chief AI officer at Weka. They’re each semiconductor guys, so the main target is extra on the chips than the broader structure; the implications for AI software program are fairly vital too.
I used to be significantly struck by this passage, through which Bercovici seems on the rising complexity of Anthropic’s prompt-caching documentation:
The inform is that if we go to Anthropic’s immediate caching pricing web page. It began off as a quite simple web page six or seven months in the past, particularly as Claude Code was launching — simply “use caching, it’s cheaper.” Now it’s an encyclopedia of recommendation on precisely what number of cache writes to pre-buy. You’ve obtained 5-minute tiers, that are quite common throughout the trade, or 1-hour tiers — and nothing above. That’s a very necessary inform. Then in fact you’ve obtained all types of arbitrage alternatives across the pricing for cache reads based mostly on what number of cache writes you’ve pre-purchased.
The query right here is how lengthy Claude holds your immediate in cached reminiscence: you may pay for a 5-minute window, or pay extra for an hour-long window. It’s less expensive to attract on knowledge that’s nonetheless within the cache, so when you handle it proper, it can save you an terrible lot. There’s a catch although: each new bit of information you add to the question might bump one thing else out of the cache window.
That is advanced stuff, however the upshot is easy sufficient: Managing reminiscence in AI fashions goes to be an enormous a part of AI going ahead. Firms that do it properly are going to rise to the highest.
And there may be loads of progress to be made on this new discipline. Again in October, I lined a startup known as TensorMesh that was engaged on one layer within the stack generally known as cache-optimization.
Techcrunch occasion
Boston, MA
|
June 23, 2026
Alternatives exist in different components of the stack. For example, decrease down the stack, there’s the query of how knowledge facilities are utilizing the several types of reminiscence they’ve. (The interview features a good dialogue of when DRAM chips are used as an alternative of HBM, though it’s fairly deep within the {hardware} weeds.) Greater up the stack, finish customers are determining learn how to construction their mannequin swarms to reap the benefits of the shared cache.
As corporations get higher at reminiscence orchestration, they’ll use fewer tokens and inference will get cheaper. In the meantime, models are getting more efficient at processing each token, pushing the fee down nonetheless additional. As server prices drop, lots of functions that don’t appear viable now will begin to edge into profitability.
