CDI-Info/186 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
YouTube:https://www.youtube.com/watch?v=EdPGlXsgkwg
Text:
My name is Torry Steed. I'm a Product Marketing Manager at Smart Modular.

I'm primarily focused on the CXL product lines that we have. That's going to be the focus of the presentation today. We've had some excellent talks  earlier already about some of the benefits. I may move through quickly some of the benefits,  because I think that's been covered very well. But first, a quick introduction to Smart. Smart Modular is one of the divisions  of Smart Global Holdings. SGH is the ticker symbol for Smart. Smart Modular focuses on advanced memory solutions. We have been involved in working on  CXL development for the past three years or so now. You can see some of the other information  about the company there. But the only other thing I'll mention is that we have  a sister division that includes  Penguin Solutions and Penguin is a system developer. There's actually a lot of good synergy between  Smart Modular that makes the components and Penguin  that makes the in-system solutions.

Moving right on to CXL. I think we can skip over this essentially. You've seen this several times. But the gist is that CXL is  a protocol that operates on  the physical PCIe lanes and three new protocols really. Depending on how you combine those protocols,  you get different types of devices. The first to hit the market is  these memory expansion devices  which everyone's been focusing on. But I do want to say that we do expect to see CXL be  adopted to not just accelerators,  but even storage and  networking and everything else going forward. There's advantages. Do expect to see that,  but that's a little ways out.

 Right now, we're focusing on memory expansion. Some of the advantages, I think this was hit really well  by Ravi earlier, some of the advantages. But I'll just do a super high-level version of it. Obviously, increased capacity is a big one. Once you fill up your DIMM sockets,  you're out of room for that particular CPU. By connecting memory to CXL,  you can obviously increase your memory capacity. That's a big one, has a number of  performance benefits for many different applications. The next one is actually you can reduce the cost. There's a question earlier about the cost difference between  a CXL memory module versus direct attached DRAM at DIMM. The reality is that the CXL module  is always going to be a little bit more expensive. That's because the CXL module includes a controller. There's a controller that converts that serial interface  to the parallel interface used to talk to the DRAM. That controller is going to cost money. But Ravi made a really good point. When you look at the overall system cost,  you can actually bring the overall system cost of the memory down. I think that's been covered a little bit before,  but I'll cover that in more detail  coming up on how that would work. Then the final one that I think has also been covered very well,  is the increased performance just in  terms of bandwidth performance. You essentially, by adding CXL memory,  get that additional memory lane,  that superhighway to talk to memory in a different way. You get even additional benefits. Again, as Ravi covered really well with the interleaving options. That's pretty exciting. But even though the latency of  the CXL devices is higher than the direct attached DRAM,  you can get an overall boost to the system using CXL.

These are some of the applications that are driving  the development of CXL or the need for CXL memory expansion. Obviously, artificial intelligence is  a huge one that's been covered in great detail. But even just applications  where you want to do in-memory database,  applications been around for a very long time,  that can get significant benefits from CXL. As well as video processing,  that's a growing field with  security deployment of security cameras. Then even in biotech where DNA sequencing machines are being used,  the cost of those has come down dramatically in recent years. They generate lots of data and you  need to process that data in real-time. Having large amounts of memory for  those applications is highly valuable as well. A lot of different things driving CXL. The real point of this is it's not just AI  that's driving the need for more memory. It's really all coming together.

This is how we expect to see CXL being adopted in the industry. The first method which we've talked about in  great detail is adding memory to the server itself,  direct attached CXL to the CPUs. There's a lot of benefits to this that we just covered. But we also see in the future adding  CXL expansion to being disaggregated,  through that memory pooling and sharing method. We see both of those being adopted going forward,  and starting with the direct attached memory  and moving to the pooling and sharing. But even during that time period,  we will still see direct attached memory. We will still see direct attached DRAM,  as was mentioned earlier this morning. It will become this tiering,  this multiple tiers of memory that all work together in  conjunction to fill in that gap in the memory tier.

When is it all going to come together? The good news is that the spec is largely done. The CXL 3.1 spec was released the end of last year. It's continuing to be refined,  but added features and clarifications,  but it's largely done. We do have CXL controllers from a number of  different companies that I think we're going to hear  more about coming up later today. But those are available now as well. Most of them are supporting the CXL 2.0 features,  but that's more than sufficient for both direct attached  and memory pooling behind the switch applications. In 2025, we do expect to see the next generation of  CXL controllers emerging that will support  the CXL 3.x standard. We'll add those additional capabilities there. Then here's to the major CPU vendors. They're very rough releases and  when which versions of CXL those support. We do see that direct attached memory  going into production this year. We ourselves at Smart have a couple of products that I'm going to  be talking about that will be released and in production this year. Then we do see that memory pooling taking off more next year  as a second phase of CXL adoption,  but they will exist in parallel and they will go forward together, we believe. Let's talk about what Smart's actually doing.

We've got several product lines that are  all looking to be released in 2024. The top row there shows our add-in cards. Those are in the traditional PCIe add-in card form factor,  similar to a graphics card or a network interface card. We have two that are in the lab and in bring up now. We have an 8-DIMM add-in card as well as a 4-DIMM add-in card. Both of those are DDR5 solutions. We are looking at doing DDR4 add-in cards as well,  but it would be late this year is when we would be introducing those. We are looking also at a low-profile card because certain to use systems,  this is a much friendlier form factor. We're looking at introducing those as well. We do also have the E3 form factor similar to what Micron showed earlier. There's a couple of little differences in terms of capacity and whatnot. I'll talk about those in a minute. But then we're also introducing a non-volatile version  of a CXL memory expansion device. Those of you who are familiar with NVDIMMs  might be aware that those are not moving forward into the DDR5 time frame. Instead, what's happening is we're going to be introducing  NV capability on a CXL device. We're working with the industry to get that standardized. The good news is that the CXL spec already includes  a lot of the hooks that are necessary to make that work. That's making good progress.

This is the eight socket add-in card. You can see from the block diagram that in order to get eight DIMMs on this card,  we use two CXL controllers. Each is a x16 controller. But because they're working together on the same mechanical slot,  which is limited to a x16,  they operate in x8 mode. You essentially have two ports on this mechanical connector. Each is a x8 talking to one of the controllers. Then behind that, each has two channels with two DIMMs per channel,  getting you up to the total of eight DIMMs on this card. The important note about this is this does require  systems that support bifurcation of the bus. You have to be able to plug this into the system and it has to be able to identify  that there are essentially two devices on that port. But that is supported by the latest Intel and AMD systems. It's a full height, half length, dual width card. Those DIMMs sticking out horizontal to the PCB require a double width slot. Also, if you saw my form factor presentation yesterday,  you might have seen that the maximum power delivery  through the edge connector is typically 75 watts. Well, each of these DIMMs is probably on the order of 10 plus watts itself,  and we have eight of them. You can see immediately that we've exceeded that 75 watt limit. You might notice there's an auxiliary power connector here. This is a standard PCIe auxiliary power connector that you might see on a graphics card,  for example, but that's necessary to get enough power to this device for it to work.

Moving on, this is the four DIMM add-in card. It's a little bit hard to see, but there are two DIMMs stacked on top of each other,  facing down towards the connector, and then there are another two facing up. This uses a single CXL controller with two memory channels, also DDR5. The advantage of this is this does not require the bifurcation. You get the full x16 interface talking to the four DIMMs behind the controller. Also, this one fits in a single width add-in card space. So, a lot of times in the system, they're installed horizontally,  and a single width has some advantages there. Not much to add there.

This one is, like I said, this is one that's in feasibility now for us. It's a low profile card. It's also DDR5, but because of that low profile, we need to use two CXL controllers. They'd be the x8 controllers so that we can get smaller dimensions in order to have this fit. So, this would be a bifurcated bus as well, but otherwise, it's very similar to the eight DIMM. It's just smaller, just half the height and smaller.

Okay, so this is the kind of one example of a way to look at the overall system cost  when you're considering CXL. So, this is an example where you're trying to design a system that has one terabyte of memory  per socket, and so in order to do that with a system that supports eight DIMMs per CPU,  which is a pretty typical loadout for DDR5, you would need to populate it with 128 gigabyte  DIMMs in order to get to a terabyte. And because 128 gigabyte DIMMs right now require stacked or 3DS DRAM, soon we're going to have  32 gigabit DRAM coming out, and that will move the bar up a little bit, but for today, you need  the stacked 3DS DRAM. And whenever you introduce that component, there's a significant price increase. So, you can look at DDR4, DDR5 DRAM modules, and you'll see when you hit a certain  density, the price spikes, and that's why. And if you can instead populate your system with the more cost-effective modules, in this case,  this example, that'll be the 64 gigabyte modules, you get a significant cost savings. So, you could populate it with twice as many, you know, instead of 828 gigabyte modules,  you use 16 64 gigabyte modules plus an add-in card, in this case, an eight DIMM add-in card,  you get a significant cost savings, right? Even with the addition of that card, just because of that price difference in the memory.

This is the E3.S CMM that Smart introduced last year. It's a 64 gigabyte capacity. It is DDR5 behind the CXL controller. So, as Robbie mentioned from Micron, there's different choices you can make there. This example is a DDR5 module behind a x16 interface. So, this is a little bit more niche just because it's a little more rare today to find a x16  interface in the front of the systems to support an E3 module, but it does have some advantages  in terms of bandwidth if the system supports it. We are looking at releasing a new version of this that will be using 24 gigabit DRAM,  so the capacity will increase to 96 gigabytes on that version. And when 32 gigabit is available, it will support 128 gigabytes.

Okay, I want to spend a little bit of time on this one because there were some good questions  before when I've talked about this device. So, we are also introducing an NVCMM, so a non-volatile CXL memory module,  and this device has both DRAM as well as NAND on it. However, the host system during runtime only accesses the DRAM.

So, it acts just like this device during runtime. The host system has access to the full – in this case, it's a 32 gigabyte module,  so it has access to the full 32 gigabytes. The NAND is only used in the event of a power failure, and in the event of a system power  failure, the host notifies the system that – or the NVCMM that power is being lost,  and the DRAM data is copied into the NAND flash using a smart NV controller. And it uses onboard energy source module – that's what ESM stands for – and that is made up of  supercapacitors that power the module long enough to do that data transfer. Then when system power comes back, we copy the data back from the NAND and the DRAM  and let the host know that everything is back up and running,  and as far as the host system knows, data is never gone.

 Hey, Torry, we have a question for you.

 Yeah.

 Which use cases are driving CMMN?  NVCMM. So, the primary –  There's a second part to the question. How does the latency impact the workloads and application that you mentioned?

 Okay. So, for the first one – first part of that question, it's the same use cases that you would  have seen – would see for NVDIMMs in the past. So, it's write caches, it's in-memory databases, it's  database tables that – like logging, so that you can rebuild, it's checkpointing. Those type of applications are driving NV. And then, sorry, repeat the second part of the question for me.

 Yeah. The second part is how does the tail latency impact the workloads and  application that you mentioned?

There is no tail latency because he's executing always out of DRAM. Like you mentioned, you only use that NAND flash in case suddenly power goes away.

 The question there is CXL by  8 controller. So, don't you need two DDR4 channels behind it?

 Yeah, this block diagram is not as detailed, so It could be.

 With the given capacity, single channel is good enough, 32 gig.

 That's right.

No, no. It's not capacity. Otherwise, you won't saturate the CXL by 8 bandwidth, right? One by 8 will need two DDR4 behind it or one DDR5 behind it. That's my question.

 Why can't we have higher capacities here? Is it because you are taking the backup at the – that  time duration is limiting you to have 32 gig only or

yeah. Right. Yeah, no, that's a very good question, and you're starting to hit on it. So, the reason we  can't have increased capacity is because we're integrating the energy source module, the super  caps. They are quite large-ish, and so, they take a portion of the PCB. So, it's PCB space limited,  and we could go to higher density DRAM. So, this is our first offering of the NV device. It's kind  of the first generation. We do have architectural plans to go to DDR5 and to increase capacity. We have some ideas about how we would be doing this, but initially, it's limited by board space  is a big one and capacity of the ESM. I have great questions, though, and thank you, Anil, for helping me  put the clue into the first question. All right.

 This is some of the reasons why SMART is well-situated to develop these CXL modules. We have excellent controller partnerships with many of the companies that are developing  controllers, and we've been making complex memory modules for a long time, including NVDIMMs,  which have controllers and firmware, and DRAM, and we are DRAM and NAND experts,  and we have been working on many of the other protocols. I think in the presentation this  morning, it was mentioned how many of these other protocols are converging to CXL, and we've been  involved in that now for many years. So, it's a good fit for us to develop these products.

By way of summary, you can visit our website for some white papers and some additional information  on these CXL products that we're developing, or you can reach out to me directly,  and there's my contact information there.