* [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK @ 2023-02-06 10:00 Hans Holmberg 2023-02-06 12:49 ` Ming Lei 2023-02-06 18:58 ` Bart Van Assche 0 siblings, 2 replies; 12+ messages in thread From: Hans Holmberg @ 2023-02-06 10:00 UTC (permalink / raw) To: linux-block@vger.kernel.org Cc: ming.lei@redhat.com, Matias Bjørling, Damien Le Moal, Dennis Maisenbacher, Ajay Joshi, Jørgen Hansen, andreas@metaspace.dk, javier@javigon.com, slava@dubeyko.com, kbusch@kernel.org, hans@owltronix.com, mcgrof@kernel.org, guokuankuan@bytedance.com, viacheslav.dubeyko@bytedance.com, hch@lst.de I think we're missing a flexible way of routing random-ish write workloads on to zoned storage devices. Implementing a UBLK target for this would be a great way to provide zoned storage benefits to a range of use cases. Creating UBLK target would enable us experiment and move fast, and when we arrive at a common, reasonably stable, solution we could move this into the kernel. We do have dm-zoned [3]in the kernel, but it requires a bounce on conventional zones for non-sequential writes, resulting in a write amplification of 2x (which is not optimal for flash). Fully random workloads make little sense to store on ZBDs as a host FTL could not be expected to do better than what conventional block devices do today. Fully sequential writes are also well taken care of by conventional block devices. The interesting stuff is what lies in between those extremes. I would like to discuss how we could use UBLK to implement a common FTL with the right knobs to cater for a wide range of workloads that utilize raw block devices. We had some knobs in the now-dead pblk, a FTL for open channel devices, but I think we could do way better than that. Pblk did not require bouncing writes and had knobs for over-provisioning and workload isolation which could be implemented. We could also add options for different garbage collection policies. In userspace it would also be easy to support default block indirection sizes, reducing logical-physical translation table memory overhead. Use cases for such an FTL includes SSD caching stores such as Apache traffic server [1] and CacheLib[2]. CacheLib's block cache and the apache traffic server storage workloads are *almost* zone block device compatible and would need little translation overhead to perform very well on e.g. ZNS SSDs. There are probably more use cases that would benefit. It would also be a great research vehicle for academia. We've used dm-zap for this [4] purpose the last couple of years, but that is not production-ready and cumbersome to improve and maintain as it is implemented as a out-of-tree device mapper. ublk adds a bit of latency overhead, but I think this is acceptable at least until we have a great, proven solution, which could be turned into an in-kernel FTL. If there is interest in the community for a project like this, let's talk! cc:ing the folks who participated in the discussions at ALPSS 2021 and last years' plumbers on this subject. Thanks, Hans [1] https://trafficserver.apache.org/ [2] https://cachelib.org/ [3] https://docs.kernel.org/admin-guide/device-mapper/dm-zoned.html [4] https://github.com/westerndigitalcorporation/dm-zap ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK 2023-02-06 10:00 [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK Hans Holmberg @ 2023-02-06 12:49 ` Ming Lei 2023-02-06 12:54 ` Ming Lei ` (2 more replies) 2023-02-06 18:58 ` Bart Van Assche 1 sibling, 3 replies; 12+ messages in thread From: Ming Lei @ 2023-02-06 12:49 UTC (permalink / raw) To: Hans Holmberg Cc: linux-block@vger.kernel.org, Matias Bjørling, Damien Le Moal, Dennis Maisenbacher, Ajay Joshi, Jørgen Hansen, andreas@metaspace.dk, javier@javigon.com, slava@dubeyko.com, kbusch@kernel.org, hans@owltronix.com, mcgrof@kernel.org, guokuankuan@bytedance.com, viacheslav.dubeyko@bytedance.com, hch@lst.de, ming.lei On Mon, Feb 06, 2023 at 10:00:20AM +0000, Hans Holmberg wrote: > I think we're missing a flexible way of routing random-ish > write workloads on to zoned storage devices. Implementing a UBLK > target for this would be a great way to provide zoned storage > benefits to a range of use cases. Creating UBLK target would > enable us experiment and move fast, and when we arrive > at a common, reasonably stable, solution we could move this into > the kernel. Yeah, UBLK provides one easy way for fast prototype. > > We do have dm-zoned [3]in the kernel, but it requires a bounce > on conventional zones for non-sequential writes, resulting in a write > amplification of 2x (which is not optimal for flash). > > Fully random workloads make little sense to store on ZBDs as a > host FTL could not be expected to do better than what conventional block > devices do today. Fully sequential writes are also well taken care of > by conventional block devices. > > The interesting stuff is what lies in between those extremes. > > I would like to discuss how we could use UBLK to implement a > common FTL with the right knobs to cater for a wide range of workloads > that utilize raw block devices. We had some knobs in the now-dead pblk, > a FTL for open channel devices, but I think we could do way better than that. > > Pblk did not require bouncing writes and had knobs for over-provisioning and > workload isolation which could be implemented. We could also add options > for different garbage collection policies. In userspace it would also > be easy to support default block indirection sizes, reducing logical-physical > translation table memory overhead. > > Use cases for such an FTL includes SSD caching stores such as Apache > traffic server [1] and CacheLib[2]. CacheLib's block cache and the apache > traffic server storage workloads are *almost* zone block device compatible > and would need little translation overhead to perform very well on e.g. > ZNS SSDs. > > There are probably more use cases that would benefit. > > It would also be a great research vehicle for academia. We've used dm-zap > for this [4] purpose the last couple of years, but that is not production-ready > and cumbersome to improve and maintain as it is implemented as a out-of-tree > device mapper. Maybe it is one beginning for generic open-source userspace SSD FTL, which could be useful for people curious in SSD internal. I have google several times for such toolkit to see if it can be ported to UBLK easily. SSD simulator isn't great, which isn't disk and can't handle real data & workloads. With such project, SSD simulator could be less useful, IMO. > > ublk adds a bit of latency overhead, but I think this is acceptable at least > until we have a great, proven solution, which could be turned into > an in-kernel FTL. We will keep improving ublk io path, and I am working on ublk copy. Once it is done, big chunk IO latency could be reduced a lot. Thanks, Ming ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK 2023-02-06 12:49 ` Ming Lei @ 2023-02-06 12:54 ` Ming Lei 2023-02-06 14:34 ` Matias Bjørling 2023-02-07 10:31 ` Nitesh Shetty 2 siblings, 0 replies; 12+ messages in thread From: Ming Lei @ 2023-02-06 12:54 UTC (permalink / raw) To: Hans Holmberg Cc: linux-block@vger.kernel.org, Matias Bjørling, Damien Le Moal, Dennis Maisenbacher, Ajay Joshi, Jørgen Hansen, andreas@metaspace.dk, javier@javigon.com, slava@dubeyko.com, kbusch@kernel.org, hans@owltronix.com, mcgrof@kernel.org, guokuankuan@bytedance.com, viacheslav.dubeyko@bytedance.com, hch@lst.de On Mon, Feb 06, 2023 at 08:49:15PM +0800, Ming Lei wrote: > > ublk adds a bit of latency overhead, but I think this is acceptable at least > > until we have a great, proven solution, which could be turned into > > an in-kernel FTL. > > We will keep improving ublk io path, and I am working on ublk > copy. Once it is done, big chunk IO latency could be reduced a lot. s/copy/zero copy Thanks, Ming ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK 2023-02-06 12:49 ` Ming Lei 2023-02-06 12:54 ` Ming Lei @ 2023-02-06 14:34 ` Matias Bjørling 2023-02-06 15:32 ` Ming Lei ` (2 more replies) 2023-02-07 10:31 ` Nitesh Shetty 2 siblings, 3 replies; 12+ messages in thread From: Matias Bjørling @ 2023-02-06 14:34 UTC (permalink / raw) To: Ming Lei, Hans Holmberg Cc: linux-block@vger.kernel.org, Damien Le Moal, Dennis Maisenbacher, Ajay Joshi, Jørgen Hansen, andreas@metaspace.dk, javier@javigon.com, slava@dubeyko.com, kbusch@kernel.org, hans@owltronix.com, mcgrof@kernel.org, guokuankuan@bytedance.com, viacheslav.dubeyko@bytedance.com, hch@lst.de > Maybe it is one beginning for generic open-source userspace SSD FTL, which > could be useful for people curious in SSD internal. I have google several times > for such toolkit to see if it can be ported to UBLK easily. SSD simulator isn't > great, which isn't disk and can't handle real data & workloads. With such > project, SSD simulator could be less useful, IMO. > Another possible avenue could be the FTL module that's part of SPDK. It might be worth checking out as well. It has been battletested for a couple of years and is used in production (https://www.youtube.com/watch?v=qeNBSjGq0dA). The module itself could be extracted from SPDK into its own, or SPDK's ublk extension could be used to instantiate it. In any case, I think it could provide a solid foundation for a host-side FTL implementation. Best, Matias ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK 2023-02-06 14:34 ` Matias Bjørling @ 2023-02-06 15:32 ` Ming Lei 2023-02-06 18:31 ` Bart Van Assche 2023-02-07 9:32 ` Hans Holmberg 2 siblings, 0 replies; 12+ messages in thread From: Ming Lei @ 2023-02-06 15:32 UTC (permalink / raw) To: Matias Bjørling Cc: Hans Holmberg, linux-block@vger.kernel.org, Damien Le Moal, Dennis Maisenbacher, Ajay Joshi, Jørgen Hansen, andreas@metaspace.dk, javier@javigon.com, slava@dubeyko.com, kbusch@kernel.org, hans@owltronix.com, mcgrof@kernel.org, guokuankuan@bytedance.com, viacheslav.dubeyko@bytedance.com, hch@lst.de Hi Matias, On Mon, Feb 06, 2023 at 02:34:51PM +0000, Matias Bjørling wrote: > > Maybe it is one beginning for generic open-source userspace SSD FTL, which > > could be useful for people curious in SSD internal. I have google several times > > for such toolkit to see if it can be ported to UBLK easily. SSD simulator isn't > > great, which isn't disk and can't handle real data & workloads. With such > > project, SSD simulator could be less useful, IMO. > > > > Another possible avenue could be the FTL module that's part of SPDK. It might be worth checking out as well. It has been battletested for a couple of years and is used in production (https://www.youtube.com/watch?v=qeNBSjGq0dA). > > The module itself could be extracted from SPDK into its own, or SPDK's ublk extension could be used to instantiate it. In any case, I think it could provide a solid foundation for a host-side FTL implementation. Great, I will take a look, and thanks for the sharing! Thanks, Ming ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK 2023-02-06 14:34 ` Matias Bjørling 2023-02-06 15:32 ` Ming Lei @ 2023-02-06 18:31 ` Bart Van Assche 2023-02-07 9:40 ` Matias Bjørling 2023-02-07 9:32 ` Hans Holmberg 2 siblings, 1 reply; 12+ messages in thread From: Bart Van Assche @ 2023-02-06 18:31 UTC (permalink / raw) To: Matias Bjørling, Ming Lei, Hans Holmberg Cc: linux-block@vger.kernel.org, Damien Le Moal, Dennis Maisenbacher, Ajay Joshi, Jørgen Hansen, andreas@metaspace.dk, javier@javigon.com, slava@dubeyko.com, kbusch@kernel.org, hans@owltronix.com, mcgrof@kernel.org, guokuankuan@bytedance.com, viacheslav.dubeyko@bytedance.com, hch@lst.de On 2/6/23 06:34, Matias Bjørling wrote: >> Maybe it is one beginning for generic open-source userspace SSD >> FTL, which could be useful for people curious in SSD internal. I >> have google several times for such toolkit to see if it can be >> ported to UBLK easily. SSD simulator isn't great, which isn't disk >> and can't handle real data & workloads. With such project, SSD >> simulator could be less useful, IMO. >> > > Another possible avenue could be the FTL module that's part of SPDK. > It might be worth checking out as well. It has been battletested for > a couple of years and is used in production > (https://www.youtube.com/watch?v=qeNBSjGq0dA). > > The module itself could be extracted from SPDK into its own, or > SPDK's ublk extension could be used to instantiate it. In any case, I > think it could provide a solid foundation for a host-side FTL > implementation. Thanks Matias for the link. I had not yet heard about this project. Although I have not yet had the time to watch the video, on https://spdk.io/doc/ftl.html I found the following: "The Flash Translation Layer library provides efficient 4K block device access on top of devices with >4K write unit size (eg. raid5f bdev) or devices with large indirection units (some capacity-focused NAND drives), which don't handle 4K writes well. It handles the logical to physical address mapping and manages the garbage collection process." To me that sounds like an effort that has very similar goals as ZNS and ZBC? Does the following advice apply to that project: "Don't stack your log on my log"? (Yang, Jingpei, Ned Plasson, Greg Gillis, Nisha Talagala, and Swaminathan Sundararaman. "Don’t stack your log on my log." In 2nd Workshop on Interactions of NVM/Flash with Operating Systems and Workloads ({INFLOW} 14). 2014.) Thanks, Bart. ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK 2023-02-06 18:31 ` Bart Van Assche @ 2023-02-07 9:40 ` Matias Bjørling 0 siblings, 0 replies; 12+ messages in thread From: Matias Bjørling @ 2023-02-07 9:40 UTC (permalink / raw) To: Bart Van Assche, Ming Lei, Hans Holmberg Cc: linux-block@vger.kernel.org, Damien Le Moal, Dennis Maisenbacher, Ajay Joshi, Jørgen Hansen, andreas@metaspace.dk, javier@javigon.com, slava@dubeyko.com, kbusch@kernel.org, hans@owltronix.com, mcgrof@kernel.org, guokuankuan@bytedance.com, viacheslav.dubeyko@bytedance.com, hch@lst.de, Barczak, Mariusz, Malikowski, Wojciech > > The module itself could be extracted from SPDK into its own, or SPDK's > > ublk extension could be used to instantiate it. In any case, I think > > it could provide a solid foundation for a host-side FTL > > implementation. > > Thanks Matias for the link. I had not yet heard about this project. > Although I have not yet had the time to watch the video, on > https://spdk.io/doc/ftl.html I found the following: "The Flash Translation Layer > library provides efficient 4K block device access on top of devices with >4K > write unit size (eg. raid5f bdev) or devices with large indirection units (some > capacity-focused NAND drives), which don't handle 4K writes well. It handles > the logical to physical address mapping and manages the garbage collection > process." To me that sounds like an effort that has very similar goals as ZNS and > ZBC? Does the following advice apply to that project: "Don't stack your log on > my log"? (Yang, Jingpei, Ned Plasson, Greg Gillis, Nisha Talagala, and > Swaminathan Sundararaman. "Don’t stack your log on my log." In 2nd > Workshop on Interactions of NVM/Flash with Operating Systems and > Workloads ({INFLOW} 14). 2014.) > Hi Bart, Yep, it does. The early incarnation of the ftl module was targeted as an OCSSD-compatible host-side FTL. It was later extended to support large writes and caching devices (e.g., optane). Mariuz and Wojciech have had the pleasure of building it, as well as enabled ZNS support that'll soon be upstream. Regards, Matias ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK 2023-02-06 14:34 ` Matias Bjørling 2023-02-06 15:32 ` Ming Lei 2023-02-06 18:31 ` Bart Van Assche @ 2023-02-07 9:32 ` Hans Holmberg 2 siblings, 0 replies; 12+ messages in thread From: Hans Holmberg @ 2023-02-07 9:32 UTC (permalink / raw) To: Matias Bjørling Cc: Ming Lei, Hans Holmberg, linux-block@vger.kernel.org, Damien Le Moal, Dennis Maisenbacher, Ajay Joshi, Jørgen Hansen, andreas@metaspace.dk, javier@javigon.com, slava@dubeyko.com, kbusch@kernel.org, mcgrof@kernel.org, guokuankuan@bytedance.com, viacheslav.dubeyko@bytedance.com, hch@lst.de On Mon, Feb 6, 2023 at 3:35 PM Matias Bjørling <Matias.Bjorling@wdc.com> wrote: > > > Maybe it is one beginning for generic open-source userspace SSD FTL, which > > could be useful for people curious in SSD internal. I have google several times > > for such toolkit to see if it can be ported to UBLK easily. SSD simulator isn't > > great, which isn't disk and can't handle real data & workloads. With such > > project, SSD simulator could be less useful, IMO. > > > > Another possible avenue could be the FTL module that's part of SPDK. It might be worth checking out as well. It has been battletested for a couple of years and is used in production (https://www.youtube.com/watch?v=qeNBSjGq0dA). > > The module itself could be extracted from SPDK into its own, or SPDK's ublk extension could be used to instantiate it. In any case, I think it could provide a solid foundation for a host-side FTL implementation. Thanks for bringing SPDK's CSAL up, I think it's a great example of a well implemented host-ftl. It does require a fast caching device with persistence guarantees (like optane) though, not entirely unlike dm-zoned. It also lives in the spdk universe, which makes it a bit harder to work with than a standalone ftl. While a cache in front of the backing storage gives the ftl some time to organize writes in a device-friendly manner before flushing, it adds cost (write amplification or having to add a fast persistent cache device) I've seen that SPDK already has the required plumbing for UBLK: https://spdk.io/doc/ublk.html I don't know if IO can be routed to CSAL yet. That said, it would be great to support the CSAL use case in a common ftl. Not all workloads require a cache, so I think that caching should be optimal. Raiding and supporting multiple tenants from a combined pool of storage is super-nice. Cheers, Hans ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK 2023-02-06 12:49 ` Ming Lei 2023-02-06 12:54 ` Ming Lei 2023-02-06 14:34 ` Matias Bjørling @ 2023-02-07 10:31 ` Nitesh Shetty 2023-02-07 12:49 ` Ming Lei 2 siblings, 1 reply; 12+ messages in thread From: Nitesh Shetty @ 2023-02-07 10:31 UTC (permalink / raw) To: Ming Lei Cc: Hans Holmberg, linux-block@vger.kernel.org, Matias Bjørling, Damien Le Moal, Dennis Maisenbacher, Ajay Joshi, Jørgen Hansen, andreas@metaspace.dk, javier@javigon.com, slava@dubeyko.com, kbusch@kernel.org, hans@owltronix.com, mcgrof@kernel.org, guokuankuan@bytedance.com, viacheslav.dubeyko@bytedance.com, hch@lst.de [-- Attachment #1: Type: text/plain, Size: 3313 bytes --] On Mon, Feb 06, 2023 at 08:49:15PM +0800, Ming Lei wrote: > On Mon, Feb 06, 2023 at 10:00:20AM +0000, Hans Holmberg wrote: > > I think we're missing a flexible way of routing random-ish > > write workloads on to zoned storage devices. Implementing a UBLK > > target for this would be a great way to provide zoned storage > > benefits to a range of use cases. Creating UBLK target would > > enable us experiment and move fast, and when we arrive > > at a common, reasonably stable, solution we could move this into > > the kernel. > > Yeah, UBLK provides one easy way for fast prototype. > > > > > We do have dm-zoned [3]in the kernel, but it requires a bounce > > on conventional zones for non-sequential writes, resulting in a write > > amplification of 2x (which is not optimal for flash). > > > > Fully random workloads make little sense to store on ZBDs as a > > host FTL could not be expected to do better than what conventional block > > devices do today. Fully sequential writes are also well taken care of > > by conventional block devices. > > > > The interesting stuff is what lies in between those extremes. > > > > I would like to discuss how we could use UBLK to implement a > > common FTL with the right knobs to cater for a wide range of workloads > > that utilize raw block devices. We had some knobs in the now-dead pblk, > > a FTL for open channel devices, but I think we could do way better than that. > > > > Pblk did not require bouncing writes and had knobs for over-provisioning and > > workload isolation which could be implemented. We could also add options > > for different garbage collection policies. In userspace it would also > > be easy to support default block indirection sizes, reducing logical-physical > > translation table memory overhead. > > > > Use cases for such an FTL includes SSD caching stores such as Apache > > traffic server [1] and CacheLib[2]. CacheLib's block cache and the apache > > traffic server storage workloads are *almost* zone block device compatible > > and would need little translation overhead to perform very well on e.g. > > ZNS SSDs. > > > > There are probably more use cases that would benefit. > > > > It would also be a great research vehicle for academia. We've used dm-zap > > for this [4] purpose the last couple of years, but that is not production-ready > > and cumbersome to improve and maintain as it is implemented as a out-of-tree > > device mapper. > > Maybe it is one beginning for generic open-source userspace SSD FTL, > which could be useful for people curious in SSD internal. I have > google several times for such toolkit to see if it can be ported to > UBLK easily. SSD simulator isn't great, which isn't disk and can't handle > real data & workloads. With such project, SSD simulator could be less > useful, IMO. > > > > > ublk adds a bit of latency overhead, but I think this is acceptable at least > > until we have a great, proven solution, which could be turned into > > an in-kernel FTL. > > We will keep improving ublk io path, and I am working on ublk > copy. Once it is done, big chunk IO latency could be reduced a lot. > Just curious, will this also involve running do_splice_direct*() in async style like normal async read/write, instead of offloading to iowq context ? Regards, Nitesh Shetty [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK 2023-02-07 10:31 ` Nitesh Shetty @ 2023-02-07 12:49 ` Ming Lei 0 siblings, 0 replies; 12+ messages in thread From: Ming Lei @ 2023-02-07 12:49 UTC (permalink / raw) To: Nitesh Shetty Cc: Hans Holmberg, linux-block@vger.kernel.org, Matias Bjørling, Damien Le Moal, Dennis Maisenbacher, Ajay Joshi, Jørgen Hansen, andreas@metaspace.dk, javier@javigon.com, slava@dubeyko.com, kbusch@kernel.org, hans@owltronix.com, mcgrof@kernel.org, guokuankuan@bytedance.com, viacheslav.dubeyko@bytedance.com, hch@lst.de, ming.lei On Tue, Feb 07, 2023 at 04:01:41PM +0530, Nitesh Shetty wrote: > On Mon, Feb 06, 2023 at 08:49:15PM +0800, Ming Lei wrote: > > On Mon, Feb 06, 2023 at 10:00:20AM +0000, Hans Holmberg wrote: > > > I think we're missing a flexible way of routing random-ish > > > write workloads on to zoned storage devices. Implementing a UBLK > > > target for this would be a great way to provide zoned storage > > > benefits to a range of use cases. Creating UBLK target would > > > enable us experiment and move fast, and when we arrive > > > at a common, reasonably stable, solution we could move this into > > > the kernel. > > > > Yeah, UBLK provides one easy way for fast prototype. > > > > > > > > We do have dm-zoned [3]in the kernel, but it requires a bounce > > > on conventional zones for non-sequential writes, resulting in a write > > > amplification of 2x (which is not optimal for flash). > > > > > > Fully random workloads make little sense to store on ZBDs as a > > > host FTL could not be expected to do better than what conventional block > > > devices do today. Fully sequential writes are also well taken care of > > > by conventional block devices. > > > > > > The interesting stuff is what lies in between those extremes. > > > > > > I would like to discuss how we could use UBLK to implement a > > > common FTL with the right knobs to cater for a wide range of workloads > > > that utilize raw block devices. We had some knobs in the now-dead pblk, > > > a FTL for open channel devices, but I think we could do way better than that. > > > > > > Pblk did not require bouncing writes and had knobs for over-provisioning and > > > workload isolation which could be implemented. We could also add options > > > for different garbage collection policies. In userspace it would also > > > be easy to support default block indirection sizes, reducing logical-physical > > > translation table memory overhead. > > > > > > Use cases for such an FTL includes SSD caching stores such as Apache > > > traffic server [1] and CacheLib[2]. CacheLib's block cache and the apache > > > traffic server storage workloads are *almost* zone block device compatible > > > and would need little translation overhead to perform very well on e.g. > > > ZNS SSDs. > > > > > > There are probably more use cases that would benefit. > > > > > > It would also be a great research vehicle for academia. We've used dm-zap > > > for this [4] purpose the last couple of years, but that is not production-ready > > > and cumbersome to improve and maintain as it is implemented as a out-of-tree > > > device mapper. > > > > Maybe it is one beginning for generic open-source userspace SSD FTL, > > which could be useful for people curious in SSD internal. I have > > google several times for such toolkit to see if it can be ported to > > UBLK easily. SSD simulator isn't great, which isn't disk and can't handle > > real data & workloads. With such project, SSD simulator could be less > > useful, IMO. > > > > > > > > ublk adds a bit of latency overhead, but I think this is acceptable at least > > > until we have a great, proven solution, which could be turned into > > > an in-kernel FTL. > > > > We will keep improving ublk io path, and I am working on ublk > > copy. Once it is done, big chunk IO latency could be reduced a lot. > > > > Just curious, will this also involve running do_splice_direct*() in async style > like normal async read/write, instead of offloading to iowq context ? Follows the idea: - adding new type of buffer(splice buffer) to io_uring, this buffer will be populated into bvec table(reusing io_mapped_ubuf) by passing (splice_fd, offset, len) from SQE. - The buffer is filled from ublk ->read_splice() with help of splice_direct_to_actor() over direct pipe, probably we can add one private splice flag to just allow ublk ->read_splice() to be available in kernel(io_uring) & direct pipe - It requires the pipe buffer ownership not transferred, so nop_pipe_buf_ops is needed for such usage, and this way is pretty fine for ublk & fuse. - The buffer can be allocated & populated from ->prep() of io_uring rw/net, then handled just like READ[WRITE]_FIXED. So it is like normal async read/write, then two pin pages are avoided, and one time of io data copy is saved. This way is also flexible to allow read/write over any part of the buffer. Thanks, Ming ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK 2023-02-06 10:00 [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK Hans Holmberg 2023-02-06 12:49 ` Ming Lei @ 2023-02-06 18:58 ` Bart Van Assche 2023-02-07 12:11 ` Hans Holmberg 1 sibling, 1 reply; 12+ messages in thread From: Bart Van Assche @ 2023-02-06 18:58 UTC (permalink / raw) To: Hans Holmberg, linux-block@vger.kernel.org Cc: ming.lei@redhat.com, Matias Bjørling, Damien Le Moal, Dennis Maisenbacher, Ajay Joshi, Jørgen Hansen, andreas@metaspace.dk, javier@javigon.com, slava@dubeyko.com, kbusch@kernel.org, hans@owltronix.com, mcgrof@kernel.org, guokuankuan@bytedance.com, viacheslav.dubeyko@bytedance.com, hch@lst.de On 2/6/23 02:00, Hans Holmberg wrote: > I think we're missing a flexible way of routing random-ish > write workloads on to zoned storage devices. Implementing a UBLK > target for this would be a great way to provide zoned storage > benefits to a range of use cases. Creating UBLK target would > enable us experiment and move fast, and when we arrive > at a common, reasonably stable, solution we could move this into > the kernel. > > We do have dm-zoned [3]in the kernel, but it requires a bounce > on conventional zones for non-sequential writes, resulting in a write > amplification of 2x (which is not optimal for flash). > > Fully random workloads make little sense to store on ZBDs as a > host FTL could not be expected to do better than what conventional block > devices do today. Fully sequential writes are also well taken care of > by conventional block devices. > > The interesting stuff is what lies in between those extremes. > > I would like to discuss how we could use UBLK to implement a > common FTL with the right knobs to cater for a wide range of workloads > that utilize raw block devices. We had some knobs in the now-dead pblk, > a FTL for open channel devices, but I think we could do way better than that. > > Pblk did not require bouncing writes and had knobs for over-provisioning and > workload isolation which could be implemented. We could also add options > for different garbage collection policies. In userspace it would also > be easy to support default block indirection sizes, reducing logical-physical > translation table memory overhead. > > Use cases for such an FTL includes SSD caching stores such as Apache > traffic server [1] and CacheLib[2]. CacheLib's block cache and the apache > traffic server storage workloads are *almost* zone block device compatible > and would need little translation overhead to perform very well on e.g. > ZNS SSDs. > > There are probably more use cases that would benefit. > > It would also be a great research vehicle for academia. We've used dm-zap > for this [4] purpose the last couple of years, but that is not production-ready > and cumbersome to improve and maintain as it is implemented as a out-of-tree > device mapper. > > ublk adds a bit of latency overhead, but I think this is acceptable at least > until we have a great, proven solution, which could be turned into > an in-kernel FTL. > > If there is interest in the community for a project like this, let's talk! > > cc:ing the folks who participated in the discussions at ALPSS 2021 and last > years' plumbers on this subject. > > Thanks, > Hans > > [1] https://trafficserver.apache.org/ > [2] https://cachelib.org/ > [3] https://docs.kernel.org/admin-guide/device-mapper/dm-zoned.html > [4] https://github.com/westerndigitalcorporation/dm-zap Hi Hans, Which functionality would such a user space target provide that is not yet provided by BTRFS, F2FS or any other log-structured filesystem that supports zoned block devices? Thanks, Bart. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK 2023-02-06 18:58 ` Bart Van Assche @ 2023-02-07 12:11 ` Hans Holmberg 0 siblings, 0 replies; 12+ messages in thread From: Hans Holmberg @ 2023-02-07 12:11 UTC (permalink / raw) To: Bart Van Assche Cc: Hans Holmberg, linux-block@vger.kernel.org, ming.lei@redhat.com, Matias Bjørling, Damien Le Moal, Dennis Maisenbacher, Ajay Joshi, Jørgen Hansen, andreas@metaspace.dk, javier@javigon.com, slava@dubeyko.com, kbusch@kernel.org, mcgrof@kernel.org, guokuankuan@bytedance.com, viacheslav.dubeyko@bytedance.com, hch@lst.de On Mon, Feb 6, 2023 at 7:58 PM Bart Van Assche <bvanassche@acm.org> wrote: > > On 2/6/23 02:00, Hans Holmberg wrote: > > I think we're missing a flexible way of routing random-ish > > write workloads on to zoned storage devices. Implementing a UBLK > > target for this would be a great way to provide zoned storage > > benefits to a range of use cases. Creating UBLK target would > > enable us experiment and move fast, and when we arrive > > at a common, reasonably stable, solution we could move this into > > the kernel. > > > > We do have dm-zoned [3]in the kernel, but it requires a bounce > > on conventional zones for non-sequential writes, resulting in a write > > amplification of 2x (which is not optimal for flash). > > > > Fully random workloads make little sense to store on ZBDs as a > > host FTL could not be expected to do better than what conventional block > > devices do today. Fully sequential writes are also well taken care of > > by conventional block devices. > > > > The interesting stuff is what lies in between those extremes. > > > > I would like to discuss how we could use UBLK to implement a > > common FTL with the right knobs to cater for a wide range of workloads > > that utilize raw block devices. We had some knobs in the now-dead pblk, > > a FTL for open channel devices, but I think we could do way better than that. > > > > Pblk did not require bouncing writes and had knobs for over-provisioning and > > workload isolation which could be implemented. We could also add options > > for different garbage collection policies. In userspace it would also > > be easy to support default block indirection sizes, reducing logical-physical > > translation table memory overhead. > > > > Use cases for such an FTL includes SSD caching stores such as Apache > > traffic server [1] and CacheLib[2]. CacheLib's block cache and the apache > > traffic server storage workloads are *almost* zone block device compatible > > and would need little translation overhead to perform very well on e.g. > > ZNS SSDs. > > > > There are probably more use cases that would benefit. > > > > It would also be a great research vehicle for academia. We've used dm-zap > > for this [4] purpose the last couple of years, but that is not production-ready > > and cumbersome to improve and maintain as it is implemented as a out-of-tree > > device mapper. > > > > ublk adds a bit of latency overhead, but I think this is acceptable at least > > until we have a great, proven solution, which could be turned into > > an in-kernel FTL. > > > > If there is interest in the community for a project like this, let's talk! > > > > cc:ing the folks who participated in the discussions at ALPSS 2021 and last > > years' plumbers on this subject. > > > > Thanks, > > Hans > > > > [1] https://trafficserver.apache.org/ > > [2] https://cachelib.org/ > > [3] https://docs.kernel.org/admin-guide/device-mapper/dm-zoned.html > > [4] https://github.com/westerndigitalcorporation/dm-zap > > Hi Hans, > > Which functionality would such a user space target provide that is not > yet provided by BTRFS, F2FS or any other log-structured filesystem that > supports zoned block devices? > Hi Bart, The use cases I'm primarily thinking of are applications and services that work on top of raw block interfaces, like Apache Traffic server and Cachelib mentioned in my proposal. These workloads benefit from not using a file system. The file system overhead is just too big for storing millions of (> 2kiB) sized objects and billions of < 2kiB tiny objects. For the larger objects, the write pattern is log structured and almost fully sequential. Zoned storage would provide a benefit if multiple instances of these caches would be co-located on the same media, resulting in mixing of these streams, or if a large object cache would be mixed with other, random workloads, like the cache lib store for small objects. Cache workloads have relaxed persistence requirements. It's not the end of the world if an object disappears. I can recommend [1] and [2] as an introduction to these workloads. In my plumbers talk [3] from last year I sketched out how zoned storage could benefit object caching on flash. [1] https://www.usenix.org/conference/osdi20/presentation/berg [2] https://engineering.fb.com/2021/10/26/core-data/kangaroo/ [3] https://lpc.events/event/16/contributions/1232/attachments/1066/2095/LPC%202022%20Zoned%20MC%20Improving%20object%20caches%20using%20ZNS%20V2.pdf Cheers, Hans ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-02-07 12:50 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-02-06 10:00 [LSF/MM/BPF BoF]: A host FTL for zoned block devices using UBLK Hans Holmberg 2023-02-06 12:49 ` Ming Lei 2023-02-06 12:54 ` Ming Lei 2023-02-06 14:34 ` Matias Bjørling 2023-02-06 15:32 ` Ming Lei 2023-02-06 18:31 ` Bart Van Assche 2023-02-07 9:40 ` Matias Bjørling 2023-02-07 9:32 ` Hans Holmberg 2023-02-07 10:31 ` Nitesh Shetty 2023-02-07 12:49 ` Ming Lei 2023-02-06 18:58 ` Bart Van Assche 2023-02-07 12:11 ` Hans Holmberg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox