[LSF/MM/BPF TOPIC] Block storage copy offloading

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Block storage copy offloading
@ 2026-01-23 22:19 Bart Van Assche
  2026-01-26 18:18 ` Viacheslav Dubeyko
  2026-02-04 23:58 ` Keith Busch
  0 siblings, 2 replies; 8+ messages in thread
From: Bart Van Assche @ 2026-01-23 22:19 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org
  Cc: lsf-pc, Jaegeuk Kim

Adoption of zoned storage is increasing in mobile devices. Log-
structured filesystems are better suited for zoned storage than
traditional filesystems. These filesystems perform garbage collection.
Garbage collection involves copying data on the storage medium.
Offloading the copying operation to the storage device reduces energy
consumption. Hence the proposal to discuss integration of copy
offloading in the Linux kernel block, SCSI and NVMe layers.

Other use-cases for copy offloading include reducing network traffic in
NVMeOF setups while copying data and also increasing throughput while
copying data.

Note: when using fscrypt, the contents of files can be copied without
decrypting the data since how data is encrypted depends on the file
offset and not on the LBA at which data is stored. See also
https://docs.kernel.org/filesystems/fscrypt.html.

My goal is to publish a patch series before the LSF/MM/BPF summit starts
that implements the following approach, an approach that hasn't been
proposed yet as far as I know:
* Filesystems call a block layer function that initiates a copy offload
   operation asynchronously. This function supports a source block
   device, a source offset, a destination block device, a destination
   offset and the number of bytes to be copied.
* That block layer function submits separate REQ_OP_COPY_SRC and
   REQ_OP_COPY_DST operations. In both bios bi_private is set such that
   it points at copy offloading metadata. The bi_private pointer is used
   to associate the REQ_OP_COPY_SRC and REQ_OP_COPY_DST operations that
   are involved in the same copying operation.
* There are two reasons why the choice has been made to have two copy
   operations instead of one:
   - Each bio supports a single offset and size (bi_iter). Copying data
     involves a source offset and a destination offset. Although it would
     be possible to store all the copying metadata in the bio data
     buffer, this approach is not compatible with the existing bio
     splitting code.
   - Device mapper drivers only support a single LBA range per bio.
* After a device mapper driver has finished mapping a bio, the result of
   the map operation is stored in the copy offloading metadata. This
   probably can be realized by intercepting dm_submit_bio_remap() calls.
* The device mapper mapping process is repeated until all input and
   output ranges have been mapped onto ranges not associated with a
   device mapper device. Repeating this process is necessary in case of
   stacked device mapper devices, e.g. dm-crypt on top of dm-linear.
* After the mapping process is finished, the block layer checks whether
   all LBA ranges are associated with the same non-stacking block driver
   (NVMe, SCSI, ...). If not, the copy offload operation fails and the
   block layer falls back to REQ_OP_READ and REQ_OP_WRITE operations.
* One or more copy operations are submitted to the block driver. The
   block driver is responsible for checking whether the copy operation
   can be offloaded. While the SCSI EXTENDED COPY command supports
   copying between logical units, whether the NVMe Copy command supports
   copying across namespaces depends on the version of the NVMe
   specification supported by the controller.
* It is verified whether the copy operation copied all data.
   If not, the block layer falls back to REQ_OP_READ and REQ_OP_WRITE.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Block storage copy offloading
  2026-01-23 22:19 [LSF/MM/BPF TOPIC] Block storage copy offloading Bart Van Assche
@ 2026-01-26 18:18 ` Viacheslav Dubeyko
  2026-01-26 19:12   ` Bart Van Assche
  2026-02-04 23:58 ` Keith Busch
  1 sibling, 1 reply; 8+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-26 18:18 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-nvme@lists.infradead.org
  Cc: lsf-pc, Jaegeuk Kim

On Fri, 2026-01-23 at 14:19 -0800, Bart Van Assche wrote:
> Adoption of zoned storage is increasing in mobile devices. Log-
> structured filesystems are better suited for zoned storage than
> traditional filesystems. These filesystems perform garbage
> collection.
> Garbage collection involves copying data on the storage medium.
> Offloading the copying operation to the storage device reduces energy
> consumption. Hence the proposal to discuss integration of copy
> offloading in the Linux kernel block, SCSI and NVMe layers.
> 
> Other use-cases for copy offloading include reducing network traffic
> in
> NVMeOF setups while copying data and also increasing throughput while
> copying data.
> 

Idea is interesting, but...

I am not completely sure that copy offloading to the storage device can
reduce energy consumption. The storage device needs to spend energy for
executing this operation, anyway. Do you have any numbers that can
prove your point?

Also, I don't see how LFS file system can manage it. Because, LFS file
system contains a sequence of logs. And log contains as metadata as
user data. Even if one log contains only metadata and another one
contains user-data, then before sending metadata log on the volume the
user-data locations should be known and stored into metadata log(s).
So, what is your vision of model of collaboration LFS file system and
block layer? Which file system have you considered as working model of
your approach?

Thanks,
Slava.


> Note: when using fscrypt, the contents of files can be copied without
> decrypting the data since how data is encrypted depends on the file
> offset and not on the LBA at which data is stored. See also
> https://docs.kernel.org/filesystems/fscrypt.html.
> 
> My goal is to publish a patch series before the LSF/MM/BPF summit
> starts
> that implements the following approach, an approach that hasn't been
> proposed yet as far as I know:
> * Filesystems call a block layer function that initiates a copy
> offload
>    operation asynchronously. This function supports a source block
>    device, a source offset, a destination block device, a destination
>    offset and the number of bytes to be copied.
> * That block layer function submits separate REQ_OP_COPY_SRC and
>    REQ_OP_COPY_DST operations. In both bios bi_private is set such
> that
>    it points at copy offloading metadata. The bi_private pointer is
> used
>    to associate the REQ_OP_COPY_SRC and REQ_OP_COPY_DST operations
> that
>    are involved in the same copying operation.
> * There are two reasons why the choice has been made to have two copy
>    operations instead of one:
>    - Each bio supports a single offset and size (bi_iter). Copying
> data
>      involves a source offset and a destination offset. Although it
> would
>      be possible to store all the copying metadata in the bio data
>      buffer, this approach is not compatible with the existing bio
>      splitting code.
>    - Device mapper drivers only support a single LBA range per bio.
> * After a device mapper driver has finished mapping a bio, the result
> of
>    the map operation is stored in the copy offloading metadata. This
>    probably can be realized by intercepting dm_submit_bio_remap()
> calls.
> * The device mapper mapping process is repeated until all input and
>    output ranges have been mapped onto ranges not associated with a
>    device mapper device. Repeating this process is necessary in case
> of
>    stacked device mapper devices, e.g. dm-crypt on top of dm-linear.
> * After the mapping process is finished, the block layer checks
> whether
>    all LBA ranges are associated with the same non-stacking block
> driver
>    (NVMe, SCSI, ...). If not, the copy offload operation fails and
> the
>    block layer falls back to REQ_OP_READ and REQ_OP_WRITE operations.
> * One or more copy operations are submitted to the block driver. The
>    block driver is responsible for checking whether the copy
> operation
>    can be offloaded. While the SCSI EXTENDED COPY command supports
>    copying between logical units, whether the NVMe Copy command
> supports
>    copying across namespaces depends on the version of the NVMe
>    specification supported by the controller.
> * It is verified whether the copy operation copied all data.
>    If not, the block layer falls back to REQ_OP_READ and
> REQ_OP_WRITE.
> 
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Block storage copy offloading
  2026-01-26 18:18 ` Viacheslav Dubeyko
@ 2026-01-26 19:12   ` Bart Van Assche
  2026-01-27 18:03     ` Viacheslav Dubeyko
  0 siblings, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2026-01-26 19:12 UTC (permalink / raw)
  To: Viacheslav Dubeyko, linux-block@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-nvme@lists.infradead.org
  Cc: lsf-pc, Jaegeuk Kim

On 1/26/26 10:18 AM, Viacheslav Dubeyko wrote:
> I am not completely sure that copy offloading to the storage device can
> reduce energy consumption. The storage device needs to spend energy for
> executing this operation, anyway. Do you have any numbers that can
> prove your point?

Yes, we have measurements that prove this point but unfortunately the
vendor that collected this data does not allow us to publish that data.

Reducing energy consumption matters for mobile devices. There are other
applications for copy offloading, e.g. in data centers and in enterprise
applications. I don't think that these other users care as much about
reducing energy consumption as we do.

> Which file system have you considered as working model of your approach?
Every LFS for zoned storage has to perform garbage collection, isn't it?
I think that we can discuss copy offloading without having to discuss
filesystem implementation details.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Block storage copy offloading
  2026-01-26 19:12   ` Bart Van Assche
@ 2026-01-27 18:03     ` Viacheslav Dubeyko
  2026-01-27 19:11       ` Bart Van Assche
  0 siblings, 1 reply; 8+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-27 18:03 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-nvme@lists.infradead.org
  Cc: lsf-pc, Jaegeuk Kim

On Mon, 2026-01-26 at 11:12 -0800, Bart Van Assche wrote:
> On 1/26/26 10:18 AM, Viacheslav Dubeyko wrote:
> > I am not completely sure that copy offloading to the storage device
> > can
> > reduce energy consumption. The storage device needs to spend energy
> > for
> > executing this operation, anyway. Do you have any numbers that can
> > prove your point?
> 
> Yes, we have measurements that prove this point but unfortunately the
> vendor that collected this data does not allow us to publish that
> data.

I see.

> 
> Reducing energy consumption matters for mobile devices. There are
> other
> applications for copy offloading, e.g. in data centers and in
> enterprise
> applications. I don't think that these other users care as much about
> reducing energy consumption as we do.
> 
> > Which file system have you considered as working model of your
> > approach?
> Every LFS for zoned storage has to perform garbage collection, isn't
> it?
> I think that we can discuss copy offloading without having to discuss
> filesystem implementation details.
> 
> 

It is not exactly correct. :) Unfortunately, we have to discuss the
file system implementation details.

First of all, not every LFS file system needs GC. Because, SSDFS uses
migration scheme instead of classical GC. And migration scheme is not
GC based technique because the regular file system operations moves
valid blocks from exhausted erase block (segment) into clean one.

Even if GC must to be used, then (1) valid blocks needs to be taken
from the logs of exhausted segment (victim segment in GC terminology),
(2) log needs to be prepared and (3) stored into clean or current
segment. And log starts from metadata header and describes location of
other metadata structures in the log, and user data portions. For
example, NILFS2 contains b-tree + other metadata + user data in the
log. SSDFS's log starts from header that describes locations in the log
of block bitmap, offsets translation table. Finally, offsets
translation table contains offsets to compressed portions of logical
blocks with user data or other metadata (like inodes b-tree, etc).

And even if we are talking about F2FS, then offloaded copy operations
needs to be accounted in metadata portion of the file system.

So, frankly speaking, currently, I don't see the generic technique that
can work for all LFS file systems. But, maybe, I simply don't see your
point. :) And this is why I am asking explanation of details how this
suggested technique can work for LFS file systems. Because, GC is
integrated subsystem of any LFS file system.

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Block storage copy offloading
  2026-01-27 18:03     ` Viacheslav Dubeyko
@ 2026-01-27 19:11       ` Bart Van Assche
  2026-01-28 18:45         ` Viacheslav Dubeyko
  0 siblings, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2026-01-27 19:11 UTC (permalink / raw)
  To: Viacheslav Dubeyko, linux-block@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-nvme@lists.infradead.org
  Cc: lsf-pc, Jaegeuk Kim

On 1/27/26 10:03 AM, Viacheslav Dubeyko wrote:
> So, frankly speaking, currently, I don't see the generic technique that
> can work for all LFS file systems.
If I change my topic proposal such that it says "some LFS can benefit 
from copy offloading" instead of "all LFS can benefit from copy
offloading", is that sufficient to agree?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Block storage copy offloading
  2026-01-27 19:11       ` Bart Van Assche
@ 2026-01-28 18:45         ` Viacheslav Dubeyko
  0 siblings, 0 replies; 8+ messages in thread
From: Viacheslav Dubeyko @ 2026-01-28 18:45 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-nvme@lists.infradead.org
  Cc: lsf-pc, Jaegeuk Kim, Slava.Dubeyko

On Tue, 2026-01-27 at 11:11 -0800, Bart Van Assche wrote:
> On 1/27/26 10:03 AM, Viacheslav Dubeyko wrote:
> > So, frankly speaking, currently, I don't see the generic technique
> > that
> > can work for all LFS file systems.
> If I change my topic proposal such that it says "some LFS can benefit
> from copy offloading" instead of "all LFS can benefit from copy
> offloading", is that sufficient to agree?
> 
> 

I assume that your approach is based on suggestion of capability to
move one or several physical sectors as it is in background and, then,
correct file system metadata on new location of user data (or
metadata). So, you need to do this as part of GC operations because a
file system uses Copy-On-Write (COW) policy. This file system can be
LFS (log-structured) file system or not LFS file system. If file system
based on log concept, then we cannot simply move physical sectors as it
is because we need to extract valid blocks from the log and prepare the
new log. If we can offload this logic into storage device, then we can
use this approach for LFS file systems. Otherwise, we cannot talk about
LFS file systems. If file system uses COW policy, it has GC, but it
doesn't use the log-structured concept, then we can use the suggested
approach. Because, we can move physical sectors as it is in the
background. However, if file system uses compression or encryption,
then situation can be complicated again. Because, logical block could
be smaller than physical sector and some metadata structure needs to
keep the knowledge of the size and location of a particular logical
block. So, again, moving physical sectors as it is could move the
invalidated logical blocks.

Potentially, it needs to talk about COW policy and non-compressed/non-
encrypted data? But it limits the approach significantly. And real
business case should be ready for compression and encryption.

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Block storage copy offloading
  2026-01-23 22:19 [LSF/MM/BPF TOPIC] Block storage copy offloading Bart Van Assche
  2026-01-26 18:18 ` Viacheslav Dubeyko
@ 2026-02-04 23:58 ` Keith Busch
  2026-02-09 21:26   ` Bart Van Assche
  1 sibling, 1 reply; 8+ messages in thread
From: Keith Busch @ 2026-02-04 23:58 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, lsf-pc, Jaegeuk Kim

On Fri, Jan 23, 2026 at 02:19:44PM -0800, Bart Van Assche wrote:
> Adoption of zoned storage is increasing in mobile devices. Log-
> structured filesystems are better suited for zoned storage than
> traditional filesystems. These filesystems perform garbage collection.
> Garbage collection involves copying data on the storage medium.
> Offloading the copying operation to the storage device reduces energy
> consumption. Hence the proposal to discuss integration of copy
> offloading in the Linux kernel block, SCSI and NVMe layers.
> 
> Other use-cases for copy offloading include reducing network traffic in
> NVMeOF setups while copying data and also increasing throughput while
> copying data.

I'm interested in the topic. I'm just not sure about the approach. If it
doesn't support vectored sector sources, then it's much less
interesting. From the host point of view, I'd like to be able to submit
arbitrarily large bio's to the block layer that can be split and merged
for optimal alignment to hardware limits. The two-bio approach looks
overly complicated with respect to that.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Block storage copy offloading
  2026-02-04 23:58 ` Keith Busch
@ 2026-02-09 21:26   ` Bart Van Assche
  0 siblings, 0 replies; 8+ messages in thread
From: Bart Van Assche @ 2026-02-09 21:26 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org, lsf-pc, Jaegeuk Kim

On 2/4/26 3:58 PM, Keith Busch wrote:
> On Fri, Jan 23, 2026 at 02:19:44PM -0800, Bart Van Assche wrote:
>> Adoption of zoned storage is increasing in mobile devices. Log-
>> structured filesystems are better suited for zoned storage than
>> traditional filesystems. These filesystems perform garbage collection.
>> Garbage collection involves copying data on the storage medium.
>> Offloading the copying operation to the storage device reduces energy
>> consumption. Hence the proposal to discuss integration of copy
>> offloading in the Linux kernel block, SCSI and NVMe layers.
>>
>> Other use-cases for copy offloading include reducing network traffic in
>> NVMeOF setups while copying data and also increasing throughput while
>> copying data.
> 
> I'm interested in the topic. I'm just not sure about the approach. If it
> doesn't support vectored sector sources, then it's much less
> interesting. From the host point of view, I'd like to be able to submit
> arbitrarily large bio's to the block layer that can be split and merged
> for optimal alignment to hardware limits. The two-bio approach looks
> overly complicated with respect to that.

Hi Keith,

How about supporting vectored sources with this approach:
* Copy requests with multiple discontiguous input or output ranges
   are submitted as multiple bios - one bio for each contiguous range.
* Before these multiple bios are submitted, blk_start_plug() is called.
   After these have been submitted blk_finish_plug() is called.
* After device mapper LBA translation has completed for all involved
   bios, if all involved bios apply to the same input and output
   block devices, and if sufficient requests are available, the block
   layer submits all the translated requests at once to the block driver
   by calling a new callback pointer that is added in struct blk_mq_ops.
* The block driver is responsible for combining the discontiguous
   requests into a single copy offload command (if permitted by the
   device limits).

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-02-09 21:26 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-23 22:19 [LSF/MM/BPF TOPIC] Block storage copy offloading Bart Van Assche
2026-01-26 18:18 ` Viacheslav Dubeyko
2026-01-26 19:12   ` Bart Van Assche
2026-01-27 18:03     ` Viacheslav Dubeyko
2026-01-27 19:11       ` Bart Van Assche
2026-01-28 18:45         ` Viacheslav Dubeyko
2026-02-04 23:58 ` Keith Busch
2026-02-09 21:26   ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox