* [PATCH 0/5] block: another block copy offload
@ 2025-05-21 22:31 Keith Busch
2025-05-21 22:31 ` [PATCH 1/5] block: new sector copy api Keith Busch
` (6 more replies)
0 siblings, 7 replies; 46+ messages in thread
From: Keith Busch @ 2025-05-21 22:31 UTC (permalink / raw)
To: linux-block, linux-nvme; +Cc: Keith Busch
From: Keith Busch <kbusch@kernel.org>
I was never happy with previous block copy offload attempts, so I had to
take a stab at it. And I was recently asked to take a look at this, so
here goes.
Some key implementation differences from previous approaches:
1. Only one bio is needed to describe a copy request, so no plugging
or dispatch tricks required. Like read and write requests, these
can be artbitrarily large and will be split as needed based on the
request_queue's limits. The bio's are mergeable with other copy
commands on adjacent destination sectors.
2. You can describe as many source sectors as you want in a vector in
a single bio. This aligns with the nvme protocol's Copy implementation,
which can be used to efficiently defragment scattered blocks into a
contiguous destination with a single command.
Oh, and the nvme-target support was included with this patchset too, so
there's a purely in-kernel way to test out the code paths if you don't
have otherwise capable hardware. I also used qemu since that nvme device
supports copy offload too.
Keith Busch (5):
block: new sector copy api
block: add support for copy offload
nvme: add support for copy offload
block: add support for vectored copies
nvmet: implement copy support for bdev backed target
block/bio.c | 25 +++++++
block/blk-core.c | 4 ++
block/blk-lib.c | 115 ++++++++++++++++++++++++++++++
block/blk-merge.c | 28 +++++++-
block/blk-sysfs.c | 9 +++
block/blk.h | 17 ++++-
block/ioctl.c | 89 +++++++++++++++++++++++
drivers/nvme/host/core.c | 61 ++++++++++++++++
drivers/nvme/target/io-cmd-bdev.c | 52 ++++++++++++++
include/linux/bio.h | 20 ++++++
include/linux/blk-mq.h | 5 ++
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 18 +++++
include/linux/bvec.h | 68 +++++++++++++++++-
include/linux/nvme.h | 42 ++++++++++-
include/uapi/linux/fs.h | 17 +++++
16 files changed, 566 insertions(+), 6 deletions(-)
--
2.47.1
^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 1/5] block: new sector copy api
2025-05-21 22:31 [PATCH 0/5] block: another block copy offload Keith Busch
@ 2025-05-21 22:31 ` Keith Busch
2025-05-22 10:02 ` Hannes Reinecke
` (2 more replies)
2025-05-21 22:31 ` [PATCH 2/5] block: add support for copy offload Keith Busch
` (5 subsequent siblings)
6 siblings, 3 replies; 46+ messages in thread
From: Keith Busch @ 2025-05-21 22:31 UTC (permalink / raw)
To: linux-block, linux-nvme; +Cc: Keith Busch
From: Keith Busch <kbusch@kernel.org>
Provide a basic block level api to copy a range of a block device's
sectors to a new destination on the same device. This just reads the
source data into host memory, then writes it back out to the device at
the requested destination.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
block/blk-lib.c | 62 +++++++++++++++++++++++++++++++++++++++++
block/ioctl.c | 30 ++++++++++++++++++++
include/linux/blkdev.h | 2 ++
include/uapi/linux/fs.h | 3 ++
4 files changed, 97 insertions(+)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 4c9f20a689f7b..a819ded0ed3a9 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -368,3 +368,65 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
return ret;
}
EXPORT_SYMBOL(blkdev_issue_secure_erase);
+
+/**
+ * blkdev_copy - copy source sectors to a destination on the same block device
+ * @dst_sector: start sector of the destination to copy to
+ * @src_sector: start sector of the source to copy from
+ * @nr_sects: number of sectors to copy
+ * @gfp: allocation flags to use
+ */
+int blkdev_copy(struct block_device *bdev, sector_t dst_sector,
+ sector_t src_sector, sector_t nr_sects, gfp_t gfp)
+{
+ unsigned int nr_vecs = __blkdev_sectors_to_bio_pages(nr_sects);
+ unsigned int len = (unsigned int)nr_sects << SECTOR_SHIFT;
+ unsigned int size = min(len, nr_vecs * PAGE_SIZE);
+ struct bio *bio;
+ int ret = 0;
+ void *buf;
+
+ if (nr_sects > UINT_MAX >> SECTOR_SHIFT)
+ return -EINVAL;
+
+ buf = kvmalloc(size, gfp);
+ if (!buf)
+ return -ENOMEM;
+
+ nr_vecs = bio_add_max_vecs(buf, size);
+ bio = bio_alloc(bdev, nr_vecs, 0, gfp);
+
+ if (is_vmalloc_addr(buf))
+ bio_add_vmalloc(bio, buf, size);
+ else
+ bio_add_virt_nofail(bio, buf, size);
+
+ while (len) {
+ size = min(len, size);
+
+ bio_reset(bio, bdev, REQ_OP_READ);
+ bio->bi_iter.bi_sector = src_sector;
+ bio->bi_iter.bi_size = size;
+
+ ret = submit_bio_wait(bio);
+ if (ret)
+ break;
+
+ bio_reset(bio, bdev, REQ_OP_WRITE);
+ bio->bi_iter.bi_sector = dst_sector;
+ bio->bi_iter.bi_size = size;
+
+ ret = submit_bio_wait(bio);
+ if (ret)
+ break;
+
+ src_sector += size >> SECTOR_SHIFT;
+ dst_sector += size >> SECTOR_SHIFT;
+ len -= size;
+ }
+
+ bio_put(bio);
+ kvfree(buf);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_copy);
diff --git a/block/ioctl.c b/block/ioctl.c
index e472cc1030c60..6f03c65867348 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -212,6 +212,34 @@ static int blk_ioctl_secure_erase(struct block_device *bdev, blk_mode_t mode,
return err;
}
+static int blk_ioctl_copy(struct block_device *bdev, blk_mode_t mode,
+ void __user *argp)
+{
+ unsigned int lbs = bdev_logical_block_size(bdev) >> SECTOR_SHIFT;
+ uint64_t dst, src, end, nr, range[3];
+
+ if (!(mode & BLK_OPEN_WRITE))
+ return -EBADF;
+ if (copy_from_user(range, argp, sizeof(range)))
+ return -EFAULT;
+
+ dst = range[0];
+ src = range[1];
+ nr = range[2];
+
+ if (!(IS_ALIGNED(dst | src | nr, lbs)))
+ return -EINVAL;
+ if (check_add_overflow(src, nr - 1, &end))
+ return -EINVAL;
+ if (end >= bdev_nr_sectors(bdev))
+ return -EINVAL;
+ if (src < dst && src + nr > dst)
+ return -EINVAL;
+ if (dst < src && dst + nr > src)
+ return -EINVAL;
+
+ return blkdev_copy(bdev, dst, src, nr, GFP_KERNEL);
+}
static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode,
unsigned long arg)
@@ -575,6 +603,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, blk_mode_t mode,
return blk_ioctl_discard(bdev, mode, arg);
case BLKSECDISCARD:
return blk_ioctl_secure_erase(bdev, mode, argp);
+ case BLKCPY:
+ return blk_ioctl_copy(bdev, mode, argp);
case BLKZEROOUT:
return blk_ioctl_zeroout(bdev, mode, arg);
case BLKGETDISKSEQ:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 332b56f323d92..b7d71b126ec9b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1176,6 +1176,8 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp);
+int blkdev_copy(struct block_device *bdev, sector_t dst_sector,
+ sector_t src_sector, sector_t nr_sects, gfp_t gfp);
#define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
#define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index e762e1af650c4..534f157ce22e9 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -215,6 +215,9 @@ struct fsxattr {
/* 130-136 are used by zoned block device ioctls (uapi/linux/blkzoned.h) */
/* 137-141 are used by blk-crypto ioctls (uapi/linux/blk-crypto.h) */
+/* [0] = destination lba, [1] = source lba, [2] = number of sectors */
+#define BLKCPY _IOWR(0x12,142,__u64[3])
+
#define BMAP_IOCTL 1 /* obsolete - kept for compatibility */
#define FIBMAP _IO(0x00,1) /* bmap access */
#define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */
--
2.47.1
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 2/5] block: add support for copy offload
2025-05-21 22:31 [PATCH 0/5] block: another block copy offload Keith Busch
2025-05-21 22:31 ` [PATCH 1/5] block: new sector copy api Keith Busch
@ 2025-05-21 22:31 ` Keith Busch
2025-05-22 13:49 ` Hannes Reinecke
2025-05-23 12:46 ` Christoph Hellwig
2025-05-21 22:31 ` [PATCH 3/5] nvme: " Keith Busch
` (4 subsequent siblings)
6 siblings, 2 replies; 46+ messages in thread
From: Keith Busch @ 2025-05-21 22:31 UTC (permalink / raw)
To: linux-block, linux-nvme; +Cc: Keith Busch
From: Keith Busch <kbusch@kernel.org>
Various storage protocols can support offloading block data copies.
Enhance the block layer to know about the device's copying capabilities,
introduce the new REQ_OP_COPY operation, and provide the infrastructure
to iterate, split, and merge these kinds of requests.
A copy command must provide the device with a list of source LBAs and
their lengths, and a destination LBA. The 'struct bio' type doesn't
readily have a way to describe such a thing. But a copy request doesn't
use host memory for data, so the bio's bio_vec is unused space. This
patch adds a new purpose to the bio_vec where it can provide a vector of
sectors instead of memory pages.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
block/bio.c | 25 ++++++++++++++
block/blk-core.c | 4 +++
block/blk-lib.c | 47 ++++++++++++++++++++++-----
block/blk-merge.c | 28 +++++++++++++++-
block/blk-sysfs.c | 9 ++++++
block/blk.h | 17 +++++++++-
include/linux/bio.h | 20 ++++++++++++
include/linux/blk-mq.h | 5 +++
include/linux/blk_types.h | 2 ++
include/linux/blkdev.h | 14 ++++++++
include/linux/bvec.h | 68 +++++++++++++++++++++++++++++++++++++--
11 files changed, 226 insertions(+), 13 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 3c0a558c90f52..9c73a895c987b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1156,6 +1156,31 @@ void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter)
bio_set_flag(bio, BIO_CLONED);
}
+static int bvec_try_merge_copy_src(struct bio *bio, struct bio_vec *src)
+{
+ struct bio_vec *bv;
+
+ if (!bio->bi_vcnt)
+ return false;
+
+ bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+ if (bv->bv_sector + src->bv_sectors != src->bv_sector)
+ return false;
+
+ bv->bv_sectors += src->bv_sectors;
+ return true;
+}
+
+int bio_add_copy_src(struct bio *bio, struct bio_vec *src)
+{
+ if (bvec_try_merge_copy_src(bio, src))
+ return 0;
+ if (bio->bi_vcnt >= bio->bi_max_vecs)
+ return -EINVAL;
+ bio->bi_io_vec[bio->bi_vcnt++] = *src;
+ return 0;
+}
+
static unsigned int get_contig_folio_len(unsigned int *num_pages,
struct page **pages, unsigned int i,
struct folio *folio, size_t left,
diff --git a/block/blk-core.c b/block/blk-core.c
index b862c66018f25..cb3d9879e2d65 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -837,6 +837,10 @@ void submit_bio_noacct(struct bio *bio)
if (!bdev_max_discard_sectors(bdev))
goto not_supported;
break;
+ case REQ_OP_COPY:
+ if (!bdev_copy_sectors(bdev))
+ goto not_supported;
+ break;
case REQ_OP_SECURE_ERASE:
if (!bdev_max_secure_erase_sectors(bdev))
goto not_supported;
diff --git a/block/blk-lib.c b/block/blk-lib.c
index a819ded0ed3a9..a538acbaa2cd7 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -369,14 +369,7 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
}
EXPORT_SYMBOL(blkdev_issue_secure_erase);
-/**
- * blkdev_copy - copy source sectors to a destination on the same block device
- * @dst_sector: start sector of the destination to copy to
- * @src_sector: start sector of the source to copy from
- * @nr_sects: number of sectors to copy
- * @gfp: allocation flags to use
- */
-int blkdev_copy(struct block_device *bdev, sector_t dst_sector,
+static int __blkdev_copy(struct block_device *bdev, sector_t dst_sector,
sector_t src_sector, sector_t nr_sects, gfp_t gfp)
{
unsigned int nr_vecs = __blkdev_sectors_to_bio_pages(nr_sects);
@@ -429,4 +422,42 @@ int blkdev_copy(struct block_device *bdev, sector_t dst_sector,
kvfree(buf);
return ret;
}
+
+static int blkdev_copy_offload(struct block_device *bdev, sector_t dst_sector,
+ sector_t src_sector, sector_t nr_sects, gfp_t gfp)
+{
+ struct bio *bio;
+ int ret;
+
+ struct bio_vec bv = {
+ .bv_sector = src_sector,
+ .bv_sectors = nr_sects,
+ };
+
+ bio = bio_alloc(bdev, 1, REQ_OP_COPY, gfp);
+ bio_add_copy_src(bio, &bv);
+ bio->bi_iter.bi_sector = dst_sector;
+ bio->bi_iter.bi_size = nr_sects << SECTOR_SHIFT;
+
+ ret = submit_bio_wait(bio);
+ bio_put(bio);
+ return ret;
+
+}
+
+/**
+ * blkdev_copy - copy source sectors to a destination on the same block device
+ * @dst_sector: start sector of the destination to copy to
+ * @src_sector: start sector of the source to copy from
+ * @nr_sects: number of sectors to copy
+ * @gfp: allocation flags to use
+ */
+int blkdev_copy(struct block_device *bdev, sector_t dst_sector,
+ sector_t src_sector, sector_t nr_sects, gfp_t gfp)
+{
+ if (bdev_copy_sectors(bdev))
+ return blkdev_copy_offload(bdev, dst_sector, src_sector,
+ nr_sects, gfp);
+ return __blkdev_copy(bdev, dst_sector, src_sector, nr_sects, gfp);
+}
EXPORT_SYMBOL_GPL(blkdev_copy);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3af1d284add50..8085fc0a27c2f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -399,6 +399,31 @@ struct bio *bio_split_write_zeroes(struct bio *bio,
return bio_submit_split(bio, max_sectors);
}
+struct bio *bio_split_copy(struct bio *bio, const struct queue_limits *lim,
+ unsigned *nr_segs)
+{
+ unsigned nsegs = 0, sectors = 0, mcss = lim->max_copy_segment_sectors;
+ struct bvec_iter iter;
+ struct bio_vec bv;
+
+ bio_for_each_copy_bvec(bv, bio, iter, mcss) {
+ unsigned s;
+
+ s = min(lim->max_copy_sectors - sectors, bv.bv_sectors);
+ nsegs += 1;
+ sectors += s;
+
+ if (nsegs >= lim->max_copy_segments || sectors >= lim->max_copy_sectors)
+ break;
+ }
+
+ if (sectors == bio_sectors(bio))
+ sectors = 0;
+
+ *nr_segs = nsegs;
+ return bio_submit_split(bio, sectors);
+}
+
/**
* bio_split_to_limits - split a bio to fit the queue limits
* @bio: bio to be split
@@ -467,6 +492,7 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
if (!boundary_sectors ||
req_op(rq) == REQ_OP_DISCARD ||
+ req_op(rq) == REQ_OP_COPY ||
req_op(rq) == REQ_OP_SECURE_ERASE)
return max_sectors;
return min(max_sectors,
@@ -753,7 +779,7 @@ static struct request *attempt_merge(struct request_queue *q,
req->__data_len += blk_rq_bytes(next);
- if (!blk_discard_mergable(req))
+ if (!blk_discard_mergable(req) && !blk_copy_mergable(req))
elv_merge_requests(q, req, next);
blk_crypto_rq_put_keyslot(next);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b2b9b89d6967c..93ce41f399363 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -132,6 +132,7 @@ static ssize_t queue_##_field##_show(struct gendisk *disk, char *page) \
QUEUE_SYSFS_LIMIT_SHOW(max_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_discard_segments)
+QUEUE_SYSFS_LIMIT_SHOW(max_copy_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_integrity_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_segment_size)
QUEUE_SYSFS_LIMIT_SHOW(max_write_streams)
@@ -160,6 +161,8 @@ static ssize_t queue_##_field##_show(struct gendisk *disk, char *page) \
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_discard_sectors)
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_hw_discard_sectors)
+QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_copy_sectors)
+QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_copy_segment_sectors)
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_write_zeroes_sectors)
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(atomic_write_max_sectors)
QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(atomic_write_boundary_sectors)
@@ -501,10 +504,13 @@ QUEUE_LIM_RO_ENTRY(queue_io_min, "minimum_io_size");
QUEUE_LIM_RO_ENTRY(queue_io_opt, "optimal_io_size");
QUEUE_LIM_RO_ENTRY(queue_max_discard_segments, "max_discard_segments");
+QUEUE_LIM_RO_ENTRY(queue_max_copy_segments, "max_copy_segments");
QUEUE_LIM_RO_ENTRY(queue_discard_granularity, "discard_granularity");
QUEUE_LIM_RO_ENTRY(queue_max_hw_discard_sectors, "discard_max_hw_bytes");
QUEUE_LIM_RW_ENTRY(queue_max_discard_sectors, "discard_max_bytes");
QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
+QUEUE_RO_ENTRY(queue_max_copy_sectors, "copy_max_bytes");
+QUEUE_RO_ENTRY(queue_max_copy_segment_sectors, "copy_segment_max_bytes");
QUEUE_LIM_RO_ENTRY(queue_atomic_write_max_sectors, "atomic_write_max_bytes");
QUEUE_LIM_RO_ENTRY(queue_atomic_write_boundary_sectors,
@@ -644,6 +650,7 @@ static struct attribute *queue_attrs[] = {
&queue_max_sectors_entry.attr,
&queue_max_segments_entry.attr,
&queue_max_discard_segments_entry.attr,
+ &queue_max_copy_segments_entry.attr,
&queue_max_integrity_segments_entry.attr,
&queue_max_segment_size_entry.attr,
&queue_max_write_streams_entry.attr,
@@ -657,6 +664,8 @@ static struct attribute *queue_attrs[] = {
&queue_discard_granularity_entry.attr,
&queue_max_discard_sectors_entry.attr,
&queue_max_hw_discard_sectors_entry.attr,
+ &queue_max_copy_sectors_entry.attr,
+ &queue_max_copy_segment_sectors_entry.attr,
&queue_atomic_write_max_sectors_entry.attr,
&queue_atomic_write_boundary_sectors_entry.attr,
&queue_atomic_write_unit_min_entry.attr,
diff --git a/block/blk.h b/block/blk.h
index 37ec459fe6562..685f3eeca46e0 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -185,10 +185,20 @@ static inline bool blk_discard_mergable(struct request *req)
return false;
}
+static inline bool blk_copy_mergable(struct request *req)
+{
+ if (req_op(req) == REQ_OP_COPY &&
+ queue_max_copy_segments(req->q) > 1)
+ return true;
+ return false;
+}
+
static inline unsigned int blk_rq_get_max_segments(struct request *rq)
{
if (req_op(rq) == REQ_OP_DISCARD)
return queue_max_discard_segments(rq->q);
+ if (req_op(rq) == REQ_OP_COPY)
+ return queue_max_copy_segments(rq->q);
return queue_max_segments(rq->q);
}
@@ -200,7 +210,8 @@ static inline unsigned int blk_queue_get_max_sectors(struct request *rq)
if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
return min(q->limits.max_discard_sectors,
UINT_MAX >> SECTOR_SHIFT);
-
+ if (unlikely(op == REQ_OP_COPY))
+ return q->limits.max_copy_sectors;
if (unlikely(op == REQ_OP_WRITE_ZEROES))
return q->limits.max_write_zeroes_sectors;
@@ -347,6 +358,8 @@ struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
unsigned *nr_segs);
struct bio *bio_split_zone_append(struct bio *bio,
const struct queue_limits *lim, unsigned *nr_segs);
+struct bio *bio_split_copy(struct bio *bio, const struct queue_limits *lim,
+ unsigned *nsegs);
/*
* All drivers must accept single-segments bios that are smaller than PAGE_SIZE.
@@ -397,6 +410,8 @@ static inline struct bio *__bio_split_to_limits(struct bio *bio,
return bio_split_discard(bio, lim, nr_segs);
case REQ_OP_WRITE_ZEROES:
return bio_split_write_zeroes(bio, lim, nr_segs);
+ case REQ_OP_COPY:
+ return bio_split_copy(bio, lim, nr_segs);
default:
/* other operations can't be split */
*nr_segs = 0;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 9c37c66ef9ca3..e25bcde9ec59d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -54,6 +54,7 @@ static inline bool bio_has_data(struct bio *bio)
if (bio &&
bio->bi_iter.bi_size &&
bio_op(bio) != REQ_OP_DISCARD &&
+ bio_op(bio) != REQ_OP_COPY &&
bio_op(bio) != REQ_OP_SECURE_ERASE &&
bio_op(bio) != REQ_OP_WRITE_ZEROES)
return true;
@@ -68,6 +69,11 @@ static inline bool bio_no_advance_iter(const struct bio *bio)
bio_op(bio) == REQ_OP_WRITE_ZEROES;
}
+static inline bool bio_sector_advance_iter(const struct bio *bio)
+{
+ return bio_op(bio) == REQ_OP_COPY;
+}
+
static inline void *bio_data(struct bio *bio)
{
if (bio_has_data(bio))
@@ -100,6 +106,8 @@ static inline void bio_advance_iter(const struct bio *bio,
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
+ else if (bio_sector_advance_iter(bio))
+ bvec_iter_sector_advance(bio->bi_io_vec, iter, bytes);
else
bvec_iter_advance(bio->bi_io_vec, iter, bytes);
/* TODO: It is reasonable to complete bio with error here. */
@@ -114,6 +122,8 @@ static inline void bio_advance_iter_single(const struct bio *bio,
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
+ else if (bio_sector_advance_iter(bio))
+ bvec_iter_sector_advance_single(bio->bi_io_vec, iter, bytes);
else
bvec_iter_advance_single(bio->bi_io_vec, iter, bytes);
}
@@ -155,6 +165,15 @@ static inline void bio_advance(struct bio *bio, unsigned int nbytes)
((bvl = mp_bvec_iter_bvec((bio)->bi_io_vec, (iter))), 1); \
bio_advance_iter_single((bio), &(iter), (bvl).bv_len))
+#define __bio_for_each_copy_bvec(bvl, bio, iter, start, max) \
+ for (iter = (start); \
+ (iter).bi_size && \
+ ((bvl = copy_bvec_iter_bvec((bio)->bi_io_vec, (iter), max)), 1); \
+ bio_advance_iter_single((bio), &(iter), (bvl).bv_sectors << SECTOR_SHIFT))
+
+#define bio_for_each_copy_bvec(bvl, bio, iter, max) \
+ __bio_for_each_copy_bvec(bvl, bio, iter, (bio)->bi_iter, max)
+
/* iterate over multi-page bvec */
#define bio_for_each_bvec(bvl, bio, iter) \
__bio_for_each_bvec(bvl, bio, iter, (bio)->bi_iter)
@@ -409,6 +428,7 @@ extern void bio_uninit(struct bio *);
void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf);
void bio_chain(struct bio *, struct bio *);
+int bio_add_copy_src(struct bio *bio, struct bio_vec *src);
int __must_check bio_add_page(struct bio *bio, struct page *page, unsigned len,
unsigned off);
bool __must_check bio_add_folio(struct bio *bio, struct folio *folio,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index de8c85a03bb7f..49816e7f7df7d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1049,6 +1049,11 @@ struct req_iterator {
struct bio *bio;
};
+#define rq_for_each_copy_bvec(bvl, _rq, _iter) \
+ __rq_for_each_bio(_iter.bio, _rq) \
+ bio_for_each_copy_bvec(bvl, _iter.bio, _iter.iter, \
+ _rq->q->limits.max_copy_segment_sectors)
+
#define __rq_for_each_bio(_bio, rq) \
if ((rq->bio)) \
for (_bio = (rq)->bio; _bio; _bio = _bio->bi_next)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 3d1577f07c1c8..361d44c0d1317 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -355,6 +355,8 @@ enum req_op {
REQ_OP_ZONE_RESET = (__force blk_opf_t)13,
/* reset all the zone present on the device */
REQ_OP_ZONE_RESET_ALL = (__force blk_opf_t)15,
+ /* Copy offload sectors to the device */
+ REQ_OP_COPY = (__force blk_opf_t)17,
/* Driver private requests */
REQ_OP_DRV_IN = (__force blk_opf_t)34,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b7d71b126ec9b..e39ba0e91d43e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -399,9 +399,13 @@ struct queue_limits {
unsigned int atomic_write_hw_unit_max;
unsigned int atomic_write_unit_max;
+ unsigned int max_copy_sectors;
+ unsigned int max_copy_segment_sectors;
+
unsigned short max_segments;
unsigned short max_integrity_segments;
unsigned short max_discard_segments;
+ unsigned short max_copy_segments;
unsigned short max_write_streams;
unsigned int write_stream_granularity;
@@ -1271,6 +1275,11 @@ static inline unsigned short queue_max_discard_segments(const struct request_que
return q->limits.max_discard_segments;
}
+static inline unsigned short queue_max_copy_segments(const struct request_queue *q)
+{
+ return q->limits.max_copy_segments;
+}
+
static inline unsigned int queue_max_segment_size(const struct request_queue *q)
{
return q->limits.max_segment_size;
@@ -1380,6 +1389,11 @@ static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
return bdev_limits(bdev)->max_write_zeroes_sectors;
}
+static inline unsigned int bdev_copy_sectors(struct block_device *bdev)
+{
+ return bdev_limits(bdev)->max_copy_sectors;
+}
+
static inline bool bdev_nonrot(struct block_device *bdev)
{
return blk_queue_nonrot(bdev_get_queue(bdev));
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 204b22a99c4ba..7cc82738ede8a 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -21,6 +21,8 @@ struct page;
* @bv_page: First page associated with the address range.
* @bv_len: Number of bytes in the address range.
* @bv_offset: Start of the address range relative to the start of @bv_page.
+ * @bv_sector: Start sector associated with the source block range
+ * @bv_sectors: Number of sectors in the block range
*
* The following holds for a bvec if n * PAGE_SIZE < bv_offset + bv_len:
*
@@ -29,9 +31,17 @@ struct page;
* This holds because page_is_mergeable() checks the above property.
*/
struct bio_vec {
- struct page *bv_page;
- unsigned int bv_len;
- unsigned int bv_offset;
+ union {
+ struct {
+ struct page *bv_page;
+ unsigned int bv_len;
+ unsigned int bv_offset;
+ };
+ struct {
+ sector_t bv_sector;
+ sector_t bv_sectors;
+ };
+ };
};
/**
@@ -118,6 +128,21 @@ struct bvec_iter_all {
.bv_offset = mp_bvec_iter_offset((bvec), (iter)), \
})
+/* sector based bvec helpers */
+#define copy_bvec_iter_sector(bvec, iter) \
+ (__bvec_iter_bvec((bvec), (iter))->bv_sector) + \
+ ((iter).bi_bvec_done >> 9)
+
+#define copy_bvec_iter_sectors(bvec, iter) \
+ (__bvec_iter_bvec((bvec), (iter))->bv_sectors) - \
+ ((iter).bi_bvec_done >> 9)
+
+#define copy_bvec_iter_bvec(bvec, iter, max) \
+((struct bio_vec) { \
+ .bv_sector = copy_bvec_iter_sector((bvec), (iter)), \
+ .bv_sectors = min(max, copy_bvec_iter_sectors((bvec), (iter))), \
+})
+
/* For building single-page bvec in flight */
#define bvec_iter_offset(bvec, iter) \
(mp_bvec_iter_offset((bvec), (iter)) % PAGE_SIZE)
@@ -161,6 +186,30 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
return true;
}
+static inline bool bvec_iter_sector_advance(const struct bio_vec *bv,
+ struct bvec_iter *iter, unsigned bytes)
+{
+ unsigned int idx = iter->bi_idx;
+
+ if (WARN_ONCE(bytes > iter->bi_size,
+ "Attempted to advance past end of bvec iter\n")) {
+ iter->bi_size = 0;
+ return false;
+ }
+
+ iter->bi_size -= bytes;
+ bytes += iter->bi_bvec_done;
+
+ while (bytes && bytes >> 9 >= bv[idx].bv_sectors) {
+ bytes -= bv[idx].bv_sectors << 9;
+ idx++;
+ }
+
+ iter->bi_idx = idx;
+ iter->bi_bvec_done = bytes;
+ return true;
+}
+
/*
* A simpler version of bvec_iter_advance(), @bytes should not span
* across multiple bvec entries, i.e. bytes <= bv[i->bi_idx].bv_len
@@ -178,6 +227,19 @@ static inline void bvec_iter_advance_single(const struct bio_vec *bv,
iter->bi_size -= bytes;
}
+static inline void bvec_iter_sector_advance_single(const struct bio_vec *bv,
+ struct bvec_iter *iter, unsigned bytes)
+{
+ unsigned int done = iter->bi_bvec_done + bytes;
+
+ if (done == bv[iter->bi_idx].bv_sectors << 9) {
+ done = 0;
+ iter->bi_idx++;
+ }
+ iter->bi_bvec_done = done;
+ iter->bi_size -= bytes;
+}
+
#define for_each_bvec(bvl, bio_vec, iter, start) \
for (iter = (start); \
(iter).bi_size && \
--
2.47.1
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 3/5] nvme: add support for copy offload
2025-05-21 22:31 [PATCH 0/5] block: another block copy offload Keith Busch
2025-05-21 22:31 ` [PATCH 1/5] block: new sector copy api Keith Busch
2025-05-21 22:31 ` [PATCH 2/5] block: add support for copy offload Keith Busch
@ 2025-05-21 22:31 ` Keith Busch
2025-05-22 0:47 ` Caleb Sander Mateos
` (2 more replies)
2025-05-21 22:31 ` [PATCH 4/5] block: add support for vectored copies Keith Busch
` (3 subsequent siblings)
6 siblings, 3 replies; 46+ messages in thread
From: Keith Busch @ 2025-05-21 22:31 UTC (permalink / raw)
To: linux-block, linux-nvme; +Cc: Keith Busch
From: Keith Busch <kbusch@kernel.org>
Register the nvme namespace copy capablities with the request_queue
limits and implement support for the REQ_OP_COPY operation.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
drivers/nvme/host/core.c | 61 ++++++++++++++++++++++++++++++++++++++++
include/linux/nvme.h | 42 ++++++++++++++++++++++++++-
2 files changed, 102 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f69a232a000ac..3134fe85b1abc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -888,6 +888,52 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
return BLK_STS_OK;
}
+static inline blk_status_t nvme_setup_copy(struct nvme_ns *ns,
+ struct request *req, struct nvme_command *cmnd)
+{
+ struct nvme_copy_range *range;
+ struct req_iterator iter;
+ struct bio_vec bvec;
+ u16 control = 0;
+ int i = 0;
+
+ static const size_t alloc_size = sizeof(*range) * NVME_COPY_MAX_RANGES;
+
+ if (WARN_ON_ONCE(blk_rq_nr_phys_segments(req) >= NVME_COPY_MAX_RANGES))
+ return BLK_STS_IOERR;
+
+ range = kzalloc(alloc_size, GFP_ATOMIC | __GFP_NOWARN);
+ if (!range)
+ return BLK_STS_RESOURCE;
+
+ if (req->cmd_flags & REQ_FUA)
+ control |= NVME_RW_FUA;
+ if (req->cmd_flags & REQ_FAILFAST_DEV)
+ control |= NVME_RW_LR;
+
+ rq_for_each_copy_bvec(bvec, req, iter) {
+ u64 slba = nvme_sect_to_lba(ns->head, bvec.bv_sector);
+ u64 nlb = nvme_sect_to_lba(ns->head, bvec.bv_sectors) - 1;
+
+ range[i].slba = cpu_to_le64(slba);
+ range[i].nlb = cpu_to_le16(nlb);
+ i++;
+ }
+
+ memset(cmnd, 0, sizeof(*cmnd));
+ cmnd->copy.opcode = nvme_cmd_copy;
+ cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+ cmnd->copy.nr_range = i - 1;
+ cmnd->copy.sdlba = cpu_to_le64(nvme_sect_to_lba(ns->head,
+ blk_rq_pos(req)));
+ cmnd->copy.control = cpu_to_le16(control);
+
+ bvec_set_virt(&req->special_vec, range, alloc_size);
+ req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+ return BLK_STS_OK;
+}
+
static void nvme_set_app_tag(struct request *req, struct nvme_command *cmnd)
{
cmnd->rw.lbat = cpu_to_le16(bio_integrity(req->bio)->app_tag);
@@ -1106,6 +1152,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
case REQ_OP_DISCARD:
ret = nvme_setup_discard(ns, req, cmd);
break;
+ case REQ_OP_COPY:
+ ret = nvme_setup_copy(ns, req, cmd);
+ break;
case REQ_OP_READ:
ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
break;
@@ -2119,6 +2168,15 @@ static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id,
lim->max_write_zeroes_sectors = UINT_MAX;
else
lim->max_write_zeroes_sectors = ns->ctrl->max_zeroes_sectors;
+
+ if (ns->ctrl->oncs & NVME_CTRL_ONCS_NVMCPYS && id->mssrl && id->mcl) {
+ u32 mcss = bs * le16_to_cpu(id->mssrl) >> SECTOR_SHIFT;
+ u32 mcs = bs * le32_to_cpu(id->mcl) >> SECTOR_SHIFT;
+
+ lim->max_copy_segment_sectors = mcss;
+ lim->max_copy_sectors = mcs;
+ lim->max_copy_segments = id->msrc + 1;
+ }
return valid;
}
@@ -2526,6 +2584,9 @@ static int nvme_update_ns_info(struct nvme_ns *ns, struct nvme_ns_info *info)
nvme_init_integrity(ns->head, &lim, info);
lim.max_write_streams = ns_lim->max_write_streams;
lim.write_stream_granularity = ns_lim->write_stream_granularity;
+ lim.max_copy_segment_sectors = ns_lim->max_copy_segment_sectors;
+ lim.max_copy_sectors = ns_lim->max_copy_sectors;
+ lim.max_copy_segments = ns_lim->max_copy_segments;
ret = queue_limits_commit_update(ns->head->disk->queue, &lim);
set_capacity_and_notify(ns->head->disk, get_capacity(ns->disk));
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 51308f65b72fd..14f46ad1330b6 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -404,6 +404,7 @@ enum {
NVME_CTRL_ONCS_WRITE_ZEROES = 1 << 3,
NVME_CTRL_ONCS_RESERVATIONS = 1 << 5,
NVME_CTRL_ONCS_TIMESTAMP = 1 << 6,
+ NVME_CTRL_ONCS_NVMCPYS = 1 << 8,
NVME_CTRL_VWC_PRESENT = 1 << 0,
NVME_CTRL_OACS_SEC_SUPP = 1 << 0,
NVME_CTRL_OACS_NS_MNGT_SUPP = 1 << 3,
@@ -458,7 +459,10 @@ struct nvme_id_ns {
__le16 npdg;
__le16 npda;
__le16 nows;
- __u8 rsvd74[18];
+ __le16 mssrl;
+ __le32 mcl;
+ __u8 msrc;
+ __u8 rsvd81[11];
__le32 anagrpid;
__u8 rsvd96[3];
__u8 nsattr;
@@ -956,6 +960,7 @@ enum nvme_opcode {
nvme_cmd_resv_acquire = 0x11,
nvme_cmd_io_mgmt_recv = 0x12,
nvme_cmd_resv_release = 0x15,
+ nvme_cmd_copy = 0x19,
nvme_cmd_zone_mgmt_send = 0x79,
nvme_cmd_zone_mgmt_recv = 0x7a,
nvme_cmd_zone_append = 0x7d,
@@ -978,6 +983,7 @@ enum nvme_opcode {
nvme_opcode_name(nvme_cmd_resv_acquire), \
nvme_opcode_name(nvme_cmd_io_mgmt_recv), \
nvme_opcode_name(nvme_cmd_resv_release), \
+ nvme_opcode_name(nvme_cmd_copy), \
nvme_opcode_name(nvme_cmd_zone_mgmt_send), \
nvme_opcode_name(nvme_cmd_zone_mgmt_recv), \
nvme_opcode_name(nvme_cmd_zone_append))
@@ -1158,6 +1164,39 @@ struct nvme_dsm_range {
__le64 slba;
};
+struct nvme_copy_cmd {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd2;
+ __le64 metadata;
+ union nvme_data_ptr dptr;
+ __le64 sdlba;
+ __u8 nr_range;
+ __u8 format;
+ __le16 control;
+ __le16 cev;
+ __le16 dspec;
+ __le32 lbtl;
+ __le16 lbat;
+ __le16 lbatm;
+};
+
+#define NVME_COPY_MAX_RANGES 128
+struct nvme_copy_range {
+ __le32 spars;
+ __u32 rsvd4;
+ __le64 slba;
+ __le16 nlb;
+ __le16 cetype;
+ __le16 cev;
+ __le16 sopt;
+ __le32 elbt;
+ __le16 elbat;
+ __le16 elbatm;
+};
+
struct nvme_write_zeroes_cmd {
__u8 opcode;
__u8 flags;
@@ -1985,6 +2024,7 @@ struct nvme_command {
struct nvme_download_firmware dlfw;
struct nvme_format_cmd format;
struct nvme_dsm_cmd dsm;
+ struct nvme_copy_cmd copy;
struct nvme_write_zeroes_cmd write_zeroes;
struct nvme_zone_mgmt_send_cmd zms;
struct nvme_zone_mgmt_recv_cmd zmr;
--
2.47.1
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 4/5] block: add support for vectored copies
2025-05-21 22:31 [PATCH 0/5] block: another block copy offload Keith Busch
` (2 preceding siblings ...)
2025-05-21 22:31 ` [PATCH 3/5] nvme: " Keith Busch
@ 2025-05-21 22:31 ` Keith Busch
2025-05-22 13:58 ` Hannes Reinecke
2025-05-21 22:31 ` [PATCH 5/5] nvmet: implement copy support for bdev backed target Keith Busch
` (2 subsequent siblings)
6 siblings, 1 reply; 46+ messages in thread
From: Keith Busch @ 2025-05-21 22:31 UTC (permalink / raw)
To: linux-block, linux-nvme; +Cc: Keith Busch
From: Keith Busch <kbusch@kernel.org>
Copy offload can be used to defrad or garbage collect data spread across
the disk. Most storage protocols provide a way to specifiy multiple
sources in a single copy commnd, so introduce kernel and user space
interfaces to accomplish that.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
block/blk-lib.c | 50 ++++++++++++++++++++++++----------
block/ioctl.c | 59 +++++++++++++++++++++++++++++++++++++++++
include/linux/blkdev.h | 2 ++
include/uapi/linux/fs.h | 14 ++++++++++
4 files changed, 111 insertions(+), 14 deletions(-)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index a538acbaa2cd7..7513b876a5399 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -424,26 +424,46 @@ static int __blkdev_copy(struct block_device *bdev, sector_t dst_sector,
}
static int blkdev_copy_offload(struct block_device *bdev, sector_t dst_sector,
- sector_t src_sector, sector_t nr_sects, gfp_t gfp)
+ struct bio_vec *bv, int nr_vecs, gfp_t gfp)
{
+ unsigned size = 0;
struct bio *bio;
- int ret;
-
- struct bio_vec bv = {
- .bv_sector = src_sector,
- .bv_sectors = nr_sects,
- };
+ int ret, i;
- bio = bio_alloc(bdev, 1, REQ_OP_COPY, gfp);
- bio_add_copy_src(bio, &bv);
+ bio = bio_alloc(bdev, nr_vecs, REQ_OP_COPY, gfp);
+ for (i = 0; i < nr_vecs; i++) {
+ size += bv[i].bv_sectors << SECTOR_SHIFT;
+ bio_add_copy_src(bio, &bv[i]);
+ }
bio->bi_iter.bi_sector = dst_sector;
- bio->bi_iter.bi_size = nr_sects << SECTOR_SHIFT;
+ bio->bi_iter.bi_size = size;
ret = submit_bio_wait(bio);
bio_put(bio);
return ret;
+}
+
+/**
+ * blkdev_copy_range - copy range of sectors to a destination
+ * @dst_sector: start sector of the destination to copy to
+ * @bv: vector of source sectors
+ * @nr_vecs: number of source sector vectors
+ * @gfp: allocation flags to use
+ */
+int blkdev_copy_range(struct block_device *bdev, sector_t dst_sector,
+ struct bio_vec *bv, int nr_vecs, gfp_t gfp)
+{
+ int ret, i;
+ if (bdev_copy_sectors(bdev))
+ return blkdev_copy_offload(bdev, dst_sector, bv, nr_vecs, gfp);
+
+ for (i = 0, ret = 0; i < nr_vecs && !ret; i++)
+ ret = __blkdev_copy(bdev, dst_sector, bv[i].bv_sector,
+ bv[i].bv_sectors, gfp);
+ return ret;
}
+EXPORT_SYMBOL_GPL(blkdev_copy_range);
/**
* blkdev_copy - copy source sectors to a destination on the same block device
@@ -455,9 +475,11 @@ static int blkdev_copy_offload(struct block_device *bdev, sector_t dst_sector,
int blkdev_copy(struct block_device *bdev, sector_t dst_sector,
sector_t src_sector, sector_t nr_sects, gfp_t gfp)
{
- if (bdev_copy_sectors(bdev))
- return blkdev_copy_offload(bdev, dst_sector, src_sector,
- nr_sects, gfp);
- return __blkdev_copy(bdev, dst_sector, src_sector, nr_sects, gfp);
+ struct bio_vec bv = {
+ .bv_sector = src_sector,
+ .bv_sectors = nr_sects,
+ };
+
+ return blkdev_copy_range(bdev, dst_sector, &bv, 1, gfp);
}
EXPORT_SYMBOL_GPL(blkdev_copy);
diff --git a/block/ioctl.c b/block/ioctl.c
index 6f03c65867348..4b5095be19e1a 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -241,6 +241,63 @@ static int blk_ioctl_copy(struct block_device *bdev, blk_mode_t mode,
return blkdev_copy(bdev, dst, src, nr, GFP_KERNEL);
}
+static int blk_ioctl_copy_vec(struct block_device *bdev, blk_mode_t mode,
+ void __user *argp)
+{
+ sector_t align = bdev_logical_block_size(bdev) >> SECTOR_SHIFT;
+ struct bio_vec *bv, fast_bv[UIO_FASTIOV];
+ struct copy_range cr;
+ int i, nr, ret;
+ __u64 dst;
+
+ if (!(mode & BLK_OPEN_WRITE))
+ return -EBADF;
+ if (copy_from_user(&cr, argp, sizeof(cr)))
+ return -EFAULT;
+ if (!(IS_ALIGNED(cr.dst_sector, align)))
+ return -EINVAL;
+
+ nr = cr.nr_ranges;
+ if (nr <= UIO_FASTIOV) {
+ bv = fast_bv;
+ } else {
+ bv = kmalloc_array(nr, sizeof(*bv), GFP_KERNEL);
+ if (!bv)
+ return -ENOMEM;
+ }
+
+ dst = cr.dst_sector;
+ for (i = 0; i < nr; i++) {
+ struct copy_source csrc;
+ __u64 nr_sects, src;
+
+ if (copy_from_user(&csrc,
+ (void __user *)(cr.sources + i * sizeof(csrc)),
+ sizeof(csrc))) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ nr_sects = csrc.nr_sectors;
+ src = csrc.src_sector;
+ if (!(IS_ALIGNED(src | nr_sects, align)) ||
+ (src < dst && src + nr_sects > dst) ||
+ (dst < src && dst + nr_sects > src)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ bv[i].bv_sectors = nr_sects;
+ bv[i].bv_sector = src;
+ }
+
+ ret = blkdev_copy_range(bdev, dst, bv, nr, GFP_KERNEL);
+out:
+ if (bv != fast_bv)
+ kfree(bv);
+ return ret;
+}
+
static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode,
unsigned long arg)
{
@@ -605,6 +662,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, blk_mode_t mode,
return blk_ioctl_secure_erase(bdev, mode, argp);
case BLKCPY:
return blk_ioctl_copy(bdev, mode, argp);
+ case BLKCPY_VEC:
+ return blk_ioctl_copy_vec(bdev, mode, argp);
case BLKZEROOUT:
return blk_ioctl_zeroout(bdev, mode, arg);
case BLKGETDISKSEQ:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e39ba0e91d43e..a77f2298754b5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1182,6 +1182,8 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp);
int blkdev_copy(struct block_device *bdev, sector_t dst_sector,
sector_t src_sector, sector_t nr_sects, gfp_t gfp);
+int blkdev_copy_range(struct block_device *bdev, sector_t dst_sector,
+ struct bio_vec *bv, int nr_vecs, gfp_t gfp);
#define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
#define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 534f157ce22e9..aed965f74ea2c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -218,6 +218,20 @@ struct fsxattr {
/* [0] = destination lba, [1] = source lba, [2] = number of sectors */
#define BLKCPY _IOWR(0x12,142,__u64[3])
+struct copy_source {
+ __u64 src_sector;
+ __u64 nr_sectors;
+};
+
+struct copy_range {
+ __u64 dst_sector;
+ __u16 nr_ranges;
+ __u8 rsvd[6];
+ __u64 sources; /* user space pointer to struct copy_source[] */
+};
+#define BLKCPY_VEC _IOWR(0x12,143,struct copy_range)
+
+
#define BMAP_IOCTL 1 /* obsolete - kept for compatibility */
#define FIBMAP _IO(0x00,1) /* bmap access */
#define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */
--
2.47.1
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 5/5] nvmet: implement copy support for bdev backed target
2025-05-21 22:31 [PATCH 0/5] block: another block copy offload Keith Busch
` (3 preceding siblings ...)
2025-05-21 22:31 ` [PATCH 4/5] block: add support for vectored copies Keith Busch
@ 2025-05-21 22:31 ` Keith Busch
2025-05-22 13:59 ` Hannes Reinecke
2025-05-23 13:18 ` Christoph Hellwig
2025-05-22 15:52 ` [PATCH 0/5] block: another block copy offload Bart Van Assche
2025-07-03 14:47 ` Niklas Cassel
6 siblings, 2 replies; 46+ messages in thread
From: Keith Busch @ 2025-05-21 22:31 UTC (permalink / raw)
To: linux-block, linux-nvme; +Cc: Keith Busch
From: Keith Busch <kbusch@kernel.org>
The nvme block device target type does not have any particular limits on
copy commands, so all the settings are the protocol's max.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
drivers/nvme/target/io-cmd-bdev.c | 52 +++++++++++++++++++++++++++++++
1 file changed, 52 insertions(+)
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 83be0657e6df4..d90dedcd2352f 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,11 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
id->npda = id->npdg;
/* NOWS = Namespace Optimal Write Size */
id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+ /* Copy offload support */
+ id->mssrl = cpu_to_le16(U16_MAX);
+ id->mcl = cpu_to_le32(U32_MAX);
+ id->msrc = U8_MAX;
}
void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -412,6 +417,50 @@ static void nvmet_bdev_execute_discard(struct nvmet_req *req)
}
}
+static void nvmet_bdev_execute_copy(struct nvmet_req *req)
+{
+ struct bio_vec *bv, fast_bv[UIO_FASTIOV];
+ struct nvme_copy_range range;
+ u64 dst_sector, slba;
+ u16 status, nlb, nr;
+ int ret, i;
+
+ nr = req->cmd->copy.nr_range + 1;
+ if (nr <= UIO_FASTIOV) {
+ bv = fast_bv;
+ } else {
+ bv = kmalloc_array(nr, sizeof(*bv), GFP_KERNEL);
+ if (!bv) {
+ status = NVME_SC_INTERNAL;
+ goto done;
+ }
+ }
+
+ for (i = 0; i < nr; i++) {
+ status = nvmet_copy_from_sgl(req, i * sizeof(range), &range,
+ sizeof(range));
+ if (status)
+ goto done;
+
+ slba = le64_to_cpu(range.slba);
+ nlb = le16_to_cpu(range.nlb) + 1;
+ bv[i].bv_sector = nvmet_lba_to_sect(req->ns, slba);
+ bv[i].bv_sectors = nvmet_lba_to_sect(req->ns, nlb);
+ }
+
+ dst_sector = nvmet_lba_to_sect(req->ns, req->cmd->copy.sdlba);
+ ret = blkdev_copy_range(req->ns->bdev, dst_sector, bv, nr, GFP_KERNEL);
+ if (ret) {
+ req->error_slba = le64_to_cpu(dst_sector);
+ status = errno_to_nvme_status(req, ret);
+ } else
+ status = NVME_SC_SUCCESS;
+done:
+ nvmet_req_complete(req, status);
+ if (bv != fast_bv)
+ kfree(bv);
+}
+
static void nvmet_bdev_execute_dsm(struct nvmet_req *req)
{
if (!nvmet_check_data_len_lte(req, nvmet_dsm_len(req)))
@@ -474,6 +523,9 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
case nvme_cmd_write_zeroes:
req->execute = nvmet_bdev_execute_write_zeroes;
return 0;
+ case nvme_cmd_copy:
+ req->execute = nvmet_bdev_execute_copy;
+ return 0;
default:
return nvmet_report_invalid_opcode(req);
}
--
2.47.1
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 3/5] nvme: add support for copy offload
2025-05-21 22:31 ` [PATCH 3/5] nvme: " Keith Busch
@ 2025-05-22 0:47 ` Caleb Sander Mateos
2025-05-22 0:51 ` Caleb Sander Mateos
2025-05-22 13:54 ` Hannes Reinecke
2025-06-09 9:29 ` Niklas Cassel
2 siblings, 1 reply; 46+ messages in thread
From: Caleb Sander Mateos @ 2025-05-22 0:47 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-block, linux-nvme, Keith Busch
On Wed, May 21, 2025 at 3:31 PM Keith Busch <kbusch@meta.com> wrote:
>
> From: Keith Busch <kbusch@kernel.org>
>
> Register the nvme namespace copy capablities with the request_queue
nit: "capabilities"
> limits and implement support for the REQ_OP_COPY operation.
>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> drivers/nvme/host/core.c | 61 ++++++++++++++++++++++++++++++++++++++++
> include/linux/nvme.h | 42 ++++++++++++++++++++++++++-
> 2 files changed, 102 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f69a232a000ac..3134fe85b1abc 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -888,6 +888,52 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
> return BLK_STS_OK;
> }
>
> +static inline blk_status_t nvme_setup_copy(struct nvme_ns *ns,
> + struct request *req, struct nvme_command *cmnd)
> +{
> + struct nvme_copy_range *range;
> + struct req_iterator iter;
> + struct bio_vec bvec;
> + u16 control = 0;
> + int i = 0;
Make this unsigned to avoid sign extension when used as an index?
> +
> + static const size_t alloc_size = sizeof(*range) * NVME_COPY_MAX_RANGES;
> +
> + if (WARN_ON_ONCE(blk_rq_nr_phys_segments(req) >= NVME_COPY_MAX_RANGES))
Should be > instead of >=?
> + return BLK_STS_IOERR;
> +
> + range = kzalloc(alloc_size, GFP_ATOMIC | __GFP_NOWARN);
> + if (!range)
> + return BLK_STS_RESOURCE;
> +
> + if (req->cmd_flags & REQ_FUA)
> + control |= NVME_RW_FUA;
> + if (req->cmd_flags & REQ_FAILFAST_DEV)
> + control |= NVME_RW_LR;
> +
> + rq_for_each_copy_bvec(bvec, req, iter) {
> + u64 slba = nvme_sect_to_lba(ns->head, bvec.bv_sector);
> + u64 nlb = nvme_sect_to_lba(ns->head, bvec.bv_sectors) - 1;
> +
> + range[i].slba = cpu_to_le64(slba);
> + range[i].nlb = cpu_to_le16(nlb);
> + i++;
> + }
> +
> + memset(cmnd, 0, sizeof(*cmnd));
> + cmnd->copy.opcode = nvme_cmd_copy;
> + cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
> + cmnd->copy.nr_range = i - 1;
> + cmnd->copy.sdlba = cpu_to_le64(nvme_sect_to_lba(ns->head,
> + blk_rq_pos(req)));
> + cmnd->copy.control = cpu_to_le16(control);
> +
> + bvec_set_virt(&req->special_vec, range, alloc_size);
alloc_size should be sizeof(*range) * i? Otherwise this exceeds the
amount of data used by the Copy command, which not all controllers
support (see bit LLDTS of SGLS in the Identify Controller data
structure). We have seen the same behavior with Dataset Management
(always specifying 4 KB of data), which also passes the maximum size
of the allocation to bvec_set_virt().
> + req->rq_flags |= RQF_SPECIAL_PAYLOAD;
> +
> + return BLK_STS_OK;
> +}
> +
> static void nvme_set_app_tag(struct request *req, struct nvme_command *cmnd)
> {
> cmnd->rw.lbat = cpu_to_le16(bio_integrity(req->bio)->app_tag);
> @@ -1106,6 +1152,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
> case REQ_OP_DISCARD:
> ret = nvme_setup_discard(ns, req, cmd);
> break;
> + case REQ_OP_COPY:
> + ret = nvme_setup_copy(ns, req, cmd);
> + break;
> case REQ_OP_READ:
> ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
> break;
> @@ -2119,6 +2168,15 @@ static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id,
> lim->max_write_zeroes_sectors = UINT_MAX;
> else
> lim->max_write_zeroes_sectors = ns->ctrl->max_zeroes_sectors;
> +
> + if (ns->ctrl->oncs & NVME_CTRL_ONCS_NVMCPYS && id->mssrl && id->mcl) {
Are the checks of MSSRL and MCL necessary? The spec says controllers
that support Copy are not allowed to set them to 0.
Best,
Caleb
> + u32 mcss = bs * le16_to_cpu(id->mssrl) >> SECTOR_SHIFT;
> + u32 mcs = bs * le32_to_cpu(id->mcl) >> SECTOR_SHIFT;
> +
> + lim->max_copy_segment_sectors = mcss;
> + lim->max_copy_sectors = mcs;
> + lim->max_copy_segments = id->msrc + 1;
> + }
> return valid;
> }
>
> @@ -2526,6 +2584,9 @@ static int nvme_update_ns_info(struct nvme_ns *ns, struct nvme_ns_info *info)
> nvme_init_integrity(ns->head, &lim, info);
> lim.max_write_streams = ns_lim->max_write_streams;
> lim.write_stream_granularity = ns_lim->write_stream_granularity;
> + lim.max_copy_segment_sectors = ns_lim->max_copy_segment_sectors;
> + lim.max_copy_sectors = ns_lim->max_copy_sectors;
> + lim.max_copy_segments = ns_lim->max_copy_segments;
> ret = queue_limits_commit_update(ns->head->disk->queue, &lim);
>
> set_capacity_and_notify(ns->head->disk, get_capacity(ns->disk));
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 51308f65b72fd..14f46ad1330b6 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -404,6 +404,7 @@ enum {
> NVME_CTRL_ONCS_WRITE_ZEROES = 1 << 3,
> NVME_CTRL_ONCS_RESERVATIONS = 1 << 5,
> NVME_CTRL_ONCS_TIMESTAMP = 1 << 6,
> + NVME_CTRL_ONCS_NVMCPYS = 1 << 8,
> NVME_CTRL_VWC_PRESENT = 1 << 0,
> NVME_CTRL_OACS_SEC_SUPP = 1 << 0,
> NVME_CTRL_OACS_NS_MNGT_SUPP = 1 << 3,
> @@ -458,7 +459,10 @@ struct nvme_id_ns {
> __le16 npdg;
> __le16 npda;
> __le16 nows;
> - __u8 rsvd74[18];
> + __le16 mssrl;
> + __le32 mcl;
> + __u8 msrc;
> + __u8 rsvd81[11];
> __le32 anagrpid;
> __u8 rsvd96[3];
> __u8 nsattr;
> @@ -956,6 +960,7 @@ enum nvme_opcode {
> nvme_cmd_resv_acquire = 0x11,
> nvme_cmd_io_mgmt_recv = 0x12,
> nvme_cmd_resv_release = 0x15,
> + nvme_cmd_copy = 0x19,
> nvme_cmd_zone_mgmt_send = 0x79,
> nvme_cmd_zone_mgmt_recv = 0x7a,
> nvme_cmd_zone_append = 0x7d,
> @@ -978,6 +983,7 @@ enum nvme_opcode {
> nvme_opcode_name(nvme_cmd_resv_acquire), \
> nvme_opcode_name(nvme_cmd_io_mgmt_recv), \
> nvme_opcode_name(nvme_cmd_resv_release), \
> + nvme_opcode_name(nvme_cmd_copy), \
> nvme_opcode_name(nvme_cmd_zone_mgmt_send), \
> nvme_opcode_name(nvme_cmd_zone_mgmt_recv), \
> nvme_opcode_name(nvme_cmd_zone_append))
> @@ -1158,6 +1164,39 @@ struct nvme_dsm_range {
> __le64 slba;
> };
>
> +struct nvme_copy_cmd {
> + __u8 opcode;
> + __u8 flags;
> + __u16 command_id;
> + __le32 nsid;
> + __u64 rsvd2;
> + __le64 metadata;
> + union nvme_data_ptr dptr;
> + __le64 sdlba;
> + __u8 nr_range;
> + __u8 format;
> + __le16 control;
> + __le16 cev;
> + __le16 dspec;
> + __le32 lbtl;
> + __le16 lbat;
> + __le16 lbatm;
> +};
> +
> +#define NVME_COPY_MAX_RANGES 128
> +struct nvme_copy_range {
> + __le32 spars;
> + __u32 rsvd4;
> + __le64 slba;
> + __le16 nlb;
> + __le16 cetype;
> + __le16 cev;
> + __le16 sopt;
> + __le32 elbt;
> + __le16 elbat;
> + __le16 elbatm;
> +};
> +
> struct nvme_write_zeroes_cmd {
> __u8 opcode;
> __u8 flags;
> @@ -1985,6 +2024,7 @@ struct nvme_command {
> struct nvme_download_firmware dlfw;
> struct nvme_format_cmd format;
> struct nvme_dsm_cmd dsm;
> + struct nvme_copy_cmd copy;
> struct nvme_write_zeroes_cmd write_zeroes;
> struct nvme_zone_mgmt_send_cmd zms;
> struct nvme_zone_mgmt_recv_cmd zmr;
> --
> 2.47.1
>
>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/5] nvme: add support for copy offload
2025-05-22 0:47 ` Caleb Sander Mateos
@ 2025-05-22 0:51 ` Caleb Sander Mateos
2025-05-22 3:23 ` Keith Busch
0 siblings, 1 reply; 46+ messages in thread
From: Caleb Sander Mateos @ 2025-05-22 0:51 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-block, linux-nvme, Keith Busch
On Wed, May 21, 2025 at 5:47 PM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Wed, May 21, 2025 at 3:31 PM Keith Busch <kbusch@meta.com> wrote:
> >
> > From: Keith Busch <kbusch@kernel.org>
> >
> > Register the nvme namespace copy capablities with the request_queue
>
> nit: "capabilities"
>
> > limits and implement support for the REQ_OP_COPY operation.
> >
> > Signed-off-by: Keith Busch <kbusch@kernel.org>
> > ---
> > drivers/nvme/host/core.c | 61 ++++++++++++++++++++++++++++++++++++++++
> > include/linux/nvme.h | 42 ++++++++++++++++++++++++++-
> > 2 files changed, 102 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index f69a232a000ac..3134fe85b1abc 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -888,6 +888,52 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
> > return BLK_STS_OK;
> > }
> >
> > +static inline blk_status_t nvme_setup_copy(struct nvme_ns *ns,
> > + struct request *req, struct nvme_command *cmnd)
> > +{
> > + struct nvme_copy_range *range;
> > + struct req_iterator iter;
> > + struct bio_vec bvec;
> > + u16 control = 0;
> > + int i = 0;
>
> Make this unsigned to avoid sign extension when used as an index?
>
> > +
> > + static const size_t alloc_size = sizeof(*range) * NVME_COPY_MAX_RANGES;
> > +
> > + if (WARN_ON_ONCE(blk_rq_nr_phys_segments(req) >= NVME_COPY_MAX_RANGES))
>
> Should be > instead of >=?
>
> > + return BLK_STS_IOERR;
> > +
> > + range = kzalloc(alloc_size, GFP_ATOMIC | __GFP_NOWARN);
> > + if (!range)
> > + return BLK_STS_RESOURCE;
> > +
> > + if (req->cmd_flags & REQ_FUA)
> > + control |= NVME_RW_FUA;
> > + if (req->cmd_flags & REQ_FAILFAST_DEV)
> > + control |= NVME_RW_LR;
> > +
> > + rq_for_each_copy_bvec(bvec, req, iter) {
> > + u64 slba = nvme_sect_to_lba(ns->head, bvec.bv_sector);
> > + u64 nlb = nvme_sect_to_lba(ns->head, bvec.bv_sectors) - 1;
> > +
> > + range[i].slba = cpu_to_le64(slba);
> > + range[i].nlb = cpu_to_le16(nlb);
> > + i++;
> > + }
> > +
> > + memset(cmnd, 0, sizeof(*cmnd));
> > + cmnd->copy.opcode = nvme_cmd_copy;
> > + cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
> > + cmnd->copy.nr_range = i - 1;
> > + cmnd->copy.sdlba = cpu_to_le64(nvme_sect_to_lba(ns->head,
> > + blk_rq_pos(req)));
> > + cmnd->copy.control = cpu_to_le16(control);
> > +
> > + bvec_set_virt(&req->special_vec, range, alloc_size);
>
> alloc_size should be sizeof(*range) * i? Otherwise this exceeds the
> amount of data used by the Copy command, which not all controllers
> support (see bit LLDTS of SGLS in the Identify Controller data
> structure). We have seen the same behavior with Dataset Management
> (always specifying 4 KB of data), which also passes the maximum size
> of the allocation to bvec_set_virt().
I see that was added in commit 530436c45ef2e ("nvme: Discard
workaround for non-conformant devices"). I would rather wait for
evidence of non-conformant devices supporting Copy before implementing
the same spec-noncompliant workaround. It could be a quirk if
necessary.
Best,
Caleb
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/5] nvme: add support for copy offload
2025-05-22 0:51 ` Caleb Sander Mateos
@ 2025-05-22 3:23 ` Keith Busch
2025-05-22 3:41 ` Caleb Sander Mateos
0 siblings, 1 reply; 46+ messages in thread
From: Keith Busch @ 2025-05-22 3:23 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Keith Busch, linux-block, linux-nvme
On Wed, May 21, 2025 at 05:51:03PM -0700, Caleb Sander Mateos wrote:
> On Wed, May 21, 2025 at 5:47 PM Caleb Sander Mateos
> >
> > alloc_size should be sizeof(*range) * i? Otherwise this exceeds the
> > amount of data used by the Copy command, which not all controllers
> > support (see bit LLDTS of SGLS in the Identify Controller data
> > structure). We have seen the same behavior with Dataset Management
> > (always specifying 4 KB of data), which also passes the maximum size
> > of the allocation to bvec_set_virt().
>
> I see that was added in commit 530436c45ef2e ("nvme: Discard
> workaround for non-conformant devices"). I would rather wait for
> evidence of non-conformant devices supporting Copy before implementing
> the same spec-noncompliant workaround. It could be a quirk if
> necessary.
Right, that's exactly why I didn't bother allocating tighter to what the
command actually needs. The number of devices that would have needed a
DSM quirk for the full 4k was untenable, so making the quirk behavior
the default was a sane compromise. I suppose Copy is a more enterprisey
feature in comparison, so maybe we can count on devices doing dma
correctly?
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/5] nvme: add support for copy offload
2025-05-22 3:23 ` Keith Busch
@ 2025-05-22 3:41 ` Caleb Sander Mateos
2025-05-22 4:29 ` Keith Busch
2025-05-23 12:48 ` Christoph Hellwig
0 siblings, 2 replies; 46+ messages in thread
From: Caleb Sander Mateos @ 2025-05-22 3:41 UTC (permalink / raw)
To: Keith Busch; +Cc: Keith Busch, linux-block, linux-nvme
On Wed, May 21, 2025 at 8:23 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, May 21, 2025 at 05:51:03PM -0700, Caleb Sander Mateos wrote:
> > On Wed, May 21, 2025 at 5:47 PM Caleb Sander Mateos
> > >
> > > alloc_size should be sizeof(*range) * i? Otherwise this exceeds the
> > > amount of data used by the Copy command, which not all controllers
> > > support (see bit LLDTS of SGLS in the Identify Controller data
> > > structure). We have seen the same behavior with Dataset Management
> > > (always specifying 4 KB of data), which also passes the maximum size
> > > of the allocation to bvec_set_virt().
> >
> > I see that was added in commit 530436c45ef2e ("nvme: Discard
> > workaround for non-conformant devices"). I would rather wait for
> > evidence of non-conformant devices supporting Copy before implementing
> > the same spec-noncompliant workaround. It could be a quirk if
> > necessary.
>
> Right, that's exactly why I didn't bother allocating tighter to what the
> command actually needs. The number of devices that would have needed a
> DSM quirk for the full 4k was untenable, so making the quirk behavior
> the default was a sane compromise. I suppose Copy is a more enterprisey
> feature in comparison, so maybe we can count on devices doing dma
> correctly?
For the record, that change broke Linux hosts sending DSM commands to
our NVMe controller, which was validating that the SGL length exactly
matches the number of data bytes implied by the command. I'm sure
we're in the minority of NVMe controller vendors in aggressively
validating the NVMe command parameters, but it was unfortunate to
discover this change in Linux's behavior. We have since relaxed the
validation on the controller, so we wouldn't reject these Copy
commands. But the extra data in the NVMe command is still wasteful for
transports like NVMe/TCP in-capsule data where the entire data is
unconditionally transferred to the controller.
It would be great to start with a presumption of spec compliance. But
by all means, if we find a significant fraction of controller
implementations can't tolerate a sub-page data buffer, we can apply
the workaround from DSM.
Best,
Caleb
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/5] nvme: add support for copy offload
2025-05-22 3:41 ` Caleb Sander Mateos
@ 2025-05-22 4:29 ` Keith Busch
2025-05-22 14:16 ` Caleb Sander Mateos
2025-05-23 12:49 ` Christoph Hellwig
2025-05-23 12:48 ` Christoph Hellwig
1 sibling, 2 replies; 46+ messages in thread
From: Keith Busch @ 2025-05-22 4:29 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Keith Busch, linux-block, linux-nvme
On Wed, May 21, 2025 at 08:41:40PM -0700, Caleb Sander Mateos wrote:
> For the record, that change broke Linux hosts sending DSM commands to
> our NVMe controller, which was validating that the SGL length exactly
> matches the number of data bytes implied by the command. I'm sure
> we're in the minority of NVMe controller vendors in aggressively
> validating the NVMe command parameters, but it was unfortunate to
> discover this change in Linux's behavior.
This is a fabrics target you're talking about? I assume so because pci
would use PRP for a 4k payload, which doesn't encode transfer lengths.
All the offending controllers were pci, so maybe we could have
constrained the DSM over-allocation to that transport if we knew this
was causing problems for fabrics.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/5] block: new sector copy api
2025-05-21 22:31 ` [PATCH 1/5] block: new sector copy api Keith Busch
@ 2025-05-22 10:02 ` Hannes Reinecke
2025-05-22 16:43 ` Keith Busch
2025-05-22 19:22 ` Bart Van Assche
2025-05-23 12:45 ` Christoph Hellwig
2 siblings, 1 reply; 46+ messages in thread
From: Hannes Reinecke @ 2025-05-22 10:02 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme; +Cc: Keith Busch
On 5/22/25 00:31, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> Provide a basic block level api to copy a range of a block device's
> sectors to a new destination on the same device. This just reads the
> source data into host memory, then writes it back out to the device at
> the requested destination.
>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> block/blk-lib.c | 62 +++++++++++++++++++++++++++++++++++++++++
> block/ioctl.c | 30 ++++++++++++++++++++
> include/linux/blkdev.h | 2 ++
> include/uapi/linux/fs.h | 3 ++
> 4 files changed, 97 insertions(+)
>
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 4c9f20a689f7b..a819ded0ed3a9 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -368,3 +368,65 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
> return ret;
> }
> EXPORT_SYMBOL(blkdev_issue_secure_erase);
> +
> +/**
> + * blkdev_copy - copy source sectors to a destination on the same block device
> + * @dst_sector: start sector of the destination to copy to
> + * @src_sector: start sector of the source to copy from
> + * @nr_sects: number of sectors to copy
> + * @gfp: allocation flags to use
> + */
> +int blkdev_copy(struct block_device *bdev, sector_t dst_sector,
> + sector_t src_sector, sector_t nr_sects, gfp_t gfp)
> +{
Hmm. This interface is for copies _within_ the same bdev only.
Shouldn't we rather expand it to have _two_ bdev arguments to
eventually handle copies between bdevs?
In the end the function itself wouldn't change...
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 2/5] block: add support for copy offload
2025-05-21 22:31 ` [PATCH 2/5] block: add support for copy offload Keith Busch
@ 2025-05-22 13:49 ` Hannes Reinecke
2025-05-23 12:46 ` Christoph Hellwig
1 sibling, 0 replies; 46+ messages in thread
From: Hannes Reinecke @ 2025-05-22 13:49 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme; +Cc: Keith Busch
On 5/22/25 00:31, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> Various storage protocols can support offloading block data copies.
> Enhance the block layer to know about the device's copying capabilities,
> introduce the new REQ_OP_COPY operation, and provide the infrastructure
> to iterate, split, and merge these kinds of requests.
>
> A copy command must provide the device with a list of source LBAs and
> their lengths, and a destination LBA. The 'struct bio' type doesn't
> readily have a way to describe such a thing. But a copy request doesn't
> use host memory for data, so the bio's bio_vec is unused space. This
> patch adds a new purpose to the bio_vec where it can provide a vector of
> sectors instead of memory pages.
>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> block/bio.c | 25 ++++++++++++++
> block/blk-core.c | 4 +++
> block/blk-lib.c | 47 ++++++++++++++++++++++-----
> block/blk-merge.c | 28 +++++++++++++++-
> block/blk-sysfs.c | 9 ++++++
> block/blk.h | 17 +++++++++-
> include/linux/bio.h | 20 ++++++++++++
> include/linux/blk-mq.h | 5 +++
> include/linux/blk_types.h | 2 ++
> include/linux/blkdev.h | 14 ++++++++
> include/linux/bvec.h | 68 +++++++++++++++++++++++++++++++++++++--
> 11 files changed, 226 insertions(+), 13 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 3c0a558c90f52..9c73a895c987b 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1156,6 +1156,31 @@ void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter)
> bio_set_flag(bio, BIO_CLONED);
> }
>
> +static int bvec_try_merge_copy_src(struct bio *bio, struct bio_vec *src)
> +{
> + struct bio_vec *bv;
> +
> + if (!bio->bi_vcnt)
> + return false;
> +
> + bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> + if (bv->bv_sector + src->bv_sectors != src->bv_sector)
> + return false;
> +
> + bv->bv_sectors += src->bv_sectors;
> + return true;
> +}
> +
> +int bio_add_copy_src(struct bio *bio, struct bio_vec *src)
> +{
> + if (bvec_try_merge_copy_src(bio, src))
> + return 0;
> + if (bio->bi_vcnt >= bio->bi_max_vecs)
> + return -EINVAL;
> + bio->bi_io_vec[bio->bi_vcnt++] = *src;
> + return 0;
> +}
> +
> static unsigned int get_contig_folio_len(unsigned int *num_pages,
> struct page **pages, unsigned int i,
> struct folio *folio, size_t left,
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b862c66018f25..cb3d9879e2d65 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -837,6 +837,10 @@ void submit_bio_noacct(struct bio *bio)
> if (!bdev_max_discard_sectors(bdev))
> goto not_supported;
> break;
> + case REQ_OP_COPY:
> + if (!bdev_copy_sectors(bdev))
> + goto not_supported;
> + break;
> case REQ_OP_SECURE_ERASE:
> if (!bdev_max_secure_erase_sectors(bdev))
> goto not_supported;
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index a819ded0ed3a9..a538acbaa2cd7 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -369,14 +369,7 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
> }
> EXPORT_SYMBOL(blkdev_issue_secure_erase);
>
> -/**
> - * blkdev_copy - copy source sectors to a destination on the same block device
> - * @dst_sector: start sector of the destination to copy to
> - * @src_sector: start sector of the source to copy from
> - * @nr_sects: number of sectors to copy
> - * @gfp: allocation flags to use
> - */
> -int blkdev_copy(struct block_device *bdev, sector_t dst_sector,
> +static int __blkdev_copy(struct block_device *bdev, sector_t dst_sector,
> sector_t src_sector, sector_t nr_sects, gfp_t gfp)
> {
> unsigned int nr_vecs = __blkdev_sectors_to_bio_pages(nr_sects);
That's a bit odd, renaming a just introduced function.
But if you must...
Otherwise looks good.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/5] nvme: add support for copy offload
2025-05-21 22:31 ` [PATCH 3/5] nvme: " Keith Busch
2025-05-22 0:47 ` Caleb Sander Mateos
@ 2025-05-22 13:54 ` Hannes Reinecke
2025-05-23 12:50 ` Christoph Hellwig
2025-06-09 9:29 ` Niklas Cassel
2 siblings, 1 reply; 46+ messages in thread
From: Hannes Reinecke @ 2025-05-22 13:54 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme; +Cc: Keith Busch
On 5/22/25 00:31, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> Register the nvme namespace copy capablities with the request_queue
> limits and implement support for the REQ_OP_COPY operation.
>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> drivers/nvme/host/core.c | 61 ++++++++++++++++++++++++++++++++++++++++
> include/linux/nvme.h | 42 ++++++++++++++++++++++++++-
> 2 files changed, 102 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f69a232a000ac..3134fe85b1abc 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -888,6 +888,52 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
> return BLK_STS_OK;
> }
>
> +static inline blk_status_t nvme_setup_copy(struct nvme_ns *ns,
> + struct request *req, struct nvme_command *cmnd)
> +{
> + struct nvme_copy_range *range;
> + struct req_iterator iter;
> + struct bio_vec bvec;
> + u16 control = 0;
> + int i = 0;
> +
> + static const size_t alloc_size = sizeof(*range) * NVME_COPY_MAX_RANGES;
> +
> + if (WARN_ON_ONCE(blk_rq_nr_phys_segments(req) >= NVME_COPY_MAX_RANGES))
> + return BLK_STS_IOERR;
> +
> + range = kzalloc(alloc_size, GFP_ATOMIC | __GFP_NOWARN);
> + if (!range)
> + return BLK_STS_RESOURCE;
> +
> + if (req->cmd_flags & REQ_FUA)
> + control |= NVME_RW_FUA;
> + if (req->cmd_flags & REQ_FAILFAST_DEV)
> + control |= NVME_RW_LR;
FAILFAST_DEV? Is that even set anywhere?
Otherwise looks ok.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/5] block: add support for vectored copies
2025-05-21 22:31 ` [PATCH 4/5] block: add support for vectored copies Keith Busch
@ 2025-05-22 13:58 ` Hannes Reinecke
2025-05-22 16:36 ` Keith Busch
0 siblings, 1 reply; 46+ messages in thread
From: Hannes Reinecke @ 2025-05-22 13:58 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme; +Cc: Keith Busch
On 5/22/25 00:31, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> Copy offload can be used to defrad or garbage collect data spread across
Defrag?
> the disk. Most storage protocols provide a way to specifiy multiple
> sources in a single copy commnd, so introduce kernel and user space
> interfaces to accomplish that.
>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> block/blk-lib.c | 50 ++++++++++++++++++++++++----------
> block/ioctl.c | 59 +++++++++++++++++++++++++++++++++++++++++
> include/linux/blkdev.h | 2 ++
> include/uapi/linux/fs.h | 14 ++++++++++
> 4 files changed, 111 insertions(+), 14 deletions(-)
>
Any specific reason why this is a different patch, and not folded into
patch 2? It really feels odd to continuously updating interfaces which
have been added with the same patchset...
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index a538acbaa2cd7..7513b876a5399 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -424,26 +424,46 @@ static int __blkdev_copy(struct block_device *bdev, sector_t dst_sector,
> }
>
> static int blkdev_copy_offload(struct block_device *bdev, sector_t dst_sector,
> - sector_t src_sector, sector_t nr_sects, gfp_t gfp)
> + struct bio_vec *bv, int nr_vecs, gfp_t gfp)
> {
> + unsigned size = 0;
> struct bio *bio;
> - int ret;
> -
> - struct bio_vec bv = {
> - .bv_sector = src_sector,
> - .bv_sectors = nr_sects,
> - };
> + int ret, i;
>
> - bio = bio_alloc(bdev, 1, REQ_OP_COPY, gfp);
> - bio_add_copy_src(bio, &bv);
> + bio = bio_alloc(bdev, nr_vecs, REQ_OP_COPY, gfp);
> + for (i = 0; i < nr_vecs; i++) {
> + size += bv[i].bv_sectors << SECTOR_SHIFT;
> + bio_add_copy_src(bio, &bv[i]);
> + }
> bio->bi_iter.bi_sector = dst_sector;
> - bio->bi_iter.bi_size = nr_sects << SECTOR_SHIFT;
> + bio->bi_iter.bi_size = size;
>
> ret = submit_bio_wait(bio);
> bio_put(bio);
> return ret;
> +}
> +
> +/**
> + * blkdev_copy_range - copy range of sectors to a destination
> + * @dst_sector: start sector of the destination to copy to
> + * @bv: vector of source sectors
> + * @nr_vecs: number of source sector vectors
> + * @gfp: allocation flags to use
> + */
> +int blkdev_copy_range(struct block_device *bdev, sector_t dst_sector,
> + struct bio_vec *bv, int nr_vecs, gfp_t gfp)
> +{
> + int ret, i;
>
> + if (bdev_copy_sectors(bdev))
> + return blkdev_copy_offload(bdev, dst_sector, bv, nr_vecs, gfp);
> +
> + for (i = 0, ret = 0; i < nr_vecs && !ret; i++)
> + ret = __blkdev_copy(bdev, dst_sector, bv[i].bv_sector,
> + bv[i].bv_sectors, gfp);
> + return ret;
> }
> +EXPORT_SYMBOL_GPL(blkdev_copy_range);
>
> /**
> * blkdev_copy - copy source sectors to a destination on the same block device
> @@ -455,9 +475,11 @@ static int blkdev_copy_offload(struct block_device *bdev, sector_t dst_sector,
> int blkdev_copy(struct block_device *bdev, sector_t dst_sector,
> sector_t src_sector, sector_t nr_sects, gfp_t gfp)
> {
> - if (bdev_copy_sectors(bdev))
> - return blkdev_copy_offload(bdev, dst_sector, src_sector,
> - nr_sects, gfp);
> - return __blkdev_copy(bdev, dst_sector, src_sector, nr_sects, gfp);
> + struct bio_vec bv = {
> + .bv_sector = src_sector,
> + .bv_sectors = nr_sects,
> + };
> +
> + return blkdev_copy_range(bdev, dst_sector, &bv, 1, gfp);
> }
> EXPORT_SYMBOL_GPL(blkdev_copy);
> diff --git a/block/ioctl.c b/block/ioctl.c
> index 6f03c65867348..4b5095be19e1a 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -241,6 +241,63 @@ static int blk_ioctl_copy(struct block_device *bdev, blk_mode_t mode,
> return blkdev_copy(bdev, dst, src, nr, GFP_KERNEL);
> }
>
> +static int blk_ioctl_copy_vec(struct block_device *bdev, blk_mode_t mode,
> + void __user *argp)
> +{
> + sector_t align = bdev_logical_block_size(bdev) >> SECTOR_SHIFT;
> + struct bio_vec *bv, fast_bv[UIO_FASTIOV];
> + struct copy_range cr;
> + int i, nr, ret;
> + __u64 dst;
> +
> + if (!(mode & BLK_OPEN_WRITE))
> + return -EBADF;
> + if (copy_from_user(&cr, argp, sizeof(cr)))
> + return -EFAULT;
> + if (!(IS_ALIGNED(cr.dst_sector, align)))
> + return -EINVAL;
> +
> + nr = cr.nr_ranges;
> + if (nr <= UIO_FASTIOV) {
> + bv = fast_bv;
> + } else {
> + bv = kmalloc_array(nr, sizeof(*bv), GFP_KERNEL);
> + if (!bv)
> + return -ENOMEM;
> + }
> +
> + dst = cr.dst_sector;
> + for (i = 0; i < nr; i++) {
> + struct copy_source csrc;
> + __u64 nr_sects, src;
> +
> + if (copy_from_user(&csrc,
> + (void __user *)(cr.sources + i * sizeof(csrc)),
> + sizeof(csrc))) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + nr_sects = csrc.nr_sectors;
> + src = csrc.src_sector;
> + if (!(IS_ALIGNED(src | nr_sects, align)) ||
> + (src < dst && src + nr_sects > dst) ||
> + (dst < src && dst + nr_sects > src)) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + bv[i].bv_sectors = nr_sects;
> + bv[i].bv_sector = src;
> + }
> +
> + ret = blkdev_copy_range(bdev, dst, bv, nr, GFP_KERNEL);
> +out:
> + if (bv != fast_bv)
> + kfree(bv);
> + return ret;
> +}
> +
> static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode,
> unsigned long arg)
> {
> @@ -605,6 +662,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, blk_mode_t mode,
> return blk_ioctl_secure_erase(bdev, mode, argp);
> case BLKCPY:
> return blk_ioctl_copy(bdev, mode, argp);
> + case BLKCPY_VEC:
> + return blk_ioctl_copy_vec(bdev, mode, argp);
> case BLKZEROOUT:
> return blk_ioctl_zeroout(bdev, mode, arg);
> case BLKGETDISKSEQ:
And that makes it even worse; introducing two ioctls which basically do
the same thing (or where one is actually a special case of the other)
is probably not what we should be doing.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 5/5] nvmet: implement copy support for bdev backed target
2025-05-21 22:31 ` [PATCH 5/5] nvmet: implement copy support for bdev backed target Keith Busch
@ 2025-05-22 13:59 ` Hannes Reinecke
2025-05-23 13:18 ` Christoph Hellwig
1 sibling, 0 replies; 46+ messages in thread
From: Hannes Reinecke @ 2025-05-22 13:59 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme; +Cc: Keith Busch
On 5/22/25 00:31, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> The nvme block device target type does not have any particular limits on
> copy commands, so all the settings are the protocol's max.
>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> drivers/nvme/target/io-cmd-bdev.c | 52 +++++++++++++++++++++++++++++++
> 1 file changed, 52 insertions(+)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/5] nvme: add support for copy offload
2025-05-22 4:29 ` Keith Busch
@ 2025-05-22 14:16 ` Caleb Sander Mateos
2025-05-23 12:49 ` Christoph Hellwig
1 sibling, 0 replies; 46+ messages in thread
From: Caleb Sander Mateos @ 2025-05-22 14:16 UTC (permalink / raw)
To: Keith Busch; +Cc: Keith Busch, linux-block, linux-nvme
On Wed, May 21, 2025 at 9:30 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, May 21, 2025 at 08:41:40PM -0700, Caleb Sander Mateos wrote:
> > For the record, that change broke Linux hosts sending DSM commands to
> > our NVMe controller, which was validating that the SGL length exactly
> > matches the number of data bytes implied by the command. I'm sure
> > we're in the minority of NVMe controller vendors in aggressively
> > validating the NVMe command parameters, but it was unfortunate to
> > discover this change in Linux's behavior.
>
> This is a fabrics target you're talking about? I assume so because pci
> would use PRP for a 4k payload, which doesn't encode transfer lengths.
> All the offending controllers were pci, so maybe we could have
> constrained the DSM over-allocation to that transport if we knew this
> was causing problems for fabrics.
Yes, fabrics. I would be fine with always mapping the full 4K data
length for PCI but using the exact length of the ranges for fabrics.
Best,
Caleb
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/5] block: another block copy offload
2025-05-21 22:31 [PATCH 0/5] block: another block copy offload Keith Busch
` (4 preceding siblings ...)
2025-05-21 22:31 ` [PATCH 5/5] nvmet: implement copy support for bdev backed target Keith Busch
@ 2025-05-22 15:52 ` Bart Van Assche
2025-05-23 12:53 ` Christoph Hellwig
2025-07-03 14:47 ` Niklas Cassel
6 siblings, 1 reply; 46+ messages in thread
From: Bart Van Assche @ 2025-05-22 15:52 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme; +Cc: Keith Busch
On 5/21/25 3:31 PM, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> I was never happy with previous block copy offload attempts, so I had to
> take a stab at it. And I was recently asked to take a look at this, so
> here goes.
>
> Some key implementation differences from previous approaches:
>
> 1. Only one bio is needed to describe a copy request, so no plugging
> or dispatch tricks required. Like read and write requests, these
> can be artbitrarily large and will be split as needed based on the
> request_queue's limits. The bio's are mergeable with other copy
> commands on adjacent destination sectors.
>
> 2. You can describe as many source sectors as you want in a vector in
> a single bio. This aligns with the nvme protocol's Copy implementation,
> which can be used to efficiently defragment scattered blocks into a
> contiguous destination with a single command.
>
> Oh, and the nvme-target support was included with this patchset too, so
> there's a purely in-kernel way to test out the code paths if you don't
> have otherwise capable hardware. I also used qemu since that nvme device
> supports copy offload too.
Before any copy offloading code is merged in the upstream kernel, a plan
should be presented for how this partial implementation will evolve into
a generic solution. As Hannes already pointed out, support for
copying between block devices is missing. SCSI support is missing.
Device mapper support is missing. What is the plan for evolving this
patch series into a generic solution? Is it even possible to evolve this
patch series into a generic solution?
Thanks,
Bart.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/5] block: add support for vectored copies
2025-05-22 13:58 ` Hannes Reinecke
@ 2025-05-22 16:36 ` Keith Busch
0 siblings, 0 replies; 46+ messages in thread
From: Keith Busch @ 2025-05-22 16:36 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: Keith Busch, linux-block, linux-nvme
On Thu, May 22, 2025 at 03:58:18PM +0200, Hannes Reinecke wrote:
> On 5/22/25 00:31, Keith Busch wrote:
> > ---
> > block/blk-lib.c | 50 ++++++++++++++++++++++++----------
> > block/ioctl.c | 59 +++++++++++++++++++++++++++++++++++++++++
> > include/linux/blkdev.h | 2 ++
> > include/uapi/linux/fs.h | 14 ++++++++++
> > 4 files changed, 111 insertions(+), 14 deletions(-)
> >
> Any specific reason why this is a different patch, and not folded into
> patch 2? It really feels odd to continuously updating interfaces which
> have been added with the same patchset...
Sure, I can do that if that's preferred. I just started this as simple
as possible, and added new capabilities from there. I thought having the
patch set show the journey might make it easier to review. If the
evolving interfaces are not helping, though, I don't mind squashing them.
> > case BLKCPY:
> > return blk_ioctl_copy(bdev, mode, argp);
> > + case BLKCPY_VEC:
> > + return blk_ioctl_copy_vec(bdev, mode, argp);
> > case BLKZEROOUT:
> > return blk_ioctl_zeroout(bdev, mode, arg);
> > case BLKGETDISKSEQ:
>
> And that makes it even worse; introducing two ioctls which basically do
> the same thing (or where one is actually a special case of the other)
> is probably not what we should be doing.
There are many interfaces that have a single vs vectored user input.
It's like read vs readv. The use cases I'm working with are in-kernel
though so I don't strongly need these user interfaces here, but it's
been great for testing. I developed some that would work well in
blktests, for example.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/5] block: new sector copy api
2025-05-22 10:02 ` Hannes Reinecke
@ 2025-05-22 16:43 ` Keith Busch
0 siblings, 0 replies; 46+ messages in thread
From: Keith Busch @ 2025-05-22 16:43 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: Keith Busch, linux-block, linux-nvme
On Thu, May 22, 2025 at 12:02:07PM +0200, Hannes Reinecke wrote:
> > +/**
> > + * blkdev_copy - copy source sectors to a destination on the same block device
> > + * @dst_sector: start sector of the destination to copy to
> > + * @src_sector: start sector of the source to copy from
> > + * @nr_sects: number of sectors to copy
> > + * @gfp: allocation flags to use
> > + */
> > +int blkdev_copy(struct block_device *bdev, sector_t dst_sector,
> > + sector_t src_sector, sector_t nr_sects, gfp_t gfp)
> > +{
>
> Hmm. This interface is for copies _within_ the same bdev only.
> Shouldn't we rather expand it to have _two_ bdev arguments to
> eventually handle copies between bdevs?
> In the end the function itself wouldn't change...
Sure. To start, I think it could just fallback to the non-offloaded
instrumented "copy" path if you have two different bdevs. Utilizing copy
offload across multiple devices is a bit more complex, so I'm focusing
on simple copy for now, but want to leave it flexible enough for such
future enhancements too.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/5] block: new sector copy api
2025-05-21 22:31 ` [PATCH 1/5] block: new sector copy api Keith Busch
2025-05-22 10:02 ` Hannes Reinecke
@ 2025-05-22 19:22 ` Bart Van Assche
2025-05-22 20:04 ` Keith Busch
2025-05-23 12:45 ` Christoph Hellwig
2 siblings, 1 reply; 46+ messages in thread
From: Bart Van Assche @ 2025-05-22 19:22 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme; +Cc: Keith Busch
On 5/21/25 3:31 PM, Keith Busch wrote:
> +/**
> + * blkdev_copy - copy source sectors to a destination on the same block device
> + * @dst_sector: start sector of the destination to copy to
> + * @src_sector: start sector of the source to copy from
> + * @nr_sects: number of sectors to copy
> + * @gfp: allocation flags to use
> + */
> +int blkdev_copy(struct block_device *bdev, sector_t dst_sector,
> + sector_t src_sector, sector_t nr_sects, gfp_t gfp)
> +{
> + unsigned int nr_vecs = __blkdev_sectors_to_bio_pages(nr_sects);
> + unsigned int len = (unsigned int)nr_sects << SECTOR_SHIFT;
> + unsigned int size = min(len, nr_vecs * PAGE_SIZE);
> + struct bio *bio;
> + int ret = 0;
> + void *buf;
> +
> + if (nr_sects > UINT_MAX >> SECTOR_SHIFT)
> + return -EINVAL;
> +
> + buf = kvmalloc(size, gfp);
> + if (!buf)
> + return -ENOMEM;
> +
> + nr_vecs = bio_add_max_vecs(buf, size);
> + bio = bio_alloc(bdev, nr_vecs, 0, gfp);
> +
> + if (is_vmalloc_addr(buf))
> + bio_add_vmalloc(bio, buf, size);
> + else
> + bio_add_virt_nofail(bio, buf, size);
> +
> + while (len) {
> + size = min(len, size);
> +
> + bio_reset(bio, bdev, REQ_OP_READ);
> + bio->bi_iter.bi_sector = src_sector;
> + bio->bi_iter.bi_size = size;
> +
> + ret = submit_bio_wait(bio);
> + if (ret)
> + break;
> +
> + bio_reset(bio, bdev, REQ_OP_WRITE);
> + bio->bi_iter.bi_sector = dst_sector;
> + bio->bi_iter.bi_size = size;
> +
> + ret = submit_bio_wait(bio);
> + if (ret)
> + break;
> +
> + src_sector += size >> SECTOR_SHIFT;
> + dst_sector += size >> SECTOR_SHIFT;
> + len -= size;
> + }
> +
> + bio_put(bio);
> + kvfree(buf);
> + return ret;
> +}
Is avoiding code duplication still a goal in the Linux kernel project?
If so, should the above code be consolidated with kcopyd into a single
implementation?
Thanks,
Bart.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/5] block: new sector copy api
2025-05-22 19:22 ` Bart Van Assche
@ 2025-05-22 20:04 ` Keith Busch
0 siblings, 0 replies; 46+ messages in thread
From: Keith Busch @ 2025-05-22 20:04 UTC (permalink / raw)
To: Bart Van Assche; +Cc: Keith Busch, linux-block, linux-nvme
On Thu, May 22, 2025 at 12:22:22PM -0700, Bart Van Assche wrote:
>
> Is avoiding code duplication still a goal in the Linux kernel project?
I feel like that's a generic goal of software development in general.
> If so, should the above code be consolidated with kcopyd into a single
> implementation?
This patch provides a synchronous interface, similar to other services
in blk-lib's APIs, like blkdev_issue_zeroout(). kcopyd, on the other
hand, is an asynchronous interface with its own zero-out implementation
(dm_kcopyd_zero). It's not like unifying these operations for different
use cases was a priority; the implementations don't have much in common,
so its not really duplicated code anyway.
Now, if we are able to settle on the REQ_OP_COPY implementation, then it
wouldn't be a big deal to have kcopyd use it too.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/5] block: new sector copy api
2025-05-21 22:31 ` [PATCH 1/5] block: new sector copy api Keith Busch
2025-05-22 10:02 ` Hannes Reinecke
2025-05-22 19:22 ` Bart Van Assche
@ 2025-05-23 12:45 ` Christoph Hellwig
2025-05-23 17:02 ` Keith Busch
2 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2025-05-23 12:45 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-block, linux-nvme, Keith Busch
On Wed, May 21, 2025 at 03:31:03PM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> Provide a basic block level api to copy a range of a block device's
> sectors to a new destination on the same device. This just reads the
> source data into host memory, then writes it back out to the device at
> the requested destination.
As someone who recently spent a lot of time on optimizing such loops:
having a general API that allocates a buffer for each copy is a bad
idea. You'll want some kind of caller provided longer living allocation
if you do regularly do such copies.
Maybe having common code is good to avoid copies, but I suspect most
real users would want their own.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 2/5] block: add support for copy offload
2025-05-21 22:31 ` [PATCH 2/5] block: add support for copy offload Keith Busch
2025-05-22 13:49 ` Hannes Reinecke
@ 2025-05-23 12:46 ` Christoph Hellwig
2025-05-23 13:26 ` Keith Busch
1 sibling, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2025-05-23 12:46 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-block, linux-nvme, Keith Busch
On Wed, May 21, 2025 at 03:31:04PM -0700, Keith Busch wrote:
> struct bio_vec {
> - struct page *bv_page;
> - unsigned int bv_len;
> - unsigned int bv_offset;
> + union {
> + struct {
> + struct page *bv_page;
> + unsigned int bv_len;
> + unsigned int bv_offset;
> + };
> + struct {
> + sector_t bv_sector;
> + sector_t bv_sectors;
> + };
> + };
Urrgg. Please don't overload the bio_vec. We've been working hard to
generalize it and share the data structures with more users in the
block layer. If having a bio for each source range is too much overhead
for your user case (but I'd like to numbers for that), we'll need to
find a way to do that without overloading the actual bio_vec structure.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/5] nvme: add support for copy offload
2025-05-22 3:41 ` Caleb Sander Mateos
2025-05-22 4:29 ` Keith Busch
@ 2025-05-23 12:48 ` Christoph Hellwig
1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2025-05-23 12:48 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Keith Busch, Keith Busch, linux-block, linux-nvme
On Wed, May 21, 2025 at 08:41:40PM -0700, Caleb Sander Mateos wrote:
> On Wed, May 21, 2025 at 8:23 PM Keith Busch <kbusch@kernel.org> wrote:
> >
> > On Wed, May 21, 2025 at 05:51:03PM -0700, Caleb Sander Mateos wrote:
> > > On Wed, May 21, 2025 at 5:47 PM Caleb Sander Mateos
> > > >
> > > > alloc_size should be sizeof(*range) * i? Otherwise this exceeds the
> > > > amount of data used by the Copy command, which not all controllers
> > > > support (see bit LLDTS of SGLS in the Identify Controller data
> > > > structure). We have seen the same behavior with Dataset Management
> > > > (always specifying 4 KB of data), which also passes the maximum size
> > > > of the allocation to bvec_set_virt().
> > >
> > > I see that was added in commit 530436c45ef2e ("nvme: Discard
> > > workaround for non-conformant devices"). I would rather wait for
> > > evidence of non-conformant devices supporting Copy before implementing
> > > the same spec-noncompliant workaround. It could be a quirk if
> > > necessary.
> >
> > Right, that's exactly why I didn't bother allocating tighter to what the
> > command actually needs. The number of devices that would have needed a
> > DSM quirk for the full 4k was untenable, so making the quirk behavior
> > the default was a sane compromise. I suppose Copy is a more enterprisey
> > feature in comparison, so maybe we can count on devices doing dma
> > correctly?
>
> For the record, that change broke Linux hosts sending DSM commands to
> our NVMe controller, which was validating that the SGL length exactly
> matches the number of data bytes implied by the command.
Next time please send a bug report and/or fix ASAP when you see
such changes.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/5] nvme: add support for copy offload
2025-05-22 4:29 ` Keith Busch
2025-05-22 14:16 ` Caleb Sander Mateos
@ 2025-05-23 12:49 ` Christoph Hellwig
1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2025-05-23 12:49 UTC (permalink / raw)
To: Keith Busch; +Cc: Caleb Sander Mateos, Keith Busch, linux-block, linux-nvme
On Wed, May 21, 2025 at 10:29:59PM -0600, Keith Busch wrote:
> On Wed, May 21, 2025 at 08:41:40PM -0700, Caleb Sander Mateos wrote:
> > For the record, that change broke Linux hosts sending DSM commands to
> > our NVMe controller, which was validating that the SGL length exactly
> > matches the number of data bytes implied by the command. I'm sure
> > we're in the minority of NVMe controller vendors in aggressively
> > validating the NVMe command parameters, but it was unfortunate to
> > discover this change in Linux's behavior.
>
> This is a fabrics target you're talking about? I assume so because pci
> would use PRP for a 4k payload, which doesn't encode transfer lengths.
> All the offending controllers were pci, so maybe we could have
> constrained the DSM over-allocation to that transport if we knew this
> was causing problems for fabrics.
Or constrain the workaround to PRPs which never encode the actual
length anyway (which probably is the source of the bugs).
The fact that NVMe went with this stupid PRP scheme instead of SGLs
still makes me angry..
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/5] nvme: add support for copy offload
2025-05-22 13:54 ` Hannes Reinecke
@ 2025-05-23 12:50 ` Christoph Hellwig
2025-05-23 14:22 ` Caleb Sander Mateos
0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2025-05-23 12:50 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: Keith Busch, linux-block, linux-nvme, Keith Busch
On Thu, May 22, 2025 at 03:54:46PM +0200, Hannes Reinecke wrote:
> > + if (req->cmd_flags & REQ_FUA)
> > + control |= NVME_RW_FUA;
> > + if (req->cmd_flags & REQ_FAILFAST_DEV)
> > + control |= NVME_RW_LR;
>
> FAILFAST_DEV? Is that even set anywhere?
That is a good question, but this is consistent with what we do for
other I/O commands.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/5] block: another block copy offload
2025-05-22 15:52 ` [PATCH 0/5] block: another block copy offload Bart Van Assche
@ 2025-05-23 12:53 ` Christoph Hellwig
0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2025-05-23 12:53 UTC (permalink / raw)
To: Bart Van Assche; +Cc: Keith Busch, linux-block, linux-nvme, Keith Busch
On Thu, May 22, 2025 at 08:52:19AM -0700, Bart Van Assche wrote:
> Before any copy offloading code is merged in the upstream kernel, a plan
> should be presented for how this partial implementation will evolve into
> a generic solution. As Hannes already pointed out, support for
> copying between block devices is missing. SCSI support is missing.
> Device mapper support is missing. What is the plan for evolving this
> patch series into a generic solution? Is it even possible to evolve this
> patch series into a generic solution?
While I agree with some of this, please keep cross-device copies
out of the picture here. The cross-namepace copies and cross LUN
copies in SCSI are a horrible misfeature that is almost impossible
to implement in an I/O stack. If someone ever manages to get them
to work they most certainly would have to use a very different
infrastructure outside the block layer.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 5/5] nvmet: implement copy support for bdev backed target
2025-05-21 22:31 ` [PATCH 5/5] nvmet: implement copy support for bdev backed target Keith Busch
2025-05-22 13:59 ` Hannes Reinecke
@ 2025-05-23 13:18 ` Christoph Hellwig
2025-05-23 14:00 ` Keith Busch
1 sibling, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2025-05-23 13:18 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-block, linux-nvme, Keith Busch
On Wed, May 21, 2025 at 03:31:07PM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> The nvme block device target type does not have any particular limits on
> copy commands, so all the settings are the protocol's max.
I need this additional bit so that the host code actually see the
copy support:
diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index c7317299078d..355cb67a8b74 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -734,7 +734,8 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
NVME_CTRL_ONCS_WRITE_ZEROES |
- NVME_CTRL_ONCS_RESERVATIONS);
+ NVME_CTRL_ONCS_RESERVATIONS |
+ NVME_CTRL_ONCS_NVMCPYS);
/* XXX: don't report vwc if the underlying device is write through */
id->vwc = NVME_CTRL_VWC_PRESENT;
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 2/5] block: add support for copy offload
2025-05-23 12:46 ` Christoph Hellwig
@ 2025-05-23 13:26 ` Keith Busch
2025-05-23 13:37 ` Christoph Hellwig
0 siblings, 1 reply; 46+ messages in thread
From: Keith Busch @ 2025-05-23 13:26 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Keith Busch, linux-block, linux-nvme
On Fri, May 23, 2025 at 05:46:57AM -0700, Christoph Hellwig wrote:
> On Wed, May 21, 2025 at 03:31:04PM -0700, Keith Busch wrote:
> > struct bio_vec {
> > - struct page *bv_page;
> > - unsigned int bv_len;
> > - unsigned int bv_offset;
> > + union {
> > + struct {
> > + struct page *bv_page;
> > + unsigned int bv_len;
> > + unsigned int bv_offset;
> > + };
> > + struct {
> > + sector_t bv_sector;
> > + sector_t bv_sectors;
> > + };
> > + };
>
> Urrgg. Please don't overload the bio_vec. We've been working hard to
> generalize it and share the data structures with more users in the
> block layer.
Darn, this part of the proposal is really the core concept of this patch
set that everything builds around. It's what allows submitting
arbitrarily large sized copy requests and letting the block layer
efficiently split a bio to the queue limits later.
> If having a bio for each source range is too much overhead
> for your user case (but I'd like to numbers for that), we'll need to
> find a way to do that without overloading the actual bio_vec structure.
Getting good numbers might be a problem in the near term. The current
generation of devices I have access to that can do copy offload don't
have asic support for it, so it is instrumented entirely in firmware.
The performance is currently underwhelming, but I expect next generation
to be much better.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 2/5] block: add support for copy offload
2025-05-23 13:26 ` Keith Busch
@ 2025-05-23 13:37 ` Christoph Hellwig
2025-05-23 13:48 ` Keith Busch
2025-05-27 21:33 ` Keith Busch
0 siblings, 2 replies; 46+ messages in thread
From: Christoph Hellwig @ 2025-05-23 13:37 UTC (permalink / raw)
To: Keith Busch; +Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme
On Fri, May 23, 2025 at 07:26:45AM -0600, Keith Busch wrote:
> > Urrgg. Please don't overload the bio_vec. We've been working hard to
> > generalize it and share the data structures with more users in the
> > block layer.
>
> Darn, this part of the proposal is really the core concept of this patch
> set that everything builds around. It's what allows submitting
> arbitrarily large sized copy requests and letting the block layer
> efficiently split a bio to the queue limits later.
Well, you can still do that without overloading the bio_bvec by just
making bi_io_vec in the bio itself a union.
>
> > If having a bio for each source range is too much overhead
> > for your user case (but I'd like to numbers for that), we'll need to
> > find a way to do that without overloading the actual bio_vec structure.
>
> Getting good numbers might be a problem in the near term. The current
> generation of devices I have access to that can do copy offload don't
> have asic support for it, so it is instrumented entirely in firmware.
> The performance is currently underwhelming, but I expect next generation
> to be much better.
I meant numbers for the all in one bio vs multiple bios approach.
For hardware I think the main benefit is to not use host dram
bandwith.
Anyway, below is a patch to wire it up to the XFS garbage collection
daemin. It survices the xfstests test cases for GC when run on a
conventional device, but otherwise I've not done much testing with it.
It shows two things, though:
- right now there is block layer merging, and we always see single
range bios. That is really annoying, and fixing the fs code to
submit multiple ranges in one go would be really annoying, as
extent-based completions hang off the bio completions. So I'd
really like to block layer merges similar to what the old
multi-bio code or the discard code do.
- copy also needs to be handled by the zoned write plugs
- bio_add_copy_src not updating bi_size is unexpected and annoying :)
diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
index 8c541ca71872..e7dfdbbcf126 100644
--- a/fs/xfs/xfs_zone_gc.c
+++ b/fs/xfs/xfs_zone_gc.c
@@ -158,6 +158,8 @@ struct xfs_zone_gc_data {
* Iterator for the victim zone.
*/
struct xfs_zone_gc_iter iter;
+
+ bool can_copy;
};
/*
@@ -212,12 +214,19 @@ xfs_zone_gc_data_alloc(
if (bioset_init(&data->bio_set, 16, offsetof(struct xfs_gc_bio, bio),
BIOSET_NEED_BVECS))
goto out_free_recs;
- for (i = 0; i < XFS_ZONE_GC_NR_SCRATCH; i++) {
- data->scratch[i].folio =
- folio_alloc(GFP_KERNEL, get_order(XFS_GC_CHUNK_SIZE));
- if (!data->scratch[i].folio)
- goto out_free_scratch;
+
+ if (bdev_copy_sectors(mp->m_rtdev_targp->bt_bdev)) {
+ xfs_info(mp, "using copy offload");
+ data->can_copy = true;
+ } else {
+ for (i = 0; i < XFS_ZONE_GC_NR_SCRATCH; i++) {
+ data->scratch[i].folio = folio_alloc(GFP_KERNEL,
+ get_order(XFS_GC_CHUNK_SIZE));
+ if (!data->scratch[i].folio)
+ goto out_free_scratch;
+ }
}
+
INIT_LIST_HEAD(&data->reading);
INIT_LIST_HEAD(&data->writing);
INIT_LIST_HEAD(&data->resetting);
@@ -241,8 +250,10 @@ xfs_zone_gc_data_free(
{
int i;
- for (i = 0; i < XFS_ZONE_GC_NR_SCRATCH; i++)
- folio_put(data->scratch[i].folio);
+ if (!data->can_copy) {
+ for (i = 0; i < XFS_ZONE_GC_NR_SCRATCH; i++)
+ folio_put(data->scratch[i].folio);
+ }
bioset_exit(&data->bio_set);
kfree(data->iter.recs);
kfree(data);
@@ -589,6 +600,8 @@ static unsigned int
xfs_zone_gc_scratch_available(
struct xfs_zone_gc_data *data)
{
+ if (data->can_copy)
+ return UINT_MAX;
return XFS_GC_CHUNK_SIZE - data->scratch[data->scratch_idx].offset;
}
@@ -690,7 +703,10 @@ xfs_zone_gc_start_chunk(
return false;
}
- bio = bio_alloc_bioset(bdev, 1, REQ_OP_READ, GFP_NOFS, &data->bio_set);
+ bio = bio_alloc_bioset(bdev, 1,
+ data->can_copy ? REQ_OP_COPY : REQ_OP_READ,
+ GFP_NOFS, &data->bio_set);
+ bio->bi_end_io = xfs_zone_gc_end_io;
chunk = container_of(bio, struct xfs_gc_bio, bio);
chunk->ip = ip;
@@ -700,21 +716,38 @@ xfs_zone_gc_start_chunk(
xfs_rgbno_to_rtb(iter->victim_rtg, irec.rm_startblock);
chunk->new_daddr = daddr;
chunk->is_seq = is_seq;
- chunk->scratch = &data->scratch[data->scratch_idx];
chunk->data = data;
chunk->oz = oz;
- bio->bi_iter.bi_sector = xfs_rtb_to_daddr(mp, chunk->old_startblock);
- bio->bi_end_io = xfs_zone_gc_end_io;
- bio_add_folio_nofail(bio, chunk->scratch->folio, chunk->len,
- chunk->scratch->offset);
- chunk->scratch->offset += chunk->len;
- if (chunk->scratch->offset == XFS_GC_CHUNK_SIZE) {
- data->scratch_idx =
- (data->scratch_idx + 1) % XFS_ZONE_GC_NR_SCRATCH;
+ if (data->can_copy) {
+ struct bio_vec src = {
+ .bv_sector =
+ xfs_rtb_to_daddr(mp, chunk->old_startblock),
+ .bv_sectors = BTOBB(chunk->len),
+ };
+
+ bio_add_copy_src(bio, &src);
+ bio->bi_iter.bi_sector = daddr;
+ bio->bi_iter.bi_size = chunk->len;
+
+ WRITE_ONCE(chunk->state, XFS_GC_BIO_NEW);
+ list_add_tail(&chunk->entry, &data->writing);
+ } else {
+ chunk->scratch = &data->scratch[data->scratch_idx];
+
+ bio->bi_iter.bi_sector =
+ xfs_rtb_to_daddr(mp, chunk->old_startblock);
+ bio_add_folio_nofail(bio, chunk->scratch->folio, chunk->len,
+ chunk->scratch->offset);
+ chunk->scratch->offset += chunk->len;
+ if (chunk->scratch->offset == XFS_GC_CHUNK_SIZE) {
+ data->scratch_idx =
+ (data->scratch_idx + 1) %
+ XFS_ZONE_GC_NR_SCRATCH;
+ }
+ WRITE_ONCE(chunk->state, XFS_GC_BIO_NEW);
+ list_add_tail(&chunk->entry, &data->reading);
}
- WRITE_ONCE(chunk->state, XFS_GC_BIO_NEW);
- list_add_tail(&chunk->entry, &data->reading);
xfs_zone_gc_iter_advance(iter, irec.rm_blockcount);
submit_bio(bio);
@@ -839,10 +872,12 @@ xfs_zone_gc_finish_chunk(
return;
}
- chunk->scratch->freed += chunk->len;
- if (chunk->scratch->freed == chunk->scratch->offset) {
- chunk->scratch->offset = 0;
- chunk->scratch->freed = 0;
+ if (!chunk->data->can_copy) {
+ chunk->scratch->freed += chunk->len;
+ if (chunk->scratch->freed == chunk->scratch->offset) {
+ chunk->scratch->offset = 0;
+ chunk->scratch->freed = 0;
+ }
}
/*
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 2/5] block: add support for copy offload
2025-05-23 13:37 ` Christoph Hellwig
@ 2025-05-23 13:48 ` Keith Busch
2025-05-26 5:22 ` Christoph Hellwig
2025-05-27 21:33 ` Keith Busch
1 sibling, 1 reply; 46+ messages in thread
From: Keith Busch @ 2025-05-23 13:48 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Keith Busch, linux-block, linux-nvme
On Fri, May 23, 2025 at 06:37:33AM -0700, Christoph Hellwig wrote:
> On Fri, May 23, 2025 at 07:26:45AM -0600, Keith Busch wrote:
> > Darn, this part of the proposal is really the core concept of this patch
> > set that everything builds around. It's what allows submitting
> > arbitrarily large sized copy requests and letting the block layer
> > efficiently split a bio to the queue limits later.
>
> Well, you can still do that without overloading the bio_bvec by just
> making bi_io_vec in the bio itself a union.
I like that idea.
> - bio_add_copy_src not updating bi_size is unexpected and annoying :)
Ha, I currently have the submitter responsible for bi_size. Your
suggestion will make this easier to use.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 5/5] nvmet: implement copy support for bdev backed target
2025-05-23 13:18 ` Christoph Hellwig
@ 2025-05-23 14:00 ` Keith Busch
2025-05-23 14:02 ` Christoph Hellwig
0 siblings, 1 reply; 46+ messages in thread
From: Keith Busch @ 2025-05-23 14:00 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Keith Busch, linux-block, linux-nvme
On Fri, May 23, 2025 at 06:18:31AM -0700, Christoph Hellwig wrote:
> I need this additional bit so that the host code actually see the
> copy support:
Thanks! I think it just needs to be constrained to bdev backed targets
since I don't have file support for copy offload here. And less
important, but for spec compliance, I need to update the Command Effects
Log too.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 5/5] nvmet: implement copy support for bdev backed target
2025-05-23 14:00 ` Keith Busch
@ 2025-05-23 14:02 ` Christoph Hellwig
0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2025-05-23 14:02 UTC (permalink / raw)
To: Keith Busch; +Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme
On Fri, May 23, 2025 at 08:00:56AM -0600, Keith Busch wrote:
> On Fri, May 23, 2025 at 06:18:31AM -0700, Christoph Hellwig wrote:
> > I need this additional bit so that the host code actually see the
> > copy support:
>
> Thanks! I think it just needs to be constrained to bdev backed targets
> since I don't have file support for copy offload here. And less
> important, but for spec compliance, I need to update the Command Effects
> Log too.
A file backend should be pretty trivial using vfs_copy_file_range.
On reflink-enabled file systems it might end up beeing to fast to
be useful, though :)
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/5] nvme: add support for copy offload
2025-05-23 12:50 ` Christoph Hellwig
@ 2025-05-23 14:22 ` Caleb Sander Mateos
0 siblings, 0 replies; 46+ messages in thread
From: Caleb Sander Mateos @ 2025-05-23 14:22 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Hannes Reinecke, Keith Busch, linux-block, linux-nvme,
Keith Busch
On Fri, May 23, 2025 at 5:50 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, May 22, 2025 at 03:54:46PM +0200, Hannes Reinecke wrote:
> > > + if (req->cmd_flags & REQ_FUA)
> > > + control |= NVME_RW_FUA;
> > > + if (req->cmd_flags & REQ_FAILFAST_DEV)
> > > + control |= NVME_RW_LR;
> >
> > FAILFAST_DEV? Is that even set anywhere?
>
> That is a good question, but this is consistent with what we do for
> other I/O commands.
Looks like it might be set by blk_mq_bio_to_request() and
blk_update_mixed_merge() for read-ahead bios:
#define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
if (bio->bi_opf & REQ_RAHEAD)
rq->cmd_flags |= REQ_FAILFAST_MASK;
Best,
Caleb
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/5] block: new sector copy api
2025-05-23 12:45 ` Christoph Hellwig
@ 2025-05-23 17:02 ` Keith Busch
2025-05-26 5:18 ` Christoph Hellwig
0 siblings, 1 reply; 46+ messages in thread
From: Keith Busch @ 2025-05-23 17:02 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Keith Busch, linux-block, linux-nvme
On Fri, May 23, 2025 at 05:45:10AM -0700, Christoph Hellwig wrote:
> On Wed, May 21, 2025 at 03:31:03PM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> >
> > Provide a basic block level api to copy a range of a block device's
> > sectors to a new destination on the same device. This just reads the
> > source data into host memory, then writes it back out to the device at
> > the requested destination.
>
> As someone who recently spent a lot of time on optimizing such loops:
> having a general API that allocates a buffer for each copy is a bad
> idea. You'll want some kind of caller provided longer living allocation
> if you do regularly do such copies.
>
> Maybe having common code is good to avoid copies, but I suspect most
> real users would want their own.
But the end goal is that no host memory buffer would be needed at all.
A buffer is allocated only in the fallback path. If we have the caller
provide their buffer for that fallback case, that kind of defeats the
benefit of reduced memory utilization. So it sounds like you might be
suggesting that I don't even bother providing the host instrumented copy
fallback and provide an API that performs the copy only if it can be
offloaded. Yes?
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/5] block: new sector copy api
2025-05-23 17:02 ` Keith Busch
@ 2025-05-26 5:18 ` Christoph Hellwig
2025-05-27 17:45 ` Keith Busch
0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2025-05-26 5:18 UTC (permalink / raw)
To: Keith Busch; +Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme
On Fri, May 23, 2025 at 11:02:16AM -0600, Keith Busch wrote:
> > Maybe having common code is good to avoid copies, but I suspect most
> > real users would want their own.
>
> But the end goal is that no host memory buffer would be needed at all.
> A buffer is allocated only in the fallback path. If we have the caller
> provide their buffer for that fallback case, that kind of defeats the
> benefit of reduced memory utilization. So it sounds like you might be
> suggesting that I don't even bother providing the host instrumented copy
> fallback and provide an API that performs the copy only if it can be
> offloaded. Yes?
Well, I'd expect most uses cases of things like GC to use the "fallback"
for now - because copy isn't actually that often supported, and when it
is performance might not be good enough. So we'll need to optimize
for both cases. Maybe a fallback code is useful, but I'd leave the
buffer management for it to the caller so that it can reuse them. Or
maybe don't bother given how trivial the code is anyway. The use case
for common code would be strong if it refactored existing code and
showed that existing callers actually have enough common logic that
it's worth the effort.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 2/5] block: add support for copy offload
2025-05-23 13:48 ` Keith Busch
@ 2025-05-26 5:22 ` Christoph Hellwig
0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2025-05-26 5:22 UTC (permalink / raw)
To: Keith Busch; +Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme
On Fri, May 23, 2025 at 07:48:54AM -0600, Keith Busch wrote:
> I like that idea.
>
> > - bio_add_copy_src not updating bi_size is unexpected and annoying :)
>
> Ha, I currently have the submitter responsible for bi_size. Your
> suggestion will make this easier to use.
Another thing about bio_add_copy_src: currently it gets passes a
bio_vec, which doesn't really help it's own implementation much, and
actually makes the callers more cumbersome. I'd pass the sector and
number of sectors (or maybe byte length to be closer to the other block
x`interfaces?) as separate scalar arguments instead. For both current
callers of blkdev_copy_range and iterative interface where they build
up the range without having to allocate the bvec array would simply
the code.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/5] block: new sector copy api
2025-05-26 5:18 ` Christoph Hellwig
@ 2025-05-27 17:45 ` Keith Busch
2025-05-28 7:46 ` Christoph Hellwig
0 siblings, 1 reply; 46+ messages in thread
From: Keith Busch @ 2025-05-27 17:45 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Keith Busch, linux-block, linux-nvme
On Sun, May 25, 2025 at 10:18:27PM -0700, Christoph Hellwig wrote:
> Well, I'd expect most uses cases of things like GC to use the "fallback"
> for now - because copy isn't actually that often supported, and when it
> is performance might not be good enough. So we'll need to optimize
> for both cases. Maybe a fallback code is useful, but I'd leave the
> buffer management for it to the caller so that it can reuse them. Or
> maybe don't bother given how trivial the code is anyway. The use case
> for common code would be strong if it refactored existing code and
> showed that existing callers actually have enough common logic that
> it's worth the effort.
Just fyi, the initial user I was planning to target with the block
layer's copy fallback isn't in kernel yet. Just an RFC at this moment on
btrfs:
https://lore.kernel.org/linux-btrfs/20250515163641.3449017-10-maharmstone@fb.com/
The blk-lib function could easily replace that patch's "do_copy()"
without to much refactoring on the btrfs side.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 2/5] block: add support for copy offload
2025-05-23 13:37 ` Christoph Hellwig
2025-05-23 13:48 ` Keith Busch
@ 2025-05-27 21:33 ` Keith Busch
2025-05-28 7:47 ` Christoph Hellwig
1 sibling, 1 reply; 46+ messages in thread
From: Keith Busch @ 2025-05-27 21:33 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Keith Busch, linux-block, linux-nvme
On Fri, May 23, 2025 at 06:37:33AM -0700, Christoph Hellwig wrote:
> Anyway, below is a patch to wire it up to the XFS garbage collection
> daemin. It survices the xfstests test cases for GC when run on a
> conventional device, but otherwise I've not done much testing with it.
>
> It shows two things, though:
>
> - right now there is block layer merging, and we always see single
> range bios. That is really annoying, and fixing the fs code to
> submit multiple ranges in one go would be really annoying, as
> extent-based completions hang off the bio completions. So I'd
> really like to block layer merges similar to what the old
> multi-bio code or the discard code do.
This series should handle bio merging REQ_OP_COPY. If it's inconvenient
to get multiple source sectors, bios that fit in the queue limits will
merge to one request when the using the blk_plug like you're doing.
> - copy also needs to be handled by the zoned write plugs
This is about emulating "copy append" behavior for sequential zones? In
case it gets split, we can't safely dispatch multiple copy operations
targeting the same destination zone, right?
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/5] block: new sector copy api
2025-05-27 17:45 ` Keith Busch
@ 2025-05-28 7:46 ` Christoph Hellwig
2025-05-28 22:41 ` Keith Busch
0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2025-05-28 7:46 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
Mark Harmstone
On Tue, May 27, 2025 at 11:45:26AM -0600, Keith Busch wrote:
> Just fyi, the initial user I was planning to target with the block
> layer's copy fallback isn't in kernel yet. Just an RFC at this moment on
> btrfs:
>
> https://lore.kernel.org/linux-btrfs/20250515163641.3449017-10-maharmstone@fb.com/
>
> The blk-lib function could easily replace that patch's "do_copy()"
> without to much refactoring on the btrfs side.
Well, that code would be much better off using a long living buffer,
because the frequent allocations are worse. Also from talking to a btrfs
guys sitting next to me at least the classic relocation can actually be
in the normal write path, so you probably also don't want it to fail
because a large memory allocation fails.
What we did for the GC code in XFS that exists to basically do the same
is to do a large allocation at mount time, and then just keep reusing
it.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 2/5] block: add support for copy offload
2025-05-27 21:33 ` Keith Busch
@ 2025-05-28 7:47 ` Christoph Hellwig
0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2025-05-28 7:47 UTC (permalink / raw)
To: Keith Busch; +Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme
On Tue, May 27, 2025 at 03:33:00PM -0600, Keith Busch wrote:
> > - copy also needs to be handled by the zoned write plugs
>
> This is about emulating "copy append" behavior for sequential zones? In
> case it gets split, we can't safely dispatch multiple copy operations
> targeting the same destination zone, right?
No, it's just about ensuring there is no multiple copies outatanding
that the HBA could reorder. i.e.g just serialization. The only thing
that should be needed is adding REQ_OP_COPY to the switch statement
in blk_zone_plug_bio and make it go into the blk_zone_wplug_handle_write
case.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/5] block: new sector copy api
2025-05-28 7:46 ` Christoph Hellwig
@ 2025-05-28 22:41 ` Keith Busch
2025-06-02 4:58 ` Christoph Hellwig
0 siblings, 1 reply; 46+ messages in thread
From: Keith Busch @ 2025-05-28 22:41 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Keith Busch, linux-block, linux-nvme, Mark Harmstone
On Wed, May 28, 2025 at 12:46:03AM -0700, Christoph Hellwig wrote:
> On Tue, May 27, 2025 at 11:45:26AM -0600, Keith Busch wrote:
> > Just fyi, the initial user I was planning to target with the block
> > layer's copy fallback isn't in kernel yet. Just an RFC at this moment on
> > btrfs:
> >
> > https://lore.kernel.org/linux-btrfs/20250515163641.3449017-10-maharmstone@fb.com/
> >
> > The blk-lib function could easily replace that patch's "do_copy()"
> > without to much refactoring on the btrfs side.
>
> Well, that code would be much better off using a long living buffer,
> because the frequent allocations are worse.
No argument against that. I'm just adding context for where this blk lib
patch was targeted. I'm happy to help on both sides to make it more
usable, though refactoring other block copy implementations (splice,
kcopyd, xfs gc) to a common api looks like a much longer term project.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/5] block: new sector copy api
2025-05-28 22:41 ` Keith Busch
@ 2025-06-02 4:58 ` Christoph Hellwig
0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2025-06-02 4:58 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
Mark Harmstone
On Wed, May 28, 2025 at 04:41:58PM -0600, Keith Busch wrote:
> On Wed, May 28, 2025 at 12:46:03AM -0700, Christoph Hellwig wrote:
> > On Tue, May 27, 2025 at 11:45:26AM -0600, Keith Busch wrote:
> > > Just fyi, the initial user I was planning to target with the block
> > > layer's copy fallback isn't in kernel yet. Just an RFC at this moment on
> > > btrfs:
> > >
> > > https://lore.kernel.org/linux-btrfs/20250515163641.3449017-10-maharmstone@fb.com/
> > >
> > > The blk-lib function could easily replace that patch's "do_copy()"
> > > without to much refactoring on the btrfs side.
> >
> > Well, that code would be much better off using a long living buffer,
> > because the frequent allocations are worse.
>
> No argument against that. I'm just adding context for where this blk lib
> patch was targeted. I'm happy to help on both sides to make it more
> usable, though refactoring other block copy implementations (splice,
> kcopyd, xfs gc) to a common api looks like a much longer term project.
FYI, I'm happy to take care of XFS GC, but I don't think we need any
library code for that. And I doubt any other async implementation would
share much code besides the low-level helpers.
So the question is if there are many potential users of sync helpers.
Maybe the answer is no and we should just not bother with that code for
now and only add it when/if it actually starts sharing code?
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/5] nvme: add support for copy offload
2025-05-21 22:31 ` [PATCH 3/5] nvme: " Keith Busch
2025-05-22 0:47 ` Caleb Sander Mateos
2025-05-22 13:54 ` Hannes Reinecke
@ 2025-06-09 9:29 ` Niklas Cassel
2 siblings, 0 replies; 46+ messages in thread
From: Niklas Cassel @ 2025-06-09 9:29 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-block, linux-nvme, Keith Busch
Hello Keith,
On Wed, May 21, 2025 at 03:31:05PM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> Register the nvme namespace copy capablities with the request_queue
> limits and implement support for the REQ_OP_COPY operation.
Since you never initialize Descriptor Format (DESFMT), you will use
Descriptor Format 0 (No SNSID, 16b Guard PI).
That is understandable, since your block layer API intentionally does not
support cross-device copies. But I would have expected you to somehow
mention the descriptor type format used somewhere in the commit message and
somewhere in the code (as a comment).
I haven't followed all the work that has happened wrt. PI in the block
layer recently, so I don't know, but I can see that nvme_set_app_tag()
can set a app tag in the request itself, and that nvme_set_ref_tag()
supports both 16b Guard PI and 64b Guard PI.
Do we ever want to set the ELBT/ELBAT/ELBATM in the nvme_copy_range struct?
And if the namespace is using 64b Guard PI, do we want to use Descriptor
Format 1 (No SNSID, 32b/64b Guard PI) rather than Descriptor Format 0 ?
Kind regards,
Niklas
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/5] block: another block copy offload
2025-05-21 22:31 [PATCH 0/5] block: another block copy offload Keith Busch
` (5 preceding siblings ...)
2025-05-22 15:52 ` [PATCH 0/5] block: another block copy offload Bart Van Assche
@ 2025-07-03 14:47 ` Niklas Cassel
6 siblings, 0 replies; 46+ messages in thread
From: Niklas Cassel @ 2025-07-03 14:47 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-block, linux-nvme, Keith Busch
Hello Keith,
On Wed, May 21, 2025 at 03:31:02PM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> I was never happy with previous block copy offload attempts, so I had to
> take a stab at it. And I was recently asked to take a look at this, so
> here goes.
>
> Some key implementation differences from previous approaches:
>
> 1. Only one bio is needed to describe a copy request, so no plugging
> or dispatch tricks required. Like read and write requests, these
> can be artbitrarily large and will be split as needed based on the
> request_queue's limits. The bio's are mergeable with other copy
> commands on adjacent destination sectors.
>
> 2. You can describe as many source sectors as you want in a vector in
> a single bio. This aligns with the nvme protocol's Copy implementation,
> which can be used to efficiently defragment scattered blocks into a
> contiguous destination with a single command.
>
> Oh, and the nvme-target support was included with this patchset too, so
> there's a purely in-kernel way to test out the code paths if you don't
> have otherwise capable hardware. I also used qemu since that nvme device
> supports copy offload too.
In order to test this series, I wrote a simple user space program to test
that does:
1) open() on the raw block device, without O_DIRECT.
2) pwrite() to a few sectors with some non-zero data.
3) pread() to those sectors, to make sure that the data was written, it was.
Since I haven't done any fsync(), both the read and the write will from/to
the page cache.
4) ioctl(.., BLKCPY_VEC, ..)
5) pread() on destination sector.
In step 5, I will read zero data.
I understand that BLKCPY_VEC is a copy offload command.
However, if I simply add an fsync() after the pwrite()s, then I will read
non-zero data in step 5, as expecting.
My question: is it expected that ioctl(.., BLKCPY_VEC, ..) will bypass/ignore
the page cache?
Because, as far as I understand, the most common thing for BLK* operations
is to do take the page cache into account, e.g. while BLKRESETZONE sends
down a command to the device, it also invalidates the corresponding pages
from the page cache.
With that logic, should ioctl(.., BLKCPY_VEC, ..) make sure that the src
pages are flushed down to the devices, before sending down the actual
copy command to the device?
I think that it is fine that the command ignores the data in the page cache,
since I guess in most cases, you will have a file system that is responsible
for the sectors being in sync, but perhaps we should document BLKCPY_VEC and
BLKCPY to more clearly highlight that they will bypass the page cache?
Which also makes me think, for storage devices that do not have a copy
command, blkdev_copy_range() will fall back to __blkdev_copy().
So in that case, I assume that the copy ioctl actually will take the page
cache into account?
Kind regards,
Niklas
^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2025-07-03 14:47 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-21 22:31 [PATCH 0/5] block: another block copy offload Keith Busch
2025-05-21 22:31 ` [PATCH 1/5] block: new sector copy api Keith Busch
2025-05-22 10:02 ` Hannes Reinecke
2025-05-22 16:43 ` Keith Busch
2025-05-22 19:22 ` Bart Van Assche
2025-05-22 20:04 ` Keith Busch
2025-05-23 12:45 ` Christoph Hellwig
2025-05-23 17:02 ` Keith Busch
2025-05-26 5:18 ` Christoph Hellwig
2025-05-27 17:45 ` Keith Busch
2025-05-28 7:46 ` Christoph Hellwig
2025-05-28 22:41 ` Keith Busch
2025-06-02 4:58 ` Christoph Hellwig
2025-05-21 22:31 ` [PATCH 2/5] block: add support for copy offload Keith Busch
2025-05-22 13:49 ` Hannes Reinecke
2025-05-23 12:46 ` Christoph Hellwig
2025-05-23 13:26 ` Keith Busch
2025-05-23 13:37 ` Christoph Hellwig
2025-05-23 13:48 ` Keith Busch
2025-05-26 5:22 ` Christoph Hellwig
2025-05-27 21:33 ` Keith Busch
2025-05-28 7:47 ` Christoph Hellwig
2025-05-21 22:31 ` [PATCH 3/5] nvme: " Keith Busch
2025-05-22 0:47 ` Caleb Sander Mateos
2025-05-22 0:51 ` Caleb Sander Mateos
2025-05-22 3:23 ` Keith Busch
2025-05-22 3:41 ` Caleb Sander Mateos
2025-05-22 4:29 ` Keith Busch
2025-05-22 14:16 ` Caleb Sander Mateos
2025-05-23 12:49 ` Christoph Hellwig
2025-05-23 12:48 ` Christoph Hellwig
2025-05-22 13:54 ` Hannes Reinecke
2025-05-23 12:50 ` Christoph Hellwig
2025-05-23 14:22 ` Caleb Sander Mateos
2025-06-09 9:29 ` Niklas Cassel
2025-05-21 22:31 ` [PATCH 4/5] block: add support for vectored copies Keith Busch
2025-05-22 13:58 ` Hannes Reinecke
2025-05-22 16:36 ` Keith Busch
2025-05-21 22:31 ` [PATCH 5/5] nvmet: implement copy support for bdev backed target Keith Busch
2025-05-22 13:59 ` Hannes Reinecke
2025-05-23 13:18 ` Christoph Hellwig
2025-05-23 14:00 ` Keith Busch
2025-05-23 14:02 ` Christoph Hellwig
2025-05-22 15:52 ` [PATCH 0/5] block: another block copy offload Bart Van Assche
2025-05-23 12:53 ` Christoph Hellwig
2025-07-03 14:47 ` Niklas Cassel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).