[PATCH v20 00/12] Implement copy offload support

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v20 00/12] Implement copy offload support
       [not found] <CGME20240520102747epcas5p33497a911ca70c991e5da8e22c5d1336b@epcas5p3.samsung.com>
@ 2024-05-20 10:20 ` Nitesh Shetty
       [not found]   ` <CGME20240520102830epcas5p27274901f3d0c2738c515709890b1dec4@epcas5p2.samsung.com>
                     ` (13 more replies)
  0 siblings, 14 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-20 10:20 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Nitesh Shetty, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

The patch series covers the points discussed in the past and most recently
in LSFMM'24[0].
We have covered the initial agreed requirements in this patch set and
further additional features suggested by the community.

This is next iteration of our previous patch set v19[1].
Copy offload is performed using two bio's -
1. Take a plug
2. The first bio containing destination info is prepared and sent,
        a request is formed.
3. This is followed by preparing and sending the second bio containing the
        source info.
4. This bio is merged with the request containing the destination info.
5. The plug is released, and the request containing source and destination
        bio's is sent to the driver.
This design helps to avoid putting payload (token) in the request,
as sending payload that is not data to the device is considered a bad
approach.

So copy offload works only for request based storage drivers.
We can make use of copy emulation in case copy offload capability is
absent.

Overall series supports:
========================
	1. Driver
		- NVMe Copy command (single NS, TP 4065), including support
		in nvme-target (for block and file back end).

	2. Block layer
		- Block-generic copy (REQ_OP_COPY_DST/SRC), operation with
                  interface accommodating two block-devs
                - Merging copy requests in request layer
		- Emulation, for in-kernel user when offload is natively
                absent
		- dm-linear support (for cases not requiring split)

	3. User-interface
		- copy_file_range

Testing
=======
	Copy offload can be tested on:
	a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
		parameters mssrl,mcl, msrc. For more info [2].
	b. Null block device
        c. NVMe Fabrics loopback.
	d. blktests[3]

	Emulation can be tested on any device.

	fio[4].

Infra and plumbing:
===================
        We populate copy_file_range callback in def_blk_fops.
        For devices that support copy-offload, use blkdev_copy_offload to
        achieve in-device copy.
        However for cases, where device doesn't support offload,
        use splice_copy_file_range.
        For in-kernel users (like NVMe fabrics), use blkdev_copy_offload
        if device is copy offload capable or else use emulation
        using blkdev_copy_emulation.
        Modify checks in copy_file_range to support block-device.

Blktests[3]
======================
	tests/block/035-040: Runs copy offload and emulation on null
                              block device.
	tests/block/050,055: Runs copy offload and emulation on test
                              nvme block device.
        tests/nvme/056-067: Create a loop backed fabrics device and
                              run copy offload and emulation.

Future Work
===========
	- loopback device copy offload support
	- upstream fio to use copy offload
	- upstream blktest to test copy offload
        - update man pages for copy_file_range
        - expand in-kernel users of copy offload

	These are to be taken up after this minimal series is agreed upon.

Additional links:
=================
	[0] https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zT14mmPGMiVCzUgB33C60tbQ@mail.gmail.com/
            https://lore.kernel.org/linux-nvme/f0e19ae4-b37a-e9a3-2be7-a5afb334a5c3@nvidia.com/
            https://lore.kernel.org/linux-nvme/20230113094648.15614-1-nj.shetty@samsung.com/
	[1] https://lore.kernel.org/linux-nvme/20231222061313.12260-1-nj.shetty@samsung.com/
	[2] https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
	[3] https://github.com/nitesh-shetty/blktests/tree/feat/copy_offload/v19
	[4] https://github.com/SamsungDS/fio/tree/copyoffload-3.35-v14

Changes since v19:
=================
        - block, nvme: update queue limits atomically
                Also remove reviewed by tag from Hannes and Luis for
                these patches. As we feel these patches changed
                significantly from the previous one.
        - vfs: generic_copy_file_range to splice_file_range
        - rebased to latest linux-next

Changes since v18:
=================
        - block, nvmet, null: change of copy dst/src request opcodes form
            19,21 to 18,19 (Keith Busch)
            Change the copy bio submission order to destination copy bio
            first, followed by source copy bio.

Changes since v17:
=================
        - block, nvmet: static error fixes (Dan Carpenter, kernel test robot)
        - nvmet: pass COPY_FILE_SPLICE flag for vfs_copy_file_range in case
                        file backed nvmet device

Changes since v16:
=================
        - block: fixed memory leaks and renamed function (Jinyoung Choi)
        - nvmet: static error fixes (kernel test robot)

Changes since v15:
=================
        - fs, nvmet: don't fallback to copy emulation for copy offload IO
                    failure, user can still use emulation by disabling
                    device offload (Hannes)
        - block: patch,function description changes (Hannes)
        - added Reviewed-by from Hannes and Luis.

Changes since v14:
=================
        - block: (Bart Van Assche)
            1. BLK_ prefix addition to COPY_MAX_BYES and COPY_MAX_SEGMENTS
            2. Improved function,patch,cover-letter description
            3. Simplified refcount updating.
        - null-blk, nvme:
            4. static warning fixes (kernel test robot)

Changes since v13:
=================
        - block:
            1. Simplified copy offload and emulation helpers, now
                  caller needs to decide between offload/emulation fallback
            2. src,dst bio order change (Christoph Hellwig)
            3. refcount changes similar to dio (Christoph Hellwig)
            4. Single outstanding IO for copy emulation (Christoph Hellwig)
            5. use copy_max_sectors to identify copy offload
                  capability and other reviews (Damien, Christoph)
            6. Return status in endio handler (Christoph Hellwig)
        - nvme-fabrics: fallback to emulation in case of partial
          offload completion
        - in kernel user addition (Ming lei)
        - indentation, documentation, minor fixes, misc changes (Damien,
          Christoph)
        - blktests changes to test kernel changes

Changes since v12:
=================
        - block,nvme: Replaced token based approach with request based
          single namespace capable approach (Christoph Hellwig)

Changes since v11:
=================
        - Documentation: Improved documentation (Damien Le Moal)
        - block,nvme: ssize_t return values (Darrick J. Wong)
        - block: token is allocated to SECTOR_SIZE (Matthew Wilcox)
        - block: mem leak fix (Maurizio Lombardi)

Changes since v10:
=================
        - NVMeOF: optimization in NVMe fabrics (Chaitanya Kulkarni)
        - NVMeOF: sparse warnings (kernel test robot)

Changes since v9:
=================
        - null_blk, improved documentation, minor fixes(Chaitanya Kulkarni)
        - fio, expanded testing and minor fixes (Vincent Fu)

Changes since v8:
=================
        - null_blk, copy_max_bytes_hw is made config fs parameter
          (Damien Le Moal)
        - Negative error handling in copy_file_range (Christian Brauner)
        - minor fixes, better documentation (Damien Le Moal)
        - fio upgraded to 3.34 (Vincent Fu)

Changes since v7:
=================
        - null block copy offload support for testing (Damien Le Moal)
        - adding direct flag check for copy offload to block device,
	  as we are using generic_copy_file_range for cached cases.
        - Minor fixes

Changes since v6:
=================
        - copy_file_range instead of ioctl for direct block device
        - Remove support for multi range (vectored) copy
        - Remove ioctl interface for copy.
        - Remove offload support in dm kcopyd.

Changes since v5:
=================
	- Addition of blktests (Chaitanya Kulkarni)
        - Minor fix for fabrics file backed path
        - Remove buggy zonefs copy file range implementation.

Changes since v4:
=================
	- make the offload and emulation design asynchronous (Hannes
	  Reinecke)
	- fabrics loopback support
	- sysfs naming improvements (Damien Le Moal)
	- use kfree() instead of kvfree() in cio_await_completion
	  (Damien Le Moal)
	- use ranges instead of rlist to represent range_entry (Damien
	  Le Moal)
	- change argument ordering in blk_copy_offload suggested (Damien
	  Le Moal)
	- removed multiple copy limit and merged into only one limit
	  (Damien Le Moal)
	- wrap overly long lines (Damien Le Moal)
	- other naming improvements and cleanups (Damien Le Moal)
	- correctly format the code example in description (Damien Le
	  Moal)
	- mark blk_copy_offload as static (kernel test robot)

Changes since v3:
=================
	- added copy_file_range support for zonefs
	- added documentation about new sysfs entries
	- incorporated review comments on v3
	- minor fixes

Changes since v2:
=================
	- fixed possible race condition reported by Damien Le Moal
	- new sysfs controls as suggested by Damien Le Moal
	- fixed possible memory leak reported by Dan Carpenter, lkp
	- minor fixes

Changes since v1:
=================
	- sysfs documentation (Greg KH)
        - 2 bios for copy operation (Bart Van Assche, Mikulas Patocka,
          Martin K. Petersen, Douglas Gilbert)
        - better payload design (Darrick J. Wong)

Anuj Gupta (1):
  fs/read_write: Enable copy_file_range for block device.

Nitesh Shetty (11):
  block: Introduce queue limits and sysfs for copy-offload support
  Add infrastructure for copy offload in block and request layer.
  block: add copy offload support
  block: add emulation for copy
  fs, block: copy_file_range for def_blk_ops for direct block device
  nvme: add copy offload support
  nvmet: add copy command support for bdev and file ns
  dm: Add support for copy offload
  dm: Enable copy offload for dm-linear target
  null: Enable trace capability for null block
  null_blk: add support for copy offload

 Documentation/ABI/stable/sysfs-block |  23 ++
 Documentation/block/null_blk.rst     |   5 +
 block/blk-core.c                     |   7 +
 block/blk-lib.c                      | 427 +++++++++++++++++++++++++++
 block/blk-merge.c                    |  41 +++
 block/blk-settings.c                 |  34 ++-
 block/blk-sysfs.c                    |  43 +++
 block/blk.h                          |  16 +
 block/elevator.h                     |   1 +
 block/fops.c                         |  26 ++
 drivers/block/null_blk/Makefile      |   2 -
 drivers/block/null_blk/main.c        | 105 ++++++-
 drivers/block/null_blk/null_blk.h    |   1 +
 drivers/block/null_blk/trace.h       |  25 ++
 drivers/block/null_blk/zoned.c       |   1 -
 drivers/md/dm-linear.c               |   1 +
 drivers/md/dm-table.c                |  37 +++
 drivers/md/dm.c                      |   7 +
 drivers/nvme/host/constants.c        |   1 +
 drivers/nvme/host/core.c             |  81 ++++-
 drivers/nvme/host/trace.c            |  19 ++
 drivers/nvme/target/admin-cmd.c      |   9 +-
 drivers/nvme/target/io-cmd-bdev.c    |  71 +++++
 drivers/nvme/target/io-cmd-file.c    |  50 ++++
 drivers/nvme/target/nvmet.h          |   1 +
 drivers/nvme/target/trace.c          |  19 ++
 fs/read_write.c                      |   8 +-
 include/linux/bio.h                  |   6 +-
 include/linux/blk_types.h            |  10 +
 include/linux/blkdev.h               |  23 ++
 include/linux/device-mapper.h        |   3 +
 include/linux/nvme.h                 |  43 ++-
 32 files changed, 1124 insertions(+), 22 deletions(-)


base-commit: dbd9e2e056d8577375ae4b31ada94f8aa3769e8a
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support
       [not found]   ` <CGME20240520102830epcas5p27274901f3d0c2738c515709890b1dec4@epcas5p2.samsung.com>
@ 2024-05-20 10:20     ` Nitesh Shetty
  2024-05-20 14:33       ` Damien Le Moal
                         ` (3 more replies)
  0 siblings, 4 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-20 10:20 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Nitesh Shetty, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

Add device limits as sysfs entries,
	- copy_max_bytes (RW)
	- copy_max_hw_bytes (RO)

Above limits help to split the copy payload in block layer.
copy_max_bytes: maximum total length of copy in single payload.
copy_max_hw_bytes: Reflects the device supported maximum limit.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 Documentation/ABI/stable/sysfs-block | 23 +++++++++++++++
 block/blk-settings.c                 | 34 ++++++++++++++++++++--
 block/blk-sysfs.c                    | 43 ++++++++++++++++++++++++++++
 include/linux/blkdev.h               | 14 +++++++++
 4 files changed, 112 insertions(+), 2 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 831f19a32e08..52d8a253bf8e 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -165,6 +165,29 @@ Description:
 		last zone of the device which may be smaller.
 
 
+What:		/sys/block/<disk>/queue/copy_max_bytes
+Date:		May 2024
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RW] This is the maximum number of bytes that the block layer
+		will allow for a copy request. This is always smaller or
+		equal to the maximum size allowed by the hardware, indicated by
+		'copy_max_hw_bytes'. An attempt to set a value higher than
+		'copy_max_hw_bytes' will truncate this to 'copy_max_hw_bytes'.
+		Writing '0' to this file will disable offloading copies for this
+		device, instead copy is done via emulation.
+
+
+What:		/sys/block/<disk>/queue/copy_max_hw_bytes
+Date:		May 2024
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RO] This is the maximum number of bytes that the hardware
+		will allow for single data copy request.
+		A value of 0 means that the device does not support
+		copy offload.
+
+
 What:		/sys/block/<disk>/queue/crypto/
 Date:		February 2022
 Contact:	linux-block@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index a7fe8e90240a..67010ed82422 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -52,6 +52,9 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_write_zeroes_sectors = UINT_MAX;
 	lim->max_zone_append_sectors = UINT_MAX;
 	lim->max_user_discard_sectors = UINT_MAX;
+	lim->max_copy_hw_sectors = UINT_MAX;
+	lim->max_copy_sectors = UINT_MAX;
+	lim->max_user_copy_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -219,6 +222,9 @@ static int blk_validate_limits(struct queue_limits *lim)
 		lim->misaligned = 0;
 	}
 
+	lim->max_copy_sectors =
+		min(lim->max_copy_hw_sectors, lim->max_user_copy_sectors);
+
 	return blk_validate_zoned_limits(lim);
 }
 
@@ -231,10 +237,11 @@ int blk_set_default_limits(struct queue_limits *lim)
 {
 	/*
 	 * Most defaults are set by capping the bounds in blk_validate_limits,
-	 * but max_user_discard_sectors is special and needs an explicit
-	 * initialization to the max value here.
+	 * but max_user_discard_sectors and max_user_copy_sectors are special
+	 * and needs an explicit initialization to the max value here.
 	 */
 	lim->max_user_discard_sectors = UINT_MAX;
+	lim->max_user_copy_sectors = UINT_MAX;
 	return blk_validate_limits(lim);
 }
 
@@ -316,6 +323,25 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/*
+ * blk_queue_max_copy_hw_sectors - set max sectors for a single copy payload
+ * @q:	the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ */
+void blk_queue_max_copy_hw_sectors(struct request_queue *q,
+				   unsigned int max_copy_sectors)
+{
+	struct queue_limits *lim = &q->limits;
+
+	if (max_copy_sectors > (BLK_COPY_MAX_BYTES >> SECTOR_SHIFT))
+		max_copy_sectors = BLK_COPY_MAX_BYTES >> SECTOR_SHIFT;
+
+	lim->max_copy_hw_sectors = max_copy_sectors;
+	lim->max_copy_sectors =
+		min(max_copy_sectors, lim->max_user_copy_sectors);
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_hw_sectors);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
@@ -633,6 +659,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_segment_size = min_not_zero(t->max_segment_size,
 					   b->max_segment_size);
 
+	t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+	t->max_copy_hw_sectors = min(t->max_copy_hw_sectors,
+				     b->max_copy_hw_sectors);
+
 	t->misaligned |= b->misaligned;
 
 	alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f0f9314ab65c..805c2b6b0393 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -205,6 +205,44 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
 	return queue_var_show(0, page);
 }
 
+static ssize_t queue_copy_hw_max_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)
+		       q->limits.max_copy_hw_sectors << SECTOR_SHIFT);
+}
+
+static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)
+		       q->limits.max_copy_sectors << SECTOR_SHIFT);
+}
+
+static ssize_t queue_copy_max_store(struct request_queue *q, const char *page,
+				    size_t count)
+{
+	unsigned long max_copy_bytes;
+	struct queue_limits lim;
+	ssize_t ret;
+	int err;
+
+	ret = queue_var_store(&max_copy_bytes, page, count);
+	if (ret < 0)
+		return ret;
+
+	if (max_copy_bytes & (queue_logical_block_size(q) - 1))
+		return -EINVAL;
+
+	blk_mq_freeze_queue(q);
+	lim = queue_limits_start_update(q);
+	lim.max_user_copy_sectors = max_copy_bytes >> SECTOR_SHIFT;
+	err = queue_limits_commit_update(q, &lim);
+	blk_mq_unfreeze_queue(q);
+
+	if (err)
+		return err;
+	return count;
+}
+
 static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(0, page);
@@ -505,6 +543,9 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
 QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
 QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
 
+QUEUE_RO_ENTRY(queue_copy_hw_max, "copy_max_hw_bytes");
+QUEUE_RW_ENTRY(queue_copy_max, "copy_max_bytes");
+
 QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
 QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
 QUEUE_RW_ENTRY(queue_poll, "io_poll");
@@ -618,6 +659,8 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_max_entry.attr,
 	&queue_discard_max_hw_entry.attr,
 	&queue_discard_zeroes_data_entry.attr,
+	&queue_copy_hw_max_entry.attr,
+	&queue_copy_max_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
 	&queue_zone_append_max_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index aefdda9f4ec7..109d9f905c3c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -309,6 +309,10 @@ struct queue_limits {
 	unsigned int		discard_alignment;
 	unsigned int		zone_write_granularity;
 
+	unsigned int		max_copy_hw_sectors;
+	unsigned int		max_copy_sectors;
+	unsigned int		max_user_copy_sectors;
+
 	unsigned short		max_segments;
 	unsigned short		max_integrity_segments;
 	unsigned short		max_discard_segments;
@@ -933,6 +937,8 @@ void blk_queue_max_secure_erase_sectors(struct request_queue *q,
 		unsigned int max_sectors);
 extern void blk_queue_max_discard_sectors(struct request_queue *q,
 		unsigned int max_discard_sectors);
+extern void blk_queue_max_copy_hw_sectors(struct request_queue *q,
+					  unsigned int max_copy_sectors);
 extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
@@ -1271,6 +1277,14 @@ static inline unsigned int bdev_discard_granularity(struct block_device *bdev)
 	return bdev_get_queue(bdev)->limits.discard_granularity;
 }
 
+/* maximum copy offload length, this is set to 128MB based on current testing */
+#define BLK_COPY_MAX_BYTES		(1 << 27)
+
+static inline unsigned int bdev_max_copy_sectors(struct block_device *bdev)
+{
+	return bdev_get_queue(bdev)->limits.max_copy_sectors;
+}
+
 static inline unsigned int
 bdev_max_secure_erase_sectors(struct block_device *bdev)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
       [not found]   ` <CGME20240520102842epcas5p4949334c2587a15b8adab2c913daa622f@epcas5p4.samsung.com>
@ 2024-05-20 10:20     ` Nitesh Shetty
  2024-05-20 15:00       ` Damien Le Moal
                         ` (4 more replies)
  0 siblings, 5 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-20 10:20 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Nitesh Shetty, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

We add two new opcode REQ_OP_COPY_DST, REQ_OP_COPY_SRC.
Since copy is a composite operation involving src and dst sectors/lba,
each needs to be represented by a separate bio to make it compatible
with device mapper.
We expect caller to take a plug and send bio with destination information,
followed by bio with source information.
Once the dst bio arrives we form a request and wait for source
bio. Upon arrival of source bio we merge these two bio's and send
corresponding request down to device driver.
Merging non copy offload bio is avoided by checking for copy specific
opcodes in merge function.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/blk-core.c          |  7 +++++++
 block/blk-merge.c         | 41 +++++++++++++++++++++++++++++++++++++++
 block/blk.h               | 16 +++++++++++++++
 block/elevator.h          |  1 +
 include/linux/bio.h       |  6 +-----
 include/linux/blk_types.h | 10 ++++++++++
 6 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ea44b13af9af..f18ee5f709c0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -122,6 +122,8 @@ static const char *const blk_op_name[] = {
 	REQ_OP_NAME(ZONE_FINISH),
 	REQ_OP_NAME(ZONE_APPEND),
 	REQ_OP_NAME(WRITE_ZEROES),
+	REQ_OP_NAME(COPY_SRC),
+	REQ_OP_NAME(COPY_DST),
 	REQ_OP_NAME(DRV_IN),
 	REQ_OP_NAME(DRV_OUT),
 };
@@ -838,6 +840,11 @@ void submit_bio_noacct(struct bio *bio)
 		 * requests.
 		 */
 		fallthrough;
+	case REQ_OP_COPY_SRC:
+	case REQ_OP_COPY_DST:
+		if (!q->limits.max_copy_sectors)
+			goto not_supported;
+		break;
 	default:
 		goto not_supported;
 	}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8534c35e0497..f8dc48a03379 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -154,6 +154,20 @@ static struct bio *bio_split_write_zeroes(struct bio *bio,
 	return bio_split(bio, lim->max_write_zeroes_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *bio_split_copy(struct bio *bio,
+				  const struct queue_limits *lim,
+				  unsigned int *nsegs)
+{
+	*nsegs = 1;
+	if (bio_sectors(bio) <= lim->max_copy_sectors)
+		return NULL;
+	/*
+	 * We don't support splitting for a copy bio. End it with EIO if
+	 * splitting is required and return an error pointer.
+	 */
+	return ERR_PTR(-EIO);
+}
+
 /*
  * Return the maximum number of sectors from the start of a bio that may be
  * submitted as a single request to a block device. If enough sectors remain,
@@ -362,6 +376,12 @@ struct bio *__bio_split_to_limits(struct bio *bio,
 	case REQ_OP_WRITE_ZEROES:
 		split = bio_split_write_zeroes(bio, lim, nr_segs, bs);
 		break;
+	case REQ_OP_COPY_SRC:
+	case REQ_OP_COPY_DST:
+		split = bio_split_copy(bio, lim, nr_segs);
+		if (IS_ERR(split))
+			return NULL;
+		break;
 	default:
 		split = bio_split_rw(bio, lim, nr_segs, bs,
 				get_max_io_size(bio, lim) << SECTOR_SHIFT);
@@ -925,6 +945,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
 	if (!rq_mergeable(rq) || !bio_mergeable(bio))
 		return false;
 
+	if (blk_copy_offload_mergable(rq, bio))
+		return true;
+
 	if (req_op(rq) != bio_op(bio))
 		return false;
 
@@ -958,6 +981,8 @@ enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
 {
 	if (blk_discard_mergable(rq))
 		return ELEVATOR_DISCARD_MERGE;
+	else if (blk_copy_offload_mergable(rq, bio))
+		return ELEVATOR_COPY_OFFLOAD_MERGE;
 	else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
 		return ELEVATOR_BACK_MERGE;
 	else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
@@ -1065,6 +1090,20 @@ static enum bio_merge_status bio_attempt_discard_merge(struct request_queue *q,
 	return BIO_MERGE_FAILED;
 }
 
+static enum bio_merge_status bio_attempt_copy_offload_merge(struct request *req,
+							    struct bio *bio)
+{
+	if (req->__data_len != bio->bi_iter.bi_size)
+		return BIO_MERGE_FAILED;
+
+	req->biotail->bi_next = bio;
+	req->biotail = bio;
+	req->nr_phys_segments++;
+	req->__data_len += bio->bi_iter.bi_size;
+
+	return BIO_MERGE_OK;
+}
+
 static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
 						   struct request *rq,
 						   struct bio *bio,
@@ -1085,6 +1124,8 @@ static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
 		break;
 	case ELEVATOR_DISCARD_MERGE:
 		return bio_attempt_discard_merge(q, rq, bio);
+	case ELEVATOR_COPY_OFFLOAD_MERGE:
+		return bio_attempt_copy_offload_merge(rq, bio);
 	default:
 		return BIO_MERGE_NONE;
 	}
diff --git a/block/blk.h b/block/blk.h
index 189bc25beb50..6528a2779b84 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -174,6 +174,20 @@ static inline bool blk_discard_mergable(struct request *req)
 	return false;
 }
 
+/*
+ * Copy offload sends a pair of bio with REQ_OP_COPY_DST and REQ_OP_COPY_SRC
+ * operation by taking a plug.
+ * Initially DST bio is sent which forms a request and
+ * waits for SRC bio to arrive. Once SRC bio arrives
+ * we merge it and send request down to driver.
+ */
+static inline bool blk_copy_offload_mergable(struct request *req,
+					     struct bio *bio)
+{
+	return (req_op(req) == REQ_OP_COPY_DST &&
+		bio_op(bio) == REQ_OP_COPY_SRC);
+}
+
 static inline unsigned int blk_rq_get_max_segments(struct request *rq)
 {
 	if (req_op(rq) == REQ_OP_DISCARD)
@@ -323,6 +337,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_COPY_SRC:
+	case REQ_OP_COPY_DST:
 		return true; /* non-trivial splitting decisions */
 	default:
 		break;
diff --git a/block/elevator.h b/block/elevator.h
index e9a050a96e53..c7a45c1f4156 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -18,6 +18,7 @@ enum elv_merge {
 	ELEVATOR_FRONT_MERGE	= 1,
 	ELEVATOR_BACK_MERGE	= 2,
 	ELEVATOR_DISCARD_MERGE	= 3,
+	ELEVATOR_COPY_OFFLOAD_MERGE	= 4,
 };
 
 struct blk_mq_alloc_data;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d5379548d684..528ef22dd65b 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -53,11 +53,7 @@ static inline unsigned int bio_max_segs(unsigned int nr_segs)
  */
 static inline bool bio_has_data(struct bio *bio)
 {
-	if (bio &&
-	    bio->bi_iter.bi_size &&
-	    bio_op(bio) != REQ_OP_DISCARD &&
-	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
-	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
+	if (bio && (bio_op(bio) == REQ_OP_READ || bio_op(bio) == REQ_OP_WRITE))
 		return true;
 
 	return false;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 781c4500491b..7f692bade271 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -342,6 +342,10 @@ enum req_op {
 	/* reset all the zone present on the device */
 	REQ_OP_ZONE_RESET_ALL	= (__force blk_opf_t)15,
 
+	/* copy offload src and dst operation */
+	REQ_OP_COPY_SRC		= (__force blk_opf_t)18,
+	REQ_OP_COPY_DST		= (__force blk_opf_t)19,
+
 	/* Driver private requests */
 	REQ_OP_DRV_IN		= (__force blk_opf_t)34,
 	REQ_OP_DRV_OUT		= (__force blk_opf_t)35,
@@ -430,6 +434,12 @@ static inline bool op_is_write(blk_opf_t op)
 	return !!(op & (__force blk_opf_t)1);
 }
 
+static inline bool op_is_copy(blk_opf_t op)
+{
+	return ((op & REQ_OP_MASK) == REQ_OP_COPY_SRC ||
+		(op & REQ_OP_MASK) == REQ_OP_COPY_DST);
+}
+
 /*
  * Check if the bio or request is one that needs special treatment in the
  * flush state machine.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v20 03/12] block: add copy offload support
       [not found]   ` <CGME20240520102853epcas5p42d635d6712b8876ea22a45d730cb1378@epcas5p4.samsung.com>
@ 2024-05-20 10:20     ` Nitesh Shetty
  2024-06-01  6:16       ` Christoph Hellwig
  0 siblings, 1 reply; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-20 10:20 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Nitesh Shetty, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

Introduce blkdev_copy_offload to perform copy offload.
Issue REQ_OP_COPY_DST with destination info along with taking a plug.
This flows till request layer and waits for src bio to arrive.
Issue REQ_OP_COPY_SRC with source info and this bio reaches request
layer and merges with dst request.
For any reason, if a request comes to the driver with only one of src/dst
bio, we fail the copy offload.

Larger copy will be divided, based on max_copy_sectors limit.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 block/blk-lib.c        | 204 +++++++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |   4 +
 2 files changed, 208 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 442da9dad042..e83461abb581 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -10,6 +10,22 @@
 
 #include "blk.h"
 
+/* Keeps track of all outstanding copy IO */
+struct blkdev_copy_io {
+	atomic_t refcount;
+	ssize_t copied;
+	int status;
+	struct task_struct *waiter;
+	void (*endio)(void *private, int status, ssize_t copied);
+	void *private;
+};
+
+/* Keeps track of single outstanding copy offload IO */
+struct blkdev_copy_offload_io {
+	struct blkdev_copy_io *cio;
+	loff_t offset;
+};
+
 static sector_t bio_discard_limit(struct block_device *bdev, sector_t sector)
 {
 	unsigned int discard_granularity = bdev_discard_granularity(bdev);
@@ -103,6 +119,194 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+static inline ssize_t blkdev_copy_sanity_check(struct block_device *bdev_in,
+					       loff_t pos_in,
+					       struct block_device *bdev_out,
+					       loff_t pos_out, size_t len)
+{
+	unsigned int align = max(bdev_logical_block_size(bdev_out),
+				 bdev_logical_block_size(bdev_in)) - 1;
+
+	if ((pos_in & align) || (pos_out & align) || (len & align) || !len ||
+	    len >= BLK_COPY_MAX_BYTES)
+		return -EINVAL;
+
+	return 0;
+}
+
+static inline void blkdev_copy_endio(struct blkdev_copy_io *cio)
+{
+	if (cio->endio) {
+		cio->endio(cio->private, cio->status, cio->copied);
+		kfree(cio);
+	} else {
+		struct task_struct *waiter = cio->waiter;
+
+		WRITE_ONCE(cio->waiter, NULL);
+		blk_wake_io_task(waiter);
+	}
+}
+
+/*
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to complete.
+ * Returns the length of bytes copied or error
+ */
+static ssize_t blkdev_copy_wait_for_completion_io(struct blkdev_copy_io *cio)
+{
+	ssize_t ret;
+
+	for (;;) {
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!READ_ONCE(cio->waiter))
+			break;
+		blk_io_schedule();
+	}
+	__set_current_state(TASK_RUNNING);
+	ret = cio->copied;
+	kfree(cio);
+
+	return ret;
+}
+
+static void blkdev_copy_offload_src_endio(struct bio *bio)
+{
+	struct blkdev_copy_offload_io *offload_io = bio->bi_private;
+	struct blkdev_copy_io *cio = offload_io->cio;
+
+	if (bio->bi_status) {
+		cio->copied = min_t(ssize_t, offload_io->offset, cio->copied);
+		if (!cio->status)
+			cio->status = blk_status_to_errno(bio->bi_status);
+	}
+	bio_put(bio);
+	kfree(offload_io);
+
+	if (atomic_dec_and_test(&cio->refcount))
+		blkdev_copy_endio(cio);
+}
+
+/*
+ * @bdev:	block device
+ * @pos_in:	source offset
+ * @pos_out:	destination offset
+ * @len:	length in bytes to be copied
+ * @endio:	endio function to be called on completion of copy operation,
+ *		for synchronous operation this should be NULL
+ * @private:	endio function will be called with this private data,
+ *		for synchronous operation this should be NULL
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * For synchronous operation returns the length of bytes copied or error
+ * For asynchronous operation returns -EIOCBQUEUED or error
+ *
+ * Description:
+ *	Copy source offset to destination offset within block device, using
+ *	device's native copy offload feature.
+ *	We perform copy operation using 2 bio's.
+ *	1. We take a plug and send a REQ_OP_COPY_DST bio along with destination
+ *	sector and length. Once this bio reaches request layer, we form a
+ *	request and wait for dst bio to arrive.
+ *	2. We issue REQ_OP_COPY_SRC bio along with source sector, length.
+ *	Once this bio reaches request layer and find a request with previously
+ *	sent destination info we merge the source bio and return.
+ *	3. Release the plug and request is sent to driver
+ *	This design works only for drivers with request queue.
+ */
+ssize_t blkdev_copy_offload(struct block_device *bdev, loff_t pos_in,
+			    loff_t pos_out, size_t len,
+			    void (*endio)(void *, int, ssize_t),
+			    void *private, gfp_t gfp)
+{
+	struct blkdev_copy_io *cio;
+	struct blkdev_copy_offload_io *offload_io;
+	struct bio *src_bio, *dst_bio;
+	size_t rem, chunk;
+	size_t max_copy_bytes = bdev_max_copy_sectors(bdev) << SECTOR_SHIFT;
+	ssize_t ret;
+	struct blk_plug plug;
+
+	if (!max_copy_bytes)
+		return -EOPNOTSUPP;
+
+	ret = blkdev_copy_sanity_check(bdev, pos_in, bdev, pos_out, len);
+	if (ret)
+		return ret;
+
+	cio = kzalloc(sizeof(*cio), gfp);
+	if (!cio)
+		return -ENOMEM;
+	atomic_set(&cio->refcount, 1);
+	cio->waiter = current;
+	cio->endio = endio;
+	cio->private = private;
+
+	/*
+	 * If there is a error, copied will be set to least successfully
+	 * completed copied length
+	 */
+	cio->copied = len;
+	for (rem = len; rem > 0; rem -= chunk) {
+		chunk = min(rem, max_copy_bytes);
+
+		offload_io = kzalloc(sizeof(*offload_io), gfp);
+		if (!offload_io)
+			goto err_free_cio;
+		offload_io->cio = cio;
+		/*
+		 * For partial completion, we use offload_io->offset to truncate
+		 * successful copy length
+		 */
+		offload_io->offset = len - rem;
+
+		dst_bio = bio_alloc(bdev, 0, REQ_OP_COPY_DST, gfp);
+		if (!dst_bio)
+			goto err_free_offload_io;
+		dst_bio->bi_iter.bi_size = chunk;
+		dst_bio->bi_iter.bi_sector = pos_out >> SECTOR_SHIFT;
+
+		blk_start_plug(&plug);
+		src_bio = blk_next_bio(dst_bio, bdev, 0, REQ_OP_COPY_SRC, gfp);
+		if (!src_bio)
+			goto err_free_dst_bio;
+		src_bio->bi_iter.bi_size = chunk;
+		src_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
+		src_bio->bi_end_io = blkdev_copy_offload_src_endio;
+		src_bio->bi_private = offload_io;
+
+		atomic_inc(&cio->refcount);
+		submit_bio(src_bio);
+		blk_finish_plug(&plug);
+		pos_in += chunk;
+		pos_out += chunk;
+	}
+
+	if (atomic_dec_and_test(&cio->refcount))
+		blkdev_copy_endio(cio);
+	if (endio)
+		return -EIOCBQUEUED;
+
+	return blkdev_copy_wait_for_completion_io(cio);
+
+err_free_dst_bio:
+	bio_put(dst_bio);
+err_free_offload_io:
+	kfree(offload_io);
+err_free_cio:
+	cio->copied = min_t(ssize_t, cio->copied, (len - rem));
+	cio->status = -ENOMEM;
+	if (rem == len) {
+		ret = cio->status;
+		kfree(cio);
+		return ret;
+	}
+	if (cio->endio)
+		return cio->status;
+
+	return blkdev_copy_wait_for_completion_io(cio);
+}
+EXPORT_SYMBOL_GPL(blkdev_copy_offload);
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
 		struct bio **biop, unsigned flags)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 109d9f905c3c..3b88dcd5c433 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1079,6 +1079,10 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
 int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp);
+ssize_t blkdev_copy_offload(struct block_device *bdev, loff_t pos_in,
+			    loff_t pos_out, size_t len,
+			    void (*endio)(void *, int, ssize_t),
+			    void *private, gfp_t gfp_mask);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v20 04/12] block: add emulation for copy
       [not found]   ` <CGME20240520102906epcas5p15b5a0b3c8edd0bf3073030a792a328bb@epcas5p1.samsung.com>
@ 2024-05-20 10:20     ` Nitesh Shetty
  2024-05-21  7:06       ` Hannes Reinecke
  2024-06-01  6:18       ` Christoph Hellwig
  0 siblings, 2 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-20 10:20 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Nitesh Shetty, Vincent Fu,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

For the devices which does not support copy, copy emulation is added.
It is required for in-kernel users like fabrics, where file descriptor is
not available and hence they can't use copy_file_range.
Copy-emulation is implemented by reading from source into memory and
writing to the corresponding destination.
At present in kernel user of emulation is fabrics.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/blk-lib.c        | 223 +++++++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |   4 +
 2 files changed, 227 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e83461abb581..e0dd17f42c8e 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -26,6 +26,20 @@ struct blkdev_copy_offload_io {
 	loff_t offset;
 };
 
+/* Keeps track of single outstanding copy emulation IO */
+struct blkdev_copy_emulation_io {
+	struct blkdev_copy_io *cio;
+	struct work_struct emulation_work;
+	void *buf;
+	ssize_t buf_len;
+	loff_t pos_in;
+	loff_t pos_out;
+	ssize_t len;
+	struct block_device *bdev_in;
+	struct block_device *bdev_out;
+	gfp_t gfp;
+};
+
 static sector_t bio_discard_limit(struct block_device *bdev, sector_t sector)
 {
 	unsigned int discard_granularity = bdev_discard_granularity(bdev);
@@ -307,6 +321,215 @@ ssize_t blkdev_copy_offload(struct block_device *bdev, loff_t pos_in,
 }
 EXPORT_SYMBOL_GPL(blkdev_copy_offload);
 
+static void *blkdev_copy_alloc_buf(ssize_t req_size, ssize_t *alloc_size,
+				   gfp_t gfp)
+{
+	int min_size = PAGE_SIZE;
+	char *buf;
+
+	while (req_size >= min_size) {
+		buf = kvmalloc(req_size, gfp);
+		if (buf) {
+			*alloc_size = req_size;
+			return buf;
+		}
+		req_size >>= 1;
+	}
+
+	return NULL;
+}
+
+static struct bio *bio_map_buf(void *data, unsigned int len, gfp_t gfp)
+{
+	unsigned long kaddr = (unsigned long)data;
+	unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	unsigned long start = kaddr >> PAGE_SHIFT;
+	const int nr_pages = end - start;
+	bool is_vmalloc = is_vmalloc_addr(data);
+	struct page *page;
+	int offset, i;
+	struct bio *bio;
+
+	bio = bio_kmalloc(nr_pages, gfp);
+	if (!bio)
+		return ERR_PTR(-ENOMEM);
+	bio_init(bio, NULL, bio->bi_inline_vecs, nr_pages, 0);
+
+	if (is_vmalloc) {
+		flush_kernel_vmap_range(data, len);
+		bio->bi_private = data;
+	}
+
+	offset = offset_in_page(kaddr);
+	for (i = 0; i < nr_pages; i++) {
+		unsigned int bytes = PAGE_SIZE - offset;
+
+		if (len <= 0)
+			break;
+
+		if (bytes > len)
+			bytes = len;
+
+		if (!is_vmalloc)
+			page = virt_to_page(data);
+		else
+			page = vmalloc_to_page(data);
+		if (bio_add_page(bio, page, bytes, offset) < bytes) {
+			/* we don't support partial mappings */
+			bio_uninit(bio);
+			kfree(bio);
+			return ERR_PTR(-EINVAL);
+		}
+
+		data += bytes;
+		len -= bytes;
+		offset = 0;
+	}
+
+	return bio;
+}
+
+static void blkdev_copy_emulation_work(struct work_struct *work)
+{
+	struct blkdev_copy_emulation_io *emulation_io = container_of(work,
+			struct blkdev_copy_emulation_io, emulation_work);
+	struct blkdev_copy_io *cio = emulation_io->cio;
+	struct bio *read_bio, *write_bio;
+	loff_t pos_in = emulation_io->pos_in, pos_out = emulation_io->pos_out;
+	ssize_t rem, chunk;
+	int ret = 0;
+
+	for (rem = emulation_io->len; rem > 0; rem -= chunk) {
+		chunk = min_t(int, emulation_io->buf_len, rem);
+
+		read_bio = bio_map_buf(emulation_io->buf,
+				       emulation_io->buf_len,
+				       emulation_io->gfp);
+		if (IS_ERR(read_bio)) {
+			ret = PTR_ERR(read_bio);
+			break;
+		}
+		read_bio->bi_opf = REQ_OP_READ | REQ_SYNC;
+		bio_set_dev(read_bio, emulation_io->bdev_in);
+		read_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
+		read_bio->bi_iter.bi_size = chunk;
+		ret = submit_bio_wait(read_bio);
+		kfree(read_bio);
+		if (ret)
+			break;
+
+		write_bio = bio_map_buf(emulation_io->buf,
+					emulation_io->buf_len,
+					emulation_io->gfp);
+		if (IS_ERR(write_bio)) {
+			ret = PTR_ERR(write_bio);
+			break;
+		}
+		write_bio->bi_opf = REQ_OP_WRITE | REQ_SYNC;
+		bio_set_dev(write_bio, emulation_io->bdev_out);
+		write_bio->bi_iter.bi_sector = pos_out >> SECTOR_SHIFT;
+		write_bio->bi_iter.bi_size = chunk;
+		ret = submit_bio_wait(write_bio);
+		kfree(write_bio);
+		if (ret)
+			break;
+
+		pos_in += chunk;
+		pos_out += chunk;
+	}
+	cio->status = ret;
+	kvfree(emulation_io->buf);
+	kfree(emulation_io);
+	blkdev_copy_endio(cio);
+}
+
+static inline ssize_t queue_max_hw_bytes(struct request_queue *q)
+{
+	return min_t(ssize_t, queue_max_hw_sectors(q) << SECTOR_SHIFT,
+		     queue_max_segments(q) << PAGE_SHIFT);
+}
+/*
+ * @bdev_in:	source block device
+ * @pos_in:	source offset
+ * @bdev_out:	destination block device
+ * @pos_out:	destination offset
+ * @len:	length in bytes to be copied
+ * @endio:	endio function to be called on completion of copy operation,
+ *		for synchronous operation this should be NULL
+ * @private:	endio function will be called with this private data,
+ *		for synchronous operation this should be NULL
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * For synchronous operation returns the length of bytes copied or error
+ * For asynchronous operation returns -EIOCBQUEUED or error
+ *
+ * Description:
+ *	If native copy offload feature is absent, caller can use this function
+ *	to perform copy.
+ *	We store information required to perform the copy along with temporary
+ *	buffer allocation. We async punt copy emulation to a worker. And worker
+ *	performs copy in 2 steps.
+ *	1. Read data from source to temporary buffer
+ *	2. Write data to destination from temporary buffer
+ */
+ssize_t blkdev_copy_emulation(struct block_device *bdev_in, loff_t pos_in,
+			      struct block_device *bdev_out, loff_t pos_out,
+			      size_t len, void (*endio)(void *, int, ssize_t),
+			      void *private, gfp_t gfp)
+{
+	struct request_queue *in = bdev_get_queue(bdev_in);
+	struct request_queue *out = bdev_get_queue(bdev_out);
+	struct blkdev_copy_emulation_io *emulation_io;
+	struct blkdev_copy_io *cio;
+	ssize_t ret;
+	size_t max_hw_bytes = min(queue_max_hw_bytes(in),
+				  queue_max_hw_bytes(out));
+
+	ret = blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len);
+	if (ret)
+		return ret;
+
+	cio = kzalloc(sizeof(*cio), gfp);
+	if (!cio)
+		return -ENOMEM;
+
+	cio->waiter = current;
+	cio->copied = len;
+	cio->endio = endio;
+	cio->private = private;
+
+	emulation_io = kzalloc(sizeof(*emulation_io), gfp);
+	if (!emulation_io)
+		goto err_free_cio;
+	emulation_io->cio = cio;
+	INIT_WORK(&emulation_io->emulation_work, blkdev_copy_emulation_work);
+	emulation_io->pos_in = pos_in;
+	emulation_io->pos_out = pos_out;
+	emulation_io->len = len;
+	emulation_io->bdev_in = bdev_in;
+	emulation_io->bdev_out = bdev_out;
+	emulation_io->gfp = gfp;
+
+	emulation_io->buf = blkdev_copy_alloc_buf(min(max_hw_bytes, len),
+						  &emulation_io->buf_len, gfp);
+	if (!emulation_io->buf)
+		goto err_free_emulation_io;
+
+	schedule_work(&emulation_io->emulation_work);
+
+	if (cio->endio)
+		return -EIOCBQUEUED;
+
+	return blkdev_copy_wait_for_completion_io(cio);
+
+err_free_emulation_io:
+	kfree(emulation_io);
+err_free_cio:
+	kfree(cio);
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(blkdev_copy_emulation);
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
 		struct bio **biop, unsigned flags)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3b88dcd5c433..8b1edb46880a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1083,6 +1083,10 @@ ssize_t blkdev_copy_offload(struct block_device *bdev, loff_t pos_in,
 			    loff_t pos_out, size_t len,
 			    void (*endio)(void *, int, ssize_t),
 			    void *private, gfp_t gfp_mask);
+ssize_t blkdev_copy_emulation(struct block_device *bdev_in, loff_t pos_in,
+			      struct block_device *bdev_out, loff_t pos_out,
+			      size_t len, void (*endio)(void *, int, ssize_t),
+			      void *private, gfp_t gfp);
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v20 05/12] fs/read_write: Enable copy_file_range for block device.
       [not found]   ` <CGME20240520102917epcas5p1bda532309b9174bf2702081f6f58daf7@epcas5p1.samsung.com>
@ 2024-05-20 10:20     ` Nitesh Shetty
  2024-05-21  7:07       ` Hannes Reinecke
  2024-05-25 23:02       ` Dave Chinner
  0 siblings, 2 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-20 10:20 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Nitesh Shetty, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

From: Anuj Gupta <anuj20.g@samsung.com>

This is a prep patch. Allow copy_file_range to work for block devices.
Relaxing generic_copy_file_checks allows us to reuse the existing infra,
instead of adding a new user interface for block copy offload.
Change generic_copy_file_checks to use ->f_mapping->host for both inode_in
and inode_out. Allow block device in generic_file_rw_checks.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 fs/read_write.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index ef6339391351..31645ca5ed58 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1413,8 +1413,8 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
 				    struct file *file_out, loff_t pos_out,
 				    size_t *req_count, unsigned int flags)
 {
-	struct inode *inode_in = file_inode(file_in);
-	struct inode *inode_out = file_inode(file_out);
+	struct inode *inode_in = file_in->f_mapping->host;
+	struct inode *inode_out = file_out->f_mapping->host;
 	uint64_t count = *req_count;
 	loff_t size_in;
 	int ret;
@@ -1726,7 +1726,9 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
 	/* Don't copy dirs, pipes, sockets... */
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
 		return -EISDIR;
-	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+	if (!S_ISREG(inode_in->i_mode) && !S_ISBLK(inode_in->i_mode))
+		return -EINVAL;
+	if ((inode_in->i_mode & S_IFMT) != (inode_out->i_mode & S_IFMT))
 		return -EINVAL;
 
 	if (!(file_in->f_mode & FMODE_READ) ||
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v20 06/12] fs, block: copy_file_range for def_blk_ops for direct block device
       [not found]   ` <CGME20240520102929epcas5p2f4456f6fa0005d90769615eb2c2bf273@epcas5p2.samsung.com>
@ 2024-05-20 10:20     ` Nitesh Shetty
  2024-05-25 23:09       ` Dave Chinner
  0 siblings, 1 reply; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-20 10:20 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Nitesh Shetty, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

For direct block device opened with O_DIRECT, use blkdev_copy_offload to
issue device copy offload, or use splice_copy_file_range in case
device copy offload capability is absent or the device files are not open
with O_DIRECT.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 block/fops.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/block/fops.c b/block/fops.c
index 376265935714..5a4bba4f43aa 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -17,6 +17,7 @@
 #include <linux/fs.h>
 #include <linux/iomap.h>
 #include <linux/module.h>
+#include <linux/splice.h>
 #include "blk.h"
 
 static inline struct inode *bdev_file_inode(struct file *file)
@@ -754,6 +755,30 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return ret;
 }
 
+static ssize_t blkdev_copy_file_range(struct file *file_in, loff_t pos_in,
+				      struct file *file_out, loff_t pos_out,
+				      size_t len, unsigned int flags)
+{
+	struct block_device *in_bdev = I_BDEV(bdev_file_inode(file_in));
+	struct block_device *out_bdev = I_BDEV(bdev_file_inode(file_out));
+	ssize_t copied = 0;
+
+	if ((in_bdev == out_bdev) && bdev_max_copy_sectors(in_bdev) &&
+	    (file_in->f_iocb_flags & IOCB_DIRECT) &&
+	    (file_out->f_iocb_flags & IOCB_DIRECT)) {
+		copied = blkdev_copy_offload(in_bdev, pos_in, pos_out, len,
+					     NULL, NULL, GFP_KERNEL);
+		if (copied < 0)
+			copied = 0;
+	} else {
+		copied = splice_copy_file_range(file_in, pos_in + copied,
+						 file_out, pos_out + copied,
+						 len - copied);
+	}
+
+	return copied;
+}
+
 #define	BLKDEV_FALLOC_FL_SUPPORTED					\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
@@ -859,6 +884,7 @@ const struct file_operations def_blk_fops = {
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= blkdev_fallocate,
 	.fop_flags	= FOP_BUFFER_RASYNC,
+	.copy_file_range = blkdev_copy_file_range,
 };
 
 static __init int blkdev_init(void)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v20 07/12] nvme: add copy offload support
       [not found]   ` <CGME20240520102940epcas5p2b5f38ceabe94bed3905fb386a0d65ec7@epcas5p2.samsung.com>
@ 2024-05-20 10:20     ` Nitesh Shetty
  2024-06-01  6:22       ` Christoph Hellwig
  0 siblings, 1 reply; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-20 10:20 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Nitesh Shetty,
	Javier González, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

Current design only supports single source range.
We receive a request with REQ_OP_COPY_DST.
Parse this request which consists of dst(1st) and src(2nd) bios.
Form a copy command (TP 4065)

trace event support for nvme_copy_cmd.
Set the device copy limits to queue limits.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/nvme/host/constants.c |  1 +
 drivers/nvme/host/core.c      | 81 ++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/trace.c     | 19 ++++++++
 include/linux/blkdev.h        |  1 +
 include/linux/nvme.h          | 43 +++++++++++++++++--
 5 files changed, 141 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index 6f2ebb5fcdb0..01b02e76e070 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -19,6 +19,7 @@ static const char * const nvme_ops[] = {
 	[nvme_cmd_resv_report] = "Reservation Report",
 	[nvme_cmd_resv_acquire] = "Reservation Acquire",
 	[nvme_cmd_resv_release] = "Reservation Release",
+	[nvme_cmd_copy] = "Copy Offload",
 	[nvme_cmd_zone_mgmt_send] = "Zone Management Send",
 	[nvme_cmd_zone_mgmt_recv] = "Zone Management Receive",
 	[nvme_cmd_zone_append] = "Zone Append",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 954f850f113a..238905f9fff6 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -795,6 +795,66 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 	cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
+static inline blk_status_t nvme_setup_copy_offload(struct nvme_ns *ns,
+						   struct request *req,
+						   struct nvme_command *cmnd)
+{
+	struct nvme_copy_range *range = NULL;
+	struct bio *bio;
+	u64 dst_lba = 0, src_lba = 0, n_lba = 0;
+	u16 nr_range = 1, control = 0, seg = 1;
+
+	if (blk_rq_nr_phys_segments(req) != BLK_COPY_MAX_SEGMENTS)
+		return BLK_STS_IOERR;
+
+	/*
+	 * First bio contains information about destination and last bio
+	 * contains information about source.
+	 */
+	__rq_for_each_bio(bio, req) {
+		if (seg == blk_rq_nr_phys_segments(req)) {
+			src_lba = nvme_sect_to_lba(ns->head,
+							bio->bi_iter.bi_sector);
+			if (n_lba !=
+				bio->bi_iter.bi_size >> ns->head->lba_shift)
+				return BLK_STS_IOERR;
+		} else {
+			dst_lba = nvme_sect_to_lba(ns->head,
+							bio->bi_iter.bi_sector);
+			n_lba = bio->bi_iter.bi_size >> ns->head->lba_shift;
+		}
+		seg++;
+	}
+
+	if (req->cmd_flags & REQ_FUA)
+		control |= NVME_RW_FUA;
+
+	if (req->cmd_flags & REQ_FAILFAST_DEV)
+		control |= NVME_RW_LR;
+
+	memset(cmnd, 0, sizeof(*cmnd));
+	cmnd->copy.opcode = nvme_cmd_copy;
+	cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+	cmnd->copy.control = cpu_to_le16(control);
+	cmnd->copy.sdlba = cpu_to_le64(dst_lba);
+	cmnd->copy.nr_range = 0;
+
+	range = kmalloc_array(nr_range, sizeof(*range),
+			      GFP_ATOMIC | __GFP_NOWARN);
+	if (!range)
+		return BLK_STS_RESOURCE;
+
+	range[0].slba = cpu_to_le64(src_lba);
+	range[0].nlb = cpu_to_le16(n_lba - 1);
+
+	req->special_vec.bv_page = virt_to_page(range);
+	req->special_vec.bv_offset = offset_in_page(range);
+	req->special_vec.bv_len = sizeof(*range) * nr_range;
+	req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+	return BLK_STS_OK;
+}
+
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 		struct nvme_command *cmnd)
 {
@@ -1041,6 +1101,11 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
 	case REQ_OP_ZONE_APPEND:
 		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
 		break;
+	case REQ_OP_COPY_DST:
+		ret = nvme_setup_copy_offload(ns, req, cmd);
+		break;
+	case REQ_OP_COPY_SRC:
+		return BLK_STS_IOERR;
 	default:
 		WARN_ON_ONCE(1);
 		return BLK_STS_IOERR;
@@ -1218,7 +1283,7 @@ EXPORT_SYMBOL_NS_GPL(nvme_passthru_end, NVME_TARGET_PASSTHRU);
 
 /*
  * Recommended frequency for KATO commands per NVMe 1.4 section 7.12.1:
- * 
+ *
  *   The host should send Keep Alive commands at half of the Keep Alive Timeout
  *   accounting for transport roundtrip times [..].
  */
@@ -1802,6 +1867,18 @@ static void nvme_config_discard(struct nvme_ns *ns, struct queue_limits *lim)
 		lim->max_discard_segments = NVME_DSM_MAX_RANGES;
 }
 
+static void nvme_config_copy(struct nvme_ns *ns, struct nvme_id_ns *id,
+			     struct queue_limits *lim)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+
+	if (ctrl->oncs & NVME_CTRL_ONCS_COPY)
+		lim->max_copy_hw_sectors =
+			nvme_lba_to_sect(ns->head, id->mssrl);
+	else
+		lim->max_copy_hw_sectors = 0;
+}
+
 static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
 {
 	return uuid_equal(&a->uuid, &b->uuid) &&
@@ -2098,6 +2175,7 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 	if (!nvme_update_disk_info(ns, id, &lim))
 		capacity = 0;
 	nvme_config_discard(ns, &lim);
+	nvme_config_copy(ns, id, &lim);
 	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
 	    ns->head->ids.csi == NVME_CSI_ZNS)
 		nvme_update_zone_info(ns, &lim, &zi);
@@ -4833,6 +4911,7 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_download_firmware) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_format_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_dsm_cmd) != 64);
+	BUILD_BUG_ON(sizeof(struct nvme_copy_command) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_write_zeroes_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_abort_cmd) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_get_log_page_command) != 64);
diff --git a/drivers/nvme/host/trace.c b/drivers/nvme/host/trace.c
index 0288315f0050..dfc97fff886b 100644
--- a/drivers/nvme/host/trace.c
+++ b/drivers/nvme/host/trace.c
@@ -153,6 +153,23 @@ static const char *nvme_trace_read_write(struct trace_seq *p, u8 *cdw10)
 	return ret;
 }
 
+static const char *nvme_trace_copy(struct trace_seq *p, u8 *cdw10)
+{
+	const char *ret = trace_seq_buffer_ptr(p);
+	u64 sdlba = get_unaligned_le64(cdw10);
+	u8 nr_range = get_unaligned_le16(cdw10 + 8);
+	u16 control = get_unaligned_le16(cdw10 + 10);
+	u32 dsmgmt = get_unaligned_le32(cdw10 + 12);
+	u32 reftag = get_unaligned_le32(cdw10 + 16);
+
+	trace_seq_printf(p,
+		"sdlba=%llu, nr_range=%u, ctrl=0x%x, dsmgmt=%u, reftag=%u",
+		sdlba, nr_range, control, dsmgmt, reftag);
+	trace_seq_putc(p, 0);
+
+	return ret;
+}
+
 static const char *nvme_trace_dsm(struct trace_seq *p, u8 *cdw10)
 {
 	const char *ret = trace_seq_buffer_ptr(p);
@@ -340,6 +357,8 @@ const char *nvme_trace_parse_nvm_cmd(struct trace_seq *p,
 		return nvme_trace_resv_rel(p, cdw10);
 	case nvme_cmd_resv_report:
 		return nvme_trace_resv_report(p, cdw10);
+	case nvme_cmd_copy:
+		return nvme_trace_copy(p, cdw10);
 	default:
 		return nvme_trace_common(p, cdw10);
 	}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8b1edb46880a..1c5974bb23d5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1287,6 +1287,7 @@ static inline unsigned int bdev_discard_granularity(struct block_device *bdev)
 
 /* maximum copy offload length, this is set to 128MB based on current testing */
 #define BLK_COPY_MAX_BYTES		(1 << 27)
+#define BLK_COPY_MAX_SEGMENTS		2
 
 static inline unsigned int bdev_max_copy_sectors(struct block_device *bdev)
 {
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 425573202295..5275a0962a02 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -341,7 +341,7 @@ struct nvme_id_ctrl {
 	__u8			nvscc;
 	__u8			nwpc;
 	__le16			acwu;
-	__u8			rsvd534[2];
+	__le16			ocfs;
 	__le32			sgls;
 	__le32			mnan;
 	__u8			rsvd544[224];
@@ -369,6 +369,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_ONCS_RESERVATIONS		= 1 << 5,
 	NVME_CTRL_ONCS_TIMESTAMP		= 1 << 6,
+	NVME_CTRL_ONCS_COPY			= 1 << 8,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
 	NVME_CTRL_OACS_NS_MNGT_SUPP		= 1 << 3,
@@ -418,7 +419,10 @@ struct nvme_id_ns {
 	__le16			npdg;
 	__le16			npda;
 	__le16			nows;
-	__u8			rsvd74[18];
+	__le16			mssrl;
+	__le32			mcl;
+	__u8			msrc;
+	__u8			rsvd91[11];
 	__le32			anagrpid;
 	__u8			rsvd96[3];
 	__u8			nsattr;
@@ -830,6 +834,7 @@ enum nvme_opcode {
 	nvme_cmd_resv_report	= 0x0e,
 	nvme_cmd_resv_acquire	= 0x11,
 	nvme_cmd_resv_release	= 0x15,
+	nvme_cmd_copy		= 0x19,
 	nvme_cmd_zone_mgmt_send	= 0x79,
 	nvme_cmd_zone_mgmt_recv	= 0x7a,
 	nvme_cmd_zone_append	= 0x7d,
@@ -853,7 +858,8 @@ enum nvme_opcode {
 		nvme_opcode_name(nvme_cmd_resv_release),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_send),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_recv),	\
-		nvme_opcode_name(nvme_cmd_zone_append))
+		nvme_opcode_name(nvme_cmd_zone_append),		\
+		nvme_opcode_name(nvme_cmd_copy))
 
 
 
@@ -1030,6 +1036,36 @@ struct nvme_dsm_range {
 	__le64			slba;
 };
 
+struct nvme_copy_command {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2;
+	__le64			metadata;
+	union nvme_data_ptr	dptr;
+	__le64			sdlba;
+	__u8			nr_range;
+	__u8			rsvd12;
+	__le16			control;
+	__le16			rsvd13;
+	__le16			dspec;
+	__le32			ilbrt;
+	__le16			lbat;
+	__le16			lbatm;
+};
+
+struct nvme_copy_range {
+	__le64			rsvd0;
+	__le64			slba;
+	__le16			nlb;
+	__le16			rsvd18;
+	__le32			rsvd20;
+	__le32			eilbrt;
+	__le16			elbat;
+	__le16			elbatm;
+};
+
 struct nvme_write_zeroes_cmd {
 	__u8			opcode;
 	__u8			flags;
@@ -1794,6 +1830,7 @@ struct nvme_command {
 		struct nvme_download_firmware dlfw;
 		struct nvme_format_cmd format;
 		struct nvme_dsm_cmd dsm;
+		struct nvme_copy_command copy;
 		struct nvme_write_zeroes_cmd write_zeroes;
 		struct nvme_zone_mgmt_send_cmd zms;
 		struct nvme_zone_mgmt_recv_cmd zmr;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v20 08/12] nvmet: add copy command support for bdev and file ns
       [not found]   ` <CGME20240520102952epcas5p18716d203d1810c38397e7fcc9a26922a@epcas5p1.samsung.com>
@ 2024-05-20 10:20     ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-20 10:20 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Nitesh Shetty, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

Add support for handling nvme_cmd_copy command on target.

For bdev-ns if backing device supports copy offload we call device copy
offload (blkdev_copy_offload).
In case of absence of device copy offload capability, we use copy emulation
(blkdev_copy_emulation)

For file-ns we call vfs_copy_file_range to service our request.

Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.

loop target has copy support, which can be used to test copy offload.
trace event support for nvme_cmd_copy.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/nvme/target/admin-cmd.c   |  9 +++-
 drivers/nvme/target/io-cmd-bdev.c | 71 +++++++++++++++++++++++++++++++
 drivers/nvme/target/io-cmd-file.c | 50 ++++++++++++++++++++++
 drivers/nvme/target/nvmet.h       |  1 +
 drivers/nvme/target/trace.c       | 19 +++++++++
 5 files changed, 148 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index f5b7054a4a05..59021552084f 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -433,8 +433,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
 	id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
 	id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
 	id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
-			NVME_CTRL_ONCS_WRITE_ZEROES);
-
+			NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
 	/* XXX: don't report vwc if the underlying device is write through */
 	id->vwc = NVME_CTRL_VWC_PRESENT;
 
@@ -536,6 +535,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req *req)
 
 	if (req->ns->bdev)
 		nvmet_bdev_set_limits(req->ns->bdev, id);
+	else {
+		id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+		id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
+					(PAGE_SHIFT - SECTOR_SHIFT));
+		id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
+	}
 
 	/*
 	 * We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 6426aac2634a..4c2c1784fec9 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,18 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
 	id->npda = id->npdg;
 	/* NOWS = Namespace Optimal Write Size */
 	id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+	if (bdev_max_copy_sectors(bdev)) {
+		id->msrc = id->msrc;
+		id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
+				SECTOR_SHIFT) /	bdev_logical_block_size(bdev));
+		id->mcl = cpu_to_le32((__force u32)id->mssrl);
+	} else {
+		id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+		id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
+					bdev_logical_block_size(bdev));
+		id->mcl = cpu_to_le32((__force u32)id->mssrl);
+	}
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -451,6 +463,61 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
 	}
 }
 
+static void nvmet_bdev_copy_endio(void *private, int status,
+					    ssize_t copied)
+{
+	struct nvmet_req *rq = (struct nvmet_req *)private;
+	u16 nvme_status;
+
+	if (copied == rq->copy_len)
+		rq->cqe->result.u32 = cpu_to_le32(1);
+	else
+		rq->cqe->result.u32 = cpu_to_le32(0);
+
+	nvme_status = errno_to_nvme_status(rq, status);
+	nvmet_req_complete(rq, nvme_status);
+}
+
+/*
+ * At present we handle only one range entry, since copy offload is aligned with
+ * copy_file_range, only one entry is passed from block layer.
+ */
+static void nvmet_bdev_execute_copy(struct nvmet_req *rq)
+{
+	struct nvme_copy_range range;
+	struct nvme_command *cmd = rq->cmd;
+	ssize_t ret;
+	off_t dst, src;
+
+	u16 status;
+
+	status = nvmet_copy_from_sgl(rq, 0, &range, sizeof(range));
+	if (status)
+		goto err_rq_complete;
+
+	dst = le64_to_cpu(cmd->copy.sdlba) << rq->ns->blksize_shift;
+	src = le64_to_cpu(range.slba) << rq->ns->blksize_shift;
+	rq->copy_len = ((__force size_t)range.nlb + 1) << rq->ns->blksize_shift;
+
+	if (bdev_max_copy_sectors(rq->ns->bdev)) {
+		ret = blkdev_copy_offload(rq->ns->bdev, dst, src, rq->copy_len,
+					  nvmet_bdev_copy_endio,
+					  (void *)rq, GFP_KERNEL);
+	} else {
+		ret = blkdev_copy_emulation(rq->ns->bdev, dst,
+					    rq->ns->bdev, src, rq->copy_len,
+					    nvmet_bdev_copy_endio,
+					    (void *)rq, GFP_KERNEL);
+	}
+	if (ret == -EIOCBQUEUED)
+		return;
+
+	rq->cqe->result.u32 = cpu_to_le32(0);
+	status = errno_to_nvme_status(rq, ret);
+err_rq_complete:
+	nvmet_req_complete(rq, status);
+}
+
 u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 {
 	switch (req->cmd->common.opcode) {
@@ -469,6 +536,10 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_write_zeroes:
 		req->execute = nvmet_bdev_execute_write_zeroes;
 		return 0;
+	case nvme_cmd_copy:
+		req->execute = nvmet_bdev_execute_copy;
+		return 0;
+
 	default:
 		return nvmet_report_invalid_opcode(req);
 	}
diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index 2d068439b129..0a8337596f0c 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -322,6 +322,47 @@ static void nvmet_file_dsm_work(struct work_struct *w)
 	}
 }
 
+static void nvmet_file_copy_work(struct work_struct *w)
+{
+	struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
+	int nr_range = req->cmd->copy.nr_range + 1;
+	u16 status = 0;
+	int src, id;
+	ssize_t len, ret;
+	loff_t pos;
+
+	pos = le64_to_cpu(req->cmd->copy.sdlba) << req->ns->blksize_shift;
+	if (unlikely(pos + req->transfer_len > req->ns->size)) {
+		nvmet_req_complete(req, errno_to_nvme_status(req, -ENOSPC));
+		return;
+	}
+
+	for (id = 0 ; id < nr_range; id++) {
+		struct nvme_copy_range range;
+
+		status = nvmet_copy_from_sgl(req, id * sizeof(range), &range,
+					     sizeof(range));
+		if (status)
+			break;
+
+		src = (le64_to_cpu(range.slba) << (req->ns->blksize_shift));
+		len = (le16_to_cpu(range.nlb) + 1) << (req->ns->blksize_shift);
+		ret = vfs_copy_file_range(req->ns->file, src, req->ns->file,
+					  pos, len, COPY_FILE_SPLICE);
+		pos += ret;
+		if (ret != len) {
+			req->cqe->result.u32 = cpu_to_le32(id);
+			if (ret < 0)
+				status = errno_to_nvme_status(req, ret);
+			else
+				status = errno_to_nvme_status(req, -EIO);
+			break;
+		}
+	}
+
+	nvmet_req_complete(req, status);
+}
+
 static void nvmet_file_execute_dsm(struct nvmet_req *req)
 {
 	if (!nvmet_check_data_len_lte(req, nvmet_dsm_len(req)))
@@ -330,6 +371,12 @@ static void nvmet_file_execute_dsm(struct nvmet_req *req)
 	queue_work(nvmet_wq, &req->f.work);
 }
 
+static void nvmet_file_execute_copy(struct nvmet_req *req)
+{
+	INIT_WORK(&req->f.work, nvmet_file_copy_work);
+	queue_work(nvmet_wq, &req->f.work);
+}
+
 static void nvmet_file_write_zeroes_work(struct work_struct *w)
 {
 	struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
@@ -376,6 +423,9 @@ u16 nvmet_file_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_write_zeroes:
 		req->execute = nvmet_file_execute_write_zeroes;
 		return 0;
+	case nvme_cmd_copy:
+		req->execute = nvmet_file_execute_copy;
+		return 0;
 	default:
 		return nvmet_report_invalid_opcode(req);
 	}
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 2f22b07eab29..480e4ffa01fd 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -406,6 +406,7 @@ struct nvmet_req {
 	struct device		*p2p_client;
 	u16			error_loc;
 	u64			error_slba;
+	size_t			copy_len;
 };
 
 #define NVMET_MAX_MPOOL_BVEC		16
diff --git a/drivers/nvme/target/trace.c b/drivers/nvme/target/trace.c
index 8d1806a82887..8be6f77ebfd8 100644
--- a/drivers/nvme/target/trace.c
+++ b/drivers/nvme/target/trace.c
@@ -92,6 +92,23 @@ static const char *nvmet_trace_dsm(struct trace_seq *p, u8 *cdw10)
 	return ret;
 }
 
+static const char *nvmet_trace_copy(struct trace_seq *p, u8 *cdw10)
+{
+	const char *ret = trace_seq_buffer_ptr(p);
+	u64 sdlba = get_unaligned_le64(cdw10);
+	u8 nr_range = get_unaligned_le16(cdw10 + 8);
+	u16 control = get_unaligned_le16(cdw10 + 10);
+	u32 dsmgmt = get_unaligned_le32(cdw10 + 12);
+	u32 reftag = get_unaligned_le32(cdw10 +  16);
+
+	trace_seq_printf(p,
+		"sdlba=%llu, nr_range=%u, ctrl=1x%x, dsmgmt=%u, reftag=%u",
+		sdlba, nr_range, control, dsmgmt, reftag);
+	trace_seq_putc(p, 0);
+
+	return ret;
+}
+
 static const char *nvmet_trace_common(struct trace_seq *p, u8 *cdw10)
 {
 	const char *ret = trace_seq_buffer_ptr(p);
@@ -195,6 +212,8 @@ const char *nvmet_trace_parse_nvm_cmd(struct trace_seq *p,
 		return nvmet_trace_zone_mgmt_send(p, cdw10);
 	case nvme_cmd_zone_mgmt_recv:
 		return nvmet_trace_zone_mgmt_recv(p, cdw10);
+	case nvme_cmd_copy:
+		return nvmet_trace_copy(p, cdw10);
 	default:
 		return nvmet_trace_common(p, cdw10);
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v20 09/12] dm: Add support for copy offload
       [not found]   ` <CGME20240520103004epcas5p4a18f3f6ba0f218d57b0ab4bb84c6ff18@epcas5p4.samsung.com>
@ 2024-05-20 10:20     ` Nitesh Shetty
  2024-05-21  7:11       ` Hannes Reinecke
  0 siblings, 1 reply; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-20 10:20 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Nitesh Shetty, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

Before enabling copy for dm target, check if underlying devices and
dm target support copy. Avoid split happening inside dm target.
Fail early if the request needs split, currently splitting copy
request is not supported.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 drivers/md/dm-table.c         | 37 +++++++++++++++++++++++++++++++++++
 drivers/md/dm.c               |  7 +++++++
 include/linux/device-mapper.h |  3 +++
 3 files changed, 47 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index cc66a27c363a..d58c67ecd794 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1899,6 +1899,38 @@ static bool dm_table_supports_nowait(struct dm_table *t)
 	return true;
 }
 
+static int device_not_copy_capable(struct dm_target *ti, struct dm_dev *dev,
+				   sector_t start, sector_t len, void *data)
+{
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+
+	return !q->limits.max_copy_sectors;
+}
+
+static bool dm_table_supports_copy(struct dm_table *t)
+{
+	struct dm_target *ti;
+	unsigned int i;
+
+	for (i = 0; i < t->num_targets; i++) {
+		ti = dm_table_get_target(t, i);
+
+		if (!ti->copy_offload_supported)
+			return false;
+
+		/*
+		 * target provides copy support (as implied by setting
+		 * 'copy_offload_supported')
+		 * and it relies on _all_ data devices having copy support.
+		 */
+		if (!ti->type->iterate_devices ||
+		    ti->type->iterate_devices(ti, device_not_copy_capable, NULL))
+			return false;
+	}
+
+	return true;
+}
+
 static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
 				      sector_t start, sector_t len, void *data)
 {
@@ -1975,6 +2007,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 		limits->discard_misaligned = 0;
 	}
 
+	if (!dm_table_supports_copy(t)) {
+		limits->max_copy_sectors = 0;
+		limits->max_copy_hw_sectors = 0;
+	}
+
 	if (!dm_table_supports_write_zeroes(t))
 		limits->max_write_zeroes_sectors = 0;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 597dd7a25823..070b41b83a97 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1717,6 +1717,13 @@ static blk_status_t __split_and_process_bio(struct clone_info *ci)
 	if (unlikely(ci->is_abnormal_io))
 		return __process_abnormal_io(ci, ti);
 
+	if ((unlikely(op_is_copy(ci->bio->bi_opf)) &&
+	    max_io_len(ti, ci->sector) < ci->sector_count)) {
+		DMERR("Error, IO size(%u) > max target size(%llu)\n",
+		      ci->sector_count, max_io_len(ti, ci->sector));
+		return BLK_STS_IOERR;
+	}
+
 	/*
 	 * Only support bio polling for normal IO, and the target io is
 	 * exactly inside the dm_io instance (verified in dm_poll_dm_io)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 82b2195efaca..6868941bc7d9 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -397,6 +397,9 @@ struct dm_target {
 	 * bio_set_dev(). NOTE: ideally a target should _not_ need this.
 	 */
 	bool needs_bio_set_dev:1;
+
+	/* copy offload is supported */
+	bool copy_offload_supported:1;
 };
 
 void *dm_per_bio_data(struct bio *bio, size_t data_size);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v20 10/12] dm: Enable copy offload for dm-linear target
       [not found]   ` <CGME20240520103016epcas5p31b9a0f3637959626d49763609ebda6ef@epcas5p3.samsung.com>
@ 2024-05-20 10:20     ` Nitesh Shetty
  2024-05-20 23:25       ` Bart Van Assche
  0 siblings, 1 reply; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-20 10:20 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Nitesh Shetty, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

Setting copy_offload_supported flag to enable offload.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 drivers/md/dm-linear.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 2d3e186ca87e..cfec2fac28e1 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -62,6 +62,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	ti->num_discard_bios = 1;
 	ti->num_secure_erase_bios = 1;
 	ti->num_write_zeroes_bios = 1;
+	ti->copy_offload_supported = 1;
 	ti->private = lc;
 	return 0;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v20 11/12] null: Enable trace capability for null block
       [not found]   ` <CGME20240520103027epcas5p4789defe8ab3bff23bd2abcf019689fa2@epcas5p4.samsung.com>
@ 2024-05-20 10:20     ` Nitesh Shetty
  2024-05-20 23:27       ` Bart Van Assche
  2024-06-01  6:23       ` Christoph Hellwig
  0 siblings, 2 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-20 10:20 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Nitesh Shetty, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

This is a prep patch to enable copy trace capability.
At present only zoned null_block is using trace, so we decoupled trace
and zoned dependency to make it usable in null_blk driver also.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 drivers/block/null_blk/Makefile | 2 --
 drivers/block/null_blk/main.c   | 3 +++
 drivers/block/null_blk/trace.h  | 2 ++
 drivers/block/null_blk/zoned.c  | 1 -
 4 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/block/null_blk/Makefile b/drivers/block/null_blk/Makefile
index 84c36e512ab8..672adcf0ad24 100644
--- a/drivers/block/null_blk/Makefile
+++ b/drivers/block/null_blk/Makefile
@@ -5,7 +5,5 @@ ccflags-y			+= -I$(src)
 
 obj-$(CONFIG_BLK_DEV_NULL_BLK)	+= null_blk.o
 null_blk-objs			:= main.o
-ifeq ($(CONFIG_BLK_DEV_ZONED), y)
 null_blk-$(CONFIG_TRACING) 	+= trace.o
-endif
 null_blk-$(CONFIG_BLK_DEV_ZONED) += zoned.o
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 5d56ad4ce01a..b33b9ebfebd2 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -11,6 +11,9 @@
 #include <linux/init.h>
 #include "null_blk.h"
 
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+
 #undef pr_fmt
 #define pr_fmt(fmt)	"null_blk: " fmt
 
diff --git a/drivers/block/null_blk/trace.h b/drivers/block/null_blk/trace.h
index 82b8f6a5e5f0..f9eadac6b22f 100644
--- a/drivers/block/null_blk/trace.h
+++ b/drivers/block/null_blk/trace.h
@@ -30,6 +30,7 @@ static inline void __assign_disk_name(char *name, struct gendisk *disk)
 }
 #endif
 
+#ifdef CONFIG_BLK_DEV_ZONED
 TRACE_EVENT(nullb_zone_op,
 	    TP_PROTO(struct nullb_cmd *cmd, unsigned int zone_no,
 		     unsigned int zone_cond),
@@ -73,6 +74,7 @@ TRACE_EVENT(nullb_report_zones,
 	    TP_printk("%s nr_zones=%u",
 		      __print_disk_name(__entry->disk), __entry->nr_zones)
 );
+#endif /* CONFIG_BLK_DEV_ZONED */
 
 #endif /* _TRACE_NULLB_H */
 
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 5b5a63adacc1..47470a732d21 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -3,7 +3,6 @@
 #include <linux/bitmap.h>
 #include "null_blk.h"
 
-#define CREATE_TRACE_POINTS
 #include "trace.h"
 
 #undef pr_fmt
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v20 12/12] null_blk: add support for copy offload
       [not found]   ` <CGME20240520103039epcas5p4373f7234162a32222ac225b976ae30ce@epcas5p4.samsung.com>
@ 2024-05-20 10:20     ` Nitesh Shetty
  2024-05-20 23:42       ` Bart Van Assche
  0 siblings, 1 reply; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-20 10:20 UTC (permalink / raw)
  To: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Nitesh Shetty, Vincent Fu,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

Implementation is based on existing read and write infrastructure.
copy_max_bytes: A new configfs and module parameter is introduced, which
can be used to set hardware/driver supported maximum copy limit.
Only request based queue mode will support for copy offload.
Added tracefs support to copy IO tracing.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
---
 Documentation/block/null_blk.rst  |   5 ++
 drivers/block/null_blk/main.c     | 102 +++++++++++++++++++++++++++++-
 drivers/block/null_blk/null_blk.h |   1 +
 drivers/block/null_blk/trace.h    |  23 +++++++
 4 files changed, 128 insertions(+), 3 deletions(-)

diff --git a/Documentation/block/null_blk.rst b/Documentation/block/null_blk.rst
index 4dd78f24d10a..6153e02fcf13 100644
--- a/Documentation/block/null_blk.rst
+++ b/Documentation/block/null_blk.rst
@@ -149,3 +149,8 @@ zone_size=[MB]: Default: 256
 zone_nr_conv=[nr_conv]: Default: 0
   The number of conventional zones to create when block device is zoned.  If
   zone_nr_conv >= nr_zones, it will be reduced to nr_zones - 1.
+
+copy_max_bytes=[size in bytes]: Default: COPY_MAX_BYTES
+  A module and configfs parameter which can be used to set hardware/driver
+  supported maximum copy offload limit.
+  COPY_MAX_BYTES(=128MB at present) is defined in fs.h
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index b33b9ebfebd2..dcfbd5275414 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -172,6 +172,10 @@ static int g_max_sectors;
 module_param_named(max_sectors, g_max_sectors, int, 0444);
 MODULE_PARM_DESC(max_sectors, "Maximum size of a command (in 512B sectors)");
 
+static unsigned long g_copy_max_bytes = BLK_COPY_MAX_BYTES;
+module_param_named(copy_max_bytes, g_copy_max_bytes, ulong, 0444);
+MODULE_PARM_DESC(copy_max_bytes, "Maximum size of a copy command (in bytes)");
+
 static unsigned int nr_devices = 1;
 module_param(nr_devices, uint, 0444);
 MODULE_PARM_DESC(nr_devices, "Number of devices to register");
@@ -433,6 +437,7 @@ NULLB_DEVICE_ATTR(home_node, uint, NULL);
 NULLB_DEVICE_ATTR(queue_mode, uint, NULL);
 NULLB_DEVICE_ATTR(blocksize, uint, NULL);
 NULLB_DEVICE_ATTR(max_sectors, uint, NULL);
+NULLB_DEVICE_ATTR(copy_max_bytes, uint, NULL);
 NULLB_DEVICE_ATTR(irqmode, uint, NULL);
 NULLB_DEVICE_ATTR(hw_queue_depth, uint, NULL);
 NULLB_DEVICE_ATTR(index, uint, NULL);
@@ -577,6 +582,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
 	&nullb_device_attr_queue_mode,
 	&nullb_device_attr_blocksize,
 	&nullb_device_attr_max_sectors,
+	&nullb_device_attr_copy_max_bytes,
 	&nullb_device_attr_irqmode,
 	&nullb_device_attr_hw_queue_depth,
 	&nullb_device_attr_index,
@@ -687,7 +693,7 @@ static ssize_t memb_group_features_show(struct config_item *item, char *page)
 			"shared_tags,size,submit_queues,use_per_node_hctx,"
 			"virt_boundary,zoned,zone_capacity,zone_max_active,"
 			"zone_max_open,zone_nr_conv,zone_offline,zone_readonly,"
-			"zone_size,zone_append_max_sectors\n");
+			"zone_size,zone_append_max_sectors,copy_max_bytes\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -753,6 +759,7 @@ static struct nullb_device *null_alloc_dev(void)
 	dev->queue_mode = g_queue_mode;
 	dev->blocksize = g_bs;
 	dev->max_sectors = g_max_sectors;
+	dev->copy_max_bytes = g_copy_max_bytes;
 	dev->irqmode = g_irqmode;
 	dev->hw_queue_depth = g_hw_queue_depth;
 	dev->blocking = g_blocking;
@@ -1221,6 +1228,81 @@ static int null_transfer(struct nullb *nullb, struct page *page,
 	return err;
 }
 
+static inline int nullb_setup_copy(struct nullb *nullb, struct request *req,
+				   bool is_fua)
+{
+	sector_t sector_in = 0, sector_out = 0;
+	loff_t offset_in, offset_out;
+	void *in, *out;
+	ssize_t chunk, rem = 0;
+	struct bio *bio;
+	struct nullb_page *t_page_in, *t_page_out;
+	u16 seg = 1;
+	int status = -EIO;
+
+	if (blk_rq_nr_phys_segments(req) != BLK_COPY_MAX_SEGMENTS)
+		return status;
+
+	/*
+	 * First bio contains information about destination and last bio
+	 * contains information about source.
+	 */
+	__rq_for_each_bio(bio, req) {
+		if (seg == blk_rq_nr_phys_segments(req)) {
+			sector_in = bio->bi_iter.bi_sector;
+			if (rem != bio->bi_iter.bi_size)
+				return status;
+		} else {
+			sector_out = bio->bi_iter.bi_sector;
+			rem = bio->bi_iter.bi_size;
+		}
+		seg++;
+	}
+
+	trace_nullb_copy_op(req, sector_out << SECTOR_SHIFT,
+			    sector_in << SECTOR_SHIFT, rem);
+
+	spin_lock_irq(&nullb->lock);
+	while (rem > 0) {
+		chunk = min_t(size_t, nullb->dev->blocksize, rem);
+		offset_in = (sector_in & SECTOR_MASK) << SECTOR_SHIFT;
+		offset_out = (sector_out & SECTOR_MASK) << SECTOR_SHIFT;
+
+		if (null_cache_active(nullb) && !is_fua)
+			null_make_cache_space(nullb, PAGE_SIZE);
+
+		t_page_in = null_lookup_page(nullb, sector_in, false,
+					     !null_cache_active(nullb));
+		if (!t_page_in)
+			goto err;
+		t_page_out = null_insert_page(nullb, sector_out,
+					      !null_cache_active(nullb) ||
+					      is_fua);
+		if (!t_page_out)
+			goto err;
+
+		in = kmap_local_page(t_page_in->page);
+		out = kmap_local_page(t_page_out->page);
+
+		memcpy(out + offset_out, in + offset_in, chunk);
+		kunmap_local(out);
+		kunmap_local(in);
+		__set_bit(sector_out & SECTOR_MASK, t_page_out->bitmap);
+
+		if (is_fua)
+			null_free_sector(nullb, sector_out, true);
+
+		rem -= chunk;
+		sector_in += chunk >> SECTOR_SHIFT;
+		sector_out += chunk >> SECTOR_SHIFT;
+	}
+
+	status = 0;
+err:
+	spin_unlock_irq(&nullb->lock);
+	return status;
+}
+
 static blk_status_t null_handle_rq(struct nullb_cmd *cmd)
 {
 	struct request *rq = blk_mq_rq_from_pdu(cmd);
@@ -1230,13 +1312,16 @@ static blk_status_t null_handle_rq(struct nullb_cmd *cmd)
 	sector_t sector = blk_rq_pos(rq);
 	struct req_iterator iter;
 	struct bio_vec bvec;
+	bool fua = rq->cmd_flags & REQ_FUA;
+
+	if (op_is_copy(req_op(rq)))
+		return nullb_setup_copy(nullb, rq, fua);
 
 	spin_lock_irq(&nullb->lock);
 	rq_for_each_segment(bvec, rq, iter) {
 		len = bvec.bv_len;
 		err = null_transfer(nullb, bvec.bv_page, len, bvec.bv_offset,
-				     op_is_write(req_op(rq)), sector,
-				     rq->cmd_flags & REQ_FUA);
+				     op_is_write(req_op(rq)), sector, fua);
 		if (err)
 			break;
 		sector += len >> SECTOR_SHIFT;
@@ -1721,6 +1806,12 @@ static void null_config_discard(struct nullb *nullb, struct queue_limits *lim)
 	lim->max_hw_discard_sectors = UINT_MAX >> 9;
 }
 
+static void null_config_copy(struct nullb *nullb, struct queue_limits *lim)
+{
+	lim->max_copy_hw_sectors = nullb->dev->copy_max_bytes >> SECTOR_SHIFT;
+	lim->max_copy_sectors = nullb->dev->copy_max_bytes >> SECTOR_SHIFT;
+}
+
 static const struct block_device_operations null_ops = {
 	.owner		= THIS_MODULE,
 	.report_zones	= null_report_zones,
@@ -1843,6 +1934,9 @@ static int null_validate_conf(struct nullb_device *dev)
 		return -EINVAL;
 	}
 
+	if (dev->queue_mode == NULL_Q_BIO)
+		dev->copy_max_bytes = 0;
+
 	return 0;
 }
 
@@ -1909,6 +2003,8 @@ static int null_add_dev(struct nullb_device *dev)
 	if (dev->virt_boundary)
 		lim.virt_boundary_mask = PAGE_SIZE - 1;
 	null_config_discard(nullb, &lim);
+	null_config_copy(nullb, &lim);
+
 	if (dev->zoned) {
 		rv = null_init_zoned_dev(dev, &lim);
 		if (rv)
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 3234e6c85eed..c588729c17bd 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -91,6 +91,7 @@ struct nullb_device {
 	unsigned int queue_mode; /* block interface */
 	unsigned int blocksize; /* block size */
 	unsigned int max_sectors; /* Max sectors per command */
+	unsigned long copy_max_bytes; /* Max copy offload length in bytes */
 	unsigned int irqmode; /* IRQ completion handler */
 	unsigned int hw_queue_depth; /* queue depth */
 	unsigned int index; /* index of the disk, only valid with a disk */
diff --git a/drivers/block/null_blk/trace.h b/drivers/block/null_blk/trace.h
index f9eadac6b22f..cda1a2249978 100644
--- a/drivers/block/null_blk/trace.h
+++ b/drivers/block/null_blk/trace.h
@@ -76,6 +76,29 @@ TRACE_EVENT(nullb_report_zones,
 );
 #endif /* CONFIG_BLK_DEV_ZONED */
 
+TRACE_EVENT(nullb_copy_op,
+		TP_PROTO(struct request *req,
+			 sector_t dst, sector_t src, size_t len),
+		TP_ARGS(req, dst, src, len),
+		TP_STRUCT__entry(
+				 __array(char, disk, DISK_NAME_LEN)
+				 __field(enum req_op, op)
+				 __field(sector_t, dst)
+				 __field(sector_t, src)
+				 __field(size_t, len)
+		),
+		TP_fast_assign(
+			       __entry->op = req_op(req);
+			       __assign_disk_name(__entry->disk, req->q->disk);
+			       __entry->dst = dst;
+			       __entry->src = src;
+			       __entry->len = len;
+		),
+		TP_printk("%s req=%-15s: dst=%llu, src=%llu, len=%lu",
+			  __print_disk_name(__entry->disk),
+			  blk_op_str(__entry->op),
+			  __entry->dst, __entry->src, __entry->len)
+);
 #endif /* _TRACE_NULLB_H */
 
 #undef TRACE_INCLUDE_PATH
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support
  2024-05-20 10:20     ` [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support Nitesh Shetty
@ 2024-05-20 14:33       ` Damien Le Moal
  2024-05-21  8:15         ` Nitesh Shetty
  2024-05-20 22:42       ` Bart Van Assche
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 89+ messages in thread
From: Damien Le Moal @ 2024-05-20 14:33 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On 2024/05/20 12:20, Nitesh Shetty wrote:
> Add device limits as sysfs entries,
> 	- copy_max_bytes (RW)
> 	- copy_max_hw_bytes (RO)
> 
> Above limits help to split the copy payload in block layer.
> copy_max_bytes: maximum total length of copy in single payload.
> copy_max_hw_bytes: Reflects the device supported maximum limit.
> 
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  Documentation/ABI/stable/sysfs-block | 23 +++++++++++++++
>  block/blk-settings.c                 | 34 ++++++++++++++++++++--
>  block/blk-sysfs.c                    | 43 ++++++++++++++++++++++++++++
>  include/linux/blkdev.h               | 14 +++++++++
>  4 files changed, 112 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index 831f19a32e08..52d8a253bf8e 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -165,6 +165,29 @@ Description:
>  		last zone of the device which may be smaller.
>  
>  
> +What:		/sys/block/<disk>/queue/copy_max_bytes
> +Date:		May 2024
> +Contact:	linux-block@vger.kernel.org
> +Description:
> +		[RW] This is the maximum number of bytes that the block layer
> +		will allow for a copy request. This is always smaller or
> +		equal to the maximum size allowed by the hardware, indicated by
> +		'copy_max_hw_bytes'. An attempt to set a value higher than
> +		'copy_max_hw_bytes' will truncate this to 'copy_max_hw_bytes'.
> +		Writing '0' to this file will disable offloading copies for this
> +		device, instead copy is done via emulation.
> +
> +
> +What:		/sys/block/<disk>/queue/copy_max_hw_bytes
> +Date:		May 2024
> +Contact:	linux-block@vger.kernel.org
> +Description:
> +		[RO] This is the maximum number of bytes that the hardware
> +		will allow for single data copy request.
> +		A value of 0 means that the device does not support
> +		copy offload.
> +
> +
>  What:		/sys/block/<disk>/queue/crypto/
>  Date:		February 2022
>  Contact:	linux-block@vger.kernel.org
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index a7fe8e90240a..67010ed82422 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -52,6 +52,9 @@ void blk_set_stacking_limits(struct queue_limits *lim)
>  	lim->max_write_zeroes_sectors = UINT_MAX;
>  	lim->max_zone_append_sectors = UINT_MAX;
>  	lim->max_user_discard_sectors = UINT_MAX;
> +	lim->max_copy_hw_sectors = UINT_MAX;
> +	lim->max_copy_sectors = UINT_MAX;
> +	lim->max_user_copy_sectors = UINT_MAX;
>  }
>  EXPORT_SYMBOL(blk_set_stacking_limits);
>  
> @@ -219,6 +222,9 @@ static int blk_validate_limits(struct queue_limits *lim)
>  		lim->misaligned = 0;
>  	}
>  
> +	lim->max_copy_sectors =
> +		min(lim->max_copy_hw_sectors, lim->max_user_copy_sectors);
> +
>  	return blk_validate_zoned_limits(lim);
>  }
>  
> @@ -231,10 +237,11 @@ int blk_set_default_limits(struct queue_limits *lim)
>  {
>  	/*
>  	 * Most defaults are set by capping the bounds in blk_validate_limits,
> -	 * but max_user_discard_sectors is special and needs an explicit
> -	 * initialization to the max value here.
> +	 * but max_user_discard_sectors and max_user_copy_sectors are special
> +	 * and needs an explicit initialization to the max value here.

s/needs/need

>  	 */
>  	lim->max_user_discard_sectors = UINT_MAX;
> +	lim->max_user_copy_sectors = UINT_MAX;
>  	return blk_validate_limits(lim);
>  }
>  
> @@ -316,6 +323,25 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
>  }
>  EXPORT_SYMBOL(blk_queue_max_discard_sectors);
>  
> +/*
> + * blk_queue_max_copy_hw_sectors - set max sectors for a single copy payload
> + * @q:	the request queue for the device
> + * @max_copy_sectors: maximum number of sectors to copy
> + */
> +void blk_queue_max_copy_hw_sectors(struct request_queue *q,
> +				   unsigned int max_copy_sectors)
> +{
> +	struct queue_limits *lim = &q->limits;
> +
> +	if (max_copy_sectors > (BLK_COPY_MAX_BYTES >> SECTOR_SHIFT))
> +		max_copy_sectors = BLK_COPY_MAX_BYTES >> SECTOR_SHIFT;
> +
> +	lim->max_copy_hw_sectors = max_copy_sectors;
> +	lim->max_copy_sectors =
> +		min(max_copy_sectors, lim->max_user_copy_sectors);
> +}
> +EXPORT_SYMBOL_GPL(blk_queue_max_copy_hw_sectors);

Hmm... Such helper seems to not fit with Christoph's changes of the limits
initialization as that is not necessarily done using &q->limits but depending on
the driver, a different limit structure. So shouldn't this function be passed a
queue_limits struct pointer instead of the request queue pointer ?

> +
>  /**
>   * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
>   * @q:  the request queue for the device
> @@ -633,6 +659,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  	t->max_segment_size = min_not_zero(t->max_segment_size,
>  					   b->max_segment_size);
>  
> +	t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
> +	t->max_copy_hw_sectors = min(t->max_copy_hw_sectors,
> +				     b->max_copy_hw_sectors);
> +
>  	t->misaligned |= b->misaligned;
>  
>  	alignment = queue_limit_alignment_offset(b, start);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index f0f9314ab65c..805c2b6b0393 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -205,6 +205,44 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
>  	return queue_var_show(0, page);
>  }
>  
> +static ssize_t queue_copy_hw_max_show(struct request_queue *q, char *page)
> +{
> +	return sprintf(page, "%llu\n", (unsigned long long)
> +		       q->limits.max_copy_hw_sectors << SECTOR_SHIFT);
> +}
> +
> +static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
> +{
> +	return sprintf(page, "%llu\n", (unsigned long long)
> +		       q->limits.max_copy_sectors << SECTOR_SHIFT);
> +}

Given that you repeat the same pattern twice, may be add a queue_var64_show()
helper ? (naming can be changed).

> +
> +static ssize_t queue_copy_max_store(struct request_queue *q, const char *page,
> +				    size_t count)
> +{
> +	unsigned long max_copy_bytes;
> +	struct queue_limits lim;
> +	ssize_t ret;
> +	int err;
> +
> +	ret = queue_var_store(&max_copy_bytes, page, count);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (max_copy_bytes & (queue_logical_block_size(q) - 1))
> +		return -EINVAL;
> +
> +	blk_mq_freeze_queue(q);
> +	lim = queue_limits_start_update(q);
> +	lim.max_user_copy_sectors = max_copy_bytes >> SECTOR_SHIFT;

max_copy_bytes is an unsigned long, so 64 bits on 64-bit arch and
max_user_copy_sectors is an unsigned int, so 32-bits. There are thus no
guarantees that this will not overflow. A check is needed.

> +	err = queue_limits_commit_update(q, &lim);
> +	blk_mq_unfreeze_queue(q);
> +
> +	if (err)

You can reuse ret here. No need for adding the err variable.

> +		return err;
> +	return count;
> +}
> +
>  static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
>  {
>  	return queue_var_show(0, page);
> @@ -505,6 +543,9 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
>  QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
>  QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
>  
> +QUEUE_RO_ENTRY(queue_copy_hw_max, "copy_max_hw_bytes");
> +QUEUE_RW_ENTRY(queue_copy_max, "copy_max_bytes");
> +
>  QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
>  QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
>  QUEUE_RW_ENTRY(queue_poll, "io_poll");
> @@ -618,6 +659,8 @@ static struct attribute *queue_attrs[] = {
>  	&queue_discard_max_entry.attr,
>  	&queue_discard_max_hw_entry.attr,
>  	&queue_discard_zeroes_data_entry.attr,
> +	&queue_copy_hw_max_entry.attr,
> +	&queue_copy_max_entry.attr,
>  	&queue_write_same_max_entry.attr,
>  	&queue_write_zeroes_max_entry.attr,
>  	&queue_zone_append_max_entry.attr,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index aefdda9f4ec7..109d9f905c3c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -309,6 +309,10 @@ struct queue_limits {
>  	unsigned int		discard_alignment;
>  	unsigned int		zone_write_granularity;
>  
> +	unsigned int		max_copy_hw_sectors;
> +	unsigned int		max_copy_sectors;
> +	unsigned int		max_user_copy_sectors;
> +
>  	unsigned short		max_segments;
>  	unsigned short		max_integrity_segments;
>  	unsigned short		max_discard_segments;
> @@ -933,6 +937,8 @@ void blk_queue_max_secure_erase_sectors(struct request_queue *q,
>  		unsigned int max_sectors);
>  extern void blk_queue_max_discard_sectors(struct request_queue *q,
>  		unsigned int max_discard_sectors);
> +extern void blk_queue_max_copy_hw_sectors(struct request_queue *q,
> +					  unsigned int max_copy_sectors);
>  extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
>  		unsigned int max_write_same_sectors);
>  extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
> @@ -1271,6 +1277,14 @@ static inline unsigned int bdev_discard_granularity(struct block_device *bdev)
>  	return bdev_get_queue(bdev)->limits.discard_granularity;
>  }
>  
> +/* maximum copy offload length, this is set to 128MB based on current testing */

Current testing will not be current in a while... So may be simply say
"arbitrary" or something. Also please capitalize the first letter of the
comment. So something like:

/* Arbitrary absolute limit of 128 MB for copy offload. */

> +#define BLK_COPY_MAX_BYTES		(1 << 27)

Also, it is not clear from the name if this is a soft limit or a cap on the
hardware limit... So at least please adjust the comment to say which one it is.

> +
> +static inline unsigned int bdev_max_copy_sectors(struct block_device *bdev)
> +{
> +	return bdev_get_queue(bdev)->limits.max_copy_sectors;
> +}
> +
>  static inline unsigned int
>  bdev_max_secure_erase_sectors(struct block_device *bdev)
>  {

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-20 10:20     ` [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer Nitesh Shetty
@ 2024-05-20 15:00       ` Damien Le Moal
  2024-05-21 10:50         ` Nitesh Shetty
  2024-05-20 23:00       ` Bart Van Assche
                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 89+ messages in thread
From: Damien Le Moal @ 2024-05-20 15:00 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On 2024/05/20 12:20, Nitesh Shetty wrote:
> We add two new opcode REQ_OP_COPY_DST, REQ_OP_COPY_SRC.
> Since copy is a composite operation involving src and dst sectors/lba,
> each needs to be represented by a separate bio to make it compatible
> with device mapper.

Why ? The beginning of the sentence isn't justification enough for the two new
operation codes ? The 2 sentences should be reversed for easier reading:
justification first naturally leads to the reader understanding why the codes
are needed.

Also: s/opcode/operations


> We expect caller to take a plug and send bio with destination information,
> followed by bio with source information.

expect ? Plugging is optional. Does copy offload require it ? Please clarify this.

> Once the dst bio arrives we form a request and wait for source

arrives ? You mean "is submitted" ?

s/and wait for/and wait for the

> bio. Upon arrival of source bio we merge these two bio's and send

s/arrival/submission ?

s/of/of the
s/bio's/BIOs
s/and send/and send the
s/down to/down to the

> corresponding request down to device driver.
> Merging non copy offload bio is avoided by checking for copy specific
> opcodes in merge function.

Super unclear... What are you trying to say here ? That merging copy offload
BIOs with other BIOs is not allowed ? That is already handled. Only BIOs &
requests with the same operation can be merged. The code below also suggests
that you allow merging copy offloads... So I really do not understand this.

> 
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  block/blk-core.c          |  7 +++++++
>  block/blk-merge.c         | 41 +++++++++++++++++++++++++++++++++++++++
>  block/blk.h               | 16 +++++++++++++++
>  block/elevator.h          |  1 +
>  include/linux/bio.h       |  6 +-----
>  include/linux/blk_types.h | 10 ++++++++++
>  6 files changed, 76 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ea44b13af9af..f18ee5f709c0 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -122,6 +122,8 @@ static const char *const blk_op_name[] = {
>  	REQ_OP_NAME(ZONE_FINISH),
>  	REQ_OP_NAME(ZONE_APPEND),
>  	REQ_OP_NAME(WRITE_ZEROES),
> +	REQ_OP_NAME(COPY_SRC),
> +	REQ_OP_NAME(COPY_DST),
>  	REQ_OP_NAME(DRV_IN),
>  	REQ_OP_NAME(DRV_OUT),
>  };
> @@ -838,6 +840,11 @@ void submit_bio_noacct(struct bio *bio)
>  		 * requests.
>  		 */
>  		fallthrough;
> +	case REQ_OP_COPY_SRC:
> +	case REQ_OP_COPY_DST:
> +		if (!q->limits.max_copy_sectors)
> +			goto not_supported;
> +		break;
>  	default:
>  		goto not_supported;
>  	}
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 8534c35e0497..f8dc48a03379 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -154,6 +154,20 @@ static struct bio *bio_split_write_zeroes(struct bio *bio,
>  	return bio_split(bio, lim->max_write_zeroes_sectors, GFP_NOIO, bs);
>  }
>  
> +static struct bio *bio_split_copy(struct bio *bio,
> +				  const struct queue_limits *lim,
> +				  unsigned int *nsegs)
> +{
> +	*nsegs = 1;
> +	if (bio_sectors(bio) <= lim->max_copy_sectors)
> +		return NULL;
> +	/*
> +	 * We don't support splitting for a copy bio. End it with EIO if
> +	 * splitting is required and return an error pointer.
> +	 */
> +	return ERR_PTR(-EIO);
> +}

Hmm... Why not check that the copy request is small enough and will not be split
when it is submitted ? Something like blk_check_zone_append() does with
REQ_OP_ZONE_APPEND ? So adding a blk_check_copy_offload(). That would also
include the limits check from the previous hunk.

> +
>  /*
>   * Return the maximum number of sectors from the start of a bio that may be
>   * submitted as a single request to a block device. If enough sectors remain,
> @@ -362,6 +376,12 @@ struct bio *__bio_split_to_limits(struct bio *bio,
>  	case REQ_OP_WRITE_ZEROES:
>  		split = bio_split_write_zeroes(bio, lim, nr_segs, bs);
>  		break;
> +	case REQ_OP_COPY_SRC:
> +	case REQ_OP_COPY_DST:
> +		split = bio_split_copy(bio, lim, nr_segs);
> +		if (IS_ERR(split))
> +			return NULL;
> +		break;

See above.

>  	default:
>  		split = bio_split_rw(bio, lim, nr_segs, bs,
>  				get_max_io_size(bio, lim) << SECTOR_SHIFT);
> @@ -925,6 +945,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
>  	if (!rq_mergeable(rq) || !bio_mergeable(bio))
>  		return false;
>  
> +	if (blk_copy_offload_mergable(rq, bio))
> +		return true;
> +
>  	if (req_op(rq) != bio_op(bio))
>  		return false;
>  
> @@ -958,6 +981,8 @@ enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
>  {
>  	if (blk_discard_mergable(rq))
>  		return ELEVATOR_DISCARD_MERGE;
> +	else if (blk_copy_offload_mergable(rq, bio))
> +		return ELEVATOR_COPY_OFFLOAD_MERGE;
>  	else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
>  		return ELEVATOR_BACK_MERGE;
>  	else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
> @@ -1065,6 +1090,20 @@ static enum bio_merge_status bio_attempt_discard_merge(struct request_queue *q,
>  	return BIO_MERGE_FAILED;
>  }
>  
> +static enum bio_merge_status bio_attempt_copy_offload_merge(struct request *req,
> +							    struct bio *bio)
> +{
> +	if (req->__data_len != bio->bi_iter.bi_size)
> +		return BIO_MERGE_FAILED;
> +
> +	req->biotail->bi_next = bio;
> +	req->biotail = bio;
> +	req->nr_phys_segments++;
> +	req->__data_len += bio->bi_iter.bi_size;

Arg... You seem to be assuming that the source BIO always comes right after the
destination request... What if copy offloads are being concurrently issued ?
Shouldn't you check somehow that the pair is a match ? Or are you relying on the
per-context plugging which prevents that from happening in the first place ? But
that would assumes that you never ever sleep trying to allocate the source BIO
after the destination BIO/request are prepared and plugged.

> +
> +	return BIO_MERGE_OK;
> +}
> +
>  static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
>  						   struct request *rq,
>  						   struct bio *bio,
> @@ -1085,6 +1124,8 @@ static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
>  		break;
>  	case ELEVATOR_DISCARD_MERGE:
>  		return bio_attempt_discard_merge(q, rq, bio);
> +	case ELEVATOR_COPY_OFFLOAD_MERGE:
> +		return bio_attempt_copy_offload_merge(rq, bio);
>  	default:
>  		return BIO_MERGE_NONE;
>  	}
> diff --git a/block/blk.h b/block/blk.h
> index 189bc25beb50..6528a2779b84 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -174,6 +174,20 @@ static inline bool blk_discard_mergable(struct request *req)
>  	return false;
>  }
>  
> +/*
> + * Copy offload sends a pair of bio with REQ_OP_COPY_DST and REQ_OP_COPY_SRC
> + * operation by taking a plug.
> + * Initially DST bio is sent which forms a request and
> + * waits for SRC bio to arrive. Once SRC bio arrives
> + * we merge it and send request down to driver.
> + */
> +static inline bool blk_copy_offload_mergable(struct request *req,
> +					     struct bio *bio)
> +{
> +	return (req_op(req) == REQ_OP_COPY_DST &&
> +		bio_op(bio) == REQ_OP_COPY_SRC);
> +}

This function is really not needed at all (used in one place only).

> +
>  static inline unsigned int blk_rq_get_max_segments(struct request *rq)
>  {
>  	if (req_op(rq) == REQ_OP_DISCARD)
> @@ -323,6 +337,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
>  	case REQ_OP_DISCARD:
>  	case REQ_OP_SECURE_ERASE:
>  	case REQ_OP_WRITE_ZEROES:
> +	case REQ_OP_COPY_SRC:
> +	case REQ_OP_COPY_DST:
>  		return true; /* non-trivial splitting decisions */

See above. Limits should be checked on submission.

>  	default:
>  		break;
> diff --git a/block/elevator.h b/block/elevator.h
> index e9a050a96e53..c7a45c1f4156 100644
> --- a/block/elevator.h
> +++ b/block/elevator.h
> @@ -18,6 +18,7 @@ enum elv_merge {
>  	ELEVATOR_FRONT_MERGE	= 1,
>  	ELEVATOR_BACK_MERGE	= 2,
>  	ELEVATOR_DISCARD_MERGE	= 3,
> +	ELEVATOR_COPY_OFFLOAD_MERGE	= 4,
>  };
>  
>  struct blk_mq_alloc_data;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index d5379548d684..528ef22dd65b 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -53,11 +53,7 @@ static inline unsigned int bio_max_segs(unsigned int nr_segs)
>   */
>  static inline bool bio_has_data(struct bio *bio)
>  {
> -	if (bio &&
> -	    bio->bi_iter.bi_size &&
> -	    bio_op(bio) != REQ_OP_DISCARD &&
> -	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
> -	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
> +	if (bio && (bio_op(bio) == REQ_OP_READ || bio_op(bio) == REQ_OP_WRITE))
>  		return true;

This change seems completely broken and out of place. This would cause a return
of false for zone append operations.

>  
>  	return false;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 781c4500491b..7f692bade271 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -342,6 +342,10 @@ enum req_op {
>  	/* reset all the zone present on the device */
>  	REQ_OP_ZONE_RESET_ALL	= (__force blk_opf_t)15,
>  
> +	/* copy offload src and dst operation */

s/src/source
s/dst/destination
s/operation/operations

> +	REQ_OP_COPY_SRC		= (__force blk_opf_t)18,
> +	REQ_OP_COPY_DST		= (__force blk_opf_t)19,
> +
>  	/* Driver private requests */
>  	REQ_OP_DRV_IN		= (__force blk_opf_t)34,
>  	REQ_OP_DRV_OUT		= (__force blk_opf_t)35,
> @@ -430,6 +434,12 @@ static inline bool op_is_write(blk_opf_t op)
>  	return !!(op & (__force blk_opf_t)1);
>  }
>  
> +static inline bool op_is_copy(blk_opf_t op)
> +{
> +	return ((op & REQ_OP_MASK) == REQ_OP_COPY_SRC ||
> +		(op & REQ_OP_MASK) == REQ_OP_COPY_DST);
> +}

May be use a switch here to avoid the double masking of op ?

> +
>  /*
>   * Check if the bio or request is one that needs special treatment in the
>   * flush state machine.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support
  2024-05-20 10:20     ` [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support Nitesh Shetty
  2024-05-20 14:33       ` Damien Le Moal
@ 2024-05-20 22:42       ` Bart Van Assche
  2024-05-21 14:25         ` Nitesh Shetty
  2024-05-21  6:57       ` Hannes Reinecke
  2024-06-01  5:53       ` Christoph Hellwig
  3 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2024-05-20 22:42 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 5/20/24 03:20, Nitesh Shetty wrote:
> +static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
> +{
> +	return sprintf(page, "%llu\n", (unsigned long long)
> +		       q->limits.max_copy_sectors << SECTOR_SHIFT);
> +}
> +
> +static ssize_t queue_copy_max_store(struct request_queue *q, const char *page,
> +				    size_t count)
> +{
> +	unsigned long max_copy_bytes;
> +	struct queue_limits lim;
> +	ssize_t ret;
> +	int err;
> +
> +	ret = queue_var_store(&max_copy_bytes, page, count);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (max_copy_bytes & (queue_logical_block_size(q) - 1))
> +		return -EINVAL;

Wouldn't it be more user-friendly if this check would be left out? Does any code
depend on max_copy_bytes being a multiple of the logical block size?

> +	blk_mq_freeze_queue(q);
> +	lim = queue_limits_start_update(q);
> +	lim.max_user_copy_sectors = max_copy_bytes >> SECTOR_SHIFT;
> +	err = queue_limits_commit_update(q, &lim);
> +	blk_mq_unfreeze_queue(q);
> +
> +	if (err)
> +		return err;
> +	return count;
> +}

queue_copy_max_show() shows max_copy_sectors while queue_copy_max_store()
modifies max_user_copy_sectors. Is that perhaps a bug?

> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index aefdda9f4ec7..109d9f905c3c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -309,6 +309,10 @@ struct queue_limits {
>   	unsigned int		discard_alignment;
>   	unsigned int		zone_write_granularity;
>   
> +	unsigned int		max_copy_hw_sectors;
> +	unsigned int		max_copy_sectors;
> +	unsigned int		max_user_copy_sectors;

Two new limits are documented in Documentation/ABI/stable/sysfs-block while three
new parameters are added in struct queue_limits. Why three new limits instead of
two? Please add a comment above the new parameters that explains the role of the
new parameters.

> +/* maximum copy offload length, this is set to 128MB based on current testing */
> +#define BLK_COPY_MAX_BYTES		(1 << 27)

"current testing" sounds vague. Why is this limit required? Why to cap what the
driver reports instead of using the value reported by the driver without modifying it?

Additionally, since this constant is only used in source code that occurs in the
block/ directory, please move the definition of this constant into a source or header
file in the block/ directory.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 00/12] Implement copy offload support
  2024-05-20 10:20 ` [PATCH v20 00/12] Implement copy offload support Nitesh Shetty
                     ` (11 preceding siblings ...)
       [not found]   ` <CGME20240520103039epcas5p4373f7234162a32222ac225b976ae30ce@epcas5p4.samsung.com>
@ 2024-05-20 22:54   ` Bart Van Assche
  2024-06-01  5:47   ` Christoph Hellwig
  13 siblings, 0 replies; 89+ messages in thread
From: Bart Van Assche @ 2024-05-20 22:54 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 5/20/24 03:20, Nitesh Shetty wrote:
> 4. This bio is merged with the request containing the destination info.

bios with different operation types must never be merged. From attempt_merge():

	if (req_op(req) != req_op(next))
		return NULL;

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-20 10:20     ` [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer Nitesh Shetty
  2024-05-20 15:00       ` Damien Le Moal
@ 2024-05-20 23:00       ` Bart Van Assche
  2024-05-21 11:17         ` Nitesh Shetty
  2024-05-21  7:01       ` Hannes Reinecke
                         ` (2 subsequent siblings)
  4 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2024-05-20 23:00 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 5/20/24 03:20, Nitesh Shetty wrote:
> Upon arrival of source bio we merge these two bio's and send
> corresponding request down to device driver.

bios with different operation types must not be merged.

> +static enum bio_merge_status bio_attempt_copy_offload_merge(struct request *req,
> +							    struct bio *bio)
> +{
> +	if (req->__data_len != bio->bi_iter.bi_size)
> +		return BIO_MERGE_FAILED;
> +
> +	req->biotail->bi_next = bio;
> +	req->biotail = bio;
> +	req->nr_phys_segments++;
> +	req->__data_len += bio->bi_iter.bi_size;
> +
> +	return BIO_MERGE_OK;
> +}

This function appends a bio to a request. Hence, the name of this function is
wrong.

> @@ -1085,6 +1124,8 @@ static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
>   		break;
>   	case ELEVATOR_DISCARD_MERGE:
>   		return bio_attempt_discard_merge(q, rq, bio);
> +	case ELEVATOR_COPY_OFFLOAD_MERGE:
> +		return bio_attempt_copy_offload_merge(rq, bio);
>   	default:
>   		return BIO_MERGE_NONE;
>   	}

Is any code added in this patch series that causes an I/O scheduler to return
ELEVATOR_COPY_OFFLOAD_MERGE?

> +static inline bool blk_copy_offload_mergable(struct request *req,
> +					     struct bio *bio)
> +{
> +	return (req_op(req) == REQ_OP_COPY_DST &&
> +		bio_op(bio) == REQ_OP_COPY_SRC);
> +}

bios with different operation types must not be merged. Please rename this function.

> +static inline bool op_is_copy(blk_opf_t op)
> +{
> +	return ((op & REQ_OP_MASK) == REQ_OP_COPY_SRC ||
> +		(op & REQ_OP_MASK) == REQ_OP_COPY_DST);
> +}

The above function is not used in this patch. Please introduce new functions in the
patch in which these are used for the first time.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 10/12] dm: Enable copy offload for dm-linear target
  2024-05-20 10:20     ` [PATCH v20 10/12] dm: Enable copy offload for dm-linear target Nitesh Shetty
@ 2024-05-20 23:25       ` Bart Van Assche
  2024-05-21 14:48         ` Nitesh Shetty
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2024-05-20 23:25 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 5/20/24 03:20, Nitesh Shetty wrote:
> Setting copy_offload_supported flag to enable offload.

I think that the description of this patch should explain why it is safe
to set the 'copy_offload_supported' flag for the dm-linear driver.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 11/12] null: Enable trace capability for null block
  2024-05-20 10:20     ` [PATCH v20 11/12] null: Enable trace capability for null block Nitesh Shetty
@ 2024-05-20 23:27       ` Bart Van Assche
  2024-06-01  6:23       ` Christoph Hellwig
  1 sibling, 0 replies; 89+ messages in thread
From: Bart Van Assche @ 2024-05-20 23:27 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 5/20/24 03:20, Nitesh Shetty wrote:
> This is a prep patch to enable copy trace capability.
> At present only zoned null_block is using trace, so we decoupled trace
> and zoned dependency to make it usable in null_blk driver also.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 12/12] null_blk: add support for copy offload
  2024-05-20 10:20     ` [PATCH v20 12/12] null_blk: add support for copy offload Nitesh Shetty
@ 2024-05-20 23:42       ` Bart Van Assche
  2024-05-21 14:46         ` Nitesh Shetty
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2024-05-20 23:42 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, Vincent Fu, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On 5/20/24 03:20, Nitesh Shetty wrote:
> +	if (blk_rq_nr_phys_segments(req) != BLK_COPY_MAX_SEGMENTS)
> +		return status;

Why is this check necessary?

> +	/*
> +	 * First bio contains information about destination and last bio
> +	 * contains information about source.
> +	 */

Please check this at runtime (WARN_ON_ONCE()?).

> +	__rq_for_each_bio(bio, req) {
> +		if (seg == blk_rq_nr_phys_segments(req)) {
> +			sector_in = bio->bi_iter.bi_sector;
> +			if (rem != bio->bi_iter.bi_size)
> +				return status;
> +		} else {
> +			sector_out = bio->bi_iter.bi_sector;
> +			rem = bio->bi_iter.bi_size;
> +		}
> +		seg++;
> +	}

_rq_for_each_bio() iterates over the bios in a request. Does a copy
offload request always have two bios - one copy destination bio and
one copy source bio? If so, is 'seg' a bio counter? Why is that bio
counter compared with the number of physical segments in the request?

> +	trace_nullb_copy_op(req, sector_out << SECTOR_SHIFT,
> +			    sector_in << SECTOR_SHIFT, rem);
> +
> +	spin_lock_irq(&nullb->lock);
> +	while (rem > 0) {
> +		chunk = min_t(size_t, nullb->dev->blocksize, rem);
> +		offset_in = (sector_in & SECTOR_MASK) << SECTOR_SHIFT;
> +		offset_out = (sector_out & SECTOR_MASK) << SECTOR_SHIFT;
> +
> +		if (null_cache_active(nullb) && !is_fua)
> +			null_make_cache_space(nullb, PAGE_SIZE);
> +
> +		t_page_in = null_lookup_page(nullb, sector_in, false,
> +					     !null_cache_active(nullb));
> +		if (!t_page_in)
> +			goto err;
> +		t_page_out = null_insert_page(nullb, sector_out,
> +					      !null_cache_active(nullb) ||
> +					      is_fua);
> +		if (!t_page_out)
> +			goto err;
> +
> +		in = kmap_local_page(t_page_in->page);
> +		out = kmap_local_page(t_page_out->page);
> +
> +		memcpy(out + offset_out, in + offset_in, chunk);
> +		kunmap_local(out);
> +		kunmap_local(in);
> +		__set_bit(sector_out & SECTOR_MASK, t_page_out->bitmap);
> +
> +		if (is_fua)
> +			null_free_sector(nullb, sector_out, true);
> +
> +		rem -= chunk;
> +		sector_in += chunk >> SECTOR_SHIFT;
> +		sector_out += chunk >> SECTOR_SHIFT;
> +	}
> +
> +	status = 0;
> +err:
> +	spin_unlock_irq(&nullb->lock);

In the worst case, how long does this loop disable interrupts?

> +TRACE_EVENT(nullb_copy_op,
> +		TP_PROTO(struct request *req,
> +			 sector_t dst, sector_t src, size_t len),
> +		TP_ARGS(req, dst, src, len),
> +		TP_STRUCT__entry(
> +				 __array(char, disk, DISK_NAME_LEN)
> +				 __field(enum req_op, op)
> +				 __field(sector_t, dst)
> +				 __field(sector_t, src)
> +				 __field(size_t, len)
> +		),

Isn't __string() preferred over __array() since the former occupies less space
in the trace buffer?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support
  2024-05-20 10:20     ` [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support Nitesh Shetty
  2024-05-20 14:33       ` Damien Le Moal
  2024-05-20 22:42       ` Bart Van Assche
@ 2024-05-21  6:57       ` Hannes Reinecke
  2024-06-01  5:53       ` Christoph Hellwig
  3 siblings, 0 replies; 89+ messages in thread
From: Hannes Reinecke @ 2024-05-21  6:57 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On 5/20/24 12:20, Nitesh Shetty wrote:
> Add device limits as sysfs entries,
> 	- copy_max_bytes (RW)
> 	- copy_max_hw_bytes (RO)
> 
> Above limits help to split the copy payload in block layer.
> copy_max_bytes: maximum total length of copy in single payload.
> copy_max_hw_bytes: Reflects the device supported maximum limit.
> 
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>   Documentation/ABI/stable/sysfs-block | 23 +++++++++++++++
>   block/blk-settings.c                 | 34 ++++++++++++++++++++--
>   block/blk-sysfs.c                    | 43 ++++++++++++++++++++++++++++
>   include/linux/blkdev.h               | 14 +++++++++
>   4 files changed, 112 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-20 10:20     ` [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer Nitesh Shetty
  2024-05-20 15:00       ` Damien Le Moal
  2024-05-20 23:00       ` Bart Van Assche
@ 2024-05-21  7:01       ` Hannes Reinecke
  2024-05-24  6:54         ` Nitesh Shetty
                           ` (2 more replies)
  2024-05-22 18:05       ` Bart Van Assche
  2024-05-24 20:33       ` Bart Van Assche
  4 siblings, 3 replies; 89+ messages in thread
From: Hannes Reinecke @ 2024-05-21  7:01 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On 5/20/24 12:20, Nitesh Shetty wrote:
> We add two new opcode REQ_OP_COPY_DST, REQ_OP_COPY_SRC.
> Since copy is a composite operation involving src and dst sectors/lba,
> each needs to be represented by a separate bio to make it compatible
> with device mapper.
> We expect caller to take a plug and send bio with destination information,
> followed by bio with source information.
> Once the dst bio arrives we form a request and wait for source
> bio. Upon arrival of source bio we merge these two bio's and send
> corresponding request down to device driver.
> Merging non copy offload bio is avoided by checking for copy specific
> opcodes in merge function.
> 
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>   block/blk-core.c          |  7 +++++++
>   block/blk-merge.c         | 41 +++++++++++++++++++++++++++++++++++++++
>   block/blk.h               | 16 +++++++++++++++
>   block/elevator.h          |  1 +
>   include/linux/bio.h       |  6 +-----
>   include/linux/blk_types.h | 10 ++++++++++
>   6 files changed, 76 insertions(+), 5 deletions(-)
> 
I am a bit unsure about leveraging 'merge' here. As Bart pointed out, 
this is arguably as mis-use of the 'merge' functionality as we don't
actually merge bios, but rather use the information from these bios to
form the actual request.
Wouldn't it be better to use bio_chain here, and send out the eventual
request from the end_io function of the bio chain?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 04/12] block: add emulation for copy
  2024-05-20 10:20     ` [PATCH v20 04/12] block: add emulation for copy Nitesh Shetty
@ 2024-05-21  7:06       ` Hannes Reinecke
  2024-05-21 11:29         ` Nitesh Shetty
  2024-06-01  6:18       ` Christoph Hellwig
  1 sibling, 1 reply; 89+ messages in thread
From: Hannes Reinecke @ 2024-05-21  7:06 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Vincent Fu, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

On 5/20/24 12:20, Nitesh Shetty wrote:
> For the devices which does not support copy, copy emulation is added.
> It is required for in-kernel users like fabrics, where file descriptor is
> not available and hence they can't use copy_file_range.
> Copy-emulation is implemented by reading from source into memory and
> writing to the corresponding destination.
> At present in kernel user of emulation is fabrics.
> 
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>   block/blk-lib.c        | 223 +++++++++++++++++++++++++++++++++++++++++
>   include/linux/blkdev.h |   4 +
>   2 files changed, 227 insertions(+)
> 
Again, I'm not sure if we need this.
After all, copy offload _is_optional, so we need to be prepared to 
handle systems where it's not supported. In the end, the caller might
decide to do something else entirely; having an in-kernel emulation 
would defeat that.
And with adding an emulation to nullblk we already have an emulation
target to try if people will want to start experimenting.
So I'd rather not have this but rather let the caller deal with the
fact that copy offload is optional.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 05/12] fs/read_write: Enable copy_file_range for block device.
  2024-05-20 10:20     ` [PATCH v20 05/12] fs/read_write: Enable copy_file_range for block device Nitesh Shetty
@ 2024-05-21  7:07       ` Hannes Reinecke
  2024-05-25 23:02       ` Dave Chinner
  1 sibling, 0 replies; 89+ messages in thread
From: Hannes Reinecke @ 2024-05-21  7:07 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On 5/20/24 12:20, Nitesh Shetty wrote:
> From: Anuj Gupta <anuj20.g@samsung.com>
> 
> This is a prep patch. Allow copy_file_range to work for block devices.
> Relaxing generic_copy_file_checks allows us to reuse the existing infra,
> instead of adding a new user interface for block copy offload.
> Change generic_copy_file_checks to use ->f_mapping->host for both inode_in
> and inode_out. Allow block device in generic_file_rw_checks.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>   fs/read_write.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index ef6339391351..31645ca5ed58 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1413,8 +1413,8 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
>   				    struct file *file_out, loff_t pos_out,
>   				    size_t *req_count, unsigned int flags)
>   {
> -	struct inode *inode_in = file_inode(file_in);
> -	struct inode *inode_out = file_inode(file_out);
> +	struct inode *inode_in = file_in->f_mapping->host;
> +	struct inode *inode_out = file_out->f_mapping->host;
>   	uint64_t count = *req_count;
>   	loff_t size_in;
>   	int ret;
> @@ -1726,7 +1726,9 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
>   	/* Don't copy dirs, pipes, sockets... */
>   	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>   		return -EISDIR;
> -	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> +	if (!S_ISREG(inode_in->i_mode) && !S_ISBLK(inode_in->i_mode))
> +		return -EINVAL;
> +	if ((inode_in->i_mode & S_IFMT) != (inode_out->i_mode & S_IFMT))
>   		return -EINVAL;
>   
>   	if (!(file_in->f_mode & FMODE_READ) ||

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 09/12] dm: Add support for copy offload
  2024-05-20 10:20     ` [PATCH v20 09/12] dm: Add support for copy offload Nitesh Shetty
@ 2024-05-21  7:11       ` Hannes Reinecke
  2024-05-21 14:08         ` Nitesh Shetty
  0 siblings, 1 reply; 89+ messages in thread
From: Hannes Reinecke @ 2024-05-21  7:11 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, bvanassche, david, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On 5/20/24 12:20, Nitesh Shetty wrote:
> Before enabling copy for dm target, check if underlying devices and
> dm target support copy. Avoid split happening inside dm target.
> Fail early if the request needs split, currently splitting copy
> request is not supported.
> 
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>   drivers/md/dm-table.c         | 37 +++++++++++++++++++++++++++++++++++
>   drivers/md/dm.c               |  7 +++++++
>   include/linux/device-mapper.h |  3 +++
>   3 files changed, 47 insertions(+)
> 
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index cc66a27c363a..d58c67ecd794 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -1899,6 +1899,38 @@ static bool dm_table_supports_nowait(struct dm_table *t)
>   	return true;
>   }
>   
> +static int device_not_copy_capable(struct dm_target *ti, struct dm_dev *dev,
> +				   sector_t start, sector_t len, void *data)
> +{
> +	struct request_queue *q = bdev_get_queue(dev->bdev);
> +
> +	return !q->limits.max_copy_sectors;
> +}
> +
> +static bool dm_table_supports_copy(struct dm_table *t)
> +{
> +	struct dm_target *ti;
> +	unsigned int i;
> +
> +	for (i = 0; i < t->num_targets; i++) {
> +		ti = dm_table_get_target(t, i);
> +
> +		if (!ti->copy_offload_supported)
> +			return false;
> +
> +		/*
> +		 * target provides copy support (as implied by setting
> +		 * 'copy_offload_supported')
> +		 * and it relies on _all_ data devices having copy support.
> +		 */
> +		if (!ti->type->iterate_devices ||
> +		    ti->type->iterate_devices(ti, device_not_copy_capable, NULL))
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
>   static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
>   				      sector_t start, sector_t len, void *data)
>   {
> @@ -1975,6 +2007,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>   		limits->discard_misaligned = 0;
>   	}
>   
> +	if (!dm_table_supports_copy(t)) {
> +		limits->max_copy_sectors = 0;
> +		limits->max_copy_hw_sectors = 0;
> +	}
> +
>   	if (!dm_table_supports_write_zeroes(t))
>   		limits->max_write_zeroes_sectors = 0;
>   
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 597dd7a25823..070b41b83a97 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1717,6 +1717,13 @@ static blk_status_t __split_and_process_bio(struct clone_info *ci)
>   	if (unlikely(ci->is_abnormal_io))
>   		return __process_abnormal_io(ci, ti);
>   
> +	if ((unlikely(op_is_copy(ci->bio->bi_opf)) &&
> +	    max_io_len(ti, ci->sector) < ci->sector_count)) {
> +		DMERR("Error, IO size(%u) > max target size(%llu)\n",
> +		      ci->sector_count, max_io_len(ti, ci->sector));
> +		return BLK_STS_IOERR;
> +	}
> +
>   	/*
>   	 * Only support bio polling for normal IO, and the target io is
>   	 * exactly inside the dm_io instance (verified in dm_poll_dm_io)
> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index 82b2195efaca..6868941bc7d9 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -397,6 +397,9 @@ struct dm_target {
>   	 * bio_set_dev(). NOTE: ideally a target should _not_ need this.
>   	 */
>   	bool needs_bio_set_dev:1;
> +
> +	/* copy offload is supported */
> +	bool copy_offload_supported:1;
>   };
>   
>   void *dm_per_bio_data(struct bio *bio, size_t data_size);

Errm. Not sure this will work. DM tables might be arbitrarily, requiring 
us to _split_ the copy offload request according to the underlying 
component devices. But we explicitly disallowed a split in one of the 
earlier patches.
Or am I wrong?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support
  2024-05-20 14:33       ` Damien Le Moal
@ 2024-05-21  8:15         ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-21  8:15 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 7209 bytes --]

On 20/05/24 04:33PM, Damien Le Moal wrote:
>On 2024/05/20 12:20, Nitesh Shetty wrote:
>> @@ -231,10 +237,11 @@ int blk_set_default_limits(struct queue_limits *lim)
>>  {
>>  	/*
>>  	 * Most defaults are set by capping the bounds in blk_validate_limits,
>> -	 * but max_user_discard_sectors is special and needs an explicit
>> -	 * initialization to the max value here.
>> +	 * but max_user_discard_sectors and max_user_copy_sectors are special
>> +	 * and needs an explicit initialization to the max value here.
>
>s/needs/need

acked
>
>>  	 */
>>  	lim->max_user_discard_sectors = UINT_MAX;
>> +	lim->max_user_copy_sectors = UINT_MAX;
>>  	return blk_validate_limits(lim);
>>  }
>>
>> @@ -316,6 +323,25 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
>>  }
>>  EXPORT_SYMBOL(blk_queue_max_discard_sectors);
>>
>> +/*
>> + * blk_queue_max_copy_hw_sectors - set max sectors for a single copy payload
>> + * @q:	the request queue for the device
>> + * @max_copy_sectors: maximum number of sectors to copy
>> + */
>> +void blk_queue_max_copy_hw_sectors(struct request_queue *q,
>> +				   unsigned int max_copy_sectors)
>> +{
>> +	struct queue_limits *lim = &q->limits;
>> +
>> +	if (max_copy_sectors > (BLK_COPY_MAX_BYTES >> SECTOR_SHIFT))
>> +		max_copy_sectors = BLK_COPY_MAX_BYTES >> SECTOR_SHIFT;
>> +
>> +	lim->max_copy_hw_sectors = max_copy_sectors;
>> +	lim->max_copy_sectors =
>> +		min(max_copy_sectors, lim->max_user_copy_sectors);
>> +}
>> +EXPORT_SYMBOL_GPL(blk_queue_max_copy_hw_sectors);
>
>Hmm... Such helper seems to not fit with Christoph's changes of the limits
>initialization as that is not necessarily done using &q->limits but depending on
>the driver, a different limit structure. So shouldn't this function be passed a
>queue_limits struct pointer instead of the request queue pointer ?
>
Acked, we made a mistake, we are no longer using this function after moving
to atomic limits change. We will remove this function in next version.

>> +
>>  /**
>>   * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
>>   * @q:  the request queue for the device
>> @@ -633,6 +659,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>  	t->max_segment_size = min_not_zero(t->max_segment_size,
>>  					   b->max_segment_size);
>>
>> +	t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
>> +	t->max_copy_hw_sectors = min(t->max_copy_hw_sectors,
>> +				     b->max_copy_hw_sectors);
>> +
>>  	t->misaligned |= b->misaligned;
>>
>>  	alignment = queue_limit_alignment_offset(b, start);
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index f0f9314ab65c..805c2b6b0393 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -205,6 +205,44 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
>>  	return queue_var_show(0, page);
>>  }
>>
>> +static ssize_t queue_copy_hw_max_show(struct request_queue *q, char *page)
>> +{
>> +	return sprintf(page, "%llu\n", (unsigned long long)
>> +		       q->limits.max_copy_hw_sectors << SECTOR_SHIFT);
>> +}
>> +
>> +static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
>> +{
>> +	return sprintf(page, "%llu\n", (unsigned long long)
>> +		       q->limits.max_copy_sectors << SECTOR_SHIFT);
>> +}
>
>Given that you repeat the same pattern twice, may be add a queue_var64_show()
>helper ? (naming can be changed).
>
Acked

>> +
>> +static ssize_t queue_copy_max_store(struct request_queue *q, const char *page,
>> +				    size_t count)
>> +{
>> +	unsigned long max_copy_bytes;
>> +	struct queue_limits lim;
>> +	ssize_t ret;
>> +	int err;
>> +
>> +	ret = queue_var_store(&max_copy_bytes, page, count);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (max_copy_bytes & (queue_logical_block_size(q) - 1))
>> +		return -EINVAL;
>> +
>> +	blk_mq_freeze_queue(q);
>> +	lim = queue_limits_start_update(q);
>> +	lim.max_user_copy_sectors = max_copy_bytes >> SECTOR_SHIFT;
>
>max_copy_bytes is an unsigned long, so 64 bits on 64-bit arch and
>max_user_copy_sectors is an unsigned int, so 32-bits. There are thus no
>guarantees that this will not overflow. A check is needed.
>
Acked

>> +	err = queue_limits_commit_update(q, &lim);
>> +	blk_mq_unfreeze_queue(q);
>> +
>> +	if (err)
>
>You can reuse ret here. No need for adding the err variable.
Acked

>
>> +		return err;
>> +	return count;
>> +}
>> +
>>  static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
>>  {
>>  	return queue_var_show(0, page);
>> @@ -505,6 +543,9 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
>>  QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
>>  QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
>>
>> +QUEUE_RO_ENTRY(queue_copy_hw_max, "copy_max_hw_bytes");
>> +QUEUE_RW_ENTRY(queue_copy_max, "copy_max_bytes");
>> +
>>  QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
>>  QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
>>  QUEUE_RW_ENTRY(queue_poll, "io_poll");
>> @@ -618,6 +659,8 @@ static struct attribute *queue_attrs[] = {
>>  	&queue_discard_max_entry.attr,
>>  	&queue_discard_max_hw_entry.attr,
>>  	&queue_discard_zeroes_data_entry.attr,
>> +	&queue_copy_hw_max_entry.attr,
>> +	&queue_copy_max_entry.attr,
>>  	&queue_write_same_max_entry.attr,
>>  	&queue_write_zeroes_max_entry.attr,
>>  	&queue_zone_append_max_entry.attr,
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index aefdda9f4ec7..109d9f905c3c 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -309,6 +309,10 @@ struct queue_limits {
>>  	unsigned int		discard_alignment;
>>  	unsigned int		zone_write_granularity;
>>
>> +	unsigned int		max_copy_hw_sectors;
>> +	unsigned int		max_copy_sectors;
>> +	unsigned int		max_user_copy_sectors;
>> +
>>  	unsigned short		max_segments;
>>  	unsigned short		max_integrity_segments;
>>  	unsigned short		max_discard_segments;
>> @@ -933,6 +937,8 @@ void blk_queue_max_secure_erase_sectors(struct request_queue *q,
>>  		unsigned int max_sectors);
>>  extern void blk_queue_max_discard_sectors(struct request_queue *q,
>>  		unsigned int max_discard_sectors);
>> +extern void blk_queue_max_copy_hw_sectors(struct request_queue *q,
>> +					  unsigned int max_copy_sectors);
>>  extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
>>  		unsigned int max_write_same_sectors);
>>  extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
>> @@ -1271,6 +1277,14 @@ static inline unsigned int bdev_discard_granularity(struct block_device *bdev)
>>  	return bdev_get_queue(bdev)->limits.discard_granularity;
>>  }
>>
>> +/* maximum copy offload length, this is set to 128MB based on current testing */
>
>Current testing will not be current in a while... So may be simply say
>"arbitrary" or something. Also please capitalize the first letter of the
>comment. So something like:
>
>/* Arbitrary absolute limit of 128 MB for copy offload. */
>
>> +#define BLK_COPY_MAX_BYTES		(1 << 27)
>
>Also, it is not clear from the name if this is a soft limit or a cap on the
>hardware limit... So at least please adjust the comment to say which one it is.
>
Acked, it is a soft limit.

Thank You,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-20 15:00       ` Damien Le Moal
@ 2024-05-21 10:50         ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-21 10:50 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 10073 bytes --]

On 20/05/24 05:00PM, Damien Le Moal wrote:
>On 2024/05/20 12:20, Nitesh Shetty wrote:
>> We add two new opcode REQ_OP_COPY_DST, REQ_OP_COPY_SRC.
>> Since copy is a composite operation involving src and dst sectors/lba,
>> each needs to be represented by a separate bio to make it compatible
>> with device mapper.
>
>Why ? The beginning of the sentence isn't justification enough for the two new
>operation codes ? The 2 sentences should be reversed for easier reading:
>justification first naturally leads to the reader understanding why the codes
>are needed.
>
>Also: s/opcode/operations
>
>
Acked

>> We expect caller to take a plug and send bio with destination information,
>> followed by bio with source information.
>
>expect ? Plugging is optional. Does copy offload require it ? Please clarify this.
>
"caller" here refers to blkdev_copy_offload. The caller of blkdev_copy_offload
does not need to take the plug. We should have made this clear.
Sorry for the confusion, will update the description to reflect the same
in next version.

>> Once the dst bio arrives we form a request and wait for source
>
>arrives ? You mean "is submitted" ?
>
>s/and wait for/and wait for the
>
>> bio. Upon arrival of source bio we merge these two bio's and send
>
>s/arrival/submission ?
>
>s/of/of the
>s/bio's/BIOs
>s/and send/and send the
>s/down to/down to the
>
acked for above.

>> corresponding request down to device driver.
>> Merging non copy offload bio is avoided by checking for copy specific
>> opcodes in merge function.
>
>Super unclear... What are you trying to say here ? That merging copy offload
>BIOs with other BIOs is not allowed ? That is already handled. Only BIOs &
>requests with the same operation can be merged. The code below also suggests
>that you allow merging copy offloads... So I really do not understand this.
>
You are right, only BIOs & requests with same operation can be merged.
We are not merging copy offloads. We are only merging src and dst BIO to
form a copy offload request.
We will remove this in next version.

>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 8534c35e0497..f8dc48a03379 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -154,6 +154,20 @@ static struct bio *bio_split_write_zeroes(struct bio *bio,
>>  	return bio_split(bio, lim->max_write_zeroes_sectors, GFP_NOIO, bs);
>>  }
>>
>> +static struct bio *bio_split_copy(struct bio *bio,
>> +				  const struct queue_limits *lim,
>> +				  unsigned int *nsegs)
>> +{
>> +	*nsegs = 1;
>> +	if (bio_sectors(bio) <= lim->max_copy_sectors)
>> +		return NULL;
>> +	/*
>> +	 * We don't support splitting for a copy bio. End it with EIO if
>> +	 * splitting is required and return an error pointer.
>> +	 */
>> +	return ERR_PTR(-EIO);
>> +}
>
>Hmm... Why not check that the copy request is small enough and will not be split
>when it is submitted ? Something like blk_check_zone_append() does with
>REQ_OP_ZONE_APPEND ? So adding a blk_check_copy_offload(). That would also
>include the limits check from the previous hunk.
Yes, that could be one way. But we followed approach similar to discard.

>
>> +
>>  /*
>>   * Return the maximum number of sectors from the start of a bio that may be
>>   * submitted as a single request to a block device. If enough sectors remain,
>> @@ -362,6 +376,12 @@ struct bio *__bio_split_to_limits(struct bio *bio,
>>  	case REQ_OP_WRITE_ZEROES:
>>  		split = bio_split_write_zeroes(bio, lim, nr_segs, bs);
>>  		break;
>> +	case REQ_OP_COPY_SRC:
>> +	case REQ_OP_COPY_DST:
>> +		split = bio_split_copy(bio, lim, nr_segs);
>> +		if (IS_ERR(split))
>> +			return NULL;
>> +		break;
>
>See above.
>
>>  	default:
>>  		split = bio_split_rw(bio, lim, nr_segs, bs,
>>  				get_max_io_size(bio, lim) << SECTOR_SHIFT);
>> @@ -925,6 +945,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
>>  	if (!rq_mergeable(rq) || !bio_mergeable(bio))
>>  		return false;
>>
>> +	if (blk_copy_offload_mergable(rq, bio))
>> +		return true;
>> +
>>  	if (req_op(rq) != bio_op(bio))
>>  		return false;
>>
>> @@ -958,6 +981,8 @@ enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
>>  {
>>  	if (blk_discard_mergable(rq))
>>  		return ELEVATOR_DISCARD_MERGE;
>> +	else if (blk_copy_offload_mergable(rq, bio))
>> +		return ELEVATOR_COPY_OFFLOAD_MERGE;
>>  	else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
>>  		return ELEVATOR_BACK_MERGE;
>>  	else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
>> @@ -1065,6 +1090,20 @@ static enum bio_merge_status bio_attempt_discard_merge(struct request_queue *q,
>>  	return BIO_MERGE_FAILED;
>>  }
>>
>> +static enum bio_merge_status bio_attempt_copy_offload_merge(struct request *req,
>> +							    struct bio *bio)
>> +{
>> +	if (req->__data_len != bio->bi_iter.bi_size)
>> +		return BIO_MERGE_FAILED;
>> +
>> +	req->biotail->bi_next = bio;
>> +	req->biotail = bio;
>> +	req->nr_phys_segments++;
>> +	req->__data_len += bio->bi_iter.bi_size;
>
>Arg... You seem to be assuming that the source BIO always comes right after the
>destination request... What if copy offloads are being concurrently issued ?
>Shouldn't you check somehow that the pair is a match ? Or are you relying on the
>per-context plugging which prevents that from happening in the first place ? But
>that would assumes that you never ever sleep trying to allocate the source BIO
>after the destination BIO/request are prepared and plugged.
>
Yes, we are rely on per-context plugging for copy to work.
Incase for any reason we are not able to merge src and dst BIOs, and the
request reaches driver layer with only one of the src or dst BIO,
we fail this request in driver layer.
This approach simplifies the overall copy plumbing.

>> +
>> +	return BIO_MERGE_OK;
>> +}
>> +
>>  static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
>>  						   struct request *rq,
>>  						   struct bio *bio,
>> @@ -1085,6 +1124,8 @@ static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
>>  		break;
>>  	case ELEVATOR_DISCARD_MERGE:
>>  		return bio_attempt_discard_merge(q, rq, bio);
>> +	case ELEVATOR_COPY_OFFLOAD_MERGE:
>> +		return bio_attempt_copy_offload_merge(rq, bio);
>>  	default:
>>  		return BIO_MERGE_NONE;
>>  	}
>> diff --git a/block/blk.h b/block/blk.h
>> index 189bc25beb50..6528a2779b84 100644
>> --- a/block/blk.h
>> +++ b/block/blk.h
>> @@ -174,6 +174,20 @@ static inline bool blk_discard_mergable(struct request *req)
>>  	return false;
>>  }
>>
>> +/*
>> + * Copy offload sends a pair of bio with REQ_OP_COPY_DST and REQ_OP_COPY_SRC
>> + * operation by taking a plug.
>> + * Initially DST bio is sent which forms a request and
>> + * waits for SRC bio to arrive. Once SRC bio arrives
>> + * we merge it and send request down to driver.
>> + */
>> +static inline bool blk_copy_offload_mergable(struct request *req,
>> +					     struct bio *bio)
>> +{
>> +	return (req_op(req) == REQ_OP_COPY_DST &&
>> +		bio_op(bio) == REQ_OP_COPY_SRC);
>> +}
>
>This function is really not needed at all (used in one place only).
>
This is used at two places and we felt this provides better readability,
similar to discard.

>> +
>>  static inline unsigned int blk_rq_get_max_segments(struct request *rq)
>>  {
>>  	if (req_op(rq) == REQ_OP_DISCARD)
>> @@ -323,6 +337,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
>>  	case REQ_OP_DISCARD:
>>  	case REQ_OP_SECURE_ERASE:
>>  	case REQ_OP_WRITE_ZEROES:
>> +	case REQ_OP_COPY_SRC:
>> +	case REQ_OP_COPY_DST:
>>  		return true; /* non-trivial splitting decisions */
>
>See above. Limits should be checked on submission.
>
>>  	default:
>>  		break;
>> diff --git a/block/elevator.h b/block/elevator.h
>> index e9a050a96e53..c7a45c1f4156 100644
>> --- a/block/elevator.h
>> +++ b/block/elevator.h
>> @@ -18,6 +18,7 @@ enum elv_merge {
>>  	ELEVATOR_FRONT_MERGE	= 1,
>>  	ELEVATOR_BACK_MERGE	= 2,
>>  	ELEVATOR_DISCARD_MERGE	= 3,
>> +	ELEVATOR_COPY_OFFLOAD_MERGE	= 4,
>>  };
>>
>>  struct blk_mq_alloc_data;
>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>> index d5379548d684..528ef22dd65b 100644
>> --- a/include/linux/bio.h
>> +++ b/include/linux/bio.h
>> @@ -53,11 +53,7 @@ static inline unsigned int bio_max_segs(unsigned int nr_segs)
>>   */
>>  static inline bool bio_has_data(struct bio *bio)
>>  {
>> -	if (bio &&
>> -	    bio->bi_iter.bi_size &&
>> -	    bio_op(bio) != REQ_OP_DISCARD &&
>> -	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
>> -	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
>> +	if (bio && (bio_op(bio) == REQ_OP_READ || bio_op(bio) == REQ_OP_WRITE))
>>  		return true;
>
>This change seems completely broken and out of place. This would cause a return
>of false for zone append operations.
>
We changed this based on previous review comments[1].
Idea is to replace this with a positive check.
But we did miss adding ZONE_APPEND in this.
We will add ZONE_APPEND check in next version.

[1] https://lore.kernel.org/linux-block/20230720074256.GA5042@lst.de/
>>
>>  	return false;
>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index 781c4500491b..7f692bade271 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -342,6 +342,10 @@ enum req_op {
>>  	/* reset all the zone present on the device */
>>  	REQ_OP_ZONE_RESET_ALL	= (__force blk_opf_t)15,
>>
>> +	/* copy offload src and dst operation */
>
>s/src/source
>s/dst/destination
>s/operation/operations
>
Acked

>> +	REQ_OP_COPY_SRC		= (__force blk_opf_t)18,
>> +	REQ_OP_COPY_DST		= (__force blk_opf_t)19,
>> +
>>  	/* Driver private requests */
>>  	REQ_OP_DRV_IN		= (__force blk_opf_t)34,
>>  	REQ_OP_DRV_OUT		= (__force blk_opf_t)35,
>> @@ -430,6 +434,12 @@ static inline bool op_is_write(blk_opf_t op)
>>  	return !!(op & (__force blk_opf_t)1);
>>  }
>>
>> +static inline bool op_is_copy(blk_opf_t op)
>> +{
>> +	return ((op & REQ_OP_MASK) == REQ_OP_COPY_SRC ||
>> +		(op & REQ_OP_MASK) == REQ_OP_COPY_DST);
>> +}
>
>May be use a switch here to avoid the double masking of op ?
>
Acked

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-20 23:00       ` Bart Van Assche
@ 2024-05-21 11:17         ` Nitesh Shetty
  2024-05-22 17:58           ` Bart Van Assche
  0 siblings, 1 reply; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-21 11:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2276 bytes --]

On 20/05/24 04:00PM, Bart Van Assche wrote:
>On 5/20/24 03:20, Nitesh Shetty wrote:
>>Upon arrival of source bio we merge these two bio's and send
>>corresponding request down to device driver.
>
>bios with different operation types must not be merged.
>
Copy is a composite operation which has two operation read and
write combined, so we chose two operation types which reflects that.

>>+static enum bio_merge_status bio_attempt_copy_offload_merge(struct request *req,
>>+							    struct bio *bio)
>>+{
>>+	if (req->__data_len != bio->bi_iter.bi_size)
>>+		return BIO_MERGE_FAILED;
>>+
>>+	req->biotail->bi_next = bio;
>>+	req->biotail = bio;
>>+	req->nr_phys_segments++;
>>+	req->__data_len += bio->bi_iter.bi_size;
>>+
>>+	return BIO_MERGE_OK;
>>+}
>
>This function appends a bio to a request. Hence, the name of this function is
>wrong.
>
We followed the naming convention from discard(bio_attempt_discard_merge)
which does similar thing.
But we are open to renaming, if overall community also feels the same.

>>@@ -1085,6 +1124,8 @@ static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
>>  		break;
>>  	case ELEVATOR_DISCARD_MERGE:
>>  		return bio_attempt_discard_merge(q, rq, bio);
>>+	case ELEVATOR_COPY_OFFLOAD_MERGE:
>>+		return bio_attempt_copy_offload_merge(rq, bio);
>>  	default:
>>  		return BIO_MERGE_NONE;
>>  	}
>
>Is any code added in this patch series that causes an I/O scheduler to return
>ELEVATOR_COPY_OFFLOAD_MERGE?
>
yes, blk_try_merge returns ELEVATOR_COPY_OFFLOAD_MERGE.

>>+static inline bool blk_copy_offload_mergable(struct request *req,
>>+					     struct bio *bio)
>>+{
>>+	return (req_op(req) == REQ_OP_COPY_DST &&
>>+		bio_op(bio) == REQ_OP_COPY_SRC);
>>+}
>
>bios with different operation types must not be merged. Please rename this function.
>
As described above we need two different opcodes.
As far as function renaming, we followed discard's naming. But open to
any suggestion.

>>+static inline bool op_is_copy(blk_opf_t op)
>>+{
>>+	return ((op & REQ_OP_MASK) == REQ_OP_COPY_SRC ||
>>+		(op & REQ_OP_MASK) == REQ_OP_COPY_DST);
>>+}
>
>The above function is not used in this patch. Please introduce new functions in the
>patch in which these are used for the first time.
>
Acked

Thank You
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 04/12] block: add emulation for copy
  2024-05-21  7:06       ` Hannes Reinecke
@ 2024-05-21 11:29         ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-21 11:29 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, david, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Vincent Fu, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1790 bytes --]

On 21/05/24 09:06AM, Hannes Reinecke wrote:
>On 5/20/24 12:20, Nitesh Shetty wrote:
>>For the devices which does not support copy, copy emulation is added.
>>It is required for in-kernel users like fabrics, where file descriptor is
>>not available and hence they can't use copy_file_range.
>>Copy-emulation is implemented by reading from source into memory and
>>writing to the corresponding destination.
>>At present in kernel user of emulation is fabrics.
>>
>>Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
>>Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>>---
>>  block/blk-lib.c        | 223 +++++++++++++++++++++++++++++++++++++++++
>>  include/linux/blkdev.h |   4 +
>>  2 files changed, 227 insertions(+)
>>
>Again, I'm not sure if we need this.
>After all, copy offload _is_optional, so we need to be prepared to 
>handle systems where it's not supported. In the end, the caller might
>decide to do something else entirely; having an in-kernel emulation 
>would defeat that.
>And with adding an emulation to nullblk we already have an emulation
>target to try if people will want to start experimenting.
>So I'd rather not have this but rather let the caller deal with the
>fact that copy offload is optional.
>
Unlike previous iteration, blkdev_copy_offload doesn't fallback to emulation
incase offload fails.
This is one more option caller/user can leverage, if for some reason
device copy offload is not supported/optimal.
It is upto to the caller to decide, if it wants to use copy emulation.
Moreover we found this is very useful for fabrics scenario, where this saves
the network bandwidth, by sending offload over the network rather than
read+write(when target doesn't support offload).

Thank you
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 09/12] dm: Add support for copy offload
  2024-05-21  7:11       ` Hannes Reinecke
@ 2024-05-21 14:08         ` Nitesh Shetty
  2024-05-22  6:22           ` Hannes Reinecke
  0 siblings, 1 reply; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-21 14:08 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, david, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1330 bytes --]

On 21/05/24 09:11AM, Hannes Reinecke wrote:
>On 5/20/24 12:20, Nitesh Shetty wrote:
>>Before enabling copy for dm target, check if underlying devices and
>>dm target support copy. Avoid split happening inside dm target.
>>Fail early if the request needs split, currently splitting copy
>>request is not supported.
>>
>>Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>---
>>@@ -397,6 +397,9 @@ struct dm_target {
>>  	 * bio_set_dev(). NOTE: ideally a target should _not_ need this.
>>  	 */
>>  	bool needs_bio_set_dev:1;
>>+
>>+	/* copy offload is supported */
>>+	bool copy_offload_supported:1;
>>  };
>>  void *dm_per_bio_data(struct bio *bio, size_t data_size);
>
>Errm. Not sure this will work. DM tables might be arbitrarily, 
>requiring us to _split_ the copy offload request according to the 
>underlying component devices. But we explicitly disallowed a split in 
>one of the earlier patches.
>Or am I wrong?
>
Yes you are right w.r.to split, we disallow split.
But this flag indicates whether we support copy offload in dm-target or
not. At present we support copy offload only in dm-linear.
For other dm-target, eventhough underlaying device supports copy
offload, dm-target based on it wont support copy offload.
If the present series get merged, we can test and integrate more
targets.

Regards,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support
  2024-05-20 22:42       ` Bart Van Assche
@ 2024-05-21 14:25         ` Nitesh Shetty
  2024-05-22 17:49           ` Bart Van Assche
  0 siblings, 1 reply; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-21 14:25 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3368 bytes --]

On 20/05/24 03:42PM, Bart Van Assche wrote:
>On 5/20/24 03:20, Nitesh Shetty wrote:
>>+static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
>>+{
>>+	return sprintf(page, "%llu\n", (unsigned long long)
>>+		       q->limits.max_copy_sectors << SECTOR_SHIFT);
>>+}
>>+
>>+static ssize_t queue_copy_max_store(struct request_queue *q, const char *page,
>>+				    size_t count)
>>+{
>>+	unsigned long max_copy_bytes;
>>+	struct queue_limits lim;
>>+	ssize_t ret;
>>+	int err;
>>+
>>+	ret = queue_var_store(&max_copy_bytes, page, count);
>>+	if (ret < 0)
>>+		return ret;
>>+
>>+	if (max_copy_bytes & (queue_logical_block_size(q) - 1))
>>+		return -EINVAL;
>
>Wouldn't it be more user-friendly if this check would be left out? Does any code
>depend on max_copy_bytes being a multiple of the logical block size?
>
In block layer, we use max_copy_bytes to split larger copy into
device supported copy size.
Simple copy spec requires length to be logical block size aligned.
Hence this check.

>>+	blk_mq_freeze_queue(q);
>>+	lim = queue_limits_start_update(q);
>>+	lim.max_user_copy_sectors = max_copy_bytes >> SECTOR_SHIFT;
>>+	err = queue_limits_commit_update(q, &lim);
>>+	blk_mq_unfreeze_queue(q);
>>+
>>+	if (err)
>>+		return err;
>>+	return count;
>>+}
>
>queue_copy_max_show() shows max_copy_sectors while queue_copy_max_store()
>modifies max_user_copy_sectors. Is that perhaps a bug?
>
This follows discard implementaion[1].
max_copy_sectors gets updated in queue_limits_commits_update.

[1] https://lore.kernel.org/linux-block/20240213073425.1621680-7-hch@lst.de/

>>diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>index aefdda9f4ec7..109d9f905c3c 100644
>>--- a/include/linux/blkdev.h
>>+++ b/include/linux/blkdev.h
>>@@ -309,6 +309,10 @@ struct queue_limits {
>>  	unsigned int		discard_alignment;
>>  	unsigned int		zone_write_granularity;
>>+	unsigned int		max_copy_hw_sectors;
>>+	unsigned int		max_copy_sectors;
>>+	unsigned int		max_user_copy_sectors;
>
>Two new limits are documented in Documentation/ABI/stable/sysfs-block while three
>new parameters are added in struct queue_limits. Why three new limits instead of
>two? Please add a comment above the new parameters that explains the role of the
>new parameters.
>
Similar to discard, only 2 limits are exposed to user.

>>+/* maximum copy offload length, this is set to 128MB based on current testing */
>>+#define BLK_COPY_MAX_BYTES		(1 << 27)
>
>"current testing" sounds vague. Why is this limit required? Why to cap what the
>driver reports instead of using the value reported by the driver without modifying it?
>
Here we are expecting BLK_COPY_MAX_BYTES >= driver supported limit.

We do support copy length larger than device supported limit.
In block later(blkdev_copy_offload), we split larger copy into device
supported limit and send down.
We added this check to make sure userspace doesn't consume all the
kernel resources[2].
We can remove/expand this arbitary limit moving forward.

[2] https://lore.kernel.org/linux-block/YRu1WFImFulfpk7s@kroah.com/

>Additionally, since this constant is only used in source code that occurs in the
>block/ directory, please move the definition of this constant into a source or header
>file in the block/ directory.
>
We are using this in null block driver as well, so we need to keep it
here.

Thank You,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 12/12] null_blk: add support for copy offload
  2024-05-20 23:42       ` Bart Van Assche
@ 2024-05-21 14:46         ` Nitesh Shetty
  2024-05-22 17:52           ` Bart Van Assche
  0 siblings, 1 reply; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-21 14:46 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, Vincent Fu, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3163 bytes --]

On 20/05/24 04:42PM, Bart Van Assche wrote:
>On 5/20/24 03:20, Nitesh Shetty wrote:
>>+	if (blk_rq_nr_phys_segments(req) != BLK_COPY_MAX_SEGMENTS)
>>+		return status;
>
>Why is this check necessary?
>
>>+	/*
>>+	 * First bio contains information about destination and last bio
>>+	 * contains information about source.
>>+	 */
>
>Please check this at runtime (WARN_ON_ONCE()?).
>
>>+	__rq_for_each_bio(bio, req) {
>>+		if (seg == blk_rq_nr_phys_segments(req)) {
>>+			sector_in = bio->bi_iter.bi_sector;
>>+			if (rem != bio->bi_iter.bi_size)
>>+				return status;
>>+		} else {
>>+			sector_out = bio->bi_iter.bi_sector;
>>+			rem = bio->bi_iter.bi_size;
>>+		}
>>+		seg++;
>>+	}
>
>_rq_for_each_bio() iterates over the bios in a request. Does a copy
>offload request always have two bios - one copy destination bio and
>one copy source bio? If so, is 'seg' a bio counter? Why is that bio
>counter compared with the number of physical segments in the request?
>
Yes, your observation is right. We are treating first bio as dst and
second as src. If not for that comparision, we might need to store the
index in a temporary variable and parse based on index value.

>>+	trace_nullb_copy_op(req, sector_out << SECTOR_SHIFT,
>>+			    sector_in << SECTOR_SHIFT, rem);
>>+
>>+	spin_lock_irq(&nullb->lock);
>>+	while (rem > 0) {
>>+		chunk = min_t(size_t, nullb->dev->blocksize, rem);
>>+		offset_in = (sector_in & SECTOR_MASK) << SECTOR_SHIFT;
>>+		offset_out = (sector_out & SECTOR_MASK) << SECTOR_SHIFT;
>>+
>>+		if (null_cache_active(nullb) && !is_fua)
>>+			null_make_cache_space(nullb, PAGE_SIZE);
>>+
>>+		t_page_in = null_lookup_page(nullb, sector_in, false,
>>+					     !null_cache_active(nullb));
>>+		if (!t_page_in)
>>+			goto err;
>>+		t_page_out = null_insert_page(nullb, sector_out,
>>+					      !null_cache_active(nullb) ||
>>+					      is_fua);
>>+		if (!t_page_out)
>>+			goto err;
>>+
>>+		in = kmap_local_page(t_page_in->page);
>>+		out = kmap_local_page(t_page_out->page);
>>+
>>+		memcpy(out + offset_out, in + offset_in, chunk);
>>+		kunmap_local(out);
>>+		kunmap_local(in);
>>+		__set_bit(sector_out & SECTOR_MASK, t_page_out->bitmap);
>>+
>>+		if (is_fua)
>>+			null_free_sector(nullb, sector_out, true);
>>+
>>+		rem -= chunk;
>>+		sector_in += chunk >> SECTOR_SHIFT;
>>+		sector_out += chunk >> SECTOR_SHIFT;
>>+	}
>>+
>>+	status = 0;
>>+err:
>>+	spin_unlock_irq(&nullb->lock);
>
>In the worst case, how long does this loop disable interrupts?
>
We havn't measured this. But this should be similar to read and write in
present infra, as we followed similar approach.

>>+TRACE_EVENT(nullb_copy_op,
>>+		TP_PROTO(struct request *req,
>>+			 sector_t dst, sector_t src, size_t len),
>>+		TP_ARGS(req, dst, src, len),
>>+		TP_STRUCT__entry(
>>+				 __array(char, disk, DISK_NAME_LEN)
>>+				 __field(enum req_op, op)
>>+				 __field(sector_t, dst)
>>+				 __field(sector_t, src)
>>+				 __field(size_t, len)
>>+		),
>
>Isn't __string() preferred over __array() since the former occupies less space
>in the trace buffer?
>
Again we followed the present existing implementation, to have a simpler
series to review.

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 10/12] dm: Enable copy offload for dm-linear target
  2024-05-20 23:25       ` Bart Van Assche
@ 2024-05-21 14:48         ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-21 14:48 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 363 bytes --]

On 20/05/24 04:25PM, Bart Van Assche wrote:
>On 5/20/24 03:20, Nitesh Shetty wrote:
>>Setting copy_offload_supported flag to enable offload.
>
>I think that the description of this patch should explain why it is safe
>to set the 'copy_offload_supported' flag for the dm-linear driver.
>
Acked, will add more description in next version.

Thank You,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 09/12] dm: Add support for copy offload
  2024-05-21 14:08         ` Nitesh Shetty
@ 2024-05-22  6:22           ` Hannes Reinecke
  2024-05-22  7:10             ` Nitesh Shetty
  0 siblings, 1 reply; 89+ messages in thread
From: Hannes Reinecke @ 2024-05-22  6:22 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, david, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On 5/21/24 16:08, Nitesh Shetty wrote:
> On 21/05/24 09:11AM, Hannes Reinecke wrote:
>> On 5/20/24 12:20, Nitesh Shetty wrote:
>>> Before enabling copy for dm target, check if underlying devices and
>>> dm target support copy. Avoid split happening inside dm target.
>>> Fail early if the request needs split, currently splitting copy
>>> request is not supported.
>>>
>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>> ---
>>> @@ -397,6 +397,9 @@ struct dm_target {
>>>       * bio_set_dev(). NOTE: ideally a target should _not_ need this.
>>>       */
>>>      bool needs_bio_set_dev:1;
>>> +
>>> +    /* copy offload is supported */
>>> +    bool copy_offload_supported:1;
>>>  };
>>>  void *dm_per_bio_data(struct bio *bio, size_t data_size);
>>
>> Errm. Not sure this will work. DM tables might be arbitrarily, 
>> requiring us to _split_ the copy offload request according to the 
>> underlying component devices. But we explicitly disallowed a split in 
>> one of the earlier patches.
>> Or am I wrong?
>>
> Yes you are right w.r.to split, we disallow split.
> But this flag indicates whether we support copy offload in dm-target or
> not. At present we support copy offload only in dm-linear.
> For other dm-target, eventhough underlaying device supports copy
> offload, dm-target based on it wont support copy offload.
> If the present series get merged, we can test and integrate more
> targets.
> 
But dm-linear can be concatenated, too; you can easily use dm-linear
to tie several devices together.
Which again would require a copy-offload range to be split.
Hmm?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 09/12] dm: Add support for copy offload
  2024-05-22  6:22           ` Hannes Reinecke
@ 2024-05-22  7:10             ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-22  7:10 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, david, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2345 bytes --]

On 22/05/24 08:22AM, Hannes Reinecke wrote:
>On 5/21/24 16:08, Nitesh Shetty wrote:
>>On 21/05/24 09:11AM, Hannes Reinecke wrote:
>>>On 5/20/24 12:20, Nitesh Shetty wrote:
>>>>Before enabling copy for dm target, check if underlying devices and
>>>>dm target support copy. Avoid split happening inside dm target.
>>>>Fail early if the request needs split, currently splitting copy
>>>>request is not supported.
>>>>
>>>>Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>>>---
>>>>@@ -397,6 +397,9 @@ struct dm_target {
>>>>      * bio_set_dev(). NOTE: ideally a target should _not_ need this.
>>>>      */
>>>>     bool needs_bio_set_dev:1;
>>>>+
>>>>+    /* copy offload is supported */
>>>>+    bool copy_offload_supported:1;
>>>> };
>>>> void *dm_per_bio_data(struct bio *bio, size_t data_size);
>>>
>>>Errm. Not sure this will work. DM tables might be arbitrarily, 
>>>requiring us to _split_ the copy offload request according to the 
>>>underlying component devices. But we explicitly disallowed a split 
>>>in one of the earlier patches.
>>>Or am I wrong?
>>>
>>Yes you are right w.r.to split, we disallow split.
>>But this flag indicates whether we support copy offload in dm-target or
>>not. At present we support copy offload only in dm-linear.
>>For other dm-target, eventhough underlaying device supports copy
>>offload, dm-target based on it wont support copy offload.
>>If the present series get merged, we can test and integrate more
>>targets.
>>
>But dm-linear can be concatenated, too; you can easily use dm-linear
>to tie several devices together.
>Which again would require a copy-offload range to be split.
>Hmm?
>
Sorry, I dont understand the concern here. I see 3 possibilites here.

1. Both src and dst IO lies in same underlying device. This will succeed.
2. src and dst lie in different devices. This will fail.
	a. src or dst needs to be split, if one or both of them
	spans across the underlying block device boundary. In this case we
	fail the IO in dm layer(refer patch 9).
	b. src and dst doesn't split in dm,
	but they wont be merged in request later as they belong to
	different block device.
	Hence the request reaches the driver with single bio and will fail
	in driver(refer patch 7)

Does this address your concern, or do you have something else in mind ?

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support
  2024-05-21 14:25         ` Nitesh Shetty
@ 2024-05-22 17:49           ` Bart Van Assche
  2024-05-23  7:05             ` Nitesh Shetty
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2024-05-22 17:49 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 5/21/24 07:25, Nitesh Shetty wrote:
> On 20/05/24 03:42PM, Bart Van Assche wrote:
>> On 5/20/24 03:20, Nitesh Shetty wrote:
>>> +    if (max_copy_bytes & (queue_logical_block_size(q) - 1))
>>> +        return -EINVAL;
>>
>> Wouldn't it be more user-friendly if this check would be left out? Does any code
>> depend on max_copy_bytes being a multiple of the logical block size?
>>
> In block layer, we use max_copy_bytes to split larger copy into
> device supported copy size.
> Simple copy spec requires length to be logical block size aligned.
> Hence this check.

Will blkdev_copy_sanity_check() reject invalid copy requests even if this
check is left out?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 12/12] null_blk: add support for copy offload
  2024-05-21 14:46         ` Nitesh Shetty
@ 2024-05-22 17:52           ` Bart Van Assche
  2024-05-23  6:55             ` Nitesh Shetty
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2024-05-22 17:52 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, Vincent Fu, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On 5/21/24 07:46, Nitesh Shetty wrote:
> On 20/05/24 04:42PM, Bart Van Assche wrote:
>> On 5/20/24 03:20, Nitesh Shetty wrote:
>>> +    __rq_for_each_bio(bio, req) {
>>> +        if (seg == blk_rq_nr_phys_segments(req)) {
>>> +            sector_in = bio->bi_iter.bi_sector;
>>> +            if (rem != bio->bi_iter.bi_size)
>>> +                return status;
>>> +        } else {
>>> +            sector_out = bio->bi_iter.bi_sector;
>>> +            rem = bio->bi_iter.bi_size;
>>> +        }
>>> +        seg++;
>>> +    }
>>
>> _rq_for_each_bio() iterates over the bios in a request. Does a copy
>> offload request always have two bios - one copy destination bio and
>> one copy source bio? If so, is 'seg' a bio counter? Why is that bio
>> counter compared with the number of physical segments in the request?
>>
> Yes, your observation is right. We are treating first bio as dst and
> second as src. If not for that comparision, we might need to store the
> index in a temporary variable and parse based on index value.

I'm still wondering why 'seg' is compared with blk_rq_nr_phys_segments(req).

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-21 11:17         ` Nitesh Shetty
@ 2024-05-22 17:58           ` Bart Van Assche
  0 siblings, 0 replies; 89+ messages in thread
From: Bart Van Assche @ 2024-05-22 17:58 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 5/21/24 04:17, Nitesh Shetty wrote:
> On 20/05/24 04:00PM, Bart Van Assche wrote:
>> On 5/20/24 03:20, Nitesh Shetty wrote:
>>> +static inline bool blk_copy_offload_mergable(struct request *req,
>>> +                         struct bio *bio)
>>> +{
>>> +    return (req_op(req) == REQ_OP_COPY_DST &&
>>> +        bio_op(bio) == REQ_OP_COPY_SRC);
>>> +}
>>
>> bios with different operation types must not be merged. Please rename this function.
>>
> As far as function renaming, we followed discard's naming. But open to
> any suggestion.

req_attempt_discard_merge() checks whether two REQ_OP_DISCARD bios can be merged.
The above function checks something else, namely whether REQ_OP_COPY_DST and
REQ_OP_COPY_SRC can be combined into a copy offload operation. Hence my request
not to use the verb "merge" for combining REQ_OP_COPY_SRC and REQ_OP_COPY_DST
operations.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-20 10:20     ` [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer Nitesh Shetty
                         ` (2 preceding siblings ...)
  2024-05-21  7:01       ` Hannes Reinecke
@ 2024-05-22 18:05       ` Bart Van Assche
  2024-05-23 11:34         ` Nitesh Shetty
  2024-05-24 20:33       ` Bart Van Assche
  4 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2024-05-22 18:05 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 5/20/24 03:20, Nitesh Shetty wrote:
> We add two new opcode REQ_OP_COPY_DST, REQ_OP_COPY_SRC.
> Since copy is a composite operation involving src and dst sectors/lba,
> each needs to be represented by a separate bio to make it compatible
> with device mapper.
> We expect caller to take a plug and send bio with destination information,
> followed by bio with source information.
> Once the dst bio arrives we form a request and wait for source
> bio. Upon arrival of source bio we merge these two bio's and send
> corresponding request down to device driver.
> Merging non copy offload bio is avoided by checking for copy specific
> opcodes in merge function.

Plugs are per task. Can the following happen?
* Task A calls blk_start_plug()
* Task B calls blk_start_plug()
* Task A submits a REQ_OP_COPY_DST bio and a REQ_OP_COPY_SRC bio.
* Task B submits a REQ_OP_COPY_DST bio and a REQ_OP_COPY_SRC bio.
* Task A calls blk_finish_plug()
* Task B calls blk_finish_plug()
* The REQ_OP_COPY_DST bio from task A and the REQ_OP_COPY_SRC bio from
   task B are combined into a single request.
* The REQ_OP_COPY_DST bio from task B and the REQ_OP_COPY_SRC bio from
   task A are combined into a single request.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 12/12] null_blk: add support for copy offload
  2024-05-22 17:52           ` Bart Van Assche
@ 2024-05-23  6:55             ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-23  6:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, Vincent Fu, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1499 bytes --]

On 22/05/24 10:52AM, Bart Van Assche wrote:
>On 5/21/24 07:46, Nitesh Shetty wrote:
>>On 20/05/24 04:42PM, Bart Van Assche wrote:
>>>On 5/20/24 03:20, Nitesh Shetty wrote:
>>>>+    __rq_for_each_bio(bio, req) {
>>>>+        if (seg == blk_rq_nr_phys_segments(req)) {
>>>>+            sector_in = bio->bi_iter.bi_sector;
>>>>+            if (rem != bio->bi_iter.bi_size)
>>>>+                return status;
>>>>+        } else {
>>>>+            sector_out = bio->bi_iter.bi_sector;
>>>>+            rem = bio->bi_iter.bi_size;
>>>>+        }
>>>>+        seg++;
>>>>+    }
>>>
>>>_rq_for_each_bio() iterates over the bios in a request. Does a copy
>>>offload request always have two bios - one copy destination bio and
>>>one copy source bio? If so, is 'seg' a bio counter? Why is that bio
>>>counter compared with the number of physical segments in the request?
>>>
>>Yes, your observation is right. We are treating first bio as dst and
>>second as src. If not for that comparision, we might need to store the
>>index in a temporary variable and parse based on index value.
>
>I'm still wondering why 'seg' is compared with blk_rq_nr_phys_segments(req).
>
In this case blk_rq_nr_phys_segments is used as counter value(==2), which tells
its a src IO. But using a macro instead of this comparison will avoid this
confusion. We will change this in next version to make it explicit.

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support
  2024-05-22 17:49           ` Bart Van Assche
@ 2024-05-23  7:05             ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-23  7:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 842 bytes --]

On 22/05/24 10:49AM, Bart Van Assche wrote:
>On 5/21/24 07:25, Nitesh Shetty wrote:
>>On 20/05/24 03:42PM, Bart Van Assche wrote:
>>>On 5/20/24 03:20, Nitesh Shetty wrote:
>>>>+    if (max_copy_bytes & (queue_logical_block_size(q) - 1))
>>>>+        return -EINVAL;
>>>
>>>Wouldn't it be more user-friendly if this check would be left out? Does any code
>>>depend on max_copy_bytes being a multiple of the logical block size?
>>>
>>In block layer, we use max_copy_bytes to split larger copy into
>>device supported copy size.
>>Simple copy spec requires length to be logical block size aligned.
>>Hence this check.
>
>Will blkdev_copy_sanity_check() reject invalid copy requests even if this
>check is left out?
>
Yes, you are right. We have checks both places.
We will remove checks in one of the places.

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-22 18:05       ` Bart Van Assche
@ 2024-05-23 11:34         ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-23 11:34 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1546 bytes --]

On 22/05/24 11:05AM, Bart Van Assche wrote:
>On 5/20/24 03:20, Nitesh Shetty wrote:
>>We add two new opcode REQ_OP_COPY_DST, REQ_OP_COPY_SRC.
>>Since copy is a composite operation involving src and dst sectors/lba,
>>each needs to be represented by a separate bio to make it compatible
>>with device mapper.
>>We expect caller to take a plug and send bio with destination information,
>>followed by bio with source information.
>>Once the dst bio arrives we form a request and wait for source
>>bio. Upon arrival of source bio we merge these two bio's and send
>>corresponding request down to device driver.
>>Merging non copy offload bio is avoided by checking for copy specific
>>opcodes in merge function.
>
>Plugs are per task. Can the following happen?
We rely on per-context plugging to avoid this.
>* Task A calls blk_start_plug()
>* Task B calls blk_start_plug()
>* Task A submits a REQ_OP_COPY_DST bio and a REQ_OP_COPY_SRC bio.
Lets say this forms request A and stored in plug A
>* Task B submits a REQ_OP_COPY_DST bio and a REQ_OP_COPY_SRC bio.
Lets say this forms request B and stored in plug B
>* Task A calls blk_finish_plug()
>* Task B calls blk_finish_plug()
>* The REQ_OP_COPY_DST bio from task A and the REQ_OP_COPY_SRC bio from
>  task B are combined into a single request.
Here task A picks plug A and hence request A
>* The REQ_OP_COPY_DST bio from task B and the REQ_OP_COPY_SRC bio from
>  task A are combined into a single request.
same as above, request B

So we expect this case not to happen.

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-21  7:01       ` Hannes Reinecke
@ 2024-05-24  6:54         ` Nitesh Shetty
       [not found]         ` <66503bc7.630a0220.56c85.8b9dSMTPIN_ADDED_BROKEN@mx.google.com>
  2024-05-28 14:07         ` Bart Van Assche
  2 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-24  6:54 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, david, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1913 bytes --]

On 21/05/24 09:01AM, Hannes Reinecke wrote:
>On 5/20/24 12:20, Nitesh Shetty wrote:
>>We add two new opcode REQ_OP_COPY_DST, REQ_OP_COPY_SRC.
>>Since copy is a composite operation involving src and dst sectors/lba,
>>each needs to be represented by a separate bio to make it compatible
>>with device mapper.
>>We expect caller to take a plug and send bio with destination information,
>>followed by bio with source information.
>>Once the dst bio arrives we form a request and wait for source
>>bio. Upon arrival of source bio we merge these two bio's and send
>>corresponding request down to device driver.
>>Merging non copy offload bio is avoided by checking for copy specific
>>opcodes in merge function.
>>
>>Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>>---
>>  block/blk-core.c          |  7 +++++++
>>  block/blk-merge.c         | 41 +++++++++++++++++++++++++++++++++++++++
>>  block/blk.h               | 16 +++++++++++++++
>>  block/elevator.h          |  1 +
>>  include/linux/bio.h       |  6 +-----
>>  include/linux/blk_types.h | 10 ++++++++++
>>  6 files changed, 76 insertions(+), 5 deletions(-)
>>
>I am a bit unsure about leveraging 'merge' here. As Bart pointed out, 
>this is arguably as mis-use of the 'merge' functionality as we don't
>actually merge bios, but rather use the information from these bios to
>form the actual request.
>Wouldn't it be better to use bio_chain here, and send out the eventual
>request from the end_io function of the bio chain?
>
With above bio chain approach, I see a difficulty while dealing with
stacked/partitioned devices, as they might get remapped.
Or am I missing something here ?

Bart, Hannes,
Regarding merge, does it looks any better, if we use single request
operation such as REQ_OP_COPY and use op_flags(REQ_COPY_DST/REQ_COPY_SRC)
to identify dst and src bios ?


Regards,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
       [not found]         ` <66503bc7.630a0220.56c85.8b9dSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2024-05-24 13:52           ` Bart Van Assche
  2024-05-27  8:27             ` Nitesh Shetty
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2024-05-24 13:52 UTC (permalink / raw)
  To: Nitesh Shetty, Hannes Reinecke
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 5/23/24 23:54, Nitesh Shetty wrote:
> Regarding merge, does it looks any better, if we use single request
> operation such as REQ_OP_COPY and use op_flags(REQ_COPY_DST/REQ_COPY_SRC)
> to identify dst and src bios ?

I prefer to keep the current approach (REQ_COPY_DST/REQ_COPY_SRC) and to
use a more appropriate verb than "merge", e.g. "combine".

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-20 10:20     ` [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer Nitesh Shetty
                         ` (3 preceding siblings ...)
  2024-05-22 18:05       ` Bart Van Assche
@ 2024-05-24 20:33       ` Bart Van Assche
  2024-05-29  6:17         ` Nitesh Shetty
  4 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2024-05-24 20:33 UTC (permalink / raw)
  To: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 5/20/24 03:20, Nitesh Shetty wrote:
> We add two new opcode REQ_OP_COPY_DST, REQ_OP_COPY_SRC.
> Since copy is a composite operation involving src and dst sectors/lba,
> each needs to be represented by a separate bio to make it compatible
> with device mapper.
> We expect caller to take a plug and send bio with destination information,
> followed by bio with source information.
> Once the dst bio arrives we form a request and wait for source
> bio. Upon arrival of source bio we merge these two bio's and send
> corresponding request down to device driver.
> Merging non copy offload bio is avoided by checking for copy specific
> opcodes in merge function.

In this patch I don't see any changes for blk_attempt_bio_merge(). Does
this mean that combining REQ_OP_COPY_DST and REQ_OP_COPY_SRC will never
happen if the QUEUE_FLAG_NOMERGES request queue flag has been set?

Can it happen that the REQ_NOMERGE flag is set by __bio_split_to_limits()
for REQ_OP_COPY_DST or REQ_OP_COPY_SRC bios? Will this happen if the
following condition is met?

dst_bio->nr_phys_segs + src_bio->nr_phys_segs > max_segments

Is it allowed to set REQ_PREFLUSH or REQ_FUA for REQ_OP_COPY_DST or
REQ_OP_COPY_SRC bios? I'm asking this because these flags disable merging.

 From include/linux/blk_types.h:

#define REQ_NOMERGE_FLAGS (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 05/12] fs/read_write: Enable copy_file_range for block device.
  2024-05-20 10:20     ` [PATCH v20 05/12] fs/read_write: Enable copy_file_range for block device Nitesh Shetty
  2024-05-21  7:07       ` Hannes Reinecke
@ 2024-05-25 23:02       ` Dave Chinner
  2024-05-28  5:57         ` Nitesh Shetty
  1 sibling, 1 reply; 89+ messages in thread
From: Dave Chinner @ 2024-05-25 23:02 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On Mon, May 20, 2024 at 03:50:18PM +0530, Nitesh Shetty wrote:
> From: Anuj Gupta <anuj20.g@samsung.com>
> 
> This is a prep patch. Allow copy_file_range to work for block devices.
> Relaxing generic_copy_file_checks allows us to reuse the existing infra,
> instead of adding a new user interface for block copy offload.
> Change generic_copy_file_checks to use ->f_mapping->host for both inode_in
> and inode_out. Allow block device in generic_file_rw_checks.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>  fs/read_write.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index ef6339391351..31645ca5ed58 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1413,8 +1413,8 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
>  				    struct file *file_out, loff_t pos_out,
>  				    size_t *req_count, unsigned int flags)
>  {
> -	struct inode *inode_in = file_inode(file_in);
> -	struct inode *inode_out = file_inode(file_out);
> +	struct inode *inode_in = file_in->f_mapping->host;
> +	struct inode *inode_out = file_out->f_mapping->host;
>  	uint64_t count = *req_count;
>  	loff_t size_in;
>  	int ret;

Ok, so this changes from file->f_inode to file->mapping->host. No
doubt this is because of how bdev inode mappings are munged.
However, the first code that is run here is:

	ret = generic_file_rw_checks(file_in, file_out);

and that function still uses file_inode().

Hence there checks:

> @@ -1726,7 +1726,9 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
>  	/* Don't copy dirs, pipes, sockets... */
>  	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>  		return -EISDIR;
> -	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> +	if (!S_ISREG(inode_in->i_mode) && !S_ISBLK(inode_in->i_mode))
> +		return -EINVAL;
> +	if ((inode_in->i_mode & S_IFMT) != (inode_out->i_mode & S_IFMT))
>  		return -EINVAL;

.... are being done on different inodes to the rest of
generic_copy_file_checks() when block devices are used.

Is this correct? If so, this needs a pair of comments (one for each
function) to explain why the specific inode used for these functions
is correct for block devices....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 06/12] fs, block: copy_file_range for def_blk_ops for direct block device
  2024-05-20 10:20     ` [PATCH v20 06/12] fs, block: copy_file_range for def_blk_ops for direct " Nitesh Shetty
@ 2024-05-25 23:09       ` Dave Chinner
  2024-05-27  8:43         ` Nitesh Shetty
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Chinner @ 2024-05-25 23:09 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On Mon, May 20, 2024 at 03:50:19PM +0530, Nitesh Shetty wrote:
> For direct block device opened with O_DIRECT, use blkdev_copy_offload to
> issue device copy offload, or use splice_copy_file_range in case
> device copy offload capability is absent or the device files are not open
> with O_DIRECT.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>  block/fops.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/block/fops.c b/block/fops.c
> index 376265935714..5a4bba4f43aa 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -17,6 +17,7 @@
>  #include <linux/fs.h>
>  #include <linux/iomap.h>
>  #include <linux/module.h>
> +#include <linux/splice.h>
>  #include "blk.h"
>  
>  static inline struct inode *bdev_file_inode(struct file *file)
> @@ -754,6 +755,30 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  	return ret;
>  }
>  
> +static ssize_t blkdev_copy_file_range(struct file *file_in, loff_t pos_in,
> +				      struct file *file_out, loff_t pos_out,
> +				      size_t len, unsigned int flags)
> +{
> +	struct block_device *in_bdev = I_BDEV(bdev_file_inode(file_in));
> +	struct block_device *out_bdev = I_BDEV(bdev_file_inode(file_out));
> +	ssize_t copied = 0;
> +
> +	if ((in_bdev == out_bdev) && bdev_max_copy_sectors(in_bdev) &&
> +	    (file_in->f_iocb_flags & IOCB_DIRECT) &&
> +	    (file_out->f_iocb_flags & IOCB_DIRECT)) {
> +		copied = blkdev_copy_offload(in_bdev, pos_in, pos_out, len,
> +					     NULL, NULL, GFP_KERNEL);
> +		if (copied < 0)
> +			copied = 0;
> +	} else {
> +		copied = splice_copy_file_range(file_in, pos_in + copied,
> +						 file_out, pos_out + copied,
> +						 len - copied);
> +	}

This should not fall back to a page cache copy.

We keep being told by application developers that if the fast
hardware/filesystem offload fails, then an error should be returned
so the application can determine what the fallback operation should
be.

It may well be that the application falls back to "copy through the
page cache", but that is an application policy choice, not a
something the kernel offload driver should be making mandatory.

Userspace has to handle copy offload failure anyway, so they a
fallback path regardless of whether copy_file_range() works on block
devices or not...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-24 13:52           ` Bart Van Assche
@ 2024-05-27  8:27             ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-27  8:27 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Hannes Reinecke, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, david,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 473 bytes --]

On 24/05/24 06:52AM, Bart Van Assche wrote:
>On 5/23/24 23:54, Nitesh Shetty wrote:
>>Regarding merge, does it looks any better, if we use single request
>>operation such as REQ_OP_COPY and use op_flags(REQ_COPY_DST/REQ_COPY_SRC)
>>to identify dst and src bios ?
>
>I prefer to keep the current approach (REQ_COPY_DST/REQ_COPY_SRC) and to
>use a more appropriate verb than "merge", e.g. "combine".
>
Thanks for the suggesion, will do this in next version.

--Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 06/12] fs, block: copy_file_range for def_blk_ops for direct block device
  2024-05-25 23:09       ` Dave Chinner
@ 2024-05-27  8:43         ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-27  8:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2599 bytes --]

On 26/05/24 09:09AM, Dave Chinner wrote:
>On Mon, May 20, 2024 at 03:50:19PM +0530, Nitesh Shetty wrote:
>> For direct block device opened with O_DIRECT, use blkdev_copy_offload to
>> issue device copy offload, or use splice_copy_file_range in case
>> device copy offload capability is absent or the device files are not open
>> with O_DIRECT.
>>
>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> ---
>>  block/fops.c | 26 ++++++++++++++++++++++++++
>>  1 file changed, 26 insertions(+)
>>
>> diff --git a/block/fops.c b/block/fops.c
>> index 376265935714..5a4bba4f43aa 100644
>> --- a/block/fops.c
>> +++ b/block/fops.c
>> @@ -17,6 +17,7 @@
>>  #include <linux/fs.h>
>>  #include <linux/iomap.h>
>>  #include <linux/module.h>
>> +#include <linux/splice.h>
>>  #include "blk.h"
>>
>>  static inline struct inode *bdev_file_inode(struct file *file)
>> @@ -754,6 +755,30 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>>  	return ret;
>>  }
>>
>> +static ssize_t blkdev_copy_file_range(struct file *file_in, loff_t pos_in,
>> +				      struct file *file_out, loff_t pos_out,
>> +				      size_t len, unsigned int flags)
>> +{
>> +	struct block_device *in_bdev = I_BDEV(bdev_file_inode(file_in));
>> +	struct block_device *out_bdev = I_BDEV(bdev_file_inode(file_out));
>> +	ssize_t copied = 0;
>> +
>> +	if ((in_bdev == out_bdev) && bdev_max_copy_sectors(in_bdev) &&
>> +	    (file_in->f_iocb_flags & IOCB_DIRECT) &&
>> +	    (file_out->f_iocb_flags & IOCB_DIRECT)) {
>> +		copied = blkdev_copy_offload(in_bdev, pos_in, pos_out, len,
>> +					     NULL, NULL, GFP_KERNEL);
>> +		if (copied < 0)
>> +			copied = 0;
>> +	} else {
>> +		copied = splice_copy_file_range(file_in, pos_in + copied,
>> +						 file_out, pos_out + copied,
>> +						 len - copied);
>> +	}
>
>This should not fall back to a page cache copy.
>
>We keep being told by application developers that if the fast
>hardware/filesystem offload fails, then an error should be returned
>so the application can determine what the fallback operation should
>be.
>
>It may well be that the application falls back to "copy through the
>page cache", but that is an application policy choice, not a
>something the kernel offload driver should be making mandatory.
>
>Userspace has to handle copy offload failure anyway, so they a
>fallback path regardless of whether copy_file_range() works on block
>devices or not...
>
Makes sense, We will remove fallback part in next version.

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 05/12] fs/read_write: Enable copy_file_range for block device.
  2024-05-25 23:02       ` Dave Chinner
@ 2024-05-28  5:57         ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-28  5:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2707 bytes --]

On 26/05/24 09:02AM, Dave Chinner wrote:
>On Mon, May 20, 2024 at 03:50:18PM +0530, Nitesh Shetty wrote:
>> From: Anuj Gupta <anuj20.g@samsung.com>
>>
>> This is a prep patch. Allow copy_file_range to work for block devices.
>> Relaxing generic_copy_file_checks allows us to reuse the existing infra,
>> instead of adding a new user interface for block copy offload.
>> Change generic_copy_file_checks to use ->f_mapping->host for both inode_in
>> and inode_out. Allow block device in generic_file_rw_checks.
>>
>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> ---
>>  fs/read_write.c | 8 +++++---
>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index ef6339391351..31645ca5ed58 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1413,8 +1413,8 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
>>  				    struct file *file_out, loff_t pos_out,
>>  				    size_t *req_count, unsigned int flags)
>>  {
>> -	struct inode *inode_in = file_inode(file_in);
>> -	struct inode *inode_out = file_inode(file_out);
>> +	struct inode *inode_in = file_in->f_mapping->host;
>> +	struct inode *inode_out = file_out->f_mapping->host;
>>  	uint64_t count = *req_count;
>>  	loff_t size_in;
>>  	int ret;
>
>Ok, so this changes from file->f_inode to file->mapping->host. No
>doubt this is because of how bdev inode mappings are munged.
>However, the first code that is run here is:
>
>	ret = generic_file_rw_checks(file_in, file_out);
>
>and that function still uses file_inode().
>
>Hence there checks:
>
>> @@ -1726,7 +1726,9 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
>>  	/* Don't copy dirs, pipes, sockets... */
>>  	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>>  		return -EISDIR;
>> -	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
>> +	if (!S_ISREG(inode_in->i_mode) && !S_ISBLK(inode_in->i_mode))
>> +		return -EINVAL;
>> +	if ((inode_in->i_mode & S_IFMT) != (inode_out->i_mode & S_IFMT))
>>  		return -EINVAL;
>
>.... are being done on different inodes to the rest of
>generic_copy_file_checks() when block devices are used.
>
>Is this correct? If so, this needs a pair of comments (one for each
>function) to explain why the specific inode used for these functions
>is correct for block devices....
>
We were getting wrong size with file_inode() for block device, but we
missed to do it here in generic_file_rw_checks.
We will change the generic_file_rw_checks to use file->mapping->host
to make it consistent.

Thank You,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-21  7:01       ` Hannes Reinecke
  2024-05-24  6:54         ` Nitesh Shetty
       [not found]         ` <66503bc7.630a0220.56c85.8b9dSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2024-05-28 14:07         ` Bart Van Assche
  2 siblings, 0 replies; 89+ messages in thread
From: Bart Van Assche @ 2024-05-28 14:07 UTC (permalink / raw)
  To: Hannes Reinecke, Nitesh Shetty, Jens Axboe, Jonathan Corbet,
	Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Alexander Viro, Christian Brauner, Jan Kara
  Cc: martin.petersen, david, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 5/21/24 00:01, Hannes Reinecke wrote:
> On 5/20/24 12:20, Nitesh Shetty wrote:
>> We add two new opcode REQ_OP_COPY_DST, REQ_OP_COPY_SRC.
>> Since copy is a composite operation involving src and dst sectors/lba,
>> each needs to be represented by a separate bio to make it compatible
>> with device mapper.
>> We expect caller to take a plug and send bio with destination 
>> information,
>> followed by bio with source information.
>> Once the dst bio arrives we form a request and wait for source
>> bio. Upon arrival of source bio we merge these two bio's and send
>> corresponding request down to device driver.
>> Merging non copy offload bio is avoided by checking for copy specific
>> opcodes in merge function.
>>
> I am a bit unsure about leveraging 'merge' here. As Bart pointed out, 
> this is arguably as mis-use of the 'merge' functionality as we don't
> actually merge bios, but rather use the information from these bios to
> form the actual request.
> Wouldn't it be better to use bio_chain here, and send out the eventual
> request from the end_io function of the bio chain?

Let me formulate this a bit stronger: I think this patch series abuses
the merge functionality and also that it should use another mechanism
for combining REQ_OP_COPY_DST and REQ_OP_COPY_SRC. See also my email
with concerns about using the merge functionality:
https://lore.kernel.org/linux-block/eda6c198-3a29-4da4-94db-305cfe28d3d6@acm.org/.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-24 20:33       ` Bart Van Assche
@ 2024-05-29  6:17         ` Nitesh Shetty
  2024-05-29  7:48           ` Damien Le Moal
  0 siblings, 1 reply; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-29  6:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2001 bytes --]

On 24/05/24 01:33PM, Bart Van Assche wrote:
>On 5/20/24 03:20, Nitesh Shetty wrote:
>>We add two new opcode REQ_OP_COPY_DST, REQ_OP_COPY_SRC.
>>Since copy is a composite operation involving src and dst sectors/lba,
>>each needs to be represented by a separate bio to make it compatible
>>with device mapper.
>>We expect caller to take a plug and send bio with destination information,
>>followed by bio with source information.
>>Once the dst bio arrives we form a request and wait for source
>>bio. Upon arrival of source bio we merge these two bio's and send
>>corresponding request down to device driver.
>>Merging non copy offload bio is avoided by checking for copy specific
>>opcodes in merge function.
>
>In this patch I don't see any changes for blk_attempt_bio_merge(). Does
>this mean that combining REQ_OP_COPY_DST and REQ_OP_COPY_SRC will never
>happen if the QUEUE_FLAG_NOMERGES request queue flag has been set?
>
Yes, in this case copy won't work, as both src and dst bio reach driver
as part of separate requests.
We will add this as part of documentation.

>Can it happen that the REQ_NOMERGE flag is set by __bio_split_to_limits()
>for REQ_OP_COPY_DST or REQ_OP_COPY_SRC bios? Will this happen if the
>following condition is met?
>
>dst_bio->nr_phys_segs + src_bio->nr_phys_segs > max_segments
>
No, this should not happen. We don't use bio_split_rw for copy.
We have added a separate function to check for split incase of
copy(bio_split_copy), which doesn't allow copy bio splits,
hence REQ_NOMERGE flag won't be set.

>Is it allowed to set REQ_PREFLUSH or REQ_FUA for REQ_OP_COPY_DST or
>REQ_OP_COPY_SRC bios? I'm asking this because these flags disable merging.
>
>From include/linux/blk_types.h:
>
>#define REQ_NOMERGE_FLAGS (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
>
No, setting these flags won't allow copy bio's to merge and hence copy
won't work.
We suggest to use helper API blkdev_copy_offload to achieve the
copy which won't be setting these flags.

Thank You,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-29  6:17         ` Nitesh Shetty
@ 2024-05-29  7:48           ` Damien Le Moal
  2024-05-29 22:41             ` Bart Van Assche
  2024-06-01  5:57             ` Christoph Hellwig
  0 siblings, 2 replies; 89+ messages in thread
From: Damien Le Moal @ 2024-05-29  7:48 UTC (permalink / raw)
  To: Nitesh Shetty, Bart Van Assche
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 5/29/24 15:17, Nitesh Shetty wrote:
> On 24/05/24 01:33PM, Bart Van Assche wrote:
>> On 5/20/24 03:20, Nitesh Shetty wrote:
>>> We add two new opcode REQ_OP_COPY_DST, REQ_OP_COPY_SRC.
>>> Since copy is a composite operation involving src and dst sectors/lba,
>>> each needs to be represented by a separate bio to make it compatible
>>> with device mapper.
>>> We expect caller to take a plug and send bio with destination information,
>>> followed by bio with source information.
>>> Once the dst bio arrives we form a request and wait for source
>>> bio. Upon arrival of source bio we merge these two bio's and send
>>> corresponding request down to device driver.
>>> Merging non copy offload bio is avoided by checking for copy specific
>>> opcodes in merge function.
>>
>> In this patch I don't see any changes for blk_attempt_bio_merge(). Does
>> this mean that combining REQ_OP_COPY_DST and REQ_OP_COPY_SRC will never
>> happen if the QUEUE_FLAG_NOMERGES request queue flag has been set?
>>
> Yes, in this case copy won't work, as both src and dst bio reach driver
> as part of separate requests.
> We will add this as part of documentation.

So that means that 2 major SAS HBAs which set this flag (megaraid and mpt3sas)
will not get support for copy offload ? Not ideal, by far.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-29  7:48           ` Damien Le Moal
@ 2024-05-29 22:41             ` Bart Van Assche
  2024-05-30  7:16               ` Nitesh Shetty
       [not found]               ` <665850bd.050a0220.a5e6b.5b72SMTPIN_ADDED_BROKEN@mx.google.com>
  2024-06-01  5:57             ` Christoph Hellwig
  1 sibling, 2 replies; 89+ messages in thread
From: Bart Van Assche @ 2024-05-29 22:41 UTC (permalink / raw)
  To: Damien Le Moal, Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 5/29/24 12:48 AM, Damien Le Moal wrote:
> On 5/29/24 15:17, Nitesh Shetty wrote:
>> On 24/05/24 01:33PM, Bart Van Assche wrote:
>>> On 5/20/24 03:20, Nitesh Shetty wrote:
>>>> We add two new opcode REQ_OP_COPY_DST, REQ_OP_COPY_SRC.
>>>> Since copy is a composite operation involving src and dst sectors/lba,
>>>> each needs to be represented by a separate bio to make it compatible
>>>> with device mapper.
>>>> We expect caller to take a plug and send bio with destination information,
>>>> followed by bio with source information.
>>>> Once the dst bio arrives we form a request and wait for source
>>>> bio. Upon arrival of source bio we merge these two bio's and send
>>>> corresponding request down to device driver.
>>>> Merging non copy offload bio is avoided by checking for copy specific
>>>> opcodes in merge function.
>>>
>>> In this patch I don't see any changes for blk_attempt_bio_merge(). Does
>>> this mean that combining REQ_OP_COPY_DST and REQ_OP_COPY_SRC will never
>>> happen if the QUEUE_FLAG_NOMERGES request queue flag has been set?
>>>
>> Yes, in this case copy won't work, as both src and dst bio reach driver
>> as part of separate requests.
>> We will add this as part of documentation.
> 
> So that means that 2 major SAS HBAs which set this flag (megaraid and mpt3sas)
> will not get support for copy offload ? Not ideal, by far.

QUEUE_FLAG_NOMERGES can also be set through sysfs (see also
queue_nomerges_store()). This is one of the reasons why using the merge
infrastructure for combining REQ_OP_COPY_DST and REQ_OP_COPY_SRC is
unacceptable.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-29 22:41             ` Bart Van Assche
@ 2024-05-30  7:16               ` Nitesh Shetty
       [not found]               ` <665850bd.050a0220.a5e6b.5b72SMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-30  7:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Damien Le Moal, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, david, hare,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 7100 bytes --]

On 29/05/24 03:41PM, Bart Van Assche wrote:
>On 5/29/24 12:48 AM, Damien Le Moal wrote:
>>On 5/29/24 15:17, Nitesh Shetty wrote:
>>>On 24/05/24 01:33PM, Bart Van Assche wrote:
>>>>On 5/20/24 03:20, Nitesh Shetty wrote:
>>>>>We add two new opcode REQ_OP_COPY_DST, REQ_OP_COPY_SRC.
>>>>>Since copy is a composite operation involving src and dst sectors/lba,
>>>>>each needs to be represented by a separate bio to make it compatible
>>>>>with device mapper.
>>>>>We expect caller to take a plug and send bio with destination information,
>>>>>followed by bio with source information.
>>>>>Once the dst bio arrives we form a request and wait for source
>>>>>bio. Upon arrival of source bio we merge these two bio's and send
>>>>>corresponding request down to device driver.
>>>>>Merging non copy offload bio is avoided by checking for copy specific
>>>>>opcodes in merge function.
>>>>
>>>>In this patch I don't see any changes for blk_attempt_bio_merge(). Does
>>>>this mean that combining REQ_OP_COPY_DST and REQ_OP_COPY_SRC will never
>>>>happen if the QUEUE_FLAG_NOMERGES request queue flag has been set?
>>>>
>>>Yes, in this case copy won't work, as both src and dst bio reach driver
>>>as part of separate requests.
>>>We will add this as part of documentation.
>>
>>So that means that 2 major SAS HBAs which set this flag (megaraid and mpt3sas)
>>will not get support for copy offload ? Not ideal, by far.
>
>QUEUE_FLAG_NOMERGES can also be set through sysfs (see also
>queue_nomerges_store()). This is one of the reasons why using the merge
>infrastructure for combining REQ_OP_COPY_DST and REQ_OP_COPY_SRC is
>unacceptable.
>

Bart, Damien, Hannes,
Thanks for your review.
We tried a slightly modified approach which simplifies this patch and
also avoids merge path.
Also with this, we should be able to solve the QUEUE_FLAG_MERGES issue.
Previously we also tried payload/token based approach,
which avoids merge path and tries to combine bios in driver.
But we received feedback that it wasn't the right approach [1].
Do below changes look any better or do you guys have anything else in mind ?

[1] https://lore.kernel.org/linux-block/ZIKphgDavKVPREnw@infradead.org/


diff --git a/block/blk-core.c b/block/blk-core.c
index 82c3ae22d76d..7158bac8cc57 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -122,6 +122,8 @@ static const char *const blk_op_name[] = {
  	REQ_OP_NAME(ZONE_FINISH),
  	REQ_OP_NAME(ZONE_APPEND),
  	REQ_OP_NAME(WRITE_ZEROES),
+	REQ_OP_NAME(COPY_SRC),
+	REQ_OP_NAME(COPY_DST),
  	REQ_OP_NAME(DRV_IN),
  	REQ_OP_NAME(DRV_OUT),
  };
@@ -839,6 +841,11 @@ void submit_bio_noacct(struct bio *bio)
  		 * requests.
  		 */
  		fallthrough;
+	case REQ_OP_COPY_SRC:
+	case REQ_OP_COPY_DST:
+		if (!q->limits.max_copy_sectors)
+			goto not_supported;
+		break;
  	default:
  		goto not_supported;
  	}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8534c35e0497..a651e7c114d0 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -154,6 +154,20 @@ static struct bio *bio_split_write_zeroes(struct bio *bio,
  	return bio_split(bio, lim->max_write_zeroes_sectors, GFP_NOIO, bs);
  }
  
+static struct bio *bio_split_copy(struct bio *bio,
+				  const struct queue_limits *lim,
+				  unsigned int *nsegs)
+{
+	*nsegs = 1;
+	if (bio_sectors(bio) <= lim->max_copy_sectors)
+		return NULL;
+	/*
+	 * We don't support splitting for a copy bio. End it with EIO if
+	 * splitting is required and return an error pointer.
+	 */
+	return ERR_PTR(-EIO);
+}
+
  /*
   * Return the maximum number of sectors from the start of a bio that may be
   * submitted as a single request to a block device. If enough sectors remain,
@@ -362,6 +376,12 @@ struct bio *__bio_split_to_limits(struct bio *bio,
  	case REQ_OP_WRITE_ZEROES:
  		split = bio_split_write_zeroes(bio, lim, nr_segs, bs);
  		break;
+	case REQ_OP_COPY_SRC:
+	case REQ_OP_COPY_DST:
+		split = bio_split_copy(bio, lim, nr_segs);
+		if (IS_ERR(split))
+			return NULL;
+		break;
  	default:
  		split = bio_split_rw(bio, lim, nr_segs, bs,
  				get_max_io_size(bio, lim) << SECTOR_SHIFT);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3b4df8e5ac9e..6d4ffbdade28 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2833,6 +2833,63 @@ static void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
  		blk_mq_commit_rqs(hctx, queued, false);
  }
  
+/*
+ * Copy offload sends a pair of bio with REQ_OP_COPY_DST and REQ_OP_COPY_SRC
+ * operation by taking a plug.
+ * Initially DST bio is sent which forms a request and
+ * waits for SRC bio to arrive. Once SRC bio arrives
+ * we combine it and send request down to driver.
+ */
+static inline bool blk_copy_offload_combine_ok(struct request *req,
+					      struct bio *bio)
+{
+	return (req_op(req) == REQ_OP_COPY_DST &&
+		bio_op(bio) == REQ_OP_COPY_SRC);
+}
+
+static int blk_copy_offload_combine(struct request *req, struct bio *bio)
+{
+	if (!blk_copy_offload_combine_ok(req, bio))
+		return 1;
+
+	if (req->__data_len != bio->bi_iter.bi_size)
+		return 1;
+
+	req->biotail->bi_next = bio;
+	req->biotail = bio;
+	req->nr_phys_segments++;
+	req->__data_len += bio->bi_iter.bi_size;
+
+	return 0;
+}
+
+static inline bool blk_copy_offload_attempt_combine(struct request_queue *q,
+					     struct bio *bio)
+{
+	struct blk_plug *plug = current->plug;
+	struct request *rq;
+
+	if (!plug || rq_list_empty(plug->mq_list))
+		return false;
+
+	rq_list_for_each(&plug->mq_list, rq) {
+		if (rq->q == q) {
+			if (!blk_copy_offload_combine(rq, bio))
+				return true;
+			break;
+		}
+
+		/*
+		 * Only keep iterating plug list for combines if we have multiple
+		 * queues
+		 */
+		if (!plug->multiple_queues)
+			break;
+	}
+	return false;
+}
+
  static bool blk_mq_attempt_bio_merge(struct request_queue *q,
  				     struct bio *bio, unsigned int nr_segs)
  {
@@ -2977,6 +3034,9 @@ void blk_mq_submit_bio(struct bio *bio)
  	if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
  		goto queue_exit;
  
+	if (blk_copy_offload_attempt_combine(q, bio))
+		goto queue_exit;
+
  	if (blk_queue_is_zoned(q) && blk_zone_plug_bio(bio, nr_segs))
  		goto queue_exit;
  
diff --git a/block/blk.h b/block/blk.h
index 189bc25beb50..112c6736f44c 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -323,6 +323,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
  	case REQ_OP_DISCARD:
  	case REQ_OP_SECURE_ERASE:
  	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_COPY_SRC:
+	case REQ_OP_COPY_DST:
  		return true; /* non-trivial splitting decisions */
  	default:
  		break;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 781c4500491b..22a08425d13e 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -342,6 +342,10 @@ enum req_op {
  	/* reset all the zone present on the device */
  	REQ_OP_ZONE_RESET_ALL	= (__force blk_opf_t)15,
  
+	/* copy offload source and destination operations */
+	REQ_OP_COPY_SRC		= (__force blk_opf_t)18,
+	REQ_OP_COPY_DST		= (__force blk_opf_t)19,
+
  	/* Driver private requests */
  	REQ_OP_DRV_IN		= (__force blk_opf_t)34,
  	REQ_OP_DRV_OUT		= (__force blk_opf_t)35,
--·
2.34.1


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
       [not found]               ` <665850bd.050a0220.a5e6b.5b72SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2024-05-30 17:11                 ` Bart Van Assche
  2024-05-31 10:17                   ` Nitesh Shetty
                                     ` (2 more replies)
  0 siblings, 3 replies; 89+ messages in thread
From: Bart Van Assche @ 2024-05-30 17:11 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Damien Le Moal, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, david, hare,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

On 5/30/24 00:16, Nitesh Shetty wrote:
> +static inline bool blk_copy_offload_attempt_combine(struct request_queue *q,
> +                         struct bio *bio)
> +{
> +    struct blk_plug *plug = current->plug;
> +    struct request *rq;
> +
> +    if (!plug || rq_list_empty(plug->mq_list))
> +        return false;
> +
> +    rq_list_for_each(&plug->mq_list, rq) {
> +        if (rq->q == q) {
> +            if (!blk_copy_offload_combine(rq, bio))
> +                return true;
> +            break;
> +        }
> +
> +        /*
> +         * Only keep iterating plug list for combines if we have multiple
> +         * queues
> +         */
> +        if (!plug->multiple_queues)
> +            break;
> +    }
> +    return false;
> +}

This new approach has the following two disadvantages:
* Without plug, REQ_OP_COPY_SRC and REQ_OP_COPY_DST are not combined. These two
   operation types are the only operation types for which not using a plug causes
   an I/O failure.
* A loop is required to combine the REQ_OP_COPY_SRC and REQ_OP_COPY_DST operations.

Please switch to the approach Hannes suggested, namely bio chaining. Chaining
REQ_OP_COPY_SRC and REQ_OP_COPY_DST bios before these are submitted eliminates the
two disadvantages mentioned above.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-30 17:11                 ` Bart Van Assche
@ 2024-05-31 10:17                   ` Nitesh Shetty
       [not found]                   ` <6659b691.630a0220.90195.d0ebSMTPIN_ADDED_BROKEN@mx.google.com>
  2024-06-01  5:59                   ` Christoph Hellwig
  2 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-05-31 10:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Damien Le Moal, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, david, hare,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2060 bytes --]

On 30/05/24 10:11AM, Bart Van Assche wrote:
>On 5/30/24 00:16, Nitesh Shetty wrote:
>>+static inline bool blk_copy_offload_attempt_combine(struct request_queue *q,
>>+                         struct bio *bio)
>>+{
>>+    struct blk_plug *plug = current->plug;
>>+    struct request *rq;
>>+
>>+    if (!plug || rq_list_empty(plug->mq_list))
>>+        return false;
>>+
>>+    rq_list_for_each(&plug->mq_list, rq) {
>>+        if (rq->q == q) {
>>+            if (!blk_copy_offload_combine(rq, bio))
>>+                return true;
>>+            break;
>>+        }
>>+
>>+        /*
>>+         * Only keep iterating plug list for combines if we have multiple
>>+         * queues
>>+         */
>>+        if (!plug->multiple_queues)
>>+            break;
>>+    }
>>+    return false;
>>+}
>
>This new approach has the following two disadvantages:
>* Without plug, REQ_OP_COPY_SRC and REQ_OP_COPY_DST are not combined. These two
>  operation types are the only operation types for which not using a plug causes
>  an I/O failure.
>* A loop is required to combine the REQ_OP_COPY_SRC and REQ_OP_COPY_DST operations.
>
>Please switch to the approach Hannes suggested, namely bio chaining. Chaining
>REQ_OP_COPY_SRC and REQ_OP_COPY_DST bios before these are submitted eliminates the
>two disadvantages mentioned above.
>
Bart, Hannes,

I see the following challenges with bio-chained approach.
1. partitioned device:
	We need to add the code which iterates over all bios and adjusts
	the sectors offsets.
2. dm/stacked device:
	We need to make major changes in dm, such as allocating cloned
	bios, IO splits, IO offset mappings. All of which need to
	iterate over chained BIOs.

Overall with chained BIOs we need to add a special handling only for copy
to iterate over chained BIOs and do the same thing which is being done
for single BIO at present.
Or am I missing something here ?

Thank You,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
       [not found]                   ` <6659b691.630a0220.90195.d0ebSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2024-05-31 23:45                     ` Bart Van Assche
  0 siblings, 0 replies; 89+ messages in thread
From: Bart Van Assche @ 2024-05-31 23:45 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Damien Le Moal, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, david, hare,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

On 5/31/24 03:17, Nitesh Shetty wrote:
> I see the following challenges with bio-chained approach.
> 1. partitioned device:
>      We need to add the code which iterates over all bios and adjusts
>      the sectors offsets.
> 2. dm/stacked device:
>      We need to make major changes in dm, such as allocating cloned
>      bios, IO splits, IO offset mappings. All of which need to
>      iterate over chained BIOs.
> 
> Overall with chained BIOs we need to add a special handling only for copy
> to iterate over chained BIOs and do the same thing which is being done
> for single BIO at present.
> Or am I missing something here ?

Hmm ... aren't chained bios submitted individually? See e.g.
bio_chain_and_submit(). In other words, it shouldn't be necessary to
add any code that iterates over bio chains.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 00/12] Implement copy offload support
  2024-05-20 10:20 ` [PATCH v20 00/12] Implement copy offload support Nitesh Shetty
                     ` (12 preceding siblings ...)
  2024-05-20 22:54   ` [PATCH v20 00/12] Implement copy offload support Bart Van Assche
@ 2024-06-01  5:47   ` Christoph Hellwig
  2024-06-03 10:53     ` Nitesh Shetty
  13 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2024-06-01  5:47 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On Mon, May 20, 2024 at 03:50:13PM +0530, Nitesh Shetty wrote:
> So copy offload works only for request based storage drivers.

I don't think that is actually true.  It just requires a fair amount of
code in a bio based driver to match the bios up.

I'm missing any kind of information on what this patch set as-is
actually helps with.  What operations are sped up, for what operations
does it reduce resource usage?

Part of that might be that the included use case of offloading
copy_file_range doesn't seem particularly useful - on any advance
file system that would be done using reflinks anyway.

Have you considered hooking into dm-kcopyd which would be an
instant win instead?  Or into garbage collection in zoned or other
log structured file systems?  Those would probably really like
multiple source bios, though.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support
  2024-05-20 10:20     ` [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support Nitesh Shetty
                         ` (2 preceding siblings ...)
  2024-05-21  6:57       ` Hannes Reinecke
@ 2024-06-01  5:53       ` Christoph Hellwig
  2024-06-03  6:43         ` Nitesh Shetty
  3 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2024-06-01  5:53 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On Mon, May 20, 2024 at 03:50:14PM +0530, Nitesh Shetty wrote:
> Add device limits as sysfs entries,
> 	- copy_max_bytes (RW)
> 	- copy_max_hw_bytes (RO)
> 
> Above limits help to split the copy payload in block layer.
> copy_max_bytes: maximum total length of copy in single payload.
> copy_max_hw_bytes: Reflects the device supported maximum limit.

That's a bit of a weird way to phrase the commit log as the queue_limits
are the main thing (and there are three of them as required for the
scheme to work).  The sysfs attributes really are just an artifact.

> @@ -231,10 +237,11 @@ int blk_set_default_limits(struct queue_limits *lim)
>  {
>  	/*
>  	 * Most defaults are set by capping the bounds in blk_validate_limits,
> -	 * but max_user_discard_sectors is special and needs an explicit
> -	 * initialization to the max value here.
> +	 * but max_user_discard_sectors and max_user_copy_sectors are special
> +	 * and needs an explicit initialization to the max value here.

s/needs/need/

> +/*
> + * blk_queue_max_copy_hw_sectors - set max sectors for a single copy payload
> + * @q:	the request queue for the device
> + * @max_copy_sectors: maximum number of sectors to copy
> + */
> +void blk_queue_max_copy_hw_sectors(struct request_queue *q,
> +				   unsigned int max_copy_sectors)
> +{
> +	struct queue_limits *lim = &q->limits;
> +
> +	if (max_copy_sectors > (BLK_COPY_MAX_BYTES >> SECTOR_SHIFT))
> +		max_copy_sectors = BLK_COPY_MAX_BYTES >> SECTOR_SHIFT;
> +
> +	lim->max_copy_hw_sectors = max_copy_sectors;
> +	lim->max_copy_sectors =
> +		min(max_copy_sectors, lim->max_user_copy_sectors);
> +}
> +EXPORT_SYMBOL_GPL(blk_queue_max_copy_hw_sectors);

Please don't add new blk_queue_* helpers, everything should go through
the atomic queue limits API now.  Also capping the hardware limit
here looks odd.

> +	if (max_copy_bytes & (queue_logical_block_size(q) - 1))
> +		return -EINVAL;

This should probably go into blk_validate_limits and just round down.

Also most block limits are in kb.  Not that I really know why we are
doing that, but is there a good reason to deviate from that scheme?


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-29  7:48           ` Damien Le Moal
  2024-05-29 22:41             ` Bart Van Assche
@ 2024-06-01  5:57             ` Christoph Hellwig
  1 sibling, 0 replies; 89+ messages in thread
From: Christoph Hellwig @ 2024-06-01  5:57 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Nitesh Shetty, Bart Van Assche, Jens Axboe, Jonathan Corbet,
	Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Alexander Viro, Christian Brauner, Jan Kara, martin.petersen,
	david, hare, damien.lemoal, anuj20.g, joshi.k, nitheshshetty,
	gost.dev, linux-block, linux-kernel, linux-doc, dm-devel,
	linux-nvme, linux-fsdevel

On Wed, May 29, 2024 at 04:48:18PM +0900, Damien Le Moal wrote:
> > Yes, in this case copy won't work, as both src and dst bio reach driver
> > as part of separate requests.
> > We will add this as part of documentation.
> 
> So that means that 2 major SAS HBAs which set this flag (megaraid and mpt3sas)
> will not get support for copy offload ? Not ideal, by far.

We really need to ignore the normerges flag for this.
As should we for the precedence in the discard merging.

And the drivers need to stop setting this flag as they have no business
to, but that's a separate discussion (and the flag can still be set
manually).


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-05-30 17:11                 ` Bart Van Assche
  2024-05-31 10:17                   ` Nitesh Shetty
       [not found]                   ` <6659b691.630a0220.90195.d0ebSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2024-06-01  5:59                   ` Christoph Hellwig
  2024-06-03 17:12                     ` Bart Van Assche
  2 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2024-06-01  5:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Nitesh Shetty, Damien Le Moal, Jens Axboe, Jonathan Corbet,
	Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Alexander Viro, Christian Brauner, Jan Kara, martin.petersen,
	david, hare, damien.lemoal, anuj20.g, joshi.k, nitheshshetty,
	gost.dev, linux-block, linux-kernel, linux-doc, dm-devel,
	linux-nvme, linux-fsdevel

On Thu, May 30, 2024 at 10:11:15AM -0700, Bart Van Assche wrote:
> This new approach has the following two disadvantages:
> * Without plug, REQ_OP_COPY_SRC and REQ_OP_COPY_DST are not combined. These two
>   operation types are the only operation types for which not using a plug causes
>   an I/O failure.

So?  We can clearly document that and even fail submission with a helpful
message trivially to enforce that.

> * A loop is required to combine the REQ_OP_COPY_SRC and REQ_OP_COPY_DST operations.

I don't see why that is a problem.  Especiallly once we allow multiple
sources which is really useful for GC workloads we'll need to do that
anyway.

> Please switch to the approach Hannes suggested, namely bio chaining. Chaining
> REQ_OP_COPY_SRC and REQ_OP_COPY_DST bios before these are submitted eliminates the
> two disadvantages mentioned above.

No.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 03/12] block: add copy offload support
  2024-05-20 10:20     ` [PATCH v20 03/12] block: add copy offload support Nitesh Shetty
@ 2024-06-01  6:16       ` Christoph Hellwig
  2024-06-04 10:50         ` Nitesh Shetty
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2024-06-01  6:16 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

> +/* Keeps track of all outstanding copy IO */
> +struct blkdev_copy_io {
> +	atomic_t refcount;
> +	ssize_t copied;
> +	int status;
> +	struct task_struct *waiter;
> +	void (*endio)(void *private, int status, ssize_t copied);
> +	void *private;
> +};
> +
> +/* Keeps track of single outstanding copy offload IO */
> +struct blkdev_copy_offload_io {
> +	struct blkdev_copy_io *cio;
> +	loff_t offset;
> +};

The structure names confuse me, and the comments make things even worse.

AFAICT:

blkdev_copy_io is a per-call structure, I'd name it blkdev_copy_ctx.
blkdev_copy_offload_io is per-bio pair, and something like blkdev_copy_chunk
might be a better idea.  Or we could just try to kill it entirely and add
a field to struct bio in the union currently holding the integrity
information.

I'm also quite confused what kind of offset this offset field is.  The
type and name suggest it is an offset in a file, which for a block device
based helper is pretty odd to start with.  blkdev_copy_offload
initializes it to len - rem, so it kind is an offset, but relative
to the operation and not to a file. blkdev_copy_offload_src_endio then
uses to set the ->copied field, but based on a min which means
->copied can only be decreased.  Something is really off there.

Taking about types and units: blkdev_copy_offload obviously can only
work in terms of LBAs.  Any reason to not make it work in terms of
512-byte block layer sectors instead of in bytes?

> +	if ((pos_in & align) || (pos_out & align) || (len & align) || !len ||
> +	    len >= BLK_COPY_MAX_BYTES)
> +		return -EINVAL;

This can be cleaned up an optimized a bit:

	if (!len || len >= BLK_COPY_MAX_BYTES)
		return -EINVAL;
	if ((pos_in | pos_out | len) & align)
		return -EINVAL;
	
> + *
> + * For synchronous operation returns the length of bytes copied or error
> + * For asynchronous operation returns -EIOCBQUEUED or error
> + *
> + * Description:
> + *	Copy source offset to destination offset within block device, using
> + *	device's native copy offload feature.
> + *	We perform copy operation using 2 bio's.
> + *	1. We take a plug and send a REQ_OP_COPY_DST bio along with destination
> + *	sector and length. Once this bio reaches request layer, we form a
> + *	request and wait for dst bio to arrive.
> + *	2. We issue REQ_OP_COPY_SRC bio along with source sector, length.
> + *	Once this bio reaches request layer and find a request with previously
> + *	sent destination info we merge the source bio and return.
> + *	3. Release the plug and request is sent to driver
> + *	This design works only for drivers with request queue.

The wording with all the We here is a bit odd.  Much of this also seem
superfluous or at least misplaced in the kernel doc comment as it doesn't
document the API, but just what is done in the code below.

> +	cio = kzalloc(sizeof(*cio), gfp);
> +	if (!cio)
> +		return -ENOMEM;
> +	atomic_set(&cio->refcount, 1);
> +	cio->waiter = current;
> +	cio->endio = endio;
> +	cio->private = private;

For the main use this could be allocated on-stack.  Is there any good
reason to not let callers that really want an async version to implement
the async behavior themselves using suitable helpers?

> +		src_bio = blk_next_bio(dst_bio, bdev, 0, REQ_OP_COPY_SRC, gfp);

Please switch to use bio_chain_and_submit, which is a easier to
understand API.  I'm trying to phase out blk_next_bio in favour of
bio_chain_and_submit over the next few merge windows.

> +		if (!src_bio)
> +			goto err_free_dst_bio;
> +		src_bio->bi_iter.bi_size = chunk;
> +		src_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
> +		src_bio->bi_end_io = blkdev_copy_offload_src_endio;
> +		src_bio->bi_private = offload_io;
> +
> +		atomic_inc(&cio->refcount);
> +		submit_bio(src_bio);
> +		blk_finish_plug(&plug);

plugs should be hold over all  I/Os, submitted from the same caller,
which is the point of them.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 04/12] block: add emulation for copy
  2024-05-20 10:20     ` [PATCH v20 04/12] block: add emulation for copy Nitesh Shetty
  2024-05-21  7:06       ` Hannes Reinecke
@ 2024-06-01  6:18       ` Christoph Hellwig
  1 sibling, 0 replies; 89+ messages in thread
From: Christoph Hellwig @ 2024-06-01  6:18 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Vincent Fu, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

On Mon, May 20, 2024 at 03:50:17PM +0530, Nitesh Shetty wrote:
> For the devices which does not support copy, copy emulation is added.
> It is required for in-kernel users like fabrics, where file descriptor is
> not available and hence they can't use copy_file_range.
> Copy-emulation is implemented by reading from source into memory and
> writing to the corresponding destination.
> At present in kernel user of emulation is fabrics.

I still don't see the point of offering this in the block layer,
at least in this form.  Caller usually can pre-allocate a buffer
if they need regular copies instead of doing constant allocation
and freeing which puts a lot of stress on the page allocator.

> +static void *blkdev_copy_alloc_buf(ssize_t req_size, ssize_t *alloc_size,
> +				   gfp_t gfp)
> +{
> +	int min_size = PAGE_SIZE;
> +	char *buf;
> +
> +	while (req_size >= min_size) {
> +		buf = kvmalloc(req_size, gfp);
> +		if (buf) {
> +			*alloc_size = req_size;
> +			return buf;
> +		}
> +		req_size >>= 1;
> +	}
> +
> +	return NULL;

And requiring a kernel mapping for data is is never used through the
kernel mapping is pretty silly as well.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 07/12] nvme: add copy offload support
  2024-05-20 10:20     ` [PATCH v20 07/12] nvme: add copy offload support Nitesh Shetty
@ 2024-06-01  6:22       ` Christoph Hellwig
  2024-06-03 11:43         ` Nitesh Shetty
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2024-06-01  6:22 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, Javier Gonz??lez, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

On Mon, May 20, 2024 at 03:50:20PM +0530, Nitesh Shetty wrote:
> +	if (blk_rq_nr_phys_segments(req) != BLK_COPY_MAX_SEGMENTS)
> +		return BLK_STS_IOERR;

This sounds like BLK_COPY_MAX_SEGMENTS is misnamed.  Right now this is
not a max segments, but the exact number of segments required.

>  /*
>   * Recommended frequency for KATO commands per NVMe 1.4 section 7.12.1:
> - * 
> + *

Please submit this whitespace fix separately.

> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8b1edb46880a..1c5974bb23d5 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1287,6 +1287,7 @@ static inline unsigned int bdev_discard_granularity(struct block_device *bdev)
>  
>  /* maximum copy offload length, this is set to 128MB based on current testing */
>  #define BLK_COPY_MAX_BYTES		(1 << 27)
> +#define BLK_COPY_MAX_SEGMENTS		2

... and this doesn't belong into a NVMe patch.  I'd also expect that
the block layer would verify this before sending of the request to the driver.

> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 425573202295..5275a0962a02 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h

Note that we've usually kept adding new protocol bits to nvme.h separate
from the implementation in the host or target code.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 11/12] null: Enable trace capability for null block
  2024-05-20 10:20     ` [PATCH v20 11/12] null: Enable trace capability for null block Nitesh Shetty
  2024-05-20 23:27       ` Bart Van Assche
@ 2024-06-01  6:23       ` Christoph Hellwig
  2024-06-03 11:04         ` Nitesh Shetty
  1 sibling, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2024-06-01  6:23 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On Mon, May 20, 2024 at 03:50:24PM +0530, Nitesh Shetty wrote:
> This is a prep patch to enable copy trace capability.
> At present only zoned null_block is using trace, so we decoupled trace
> and zoned dependency to make it usable in null_blk driver also.

No need to mention the "prep patch", just state what you are doing.
Any this could just go out to Jens ASAP.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support
  2024-06-01  5:53       ` Christoph Hellwig
@ 2024-06-03  6:43         ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-06-03  6:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Alexander Viro, Christian Brauner, Jan Kara, martin.petersen,
	bvanassche, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2595 bytes --]

On 01/06/24 07:53AM, Christoph Hellwig wrote:
>On Mon, May 20, 2024 at 03:50:14PM +0530, Nitesh Shetty wrote:
>> Add device limits as sysfs entries,
>> 	- copy_max_bytes (RW)
>> 	- copy_max_hw_bytes (RO)
>>
>> Above limits help to split the copy payload in block layer.
>> copy_max_bytes: maximum total length of copy in single payload.
>> copy_max_hw_bytes: Reflects the device supported maximum limit.
>
>That's a bit of a weird way to phrase the commit log as the queue_limits
>are the main thing (and there are three of them as required for the
>scheme to work).  The sysfs attributes really are just an artifact.
>
Acked, we will update commit log.

>> @@ -231,10 +237,11 @@ int blk_set_default_limits(struct queue_limits *lim)
>>  {
>>  	/*
>>  	 * Most defaults are set by capping the bounds in blk_validate_limits,
>> -	 * but max_user_discard_sectors is special and needs an explicit
>> -	 * initialization to the max value here.
>> +	 * but max_user_discard_sectors and max_user_copy_sectors are special
>> +	 * and needs an explicit initialization to the max value here.
>
>s/needs/need/
>
Acked.

>> +/*
>> + * blk_queue_max_copy_hw_sectors - set max sectors for a single copy payload
>> + * @q:	the request queue for the device
>> + * @max_copy_sectors: maximum number of sectors to copy
>> + */
>> +void blk_queue_max_copy_hw_sectors(struct request_queue *q,
>> +				   unsigned int max_copy_sectors)
>> +{
>> +	struct queue_limits *lim = &q->limits;
>> +
>> +	if (max_copy_sectors > (BLK_COPY_MAX_BYTES >> SECTOR_SHIFT))
>> +		max_copy_sectors = BLK_COPY_MAX_BYTES >> SECTOR_SHIFT;
>> +
>> +	lim->max_copy_hw_sectors = max_copy_sectors;
>> +	lim->max_copy_sectors =
>> +		min(max_copy_sectors, lim->max_user_copy_sectors);
>> +}
>> +EXPORT_SYMBOL_GPL(blk_queue_max_copy_hw_sectors);
>
>Please don't add new blk_queue_* helpers, everything should go through
>the atomic queue limits API now.  Also capping the hardware limit
>here looks odd.
>
This was a mistake, we are not using this function. We will remove this
function in next version.

>> +	if (max_copy_bytes & (queue_logical_block_size(q) - 1))
>> +		return -EINVAL;
>
>This should probably go into blk_validate_limits and just round down.
>
Bart also pointed out similar thing. We do same check before issuing
copy offload, so we will remove this check.

>Also most block limits are in kb.  Not that I really know why we are
>doing that, but is there a good reason to deviate from that scheme?
>
We followed discard as a reference, but we can move to kb, if that helps
with overall readability.

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 00/12] Implement copy offload support
  2024-06-01  5:47   ` Christoph Hellwig
@ 2024-06-03 10:53     ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-06-03 10:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Alexander Viro, Christian Brauner, Jan Kara, martin.petersen,
	bvanassche, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2475 bytes --]

On 01/06/24 07:47AM, Christoph Hellwig wrote:
>On Mon, May 20, 2024 at 03:50:13PM +0530, Nitesh Shetty wrote:
>> So copy offload works only for request based storage drivers.
>
>I don't think that is actually true.  It just requires a fair amount of
>code in a bio based driver to match the bios up.
>
>I'm missing any kind of information on what this patch set as-is
>actually helps with.  What operations are sped up, for what operations
>does it reduce resource usage?
>
The major benefit of this copy-offload/emulation framework is
observed in fabrics setup, for copy workloads across the network.
The host will send offload command over the network and actual copy
can be achieved using emulation on the target (hence patch 4).
This results in higher performance and lower network consumption,
as compared to read and write travelling across the network.
With this design of copy-offload/emulation we are able to see the
following improvements as compared to userspace read + write on a
NVMeOF TCP setup:

Setup1: Network Speed: 1000Mb/s
	Host PC: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
	Target PC: AMD Ryzen 9 5900X 12-Core Processor
	block size 8k:
	Improvement in IO BW from 106 MiB/s to 360 MiB/s
	Network utilisation drops from  97% to 6%.
	block-size 1M:
	Improvement in IO BW from 104 MiB/s to 2677 MiB/s
	Network utilisation drops from 92% to 0.66%.

Setup2: Network Speed: 100Gb/s
	Server: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, 72 cores
	(host and target have the same configuration)
	block-size 8k:
	17.5% improvement in IO BW (794 MiB/s to 933 MiB/s).
	Network utilisation drops from  6.75% to 0.16%.

>Part of that might be that the included use case of offloading
>copy_file_range doesn't seem particularly useful - on any advance
>file system that would be done using reflinks anyway.
>
Instead of coining a new user interface just for copy,
we thought of using existing infra for plumbing.
When this series gets merged, we can add io-uring interface.

>Have you considered hooking into dm-kcopyd which would be an
>instant win instead?  Or into garbage collection in zoned or other
>log structured file systems?  Those would probably really like
>multiple source bios, though.
>
Our initial few version of the series had dm-kcopyd use case.
We dropped it, to make overall series lightweight and make it
easier to review and test.
When the current series gets merged, we will start adding
more in-kernel users in next phase.

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 11/12] null: Enable trace capability for null block
  2024-06-01  6:23       ` Christoph Hellwig
@ 2024-06-03 11:04         ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-06-03 11:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Alexander Viro, Christian Brauner, Jan Kara, martin.petersen,
	bvanassche, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 568 bytes --]

On 01/06/24 08:23AM, Christoph Hellwig wrote:
>On Mon, May 20, 2024 at 03:50:24PM +0530, Nitesh Shetty wrote:
>> This is a prep patch to enable copy trace capability.
>> At present only zoned null_block is using trace, so we decoupled trace
>> and zoned dependency to make it usable in null_blk driver also.
>
>No need to mention the "prep patch", just state what you are doing.
Acked.
>Any this could just go out to Jens ASAP.
>
There is no user of trace apart from zoned and copy offload.
Hence this would not make sense as separate patch.

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 07/12] nvme: add copy offload support
  2024-06-01  6:22       ` Christoph Hellwig
@ 2024-06-03 11:43         ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-06-03 11:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Alexander Viro, Christian Brauner, Jan Kara, martin.petersen,
	bvanassche, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, Javier Gonz??lez, linux-block,
	linux-kernel, linux-doc, dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1530 bytes --]

On 01/06/24 08:22AM, Christoph Hellwig wrote:
>On Mon, May 20, 2024 at 03:50:20PM +0530, Nitesh Shetty wrote:
>> +	if (blk_rq_nr_phys_segments(req) != BLK_COPY_MAX_SEGMENTS)
>> +		return BLK_STS_IOERR;
>
>This sounds like BLK_COPY_MAX_SEGMENTS is misnamed.  Right now this is
>not a max segments, but the exact number of segments required.
>
We will move this check to block layer, with name
BLK_COPY_TOTAL_SEGMENTS.

>>  /*
>>   * Recommended frequency for KATO commands per NVMe 1.4 section 7.12.1:
>> - *
>> + *
>
>Please submit this whitespace fix separately.
>
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 8b1edb46880a..1c5974bb23d5 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1287,6 +1287,7 @@ static inline unsigned int bdev_discard_granularity(struct block_device *bdev)
>>
>>  /* maximum copy offload length, this is set to 128MB based on current testing */
>>  #define BLK_COPY_MAX_BYTES		(1 << 27)
>> +#define BLK_COPY_MAX_SEGMENTS		2
>
>... and this doesn't belong into a NVMe patch.  I'd also expect that
>the block layer would verify this before sending of the request to the driver.
>
Acked

>> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
>> index 425573202295..5275a0962a02 100644
>> --- a/include/linux/nvme.h
>> +++ b/include/linux/nvme.h
>
>Note that we've usually kept adding new protocol bits to nvme.h separate
>from the implementation in the host or target code.
>
Acked, will move it to a separate patch.

Thank you,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-06-01  5:59                   ` Christoph Hellwig
@ 2024-06-03 17:12                     ` Bart Van Assche
  2024-06-04  4:40                       ` Christoph Hellwig
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2024-06-03 17:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nitesh Shetty, Damien Le Moal, Jens Axboe, Jonathan Corbet,
	Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Keith Busch,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, david, hare,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

On 5/31/24 22:59, Christoph Hellwig wrote:
> On Thu, May 30, 2024 at 10:11:15AM -0700, Bart Van Assche wrote:
>> This new approach has the following two disadvantages:
>> * Without plug, REQ_OP_COPY_SRC and REQ_OP_COPY_DST are not combined. These two
>>    operation types are the only operation types for which not using a plug causes
>>    an I/O failure.
> 
> So?  We can clearly document that and even fail submission with a helpful
> message trivially to enforce that.

Consider the following use case:
* Task A calls blk_start_plug()
* Task B calls blk_start_plug()
* Task A submits a REQ_OP_COPY_DST bio and a REQ_OP_COPY_SRC bio.
* Task B submits a REQ_OP_COPY_DST bio and a REQ_OP_COPY_SRC bio.
* The stacking driver to which all REQ_OP_COPY_* operations have been
   submitted processes bios asynchronusly.
* Task A calls blk_finish_plug()
* Task B calls blk_finish_plug()
* The REQ_OP_COPY_DST bio from task A and the REQ_OP_COPY_SRC bio from
   task B are combined into a single request.
* The REQ_OP_COPY_DST bio from task B and the REQ_OP_COPY_SRC bio from
   task A are combined into a single request.

This results in silent and hard-to-debug data corruption. Do you agree
that we should not restrict copy offloading to stacking drivers that
process bios synchronously and also that this kind of data corruption
should be prevented?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support
@ 2024-06-04  4:31 Christoph Hellwig
  2024-06-04  7:05 ` Hannes Reinecke
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2024-06-04  4:31 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Christoph Hellwig, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, bvanassche, david, hare, damien.lemoal, anuj20.g,
	joshi.k, nitheshshetty, gost.dev, linux-block, linux-kernel,
	linux-doc, dm-devel, linux-nvme, linux-fsdevel

On Mon, Jun 03, 2024 at 06:43:56AM +0000, Nitesh Shetty wrote:
>> Also most block limits are in kb.  Not that I really know why we are
>> doing that, but is there a good reason to deviate from that scheme?
>>
> We followed discard as a reference, but we can move to kb, if that helps
> with overall readability.

I'm not really sure what is better.  Does anyone remember why we did
the _kb version?  Either way some amount of consistency would be nice.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-06-03 17:12                     ` Bart Van Assche
@ 2024-06-04  4:40                       ` Christoph Hellwig
  2024-06-04 11:44                         ` Bart Van Assche
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2024-06-04  4:40 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Nitesh Shetty, Damien Le Moal, Jens Axboe,
	Jonathan Corbet, Alasdair Kergon, Mike Snitzer, Mikulas Patocka,
	Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, david, hare,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

On Mon, Jun 03, 2024 at 10:12:48AM -0700, Bart Van Assche wrote:
> Consider the following use case:
> * Task A calls blk_start_plug()
> * Task B calls blk_start_plug()
> * Task A submits a REQ_OP_COPY_DST bio and a REQ_OP_COPY_SRC bio.
> * Task B submits a REQ_OP_COPY_DST bio and a REQ_OP_COPY_SRC bio.
> * The stacking driver to which all REQ_OP_COPY_* operations have been
>   submitted processes bios asynchronusly.
> * Task A calls blk_finish_plug()
> * Task B calls blk_finish_plug()
> * The REQ_OP_COPY_DST bio from task A and the REQ_OP_COPY_SRC bio from
>   task B are combined into a single request.
> * The REQ_OP_COPY_DST bio from task B and the REQ_OP_COPY_SRC bio from
>   task A are combined into a single request.
>
> This results in silent and hard-to-debug data corruption. Do you agree
> that we should not restrict copy offloading to stacking drivers that
> process bios synchronously and also that this kind of data corruption
> should be prevented?

There is no requirement to process them synchronously, there is just
a requirement to preserve the order.  Note that my suggestion a few
arounds ago also included a copy id to match them up.  If we don't
need that I'm happy to leave it away.  If need it it to make stacking
drivers' lifes easier that suggestion still stands.

>
> Thanks,
>
> Bart.
---end quoted text---

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support
  2024-06-04  4:31 [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support Christoph Hellwig
@ 2024-06-04  7:05 ` Hannes Reinecke
  2024-06-05  8:17   ` Christoph Hellwig
  0 siblings, 1 reply; 89+ messages in thread
From: Hannes Reinecke @ 2024-06-04  7:05 UTC (permalink / raw)
  To: Christoph Hellwig, Nitesh Shetty
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Alexander Viro, Christian Brauner, Jan Kara, martin.petersen,
	bvanassche, david, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 6/4/24 06:31, Christoph Hellwig wrote:
> On Mon, Jun 03, 2024 at 06:43:56AM +0000, Nitesh Shetty wrote:
>>> Also most block limits are in kb.  Not that I really know why we are
>>> doing that, but is there a good reason to deviate from that scheme?
>>>
>> We followed discard as a reference, but we can move to kb, if that helps
>> with overall readability.
> 
> I'm not really sure what is better.  Does anyone remember why we did
> the _kb version?  Either way some amount of consistency would be nice.
> 
If memory serves correctly we introduced the _kb versions as a 
convenience to the user; exposing values in 512 bytes increments tended
to be confusing, especially when it comes to LBA values (is the size in 
units of hardware sector size? 512 increments? kilobytes?)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 03/12] block: add copy offload support
  2024-06-01  6:16       ` Christoph Hellwig
@ 2024-06-04 10:50         ` Nitesh Shetty
  0 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-06-04 10:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Alexander Viro, Christian Brauner, Jan Kara, martin.petersen,
	bvanassche, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 5167 bytes --]

On 01/06/24 08:16AM, Christoph Hellwig wrote:
>> +/* Keeps track of all outstanding copy IO */
>> +struct blkdev_copy_io {
>> +	atomic_t refcount;
>> +	ssize_t copied;
>> +	int status;
>> +	struct task_struct *waiter;
>> +	void (*endio)(void *private, int status, ssize_t copied);
>> +	void *private;
>> +};
>> +
>> +/* Keeps track of single outstanding copy offload IO */
>> +struct blkdev_copy_offload_io {
>> +	struct blkdev_copy_io *cio;
>> +	loff_t offset;
>> +};
>
>The structure names confuse me, and the comments make things even worse.
>
>AFAICT:
>
>blkdev_copy_io is a per-call structure, I'd name it blkdev_copy_ctx.
>blkdev_copy_offload_io is per-bio pair, and something like blkdev_copy_chunk
Acked, your suggestion for structure name looks better.

>might be a better idea.  Or we could just try to kill it entirely and add
>a field to struct bio in the union currently holding the integrity
>information.
We will explore this.

>I'm also quite confused what kind of offset this offset field is.  The
>type and name suggest it is an offset in a file, which for a block device
>based helper is pretty odd to start with.  blkdev_copy_offload
>initializes it to len - rem, so it kind is an offset, but relative
>to the operation and not to a file. blkdev_copy_offload_src_endio then
>uses to set the ->copied field, but based on a min which means
>->copied can only be decreased.  Something is really off there.
>
Offset in this context, is with respect to the operation.
Overall idea was to handle partial copy, where in some of the split copy IO fails.
In this case we want to return minimum bytes copied.
We can try to store the offset in a temporary variable similar to
pos_out, pos_in instead of current (len - rem), to avoid the confusion.
We will update this in next version.

>Taking about types and units: blkdev_copy_offload obviously can only
>work in terms of LBAs.  Any reason to not make it work in terms of
>512-byte block layer sectors instead of in bytes?
>
Just that number of places where we need to sector shift were
comparatively more. We will update this to 512-byte sectors in next
version.

>> +	if ((pos_in & align) || (pos_out & align) || (len & align) || !len ||
>> +	    len >= BLK_COPY_MAX_BYTES)
>> +		return -EINVAL;
>
>This can be cleaned up an optimized a bit:
>
>	if (!len || len >= BLK_COPY_MAX_BYTES)
>		return -EINVAL;
>	if ((pos_in | pos_out | len) & align)
>		return -EINVAL;
>	
Acked.

>> + *
>> + * For synchronous operation returns the length of bytes copied or error
>> + * For asynchronous operation returns -EIOCBQUEUED or error
>> + *
>> + * Description:
>> + *	Copy source offset to destination offset within block device, using
>> + *	device's native copy offload feature.
>> + *	We perform copy operation using 2 bio's.
>> + *	1. We take a plug and send a REQ_OP_COPY_DST bio along with destination
>> + *	sector and length. Once this bio reaches request layer, we form a
>> + *	request and wait for dst bio to arrive.
>> + *	2. We issue REQ_OP_COPY_SRC bio along with source sector, length.
>> + *	Once this bio reaches request layer and find a request with previously
>> + *	sent destination info we merge the source bio and return.
>> + *	3. Release the plug and request is sent to driver
>> + *	This design works only for drivers with request queue.
>
>The wording with all the We here is a bit odd.  Much of this also seem
>superfluous or at least misplaced in the kernel doc comment as it doesn't
>document the API, but just what is done in the code below.
>
Since we were doing IO in unconventional way, we felt would be better to
document this, for easy followup.
We will remove this in next version and document just API.

>> +	cio = kzalloc(sizeof(*cio), gfp);
>> +	if (!cio)
>> +		return -ENOMEM;
>> +	atomic_set(&cio->refcount, 1);
>> +	cio->waiter = current;
>> +	cio->endio = endio;
>> +	cio->private = private;
>
>For the main use this could be allocated on-stack.  Is there any good
>reason to not let callers that really want an async version to implement
>the async behavior themselves using suitable helpers?
>
We cannot do on-stack allocation of cio as we use it in endio handler.
cio will be used to track partial IO completion as well.
Callers requiring async implementation would need to manage all this
bookkeeping themselves, leading to duplication of code. We felt it is
better to do it here onetime.
Do you see it any differently ?

>> +		src_bio = blk_next_bio(dst_bio, bdev, 0, REQ_OP_COPY_SRC, gfp);
>
>Please switch to use bio_chain_and_submit, which is a easier to
>understand API.  I'm trying to phase out blk_next_bio in favour of
>bio_chain_and_submit over the next few merge windows.
>
Acked

>> +		if (!src_bio)
>> +			goto err_free_dst_bio;
>> +		src_bio->bi_iter.bi_size = chunk;
>> +		src_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
>> +		src_bio->bi_end_io = blkdev_copy_offload_src_endio;
>> +		src_bio->bi_private = offload_io;
>> +
>> +		atomic_inc(&cio->refcount);
>> +		submit_bio(src_bio);
>> +		blk_finish_plug(&plug);
>
>plugs should be hold over all  I/Os, submitted from the same caller,
>which is the point of them.
>
Acked

Thank You,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-06-04  4:40                       ` Christoph Hellwig
@ 2024-06-04 11:44                         ` Bart Van Assche
  2024-06-05  8:20                           ` Christoph Hellwig
       [not found]                           ` <CGME20240606072827epcas5p285de8d4f3b0f6d3a87f8341414336b42@epcas5p2.samsung.com>
  0 siblings, 2 replies; 89+ messages in thread
From: Bart Van Assche @ 2024-06-04 11:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nitesh Shetty, Damien Le Moal, Jens Axboe, Jonathan Corbet,
	Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Keith Busch,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, david, hare,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

On 6/3/24 21:40, Christoph Hellwig wrote:
> There is no requirement to process them synchronously, there is just
> a requirement to preserve the order.  Note that my suggestion a few
> arounds ago also included a copy id to match them up.  If we don't
> need that I'm happy to leave it away.  If need it it to make stacking
> drivers' lifes easier that suggestion still stands.

Including an ID in REQ_OP_COPY_DST and REQ_OP_COPY_SRC operations sounds
much better to me than abusing the merge infrastructure for combining
these two operations into a single request. With the ID-based approach
stacking drivers are allowed to process copy bios asynchronously and it
is no longer necessary to activate merging for copy operations if
merging is disabled (QUEUE_FLAG_NOMERGES).

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support
  2024-06-04  7:05 ` Hannes Reinecke
@ 2024-06-05  8:17   ` Christoph Hellwig
  0 siblings, 0 replies; 89+ messages in thread
From: Christoph Hellwig @ 2024-06-05  8:17 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Nitesh Shetty, Jens Axboe, Jonathan Corbet,
	Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Keith Busch,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, bvanassche, david,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

On Tue, Jun 04, 2024 at 09:05:03AM +0200, Hannes Reinecke wrote:
> On 6/4/24 06:31, Christoph Hellwig wrote:
>> On Mon, Jun 03, 2024 at 06:43:56AM +0000, Nitesh Shetty wrote:
>>>> Also most block limits are in kb.  Not that I really know why we are
>>>> doing that, but is there a good reason to deviate from that scheme?
>>>>
>>> We followed discard as a reference, but we can move to kb, if that helps
>>> with overall readability.
>>
>> I'm not really sure what is better.  Does anyone remember why we did
>> the _kb version?  Either way some amount of consistency would be nice.
>>
> If memory serves correctly we introduced the _kb versions as a convenience 
> to the user; exposing values in 512 bytes increments tended
> to be confusing, especially when it comes to LBA values (is the size in 
> units of hardware sector size? 512 increments? kilobytes?)

Maybe.  In the meantime I did a bit more of research, and only
max_sectors and max_hw_sectors are reported in kb.  chunk_sectors is
reported in 512 byte sectors, and everything else is reported in bytes.

So sticking to bytes is probably right, and I was wrong about "most block
limits above".  Sorry.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-06-04 11:44                         ` Bart Van Assche
@ 2024-06-05  8:20                           ` Christoph Hellwig
       [not found]                             ` <CGME20240624105121epcas5p3a5a8c73bd5ef19c02e922e5829a4dff0@epcas5p3.samsung.com>
       [not found]                           ` <CGME20240606072827epcas5p285de8d4f3b0f6d3a87f8341414336b42@epcas5p2.samsung.com>
  1 sibling, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2024-06-05  8:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Nitesh Shetty, Damien Le Moal, Jens Axboe,
	Jonathan Corbet, Alasdair Kergon, Mike Snitzer, Mikulas Patocka,
	Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, david, hare,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

On Tue, Jun 04, 2024 at 04:44:34AM -0700, Bart Van Assche wrote:
> On 6/3/24 21:40, Christoph Hellwig wrote:
>> There is no requirement to process them synchronously, there is just
>> a requirement to preserve the order.  Note that my suggestion a few
>> arounds ago also included a copy id to match them up.  If we don't
>> need that I'm happy to leave it away.  If need it it to make stacking
>> drivers' lifes easier that suggestion still stands.
>
> Including an ID in REQ_OP_COPY_DST and REQ_OP_COPY_SRC operations sounds
> much better to me than abusing the merge infrastructure for combining
> these two operations into a single request. With the ID-based approach
> stacking drivers are allowed to process copy bios asynchronously and it
> is no longer necessary to activate merging for copy operations if
> merging is disabled (QUEUE_FLAG_NOMERGES).

Again, we can decided on QUEUE_FLAG_NOMERGES per request type.  In fact
I think we should not use it for discards as that just like copy
is a very different kind of "merge".

I'm in fact much more happy about avoiding the copy_id IFF we can.  It
it a fair amout of extra overhead, so we should only add it if there
is a real need for it

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
       [not found]                           ` <CGME20240606072827epcas5p285de8d4f3b0f6d3a87f8341414336b42@epcas5p2.samsung.com>
@ 2024-06-06  7:28                             ` Nitesh Shetty
       [not found]                             ` <66618886.630a0220.4d4fc.1c9cSMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-06-06  7:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Damien Le Moal, Jens Axboe, Jonathan Corbet,
	Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Keith Busch,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, david, hare,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2819 bytes --]

On 04/06/24 04:44AM, Bart Van Assche wrote:
>On 6/3/24 21:40, Christoph Hellwig wrote:
>>There is no requirement to process them synchronously, there is just
>>a requirement to preserve the order.  Note that my suggestion a few
>>arounds ago also included a copy id to match them up.  If we don't
>>need that I'm happy to leave it away.  If need it it to make stacking
>>drivers' lifes easier that suggestion still stands.
>
>Including an ID in REQ_OP_COPY_DST and REQ_OP_COPY_SRC operations sounds
>much better to me than abusing the merge infrastructure for combining
>these two operations into a single request. With the ID-based approach
>stacking drivers are allowed to process copy bios asynchronously and it
>is no longer necessary to activate merging for copy operations if
>merging is disabled (QUEUE_FLAG_NOMERGES).
>
Single request, with bio merging approach:
The current approach is to send a single request to driver,
which contains both destination and source information inside separate bios.
Do you have any different approach in mind ?

If we want to proceed with this single request based approach,
we need to merge the destination request with source BIOs at some point.
a. We chose to do it via plug approach.
b. Alternative I see is scheduler merging, but here we need some way to
hold the request which has destination info, until source bio is also
submitted.
c. Is there any other way, which I am missing here ?

Limitation of current plug based approach:
I missed the possibility of asynchronous submission by stacked device.
Since we enabled only dm-linear, we had synchronous submission till now
and our test cases worked fine.
But in future if we start enabling dm targets with asynchronous submission,
the current plug based approach won't work.
The case where Bart mentioned possibility of 2 different tasks sending
copy[1] and they are getting merged wrongly is valid in this case.
There will be corruption, copy ID approach can solve this wrong merging.

Copy ID based merging might preserve the order, but we still need the
copy destination request to wait for copy source bio to merge.

Copy ID approach:
We see 3 possibilities here:
1. No merging: If we include copy-id in src and dst bio, the bio's will get
submitted separately and reach to the driver as separate requests.
How do we plan to form a copy command in driver ?
2. Merging BIOs:
At some point we need to match the src bio with the dst bio and send this
information together to the driver. The current implementation.
This still does not solve the asynchronous submission problem, mentioned
above.
3. Chaining BIOs:
This won't work with stacked devices as there will be cloning, and hence
chain won't be maintained.

[1] https://lore.kernel.org/all/d7ae00c8-c038-4bed-937e-222251bc627a@acm.org/

Thank You,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
       [not found]                             ` <66618886.630a0220.4d4fc.1c9cSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2024-06-06 16:44                               ` Bart Van Assche
  0 siblings, 0 replies; 89+ messages in thread
From: Bart Van Assche @ 2024-06-06 16:44 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Christoph Hellwig, Damien Le Moal, Jens Axboe, Jonathan Corbet,
	Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Keith Busch,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, david, hare,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

On 6/6/24 01:28, Nitesh Shetty wrote:
> On 04/06/24 04:44AM, Bart Van Assche wrote:
>> On 6/3/24 21:40, Christoph Hellwig wrote:
>>> There is no requirement to process them synchronously, there is just
>>> a requirement to preserve the order.  Note that my suggestion a few
>>> arounds ago also included a copy id to match them up.  If we don't
>>> need that I'm happy to leave it away.  If need it it to make stacking
>>> drivers' lifes easier that suggestion still stands.
>>
>> Including an ID in REQ_OP_COPY_DST and REQ_OP_COPY_SRC operations sounds
>> much better to me than abusing the merge infrastructure for combining
>> these two operations into a single request. With the ID-based approach
>> stacking drivers are allowed to process copy bios asynchronously and it
>> is no longer necessary to activate merging for copy operations if
>> merging is disabled (QUEUE_FLAG_NOMERGES).
>>
> Single request, with bio merging approach:
> The current approach is to send a single request to driver,
> which contains both destination and source information inside separate bios.
> Do you have any different approach in mind ?

No. I did not propose to change how copy offload requests are sent to block
drivers (other than stacking drivers).

> If we want to proceed with this single request based approach,
> we need to merge the destination request with source BIOs at some point.
> a. We chose to do it via plug approach.
> b. Alternative I see is scheduler merging, but here we need some way to
> hold the request which has destination info, until source bio is also
> submitted.
> c. Is there any other way, which I am missing here ?

There are already exceptions in blk_mq_submit_bio() for zoned writes and for
flush bios. Another exception could be added for REQ_OP_COPY_* bios. I'm not
claiming that this is the best possible alternative. I'm only mentioning this
to show that there are alternatives.

> Copy ID approach:
> We see 3 possibilities here:
> 1. No merging: If we include copy-id in src and dst bio, the bio's will get
> submitted separately and reach to the driver as separate requests.
> How do we plan to form a copy command in driver ?
> 2. Merging BIOs:
> At some point we need to match the src bio with the dst bio and send this
> information together to the driver. The current implementation.
> This still does not solve the asynchronous submission problem, mentioned
> above.
> 3. Chaining BIOs:
> This won't work with stacked devices as there will be cloning, and hence
> chain won't be maintained.

I prefer option (2). Option (1) could result in a deadlock because the source
and destination bio both would have to be converted into a request before
these are sent to a request-based driver.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
       [not found]                             ` <CGME20240624105121epcas5p3a5a8c73bd5ef19c02e922e5829a4dff0@epcas5p3.samsung.com>
@ 2024-06-24 10:44                               ` Nitesh Shetty
       [not found]                               ` <6679526f.170a0220.9ffd.aefaSMTPIN_ADDED_BROKEN@mx.google.com>
       [not found]                               ` <66795280.630a0220.f3ccd.b80cSMTPIN_ADDED_BROKEN@mx.google.com>
  2 siblings, 0 replies; 89+ messages in thread
From: Nitesh Shetty @ 2024-06-24 10:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Damien Le Moal, Jens Axboe, Jonathan Corbet,
	Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Keith Busch,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, david, hare,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

[-- Attachment #1: ^[ --]
[-- Type: text/plain, Size: 3492 bytes --]

On 05/06/24 10:20AM, Christoph Hellwig wrote:
>On Tue, Jun 04, 2024 at 04:44:34AM -0700, Bart Van Assche wrote:
>> On 6/3/24 21:40, Christoph Hellwig wrote:
>>> There is no requirement to process them synchronously, there is just
>>> a requirement to preserve the order.  Note that my suggestion a few
>>> arounds ago also included a copy id to match them up.  If we don't
>>> need that I'm happy to leave it away.  If need it it to make stacking
>>> drivers' lifes easier that suggestion still stands.
>>
>> Including an ID in REQ_OP_COPY_DST and REQ_OP_COPY_SRC operations sounds
>> much better to me than abusing the merge infrastructure for combining
>> these two operations into a single request. With the ID-based approach
>> stacking drivers are allowed to process copy bios asynchronously and it
>> is no longer necessary to activate merging for copy operations if
>> merging is disabled (QUEUE_FLAG_NOMERGES).
>
>Again, we can decided on QUEUE_FLAG_NOMERGES per request type.  In fact
>I think we should not use it for discards as that just like copy
>is a very different kind of "merge".
>
>I'm in fact much more happy about avoiding the copy_id IFF we can.  It
>it a fair amout of extra overhead, so we should only add it if there
>is a real need for it

Christoph, Martin, Bart, Hannes, Damien

We have iterated over couple of designs for copy offload, but ended up
with changing them owing to some drawbacks. I would like to take your
opinion on how we can plumb copy offload in block and dm layer.

For reference, I have listed the approaches we have taken in the past.

a. Token/payload based approach:
1. Here we allocate a buffer/payload.
2. First source BIO is sent along with the buffer.
3. Once the buffer reaches driver, it is filled with the source LBA
and length and namespace info. And the request is completed.
4. Then destination BIO is sent with same buffer.
5. Once the buffer reaches driver, it retrieves the source information from
the BIO and forms a copy command and sends it down to device.

We received feedback that putting anything inside payload which is not
data, is not a good idea[1].

b. Plug based approach:
1. We take a plug.
2. Send a destination BIO, this forms a copy request and waits for source BIO
to arrive.
3. Send a source BIO, this merges with the copy request which was formed
by destination BIO.
4. We release the plug, then copy request reaches driver layer and forms
a copy command.

This approach won't work with stacked devices which has asynchronous
submission.
Overall taking plug and merging BIOs received not so good feedback from
community.

c. List/ctx based approach:
A new member is added to bio, bio_copy_ctx, which will a union with
bi_integrity. Idea is once a copy bio reaches blk_mq_submit_bio, it will
add the bio to this list.
1. Send the destination BIO, once this reaches blk_mq_submit_bio, this
will add the destination BIO to the list inside bi_copy_ctx and return
without forming any request.
2. Send source BIO, once this reaches blk_mq_submit_bio, this will
retrieve the destination BIO from bi_copy_ctx and form a request with
destination BIO and source BIO. After this request will be sent to
driver.

This work is still in POC phase[2]. But this approach makes lifetime
management of BIO complicated, especially during failure cases.

Thank You,
Nitesh Shetty

[1] https://lore.kernel.org/linux-block/20230605121732.28468-1-nj.shetty@samsung.com/
[2] https://github.com/SamsungDS/linux/tree/feat/co/v21/ctx

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
       [not found]                               ` <6679526f.170a0220.9ffd.aefaSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2024-06-24 16:25                                 ` Bart Van Assche
  2024-06-24 21:55                                   ` Damien Le Moal
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2024-06-24 16:25 UTC (permalink / raw)
  To: Nitesh Shetty, Christoph Hellwig
  Cc: Damien Le Moal, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 6/24/24 3:44 AM, Nitesh Shetty wrote:
> For reference, I have listed the approaches we have taken in the past.
> 
> a. Token/payload based approach:
> 1. Here we allocate a buffer/payload.
> 2. First source BIO is sent along with the buffer.
> 3. Once the buffer reaches driver, it is filled with the source LBA
> and length and namespace info. And the request is completed.
> 4. Then destination BIO is sent with same buffer.
> 5. Once the buffer reaches driver, it retrieves the source information from
> the BIO and forms a copy command and sends it down to device.
> 
> We received feedback that putting anything inside payload which is not
> data, is not a good idea[1].

A token-based approach (pairing copy_src and copy_dst based on a token)
is completely different from a payload-based approach (copy offload
parameters stored in the bio payload). From [1] (I agree with what has
been quoted): "In general every time we tried to come up with a request
payload that is not just data passed to the device it has been a
nightmare." [ ... ] "The only thing we'd need is a sequence number / idr
/ etc to find an input and output side match up, as long as we
stick to the proper namespace scope."

> c. List/ctx based approach:
> A new member is added to bio, bio_copy_ctx, which will a union with
> bi_integrity. Idea is once a copy bio reaches blk_mq_submit_bio, it will
> add the bio to this list.
> 1. Send the destination BIO, once this reaches blk_mq_submit_bio, this
> will add the destination BIO to the list inside bi_copy_ctx and return
> without forming any request.
> 2. Send source BIO, once this reaches blk_mq_submit_bio, this will
> retrieve the destination BIO from bi_copy_ctx and form a request with
> destination BIO and source BIO. After this request will be sent to
> driver.
> 
> This work is still in POC phase[2]. But this approach makes lifetime
> management of BIO complicated, especially during failure cases.

Associating src and dst operations by embedding a pointer to a third
data structure in struct bio is an implementation choice and is not the
only possibility for assocating src and dst operations. Hence, the
bio lifetime complexity mentioned above is not inherent to the list
based approach but is a result of the implementation choice made for
associating src and dst operations.

Has it been considered to combine the list-based approach for managing
unpaired copy operations with the token based approach for pairing copy
src and copy dst operations?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-06-24 16:25                                 ` Bart Van Assche
@ 2024-06-24 21:55                                   ` Damien Le Moal
  2024-06-25 18:18                                     ` Bart Van Assche
  0 siblings, 1 reply; 89+ messages in thread
From: Damien Le Moal @ 2024-06-24 21:55 UTC (permalink / raw)
  To: Bart Van Assche, Nitesh Shetty, Christoph Hellwig
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Alexander Viro, Christian Brauner, Jan Kara, martin.petersen,
	david, hare, damien.lemoal, anuj20.g, joshi.k, nitheshshetty,
	gost.dev, linux-block, linux-kernel, linux-doc, dm-devel,
	linux-nvme, linux-fsdevel

On 2024/06/25 1:25, Bart Van Assche wrote:
> On 6/24/24 3:44 AM, Nitesh Shetty wrote:
>> For reference, I have listed the approaches we have taken in the past.
>>
>> a. Token/payload based approach:
>> 1. Here we allocate a buffer/payload.
>> 2. First source BIO is sent along with the buffer.
>> 3. Once the buffer reaches driver, it is filled with the source LBA
>> and length and namespace info. And the request is completed.
>> 4. Then destination BIO is sent with same buffer.
>> 5. Once the buffer reaches driver, it retrieves the source information from
>> the BIO and forms a copy command and sends it down to device.
>>
>> We received feedback that putting anything inside payload which is not
>> data, is not a good idea[1].
> 
> A token-based approach (pairing copy_src and copy_dst based on a token)
> is completely different from a payload-based approach (copy offload
> parameters stored in the bio payload). From [1] (I agree with what has
> been quoted): "In general every time we tried to come up with a request
> payload that is not just data passed to the device it has been a
> nightmare." [ ... ] "The only thing we'd need is a sequence number / idr
> / etc to find an input and output side match up, as long as we
> stick to the proper namespace scope."
> 
>> c. List/ctx based approach:
>> A new member is added to bio, bio_copy_ctx, which will a union with
>> bi_integrity. Idea is once a copy bio reaches blk_mq_submit_bio, it will
>> add the bio to this list.
>> 1. Send the destination BIO, once this reaches blk_mq_submit_bio, this
>> will add the destination BIO to the list inside bi_copy_ctx and return
>> without forming any request.
>> 2. Send source BIO, once this reaches blk_mq_submit_bio, this will
>> retrieve the destination BIO from bi_copy_ctx and form a request with
>> destination BIO and source BIO. After this request will be sent to
>> driver.
>>
>> This work is still in POC phase[2]. But this approach makes lifetime
>> management of BIO complicated, especially during failure cases.
> 
> Associating src and dst operations by embedding a pointer to a third
> data structure in struct bio is an implementation choice and is not the
> only possibility for assocating src and dst operations. Hence, the
> bio lifetime complexity mentioned above is not inherent to the list
> based approach but is a result of the implementation choice made for
> associating src and dst operations.
> 
> Has it been considered to combine the list-based approach for managing
> unpaired copy operations with the token based approach for pairing copy
> src and copy dst operations?

I am still a little confused as to why we need 2 BIOs, one for src and one for
dst... Is it because of the overly complex scsi extended copy support ?

Given that the main use case is copy offload for data within the same device,
using a single BIO which somehow can carry a list of LBA sources and a single
destination LBA would be far simpler and perfectly matching nvme simple copy and
ATA write gathered. And I think that this would also match the simplest case for
scsi extended copy as well.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
       [not found]                               ` <66795280.630a0220.f3ccd.b80cSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2024-06-24 22:58                                 ` Keith Busch
  0 siblings, 0 replies; 89+ messages in thread
From: Keith Busch @ 2024-06-24 22:58 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: Christoph Hellwig, Bart Van Assche, Damien Le Moal, Jens Axboe,
	Jonathan Corbet, Alasdair Kergon, Mike Snitzer, Mikulas Patocka,
	Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, david, hare,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

On Mon, Jun 24, 2024 at 04:14:07PM +0530, Nitesh Shetty wrote:
> c. List/ctx based approach:
> A new member is added to bio, bio_copy_ctx, which will a union with
> bi_integrity. Idea is once a copy bio reaches blk_mq_submit_bio, it will
> add the bio to this list.

Is there a reason to tie this to CONFIG_BLK_DEV_INTEGRITY? Why not use
the bio_io_vec? There's no user data here, so that's unused, right?

> 1. Send the destination BIO, once this reaches blk_mq_submit_bio, this
> will add the destination BIO to the list inside bi_copy_ctx and return
> without forming any request.
> 2. Send source BIO, once this reaches blk_mq_submit_bio, this will
> retrieve the destination BIO from bi_copy_ctx and form a request with
> destination BIO and source BIO. After this request will be sent to
> driver.

Like Damien, I also don't see the point of the 2-bio requirement. Treat
it like discard, and drivers can allocate "special".

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-06-24 21:55                                   ` Damien Le Moal
@ 2024-06-25 18:18                                     ` Bart Van Assche
  2024-06-25 21:18                                       ` Damien Le Moal
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2024-06-25 18:18 UTC (permalink / raw)
  To: Damien Le Moal, Nitesh Shetty, Christoph Hellwig
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Alexander Viro, Christian Brauner, Jan Kara, martin.petersen,
	david, hare, damien.lemoal, anuj20.g, joshi.k, nitheshshetty,
	gost.dev, linux-block, linux-kernel, linux-doc, dm-devel,
	linux-nvme, linux-fsdevel

On 6/24/24 2:55 PM, Damien Le Moal wrote:
> I am still a little confused as to why we need 2 BIOs, one for src and one for
> dst... Is it because of the overly complex scsi extended copy support ?
> 
> Given that the main use case is copy offload for data within the same device,
> using a single BIO which somehow can carry a list of LBA sources and a single
> destination LBA would be far simpler and perfectly matching nvme simple copy and
> ATA write gathered. And I think that this would also match the simplest case for
> scsi extended copy as well.

Hi Damien,

What are the implications for the device mapper code if the copy source
and destination LBAs are encoded in the bio payload instead of in
bio->bi_sector?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-06-25 18:18                                     ` Bart Van Assche
@ 2024-06-25 21:18                                       ` Damien Le Moal
  2024-06-26  5:22                                         ` Christoph Hellwig
  0 siblings, 1 reply; 89+ messages in thread
From: Damien Le Moal @ 2024-06-25 21:18 UTC (permalink / raw)
  To: Bart Van Assche, Nitesh Shetty, Christoph Hellwig
  Cc: Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
	Alexander Viro, Christian Brauner, Jan Kara, martin.petersen,
	david, hare, damien.lemoal, anuj20.g, joshi.k, nitheshshetty,
	gost.dev, linux-block, linux-kernel, linux-doc, dm-devel,
	linux-nvme, linux-fsdevel

On 6/26/24 03:18, Bart Van Assche wrote:
> On 6/24/24 2:55 PM, Damien Le Moal wrote:
>> I am still a little confused as to why we need 2 BIOs, one for src and one for
>> dst... Is it because of the overly complex scsi extended copy support ?
>>
>> Given that the main use case is copy offload for data within the same device,
>> using a single BIO which somehow can carry a list of LBA sources and a single
>> destination LBA would be far simpler and perfectly matching nvme simple copy and
>> ATA write gathered. And I think that this would also match the simplest case for
>> scsi extended copy as well.
> 
> Hi Damien,
> 
> What are the implications for the device mapper code if the copy source
> and destination LBAs are encoded in the bio payload instead of in
> bio->bi_sector?

DM can deal with "abnormal" BIOs on its own. There is code for that.
See is_abnormal_io() and __process_abnormal_io(). Sure, that will need more code
compared to a bio sector+size based simple split, but I do not think it is a big
deal given the potential benefits of the offloading.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-06-25 21:18                                       ` Damien Le Moal
@ 2024-06-26  5:22                                         ` Christoph Hellwig
  2024-06-28 13:53                                           ` Bart Van Assche
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Hellwig @ 2024-06-26  5:22 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Nitesh Shetty, Christoph Hellwig, Jens Axboe,
	Jonathan Corbet, Alasdair Kergon, Mike Snitzer, Mikulas Patocka,
	Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Alexander Viro,
	Christian Brauner, Jan Kara, martin.petersen, david, hare,
	damien.lemoal, anuj20.g, joshi.k, nitheshshetty, gost.dev,
	linux-block, linux-kernel, linux-doc, dm-devel, linux-nvme,
	linux-fsdevel

On Wed, Jun 26, 2024 at 06:18:18AM +0900, Damien Le Moal wrote:
> 
> DM can deal with "abnormal" BIOs on its own. There is code for that.
> See is_abnormal_io() and __process_abnormal_io(). Sure, that will need more code
> compared to a bio sector+size based simple split, but I do not think it is a big
> deal given the potential benefits of the offloading.

It's not just dm.  You also need it in the partition remapping code
(mandatory), md (nice to have), etc.

And then we have the whole mess of what is in the payload for the I/O
stack vs what is in the payload for the on the wire protocol, which
will have different formatting and potentially also different sizes.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer.
  2024-06-26  5:22                                         ` Christoph Hellwig
@ 2024-06-28 13:53                                           ` Bart Van Assche
  0 siblings, 0 replies; 89+ messages in thread
From: Bart Van Assche @ 2024-06-28 13:53 UTC (permalink / raw)
  To: Christoph Hellwig, Damien Le Moal
  Cc: Nitesh Shetty, Jens Axboe, Jonathan Corbet, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Keith Busch, Sagi Grimberg,
	Chaitanya Kulkarni, Alexander Viro, Christian Brauner, Jan Kara,
	martin.petersen, david, hare, damien.lemoal, anuj20.g, joshi.k,
	nitheshshetty, gost.dev, linux-block, linux-kernel, linux-doc,
	dm-devel, linux-nvme, linux-fsdevel

On 6/25/24 10:22 PM, Christoph Hellwig wrote:
> It's not just dm.  You also need it in the partition remapping code
> (mandatory), md (nice to have), etc.
> 
> And then we have the whole mess of what is in the payload for the I/O
> stack vs what is in the payload for the on the wire protocol, which
> will have different formatting and potentially also different sizes.

Drivers like dm-linear rely on bio splitting. If the COPY_SRC and
COPY_DST operations travel independently down a stacked block driver
hierarchy, a separate data structure is needed to keep track of which
operations have been split and to combine the split operations into
requests. Isn't this an argument in favor of storing the source and
destination parameters in a single bio?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2024-06-28 13:54 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20240520102747epcas5p33497a911ca70c991e5da8e22c5d1336b@epcas5p3.samsung.com>
2024-05-20 10:20 ` [PATCH v20 00/12] Implement copy offload support Nitesh Shetty
     [not found]   ` <CGME20240520102830epcas5p27274901f3d0c2738c515709890b1dec4@epcas5p2.samsung.com>
2024-05-20 10:20     ` [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support Nitesh Shetty
2024-05-20 14:33       ` Damien Le Moal
2024-05-21  8:15         ` Nitesh Shetty
2024-05-20 22:42       ` Bart Van Assche
2024-05-21 14:25         ` Nitesh Shetty
2024-05-22 17:49           ` Bart Van Assche
2024-05-23  7:05             ` Nitesh Shetty
2024-05-21  6:57       ` Hannes Reinecke
2024-06-01  5:53       ` Christoph Hellwig
2024-06-03  6:43         ` Nitesh Shetty
     [not found]   ` <CGME20240520102842epcas5p4949334c2587a15b8adab2c913daa622f@epcas5p4.samsung.com>
2024-05-20 10:20     ` [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer Nitesh Shetty
2024-05-20 15:00       ` Damien Le Moal
2024-05-21 10:50         ` Nitesh Shetty
2024-05-20 23:00       ` Bart Van Assche
2024-05-21 11:17         ` Nitesh Shetty
2024-05-22 17:58           ` Bart Van Assche
2024-05-21  7:01       ` Hannes Reinecke
2024-05-24  6:54         ` Nitesh Shetty
     [not found]         ` <66503bc7.630a0220.56c85.8b9dSMTPIN_ADDED_BROKEN@mx.google.com>
2024-05-24 13:52           ` Bart Van Assche
2024-05-27  8:27             ` Nitesh Shetty
2024-05-28 14:07         ` Bart Van Assche
2024-05-22 18:05       ` Bart Van Assche
2024-05-23 11:34         ` Nitesh Shetty
2024-05-24 20:33       ` Bart Van Assche
2024-05-29  6:17         ` Nitesh Shetty
2024-05-29  7:48           ` Damien Le Moal
2024-05-29 22:41             ` Bart Van Assche
2024-05-30  7:16               ` Nitesh Shetty
     [not found]               ` <665850bd.050a0220.a5e6b.5b72SMTPIN_ADDED_BROKEN@mx.google.com>
2024-05-30 17:11                 ` Bart Van Assche
2024-05-31 10:17                   ` Nitesh Shetty
     [not found]                   ` <6659b691.630a0220.90195.d0ebSMTPIN_ADDED_BROKEN@mx.google.com>
2024-05-31 23:45                     ` Bart Van Assche
2024-06-01  5:59                   ` Christoph Hellwig
2024-06-03 17:12                     ` Bart Van Assche
2024-06-04  4:40                       ` Christoph Hellwig
2024-06-04 11:44                         ` Bart Van Assche
2024-06-05  8:20                           ` Christoph Hellwig
     [not found]                             ` <CGME20240624105121epcas5p3a5a8c73bd5ef19c02e922e5829a4dff0@epcas5p3.samsung.com>
2024-06-24 10:44                               ` Nitesh Shetty
     [not found]                               ` <6679526f.170a0220.9ffd.aefaSMTPIN_ADDED_BROKEN@mx.google.com>
2024-06-24 16:25                                 ` Bart Van Assche
2024-06-24 21:55                                   ` Damien Le Moal
2024-06-25 18:18                                     ` Bart Van Assche
2024-06-25 21:18                                       ` Damien Le Moal
2024-06-26  5:22                                         ` Christoph Hellwig
2024-06-28 13:53                                           ` Bart Van Assche
     [not found]                               ` <66795280.630a0220.f3ccd.b80cSMTPIN_ADDED_BROKEN@mx.google.com>
2024-06-24 22:58                                 ` Keith Busch
     [not found]                           ` <CGME20240606072827epcas5p285de8d4f3b0f6d3a87f8341414336b42@epcas5p2.samsung.com>
2024-06-06  7:28                             ` Nitesh Shetty
     [not found]                             ` <66618886.630a0220.4d4fc.1c9cSMTPIN_ADDED_BROKEN@mx.google.com>
2024-06-06 16:44                               ` Bart Van Assche
2024-06-01  5:57             ` Christoph Hellwig
     [not found]   ` <CGME20240520102853epcas5p42d635d6712b8876ea22a45d730cb1378@epcas5p4.samsung.com>
2024-05-20 10:20     ` [PATCH v20 03/12] block: add copy offload support Nitesh Shetty
2024-06-01  6:16       ` Christoph Hellwig
2024-06-04 10:50         ` Nitesh Shetty
     [not found]   ` <CGME20240520102906epcas5p15b5a0b3c8edd0bf3073030a792a328bb@epcas5p1.samsung.com>
2024-05-20 10:20     ` [PATCH v20 04/12] block: add emulation for copy Nitesh Shetty
2024-05-21  7:06       ` Hannes Reinecke
2024-05-21 11:29         ` Nitesh Shetty
2024-06-01  6:18       ` Christoph Hellwig
     [not found]   ` <CGME20240520102917epcas5p1bda532309b9174bf2702081f6f58daf7@epcas5p1.samsung.com>
2024-05-20 10:20     ` [PATCH v20 05/12] fs/read_write: Enable copy_file_range for block device Nitesh Shetty
2024-05-21  7:07       ` Hannes Reinecke
2024-05-25 23:02       ` Dave Chinner
2024-05-28  5:57         ` Nitesh Shetty
     [not found]   ` <CGME20240520102929epcas5p2f4456f6fa0005d90769615eb2c2bf273@epcas5p2.samsung.com>
2024-05-20 10:20     ` [PATCH v20 06/12] fs, block: copy_file_range for def_blk_ops for direct " Nitesh Shetty
2024-05-25 23:09       ` Dave Chinner
2024-05-27  8:43         ` Nitesh Shetty
     [not found]   ` <CGME20240520102940epcas5p2b5f38ceabe94bed3905fb386a0d65ec7@epcas5p2.samsung.com>
2024-05-20 10:20     ` [PATCH v20 07/12] nvme: add copy offload support Nitesh Shetty
2024-06-01  6:22       ` Christoph Hellwig
2024-06-03 11:43         ` Nitesh Shetty
     [not found]   ` <CGME20240520102952epcas5p18716d203d1810c38397e7fcc9a26922a@epcas5p1.samsung.com>
2024-05-20 10:20     ` [PATCH v20 08/12] nvmet: add copy command support for bdev and file ns Nitesh Shetty
     [not found]   ` <CGME20240520103004epcas5p4a18f3f6ba0f218d57b0ab4bb84c6ff18@epcas5p4.samsung.com>
2024-05-20 10:20     ` [PATCH v20 09/12] dm: Add support for copy offload Nitesh Shetty
2024-05-21  7:11       ` Hannes Reinecke
2024-05-21 14:08         ` Nitesh Shetty
2024-05-22  6:22           ` Hannes Reinecke
2024-05-22  7:10             ` Nitesh Shetty
     [not found]   ` <CGME20240520103016epcas5p31b9a0f3637959626d49763609ebda6ef@epcas5p3.samsung.com>
2024-05-20 10:20     ` [PATCH v20 10/12] dm: Enable copy offload for dm-linear target Nitesh Shetty
2024-05-20 23:25       ` Bart Van Assche
2024-05-21 14:48         ` Nitesh Shetty
     [not found]   ` <CGME20240520103027epcas5p4789defe8ab3bff23bd2abcf019689fa2@epcas5p4.samsung.com>
2024-05-20 10:20     ` [PATCH v20 11/12] null: Enable trace capability for null block Nitesh Shetty
2024-05-20 23:27       ` Bart Van Assche
2024-06-01  6:23       ` Christoph Hellwig
2024-06-03 11:04         ` Nitesh Shetty
     [not found]   ` <CGME20240520103039epcas5p4373f7234162a32222ac225b976ae30ce@epcas5p4.samsung.com>
2024-05-20 10:20     ` [PATCH v20 12/12] null_blk: add support for copy offload Nitesh Shetty
2024-05-20 23:42       ` Bart Van Assche
2024-05-21 14:46         ` Nitesh Shetty
2024-05-22 17:52           ` Bart Van Assche
2024-05-23  6:55             ` Nitesh Shetty
2024-05-20 22:54   ` [PATCH v20 00/12] Implement copy offload support Bart Van Assche
2024-06-01  5:47   ` Christoph Hellwig
2024-06-03 10:53     ` Nitesh Shetty
2024-06-04  4:31 [PATCH v20 01/12] block: Introduce queue limits and sysfs for copy-offload support Christoph Hellwig
2024-06-04  7:05 ` Hannes Reinecke
2024-06-05  8:17   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).