linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] Optimizing disk file & dmabuf copies via copy_file_range.
@ 2025-05-30 10:39 wangtao
  2025-05-30 10:39 ` [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files wangtao
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: wangtao @ 2025-05-30 10:39 UTC (permalink / raw)
  To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro,
	brauner, hughd, akpm, amir73il
  Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack,
	baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel,
	linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu,
	feng.han, wangtao

Typical dmabuf file reading steps:
1. dmabuf_fd = dmabuf_alloc(len, heap_fd)
2. vaddr = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, dmabuf_fd, 0)
3. file_fd = open(file_path, O_RDONLY)
4. read(file_fd, vaddr, len)

dmabuf's attachment/map/fence model uses remap_pfn_range for mmap, which
lacks Direct I/O support[1]. Buffer IO causes latency when loading large AI
model files.

My previous patch added dmabuf ioctl for Direct IO file operations, showing
good performance on low-power CPUs[2][3]. Christian suggested using existing
uAPIs (read/sendfile/splice/c_f_r) instead of new ioctls.

sendfile/splice/c_f_r enable zero-copy via Direct IO for disk-disk/network:
sendfile(skt_fd, disk_fd): [DISK]-DMA->[pipe(buf)]-DMA->[NIC]
sendfile(dst_disk, src_disk): [DISK] -DMA-> [pipe(buf)] -DMA-> [DISK]

Analysis shows existing uAPIs can't achieve zero-copy disk-to-dmabuf.
Since dmabuf lacks file ops, using tmpfs for disk-to-tmpfs CPU analysis:
| Method            | CPU Copies | Key Overhead               |
|-------------------|------------|----------------------------|
| 1. Buffer R+W     | 2          | Alloc(cache) & 2 CPU copies|
| 2. Direct R+W     | 1          | GUP(usr_buf) & 1 CPU copy  |
| 3. Mmap+Buffer R  | 1          | Alloc(cache) & 1 CPU copy  |
| 4. Mmap+Direct R  | 0          | GUP(mem_buf) ~50% CPU      |
| 5. Buffer Sendfile| 1          | Alloc(cache) & 1 CPU copy  |
| 6. Direct Sendfile| 1          | Small pipe, high IRQ       |
| 7. Buffer Splice  | 1          | Alloc(cache) & 1 CPU copy  |
| 8. Direct Splice  | 1          | Larger pipe buffer         |
| 9. c_f_r          | N/A        | Cross-FS blocked           |

GUP: get_user_page
Alloc(cache): allocate page cache

Data flows:
1. [DISK] -DMA-> [Alloc(cache)] -COPY-> [usr_buf] -COPY-> [MEM]
2. [DISK] -DMA-> [GUP(usr_buf)] -COPY-> [MEM]
3. [DISK] -DMA-> [Alloc(cache)] -COPY-> [mem_buf]
4. [DISK] -DMA-> [GUP(mem_buf)]
5. [DISK] -DMA-> [pipe(Alloc(cache))] -COPY-> [tmpfs page]
6. [DISK] -DMA-> [pipe(buf)] -COPY-> [tmpfs page]
7. [DISK] -DMA-> [big_pipe(Alloc(cache))] -COPY-> [tmpfs page]
8. [DISK] -DMA-> [big_pipe(buf)] -COPY-> [tmpfs page]
9. [DISK] -DMA-> [tmpfs page] (blocked)

Key findings:
- Buffer I/O requires page cache allocation and at least one CPU copy
- Read+Write incurs excessive CPU copies and will no longer be analyzed.
  Future approaches will use Read instead of mmap+Read.
- Mmap+Direct has zero copies but 50% GUP overhead, and dmabuf doesn't support
- sendfile/splice require intermediate pipes, needing 1 CPU copy
- c_f_r limitations: Cross-FS blocks + missing memory FS support

Modifications:
1. Enable cross-FS c_f_r for memory file types
2. Add dmabuf c_f_r callbacks for [DISK]-DMA->[dmabuf]
3. Test tmpfs c_f_r locally only (no upstream) due to lock_page deadlock risks

Performance (1GHz CPU, UFS4@4GB):
1. tmpfs(memfd) direct c_f_r(1197 MB/s): +15% vs mmap&read(1014)
2. udmabuf+memfd(2318 MB/s): +50% vs mmap&read(1457 MB/s)
3. dmabuf direct c_f_r(3405 MB/s): 260% faster than buffer IO(918 MB/s)
   40% faster than udmabuf(2318 MB/s)

|    32x32MB Read 1024MB  |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
|-------------------------|--------|--------|--------|--------|-----
| 1)Beg  dmabuf buffer R/W|     52 |      5 |   1170 |    918 | 100%
| 2)    udmabuf buffer R/W|    591 |    326 |   1281 |    838 |  91%
| 3)      memfd buffer R/W|      1 |    323 |   2370 |    453 |  49%
| 4)      memfd direct R/W|      1 |    321 |   1058 |   1014 | 110%
| 5) memfd buffer sendfile|      1 |    329 |   1577 |    681 |  74%
| 6) memfd direct sendfile|      1 |    327 |   2672 |    401 |  43%
| 7)   memfd buffer splice|      2 |    321 |   1729 |    621 |  67%
| 8)   memfd direct splice|      2 |    324 |   1528 |    702 |  76%
| 9)    memfd buffer c_f_r|      1 |    325 |   1586 |    677 |  73%
|10)    memfd direct c_f_r|      1 |    323 |    897 |   1197 | 130%
|11)      u+mfd buffer R/W|    609 |    344 |   2207 |    486 |  52%
|12)      u+mfd direct R/W|    580 |    342 |    737 |   1457 | 158%
|13) u+mfd buffer sendfile|    582 |    343 |   1270 |    845 |  92%
|14) u+mfd direct sendfile|    573 |    344 |   2254 |    476 |  51%
|15)   u+mfd buffer splice|    584 |    341 |   1202 |    893 |  97%
|16)   u+mfd direct splice|    564 |    340 |    851 |   1263 | 137%
|17)    u+mfd buffer c_f_r|    585 |    344 |   1244 |    863 |  94%
|18)    u+mfd direct c_f_r|    578 |    341 |    581 |   1848 | 201%
|19)  udmabuf buffer c_f_r|    585 |    328 |   1163 |    923 | 100%
|20)  udmabuf direct c_f_r|    579 |    328 |    464 |   2318 | 252%
|21)   dmabuf buffer c_f_r|     48 |      5 |   1058 |   1015 | 110%
|22)   dmabuf direct c_f_r|     48 |      5 |    316 |   3405 | 370%
|23)End  dmabuf buffer R/W|     48 |      5 |   1173 |    915 |  99%

u+mfd = udma+memfd = udmabuf + pre-allocated memfd combo.
Cache cleared during tests to simulate real-world large file loading.

dmabuf file Use Cases:
- Loading large AI models using dmabuf
- Real-time data capture and storage with dmabuf
- Persisting task snapshots in Android

v1: [2]
v1 -> v2: [3]
 Dma-buf exporter verify exclusive access to the dmabuf's sgtable.
v2 -> v3:
 copy_file_range supports copying from disk files to memory files.
 Implement the copy_file_range callback functions for dmabuf/udmabuf.

Reference:
[1] https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com
[2] https://lore.kernel.org/all/20250513092803.2096-1-tao.wangtao@honor.com
[3] https://lore.kernel.org/all/20250516092148.12778-1-tao.wangtao@honor.com


wangtao (4):
  fs: allow cross-FS copy_file_range for memory-backed files
  dmabuf: Implement copy_file_range for dmabuf
  udmabuf: Implement udmabuf rw_file callback
  dmabuf:system_heap Implement system_heap exporter's rw_file callback.

 drivers/dma-buf/dma-buf.c           | 32 ++++++++++++
 drivers/dma-buf/heaps/system_heap.c | 79 +++++++++++++++++++++++++++++
 drivers/dma-buf/udmabuf.c           | 59 +++++++++++++++++++++
 fs/read_write.c                     | 71 +++++++++++++++++++-------
 include/linux/dma-buf.h             | 16 ++++++
 include/linux/fs.h                  |  2 +
 6 files changed, 240 insertions(+), 19 deletions(-)

-- 
2.17.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files
  2025-05-30 10:39 [PATCH v3 0/4] Optimizing disk file & dmabuf copies via copy_file_range wangtao
@ 2025-05-30 10:39 ` wangtao
  2025-05-30 13:43   ` Amir Goldstein
  2025-05-30 10:39 ` [PATCH v3 2/4] dmabuf: Implement copy_file_range for dmabuf wangtao
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: wangtao @ 2025-05-30 10:39 UTC (permalink / raw)
  To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro,
	brauner, hughd, akpm, amir73il
  Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack,
	baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel,
	linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu,
	feng.han, wangtao

Memory-backed files can optimize copy performance via
copy_file_range callbacks. Compared to mmap&read: reduces
GUP (get_user_pages) overhead; vs sendfile/splice: eliminates
one memory copy; supports dmabuf zero-copy implementation.

Signed-off-by: wangtao <tao.wangtao@honor.com>
---
 fs/read_write.c    | 71 +++++++++++++++++++++++++++++++++-------------
 include/linux/fs.h |  2 ++
 2 files changed, 54 insertions(+), 19 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index bb0ed26a0b3a..591c6db7b785 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1469,6 +1469,20 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
 }
 #endif
 
+static inline bool is_copy_memory_file_to_file(struct file *file_in,
+				struct file *file_out)
+{
+	return (file_in->f_op->fop_flags & FOP_MEMORY_FILE) &&
+		file_in->f_op->copy_file_range && file_out->f_op->write_iter;
+}
+
+static inline bool is_copy_file_to_memory_file(struct file *file_in,
+				struct file *file_out)
+{
+	return (file_out->f_op->fop_flags & FOP_MEMORY_FILE) &&
+		file_in->f_op->read_iter && file_out->f_op->copy_file_range;
+}
+
 /*
  * Performs necessary checks before doing a file copy
  *
@@ -1484,11 +1498,23 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
 	struct inode *inode_out = file_inode(file_out);
 	uint64_t count = *req_count;
 	loff_t size_in;
+	bool splice = flags & COPY_FILE_SPLICE;
+	bool has_memory_file;
 	int ret;
 
-	ret = generic_file_rw_checks(file_in, file_out);
-	if (ret)
-		return ret;
+	/* Skip generic checks, allow cross-sb copies for dma-buf/tmpfs */
+	has_memory_file = is_copy_memory_file_to_file(file_in, file_out) ||
+			  is_copy_file_to_memory_file(file_in, file_out);
+	if (!splice && has_memory_file) {
+		if (!(file_in->f_mode & FMODE_READ) ||
+		    !(file_out->f_mode & FMODE_WRITE) ||
+		    (file_out->f_flags & O_APPEND))
+			return -EBADF;
+	} else {
+		ret = generic_file_rw_checks(file_in, file_out);
+		if (ret)
+			return ret;
+	}
 
 	/*
 	 * We allow some filesystems to handle cross sb copy, but passing
@@ -1500,7 +1526,7 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
 	 * and several different sets of file_operations, but they all end up
 	 * using the same ->copy_file_range() function pointer.
 	 */
-	if (flags & COPY_FILE_SPLICE) {
+	if (splice || has_memory_file) {
 		/* cross sb splice is allowed */
 	} else if (file_out->f_op->copy_file_range) {
 		if (file_in->f_op->copy_file_range !=
@@ -1581,23 +1607,30 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	 * same sb using clone, but for filesystems where both clone and copy
 	 * are supported (e.g. nfs,cifs), we only call the copy method.
 	 */
-	if (!splice && file_out->f_op->copy_file_range) {
-		ret = file_out->f_op->copy_file_range(file_in, pos_in,
-						      file_out, pos_out,
-						      len, flags);
-	} else if (!splice && file_in->f_op->remap_file_range && samesb) {
-		ret = file_in->f_op->remap_file_range(file_in, pos_in,
-				file_out, pos_out,
-				min_t(loff_t, MAX_RW_COUNT, len),
-				REMAP_FILE_CAN_SHORTEN);
-		/* fallback to splice */
-		if (ret <= 0)
+	if (!splice) {
+		if (is_copy_memory_file_to_file(file_in, file_out)) {
+			ret = file_in->f_op->copy_file_range(file_in, pos_in,
+					file_out, pos_out, len, flags);
+		} else if (is_copy_file_to_memory_file(file_in, file_out)) {
+			ret = file_out->f_op->copy_file_range(file_in, pos_in,
+					file_out, pos_out, len, flags);
+		} else if (file_out->f_op->copy_file_range) {
+			ret = file_out->f_op->copy_file_range(file_in, pos_in,
+							file_out, pos_out,
+							len, flags);
+		} else if (file_in->f_op->remap_file_range && samesb) {
+			ret = file_in->f_op->remap_file_range(file_in, pos_in,
+					file_out, pos_out,
+					min_t(loff_t, MAX_RW_COUNT, len),
+					REMAP_FILE_CAN_SHORTEN);
+			/* fallback to splice */
+			if (ret <= 0)
+				splice = true;
+		} else if (samesb) {
+			/* Fallback to splice for same sb copy for backward compat */
 			splice = true;
-	} else if (samesb) {
-		/* Fallback to splice for same sb copy for backward compat */
-		splice = true;
+		}
 	}
-
 	file_end_write(file_out);
 
 	if (!splice)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 016b0fe1536e..37df1b497418 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2187,6 +2187,8 @@ struct file_operations {
 #define FOP_ASYNC_LOCK		((__force fop_flags_t)(1 << 6))
 /* File system supports uncached read/write buffered IO */
 #define FOP_DONTCACHE		((__force fop_flags_t)(1 << 7))
+/* Supports cross-FS copy_file_range for memory file */
+#define FOP_MEMORY_FILE		((__force fop_flags_t)(1 << 8))
 
 /* Wrap a directory iterator that needs exclusive inode access */
 int wrap_directory_iterator(struct file *, struct dir_context *,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/4] dmabuf: Implement copy_file_range for dmabuf
  2025-05-30 10:39 [PATCH v3 0/4] Optimizing disk file & dmabuf copies via copy_file_range wangtao
  2025-05-30 10:39 ` [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files wangtao
@ 2025-05-30 10:39 ` wangtao
  2025-05-30 10:39 ` [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback wangtao
  2025-05-30 10:39 ` [PATCH v3 4/4] dmabuf:system_heap Implement system_heap exporter's " wangtao
  3 siblings, 0 replies; 9+ messages in thread
From: wangtao @ 2025-05-30 10:39 UTC (permalink / raw)
  To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro,
	brauner, hughd, akpm, amir73il
  Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack,
	baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel,
	linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu,
	feng.han, wangtao

First determine if dmabuf reads from or writes to the file.
Then call exporter's rw_file callback function.

Signed-off-by: wangtao <tao.wangtao@honor.com>
---
 drivers/dma-buf/dma-buf.c | 32 ++++++++++++++++++++++++++++++++
 include/linux/dma-buf.h   | 16 ++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 5baa83b85515..fc9bf54c921a 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -523,7 +523,38 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file)
 	spin_unlock(&dmabuf->name_lock);
 }
 
+static ssize_t dma_buf_rw_file(struct dma_buf *dmabuf, loff_t my_pos,
+	struct file *file, loff_t pos, size_t count, bool is_write)
+{
+	if (!dmabuf->ops->rw_file)
+		return -EINVAL;
+
+	if (my_pos >= dmabuf->size)
+		count = 0;
+	else
+		count = min_t(size_t, count, dmabuf->size - my_pos);
+	if (!count)
+		return 0;
+
+	return dmabuf->ops->rw_file(dmabuf, my_pos, file, pos, count, is_write);
+}
+
+static ssize_t dma_buf_copy_file_range(struct file *file_in, loff_t pos_in,
+	struct file *file_out, loff_t pos_out,
+	size_t count, unsigned int flags)
+{
+	if (is_dma_buf_file(file_in) && file_out->f_op->write_iter)
+		return dma_buf_rw_file(file_in->private_data, pos_in,
+				file_out, pos_out, count, true);
+	else if (is_dma_buf_file(file_out) && file_in->f_op->read_iter)
+		return dma_buf_rw_file(file_out->private_data, pos_out,
+				file_in, pos_in, count, false);
+	else
+		return -EINVAL;
+}
+
 static const struct file_operations dma_buf_fops = {
+	.fop_flags = FOP_MEMORY_FILE,
 	.release	= dma_buf_file_release,
 	.mmap		= dma_buf_mmap_internal,
 	.llseek		= dma_buf_llseek,
@@ -531,6 +562,7 @@ static const struct file_operations dma_buf_fops = {
 	.unlocked_ioctl	= dma_buf_ioctl,
 	.compat_ioctl	= compat_ptr_ioctl,
 	.show_fdinfo	= dma_buf_show_fdinfo,
+	.copy_file_range = dma_buf_copy_file_range,
 };
 
 /*
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 36216d28d8bd..d3636e985399 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -22,6 +22,7 @@
 #include <linux/fs.h>
 #include <linux/dma-fence.h>
 #include <linux/wait.h>
+#include <uapi/linux/dma-buf.h>
 
 struct device;
 struct dma_buf;
@@ -285,6 +286,21 @@ struct dma_buf_ops {
 
 	int (*vmap)(struct dma_buf *dmabuf, struct iosys_map *map);
 	void (*vunmap)(struct dma_buf *dmabuf, struct iosys_map *map);
+
+	/**
+	 * @rw_file:
+	 *
+	 * If an Exporter needs to support Direct I/O file operations, it can
+	 * implement this optional callback. The exporter must verify that no
+	 * other objects hold the sg_table, ensure exclusive access to the
+	 * dmabuf's sg_table, and only then proceed with the I/O operation.
+	 *
+	 * Returns:
+	 *
+	 * 0 on success or a negative error code on failure.
+	 */
+	ssize_t (*rw_file)(struct dma_buf *dmabuf, loff_t my_pos,
+		struct file *file, loff_t pos, size_t count, bool is_write);
 };
 
 /**
-- 
2.17.1




^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback
  2025-05-30 10:39 [PATCH v3 0/4] Optimizing disk file & dmabuf copies via copy_file_range wangtao
  2025-05-30 10:39 ` [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files wangtao
  2025-05-30 10:39 ` [PATCH v3 2/4] dmabuf: Implement copy_file_range for dmabuf wangtao
@ 2025-05-30 10:39 ` wangtao
  2025-05-30 14:24   ` kernel test robot
  2025-05-30 10:39 ` [PATCH v3 4/4] dmabuf:system_heap Implement system_heap exporter's " wangtao
  3 siblings, 1 reply; 9+ messages in thread
From: wangtao @ 2025-05-30 10:39 UTC (permalink / raw)
  To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro,
	brauner, hughd, akpm, amir73il
  Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack,
	baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel,
	linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu,
	feng.han, wangtao

Construct bio_vec from folios, then call the other file's
r/w callbacks for IO operations.
Test data shows direct I/O copy_file_range improves performance by
over 50% vs direct I/O mmap&read (2557 vs 1534).

Test data:
|    32x32MB Read 1024MB  |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
|-------------------------|--------|--------|--------|--------|-----
| 1)Beg udmabuf buffer R/W|    580 |    323 |   1238 |    867 | 100%
| 2)     dmabuf buffer R/W|     48 |      5 |   1149 |    934 | 107%
| 3) udma+memfd buffer R/W|    597 |    340 |   2157 |    497 |  57%
| 4) udma+memfd direct R/W|    573 |    340 |    700 |   1534 | 176%
| 5) u+mfd buffer sendfile|    577 |    340 |   1204 |    891 | 102%
| 6) u+mfd direct sendfile|    567 |    339 |   2272 |    472 |  54%
| 7)   u+mfd buffer splice|    570 |    337 |   1114 |    964 | 111%
| 8)   u+mfd direct splice|    564 |    335 |    793 |   1355 | 156%
| 9)  udmabuf buffer c_f_r|    577 |    323 |   1059 |   1014 | 116%
|10)  udmabuf direct c_f_r|    582 |    325 |    420 |   2557 | 294%
|11)End udmabuf buffer R/W|    586 |    323 |   1188 |    903 | 104%

Signed-off-by: wangtao <tao.wangtao@honor.com>
---
 drivers/dma-buf/udmabuf.c | 59 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index e74e36a8ecda..573275a51674 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -284,6 +284,64 @@ static int end_cpu_udmabuf(struct dma_buf *buf,
 	return 0;
 }
 
+static ssize_t udmabuf_rw_file(struct dma_buf *dmabuf, loff_t my_pos,
+			struct file *other, loff_t pos,
+			size_t count, bool is_write)
+{
+	struct udmabuf *ubuf = dmabuf->priv;
+	loff_t my_end = my_pos + count, bv_beg, bv_end = 0;
+	pgoff_t pg_idx = my_pos / PAGE_SIZE;
+	pgoff_t pg_end = DIV_ROUND_UP(my_end, PAGE_SIZE);
+	size_t i, bv_off, bv_len, bv_num, bv_idx = 0, bv_total = 0;
+	struct bio_vec *bvec;
+	struct kiocb kiocb;
+	struct iov_iter iter;
+	unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
+	ssize_t ret = 0, rw_total = 0;
+	struct folio *folio;
+
+	bv_num = min_t(size_t, pg_end - pg_idx + 1, 1024);
+	bvec = kvcalloc(bv_num, sizeof(*bvec), GFP_KERNEL);
+	if (!bvec)
+		return -ENOMEM;
+
+	init_sync_kiocb(&kiocb, other);
+	kiocb.ki_pos = pos;
+
+	for (i = 0; i < ubuf->nr_pinned && my_pos < my_end; i++) {
+		folio = ubuf->pinned_folios[i];
+		bv_beg = bv_end;
+		bv_end += folio_size(folio);
+		if (bv_end <= my_pos)
+			continue;
+
+		bv_len = min(bv_end, my_end) - my_pos;
+		bv_off = my_pos - bv_beg;
+		my_pos += bv_len;
+		bv_total += bv_len;
+		bvec_set_page(&bvec[bv_idx], &folio->page, bv_len, bv_off);
+		if (++bv_idx < bv_num && my_pos < my_end)
+			continue;
+
+		/* start R/W if bvec is full or count reaches zero. */
+		iov_iter_bvec(&iter, direction, bvec, bv_idx, bv_total);
+		if (is_write)
+			ret = other->f_op->write_iter(&kiocb, &iter);
+		else
+			ret = other->f_op->read_iter(&kiocb, &iter);
+		if (ret <= 0)
+			break;
+		rw_total += ret;
+		if (ret < bv_total || fatal_signal_pending(current))
+			break;
+
+		bv_idx = bv_total = 0;
+	}
+	kvfree(bvec);
+
+	return rw_total > 0 ? rw_total : ret;
+}
+
 static const struct dma_buf_ops udmabuf_ops = {
 	.cache_sgt_mapping = true,
 	.map_dma_buf	   = map_udmabuf,
@@ -294,6 +352,7 @@ static const struct dma_buf_ops udmabuf_ops = {
 	.vunmap		   = vunmap_udmabuf,
 	.begin_cpu_access  = begin_cpu_udmabuf,
 	.end_cpu_access    = end_cpu_udmabuf,
+	.rw_file = udmabuf_rw_file,
 };
 
 #define SEALS_WANTED (F_SEAL_SHRINK)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 4/4] dmabuf:system_heap Implement system_heap exporter's rw_file callback.
  2025-05-30 10:39 [PATCH v3 0/4] Optimizing disk file & dmabuf copies via copy_file_range wangtao
                   ` (2 preceding siblings ...)
  2025-05-30 10:39 ` [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback wangtao
@ 2025-05-30 10:39 ` wangtao
  3 siblings, 0 replies; 9+ messages in thread
From: wangtao @ 2025-05-30 10:39 UTC (permalink / raw)
  To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro,
	brauner, hughd, akpm, amir73il
  Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack,
	baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel,
	linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu,
	feng.han, wangtao

First verify system_heap exporter has exclusive dmabuf access.
Build bio_vec from sgtable, then invoke target file's r/w callbacks for IO.
Outperforms buffer IO mmap/read by 250%, beats direct I/O udmabuf
copy_file_range by over 30% with initialization time significantly lower
than udmabuf.

Test data:
|    32x32MB Read 1024MB  |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
|-------------------------|--------|--------|--------|--------|-----
| 1)Beg  dmabuf buffer R/W|     47 |      5 |   1125 |    954 | 100%
| 2)    udmabuf buffer R/W|    576 |    323 |   1228 |    874 |  91%
| 3) udma+memfd buffer R/W|    596 |    340 |   2166 |    495 |  51%
| 4) udma+memfd direct R/W|    570 |    338 |    711 |   1510 | 158%
| 5)  udmabuf buffer c_f_r|    578 |    329 |   1128 |    952 |  99%
| 6)  udmabuf direct c_f_r|    570 |    324 |    405 |   2651 | 277%
| 7)   dmabuf buffer c_f_r|     47 |      5 |   1035 |   1037 | 108%
| 8)   dmabuf direct c_f_r|     51 |      5 |    309 |   3480 | 364%
| 9)End  dmabuf buffer R/W|     48 |      5 |   1153 |    931 |  97%

|    32x32MB Write 1024MB |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
|-------------------------|--------|--------|--------|--------|-----
| 1)Beg  dmabuf buffer R/W|     50 |      5 |   1405 |    764 | 100%
| 2)    udmabuf buffer R/W|    580 |    341 |   1337 |    803 | 105%
| 3) udma+memfd buffer R/W|    588 |    331 |   1820 |    590 |  77%
| 4) udma+memfd direct R/W|    585 |    333 |    662 |   1622 | 212%
| 5)  udmabuf buffer c_f_r|    577 |    329 |   1326 |    810 | 106%
| 6)  udmabuf direct c_f_r|    580 |    330 |    602 |   1784 | 233%
| 7)   dmabuf buffer c_f_r|     49 |      5 |   1330 |    807 | 105%
| 8)   dmabuf direct c_f_r|     49 |      5 |    344 |   3127 | 409%
| 9)End  dmabuf buffer R/W|     50 |      5 |   1442 |    745 |  97%

Signed-off-by: wangtao <tao.wangtao@honor.com>
---
 drivers/dma-buf/heaps/system_heap.c | 79 +++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
index 26d5dc89ea16..d3a1956ebad8 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -20,6 +20,9 @@
 #include <linux/scatterlist.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
+#include <linux/bvec.h>
+#include <linux/bio.h>
+#include <linux/uio.h>
 
 static struct dma_heap *sys_heap;
 
@@ -281,6 +284,81 @@ static void system_heap_vunmap(struct dma_buf *dmabuf, struct iosys_map *map)
 	iosys_map_clear(map);
 }
 
+static ssize_t system_heap_buffer_rw_other(struct system_heap_buffer *buffer,
+			loff_t my_pos, struct file *other, loff_t pos,
+			size_t count, bool is_write)
+{
+	struct sg_table *sgt = &buffer->sg_table;
+	struct scatterlist *sg;
+	loff_t my_end = my_pos + count, bv_beg, bv_end = 0;
+	pgoff_t pg_idx = my_pos / PAGE_SIZE;
+	pgoff_t pg_end = DIV_ROUND_UP(my_end, PAGE_SIZE);
+	size_t i, bv_off, bv_len, bv_num, bv_idx = 0, bv_total = 0;
+	struct bio_vec *bvec;
+	struct kiocb kiocb;
+	struct iov_iter iter;
+	unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
+	ssize_t ret = 0, rw_total = 0;
+
+	bv_num = min_t(size_t, pg_end - pg_idx + 1, 1024);
+	bvec = kvcalloc(bv_num, sizeof(*bvec), GFP_KERNEL);
+	if (!bvec)
+		return -ENOMEM;
+
+	init_sync_kiocb(&kiocb, other);
+	kiocb.ki_pos = pos;
+
+	for_each_sg(sgt->sgl, sg, sgt->nents, i) {
+		if (my_pos >= my_end)
+			break;
+		bv_beg = bv_end;
+		bv_end += sg->length;
+		if (bv_end <= my_pos)
+			continue;
+
+		bv_len = min(bv_end, my_end) - my_pos;
+		bv_off = sg->offset + my_pos - bv_beg;
+		my_pos += bv_len;
+		bv_total += bv_len;
+		bvec_set_page(&bvec[bv_idx], sg_page(sg), bv_len, bv_off);
+		if (++bv_idx < bv_num && my_pos < my_end)
+			continue;
+
+		/* start R/W if bvec is full or count reaches zero. */
+		iov_iter_bvec(&iter, direction, bvec, bv_idx, bv_total);
+		if (is_write)
+			ret = other->f_op->write_iter(&kiocb, &iter);
+		else
+			ret = other->f_op->read_iter(&kiocb, &iter);
+		if (ret <= 0)
+			break;
+		rw_total += ret;
+		if (ret < bv_total || fatal_signal_pending(current))
+			break;
+
+		bv_idx = bv_total = 0;
+	}
+	kvfree(bvec);
+
+	return rw_total > 0 ? rw_total : ret;
+}
+
+static ssize_t system_heap_dma_buf_rw_file(struct dma_buf *dmabuf,
+			loff_t my_pos, struct file *file, loff_t pos,
+			size_t count, bool is_write)
+{
+	struct system_heap_buffer *buffer = dmabuf->priv;
+	ssize_t ret = -EBUSY;
+
+	mutex_lock(&buffer->lock);
+	if (list_empty(&buffer->attachments) && !buffer->vmap_cnt)
+		ret = system_heap_buffer_rw_other(buffer, my_pos,
+			file, pos, count, is_write);
+	mutex_unlock(&buffer->lock);
+
+	return ret;
+}
+
 static void system_heap_dma_buf_release(struct dma_buf *dmabuf)
 {
 	struct system_heap_buffer *buffer = dmabuf->priv;
@@ -308,6 +386,7 @@ static const struct dma_buf_ops system_heap_buf_ops = {
 	.mmap = system_heap_mmap,
 	.vmap = system_heap_vmap,
 	.vunmap = system_heap_vunmap,
+	.rw_file = system_heap_dma_buf_rw_file,
 	.release = system_heap_dma_buf_release,
 };
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files
  2025-05-30 10:39 ` [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files wangtao
@ 2025-05-30 13:43   ` Amir Goldstein
  2025-06-03  1:28     ` wangtao
  0 siblings, 1 reply; 9+ messages in thread
From: Amir Goldstein @ 2025-05-30 13:43 UTC (permalink / raw)
  To: wangtao
  Cc: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro,
	brauner, hughd, akpm, benjamin.gaignard, Brian.Starkey, jstultz,
	tjmercier, jack, baolin.wang, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm,
	bintian.wang, yipengxiang, liulu.liu, feng.han

On Fri, May 30, 2025 at 12:40 PM wangtao <tao.wangtao@honor.com> wrote:
>
> Memory-backed files can optimize copy performance via
> copy_file_range callbacks. Compared to mmap&read: reduces
> GUP (get_user_pages) overhead; vs sendfile/splice: eliminates
> one memory copy; supports dmabuf zero-copy implementation.
>
> Signed-off-by: wangtao <tao.wangtao@honor.com>
> ---

Hi wangtao,

Let me rephrase my reaction gently:
Regardless of the proposed API extension, and referring to the
implementation itself -
I wrote the current code and I can barely understand the logic every time
I come back to read it.

Your changes make it too messy to be reviewed/maintained.
I have a few suggestions for simplifications (see below).

The complication arise from the fact that normally the destination fops
are used to call the copy_range() method and this case is a deviation
from this standard, we need to make the cope simpler using a helper
to choose the fops to use.

>  fs/read_write.c    | 71 +++++++++++++++++++++++++++++++++-------------
>  include/linux/fs.h |  2 ++
>  2 files changed, 54 insertions(+), 19 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index bb0ed26a0b3a..591c6db7b785 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1469,6 +1469,20 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
>  }
>  #endif
>
> +static inline bool is_copy_memory_file_to_file(struct file *file_in,
> +                               struct file *file_out)
> +{
> +       return (file_in->f_op->fop_flags & FOP_MEMORY_FILE) &&
> +               file_in->f_op->copy_file_range && file_out->f_op->write_iter;
> +}
> +
> +static inline bool is_copy_file_to_memory_file(struct file *file_in,
> +                               struct file *file_out)
> +{
> +       return (file_out->f_op->fop_flags & FOP_MEMORY_FILE) &&
> +               file_in->f_op->read_iter && file_out->f_op->copy_file_range;
> +}
> +

Please combine that to a helper:
const struct file_operations *memory_copy_file_ops(struct file
*file_in, struct file *file_out)
which returns either file_in->f_op, file_out->f_op or NULL.


>  /*
>   * Performs necessary checks before doing a file copy
>   *
> @@ -1484,11 +1498,23 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
>         struct inode *inode_out = file_inode(file_out);
>         uint64_t count = *req_count;
>         loff_t size_in;
> +       bool splice = flags & COPY_FILE_SPLICE;
> +       bool has_memory_file;
>         int ret;
>
> -       ret = generic_file_rw_checks(file_in, file_out);
> -       if (ret)
> -               return ret;
> +       /* Skip generic checks, allow cross-sb copies for dma-buf/tmpfs */
> +       has_memory_file = is_copy_memory_file_to_file(file_in, file_out) ||
> +                         is_copy_file_to_memory_file(file_in, file_out);
> +       if (!splice && has_memory_file) {
> +               if (!(file_in->f_mode & FMODE_READ) ||
> +                   !(file_out->f_mode & FMODE_WRITE) ||
> +                   (file_out->f_flags & O_APPEND))
> +                       return -EBADF;

I don't like that this both duplicates code and does not check all the checks in
generic_file_rw_checks() for the non-memory file side.
I do not currently have a suggestion how to make this less messy,
more human readable and correct.

> +       } else {
> +               ret = generic_file_rw_checks(file_in, file_out);
> +               if (ret)
> +                       return ret;
> +       }
>
>         /*
>          * We allow some filesystems to handle cross sb copy, but passing
> @@ -1500,7 +1526,7 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
>          * and several different sets of file_operations, but they all end up
>          * using the same ->copy_file_range() function pointer.
>          */
> -       if (flags & COPY_FILE_SPLICE) {
> +       if (splice || has_memory_file) {
>                 /* cross sb splice is allowed */

This comment is not accurate - it should say "cross fs-type", because it skips
the test that compares the ->copy_file_range pointers, not the sb.
If you are making changes to this function, this should be clarified.

>         } else if (file_out->f_op->copy_file_range) {
>                 if (file_in->f_op->copy_file_range !=
> @@ -1581,23 +1607,30 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>          * same sb using clone, but for filesystems where both clone and copy
>          * are supported (e.g. nfs,cifs), we only call the copy method.
>          */
> -       if (!splice && file_out->f_op->copy_file_range) {
> -               ret = file_out->f_op->copy_file_range(file_in, pos_in,
> -                                                     file_out, pos_out,
> -                                                     len, flags);
> -       } else if (!splice && file_in->f_op->remap_file_range && samesb) {
> -               ret = file_in->f_op->remap_file_range(file_in, pos_in,
> -                               file_out, pos_out,
> -                               min_t(loff_t, MAX_RW_COUNT, len),
> -                               REMAP_FILE_CAN_SHORTEN);
> -               /* fallback to splice */
> -               if (ret <= 0)
> +       if (!splice) {
> +               if (is_copy_memory_file_to_file(file_in, file_out)) {
> +                       ret = file_in->f_op->copy_file_range(file_in, pos_in,
> +                                       file_out, pos_out, len, flags);
> +               } else if (is_copy_file_to_memory_file(file_in, file_out)) {
> +                       ret = file_out->f_op->copy_file_range(file_in, pos_in,
> +                                       file_out, pos_out, len, flags);
> +               } else if (file_out->f_op->copy_file_range) {
> +                       ret = file_out->f_op->copy_file_range(file_in, pos_in,
> +                                                       file_out, pos_out,
> +                                                       len, flags);
> +               } else if (file_in->f_op->remap_file_range && samesb) {
> +                       ret = file_in->f_op->remap_file_range(file_in, pos_in,
> +                                       file_out, pos_out,
> +                                       min_t(loff_t, MAX_RW_COUNT, len),
> +                                       REMAP_FILE_CAN_SHORTEN);
> +                       /* fallback to splice */
> +                       if (ret <= 0)
> +                               splice = true;
> +               } else if (samesb) {
> +                       /* Fallback to splice for same sb copy for backward compat */
>                         splice = true;
> -       } else if (samesb) {
> -               /* Fallback to splice for same sb copy for backward compat */
> -               splice = true;
> +               }

This is the way-too-ugly-to-live part.
Was pretty bad before and now it is unacceptable.
way too much unneeded nesting and too hard to follow.

First of all, please use splice label and call goto splice (before the comment)
instead of adding another nesting level.
I would do that even if not adding memory fd copy support.

Second, please store result of mem_fops = memory_copy_file_ops()
and start with:
+    if (mem_fops) {
+        ret = mem_fops->copy_file_range(file_in, pos_in,
+
file_out, pos_out,
+                                                               len, flags);
 +   } else if (file_out->f_op->copy_file_range) { ...

and update the comment above to say that:
+ * For copy to/from memory fd, we alway call the copy method of the memory fd.
   */

I think that makes the code not more ugly than it already is.

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback
  2025-05-30 10:39 ` [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback wangtao
@ 2025-05-30 14:24   ` kernel test robot
  2025-06-03  1:32     ` wangtao
  0 siblings, 1 reply; 9+ messages in thread
From: kernel test robot @ 2025-05-30 14:24 UTC (permalink / raw)
  To: wangtao, sumit.semwal, christian.koenig, kraxel, vivek.kasireddy,
	viro, brauner, hughd, akpm, amir73il
  Cc: oe-kbuild-all, benjamin.gaignard, Brian.Starkey, jstultz,
	tjmercier, jack, baolin.wang, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm,
	bintian.wang, yipengxiang, liulu.liu, feng.han, wangtao

Hi wangtao,

kernel test robot noticed the following build errors:

[auto build test ERROR on brauner-vfs/vfs.all]
[also build test ERROR on next-20250530]
[cannot apply to linus/master v6.15]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/wangtao/fs-allow-cross-FS-copy_file_range-for-memory-backed-files/20250530-184146
base:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.all
patch link:    https://lore.kernel.org/r/20250530103941.11092-4-tao.wangtao%40honor.com
patch subject: [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback
config: sparc64-randconfig-002-20250530 (https://download.01.org/0day-ci/archive/20250530/202505302235.mDzENMSm-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250530/202505302235.mDzENMSm-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505302235.mDzENMSm-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   drivers/dma-buf/udmabuf.c: In function 'udmabuf_rw_file':
>> drivers/dma-buf/udmabuf.c:298:25: error: storage size of 'iter' isn't known
     298 |         struct iov_iter iter;
         |                         ^~~~
>> drivers/dma-buf/udmabuf.c:299:45: error: 'ITER_SOURCE' undeclared (first use in this function)
     299 |         unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
         |                                             ^~~~~~~~~~~
   drivers/dma-buf/udmabuf.c:299:45: note: each undeclared identifier is reported only once for each function it appears in
>> drivers/dma-buf/udmabuf.c:299:59: error: 'ITER_DEST' undeclared (first use in this function)
     299 |         unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
         |                                                           ^~~~~~~~~
>> drivers/dma-buf/udmabuf.c:327:17: error: implicit declaration of function 'iov_iter_bvec'; did you mean 'bvec_iter_bvec'? [-Wimplicit-function-declaration]
     327 |                 iov_iter_bvec(&iter, direction, bvec, bv_idx, bv_total);
         |                 ^~~~~~~~~~~~~
         |                 bvec_iter_bvec
>> drivers/dma-buf/udmabuf.c:298:25: warning: unused variable 'iter' [-Wunused-variable]
     298 |         struct iov_iter iter;
         |                         ^~~~


vim +298 drivers/dma-buf/udmabuf.c

   286	
   287	static ssize_t udmabuf_rw_file(struct dma_buf *dmabuf, loff_t my_pos,
   288				struct file *other, loff_t pos,
   289				size_t count, bool is_write)
   290	{
   291		struct udmabuf *ubuf = dmabuf->priv;
   292		loff_t my_end = my_pos + count, bv_beg, bv_end = 0;
   293		pgoff_t pg_idx = my_pos / PAGE_SIZE;
   294		pgoff_t pg_end = DIV_ROUND_UP(my_end, PAGE_SIZE);
   295		size_t i, bv_off, bv_len, bv_num, bv_idx = 0, bv_total = 0;
   296		struct bio_vec *bvec;
   297		struct kiocb kiocb;
 > 298		struct iov_iter iter;
 > 299		unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
   300		ssize_t ret = 0, rw_total = 0;
   301		struct folio *folio;
   302	
   303		bv_num = min_t(size_t, pg_end - pg_idx + 1, 1024);
   304		bvec = kvcalloc(bv_num, sizeof(*bvec), GFP_KERNEL);
   305		if (!bvec)
   306			return -ENOMEM;
   307	
   308		init_sync_kiocb(&kiocb, other);
   309		kiocb.ki_pos = pos;
   310	
   311		for (i = 0; i < ubuf->nr_pinned && my_pos < my_end; i++) {
   312			folio = ubuf->pinned_folios[i];
   313			bv_beg = bv_end;
   314			bv_end += folio_size(folio);
   315			if (bv_end <= my_pos)
   316				continue;
   317	
   318			bv_len = min(bv_end, my_end) - my_pos;
   319			bv_off = my_pos - bv_beg;
   320			my_pos += bv_len;
   321			bv_total += bv_len;
   322			bvec_set_page(&bvec[bv_idx], &folio->page, bv_len, bv_off);
   323			if (++bv_idx < bv_num && my_pos < my_end)
   324				continue;
   325	
   326			/* start R/W if bvec is full or count reaches zero. */
 > 327			iov_iter_bvec(&iter, direction, bvec, bv_idx, bv_total);
   328			if (is_write)
   329				ret = other->f_op->write_iter(&kiocb, &iter);
   330			else
   331				ret = other->f_op->read_iter(&kiocb, &iter);
   332			if (ret <= 0)
   333				break;
   334			rw_total += ret;
   335			if (ret < bv_total || fatal_signal_pending(current))
   336				break;
   337	
   338			bv_idx = bv_total = 0;
   339		}
   340		kvfree(bvec);
   341	
   342		return rw_total > 0 ? rw_total : ret;
   343	}
   344	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files
  2025-05-30 13:43   ` Amir Goldstein
@ 2025-06-03  1:28     ` wangtao
  0 siblings, 0 replies; 9+ messages in thread
From: wangtao @ 2025-06-03  1:28 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: sumit.semwal@linaro.org, christian.koenig@amd.com,
	kraxel@redhat.com, vivek.kasireddy@intel.com,
	viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com,
	akpm@linux-foundation.org, benjamin.gaignard@collabora.com,
	Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com,
	jack@suse.cz, baolin.wang@linux.alibaba.com,
	linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	wangbintian(BintianWang), yipengxiang, liulu 00013167,
	hanfeng 00012985



> -----Original Message-----
> From: Amir Goldstein <amir73il@gmail.com>
> Sent: Friday, May 30, 2025 9:44 PM
> To: wangtao <tao.wangtao@honor.com>
> Cc: sumit.semwal@linaro.org; christian.koenig@amd.com;
> kraxel@redhat.com; vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk;
> brauner@kernel.org; hughd@google.com; akpm@linux-foundation.org;
> benjamin.gaignard@collabora.com; Brian.Starkey@arm.com;
> jstultz@google.com; tjmercier@google.com; jack@suse.cz;
> baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri-
> devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>;
> yipengxiang <yipengxiang@honor.com>; liulu 00013167
> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com>
> Subject: Re: [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-
> backed files
> 
> On Fri, May 30, 2025 at 12:40 PM wangtao <tao.wangtao@honor.com> wrote:
> >
> > Memory-backed files can optimize copy performance via copy_file_range
> > callbacks. Compared to mmap&read: reduces GUP (get_user_pages)
> > overhead; vs sendfile/splice: eliminates one memory copy; supports
> > dmabuf zero-copy implementation.
> >
> > Signed-off-by: wangtao <tao.wangtao@honor.com>
> > ---
> 
> Hi wangtao,
> 
> Let me rephrase my reaction gently:
> Regardless of the proposed API extension, and referring to the
> implementation itself - I wrote the current code and I can barely understand
> the logic every time I come back to read it.
> 
> Your changes make it too messy to be reviewed/maintained.
> I have a few suggestions for simplifications (see below).
> 
> The complication arise from the fact that normally the destination fops are
> used to call the copy_range() method and this case is a deviation from this
> standard, we need to make the cope simpler using a helper to choose the
> fops to use.
> 
> >  fs/read_write.c    | 71 +++++++++++++++++++++++++++++++++-----------
> --
> >  include/linux/fs.h |  2 ++
> >  2 files changed, 54 insertions(+), 19 deletions(-)
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c index
> > bb0ed26a0b3a..591c6db7b785 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1469,6 +1469,20 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int,
> out_fd,
> > int, in_fd,  }  #endif
> >
> > +static inline bool is_copy_memory_file_to_file(struct file *file_in,
> > +                               struct file *file_out) {
> > +       return (file_in->f_op->fop_flags & FOP_MEMORY_FILE) &&
> > +               file_in->f_op->copy_file_range &&
> > +file_out->f_op->write_iter; }
> > +
> > +static inline bool is_copy_file_to_memory_file(struct file *file_in,
> > +                               struct file *file_out) {
> > +       return (file_out->f_op->fop_flags & FOP_MEMORY_FILE) &&
> > +               file_in->f_op->read_iter &&
> > +file_out->f_op->copy_file_range; }
> > +
> 
> Please combine that to a helper:
> const struct file_operations *memory_copy_file_ops(struct file *file_in,
> struct file *file_out) which returns either file_in->f_op, file_out->f_op or
> NULL.
> 
Great suggestions! I'll update them in the new patch version soon.

> 
> >  /*
> >   * Performs necessary checks before doing a file copy
> >   *
> > @@ -1484,11 +1498,23 @@ static int generic_copy_file_checks(struct file
> *file_in, loff_t pos_in,
> >         struct inode *inode_out = file_inode(file_out);
> >         uint64_t count = *req_count;
> >         loff_t size_in;
> > +       bool splice = flags & COPY_FILE_SPLICE;
> > +       bool has_memory_file;
> >         int ret;
> >
> > -       ret = generic_file_rw_checks(file_in, file_out);
> > -       if (ret)
> > -               return ret;
> > +       /* Skip generic checks, allow cross-sb copies for dma-buf/tmpfs */
> > +       has_memory_file = is_copy_memory_file_to_file(file_in, file_out) ||
> > +                         is_copy_file_to_memory_file(file_in, file_out);
> > +       if (!splice && has_memory_file) {
> > +               if (!(file_in->f_mode & FMODE_READ) ||
> > +                   !(file_out->f_mode & FMODE_WRITE) ||
> > +                   (file_out->f_flags & O_APPEND))
> > +                       return -EBADF;
> 
> I don't like that this both duplicates code and does not check all the checks in
> generic_file_rw_checks() for the non-memory file side.
> I do not currently have a suggestion how to make this less messy, more
> human readable and correct.
Since dmabuf files aren't regular files, we'll skip generic_file_rw_checks.
Adding an essential_file_rw_checks() could reduce code duplication.

> 
> > +       } else {
> > +               ret = generic_file_rw_checks(file_in, file_out);
> > +               if (ret)
> > +                       return ret;
> > +       }
> >
> >         /*
> >          * We allow some filesystems to handle cross sb copy, but
> > passing @@ -1500,7 +1526,7 @@ static int generic_copy_file_checks(struct
> file *file_in, loff_t pos_in,
> >          * and several different sets of file_operations, but they all end up
> >          * using the same ->copy_file_range() function pointer.
> >          */
> > -       if (flags & COPY_FILE_SPLICE) {
> > +       if (splice || has_memory_file) {
> >                 /* cross sb splice is allowed */
> 
> This comment is not accurate - it should say "cross fs-type", because it skips
> the test that compares the ->copy_file_range pointers, not the sb.
> If you are making changes to this function, this should be clarified.
> 
Will fix this shortly.

> >         } else if (file_out->f_op->copy_file_range) {
> >                 if (file_in->f_op->copy_file_range != @@ -1581,23
> > +1607,30 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> >          * same sb using clone, but for filesystems where both clone and copy
> >          * are supported (e.g. nfs,cifs), we only call the copy method.
> >          */
> > -       if (!splice && file_out->f_op->copy_file_range) {
> > -               ret = file_out->f_op->copy_file_range(file_in, pos_in,
> > -                                                     file_out, pos_out,
> > -                                                     len, flags);
> > -       } else if (!splice && file_in->f_op->remap_file_range && samesb) {
> > -               ret = file_in->f_op->remap_file_range(file_in, pos_in,
> > -                               file_out, pos_out,
> > -                               min_t(loff_t, MAX_RW_COUNT, len),
> > -                               REMAP_FILE_CAN_SHORTEN);
> > -               /* fallback to splice */
> > -               if (ret <= 0)
> > +       if (!splice) {
> > +               if (is_copy_memory_file_to_file(file_in, file_out)) {
> > +                       ret = file_in->f_op->copy_file_range(file_in, pos_in,
> > +                                       file_out, pos_out, len, flags);
> > +               } else if (is_copy_file_to_memory_file(file_in, file_out)) {
> > +                       ret = file_out->f_op->copy_file_range(file_in, pos_in,
> > +                                       file_out, pos_out, len, flags);
> > +               } else if (file_out->f_op->copy_file_range) {
> > +                       ret = file_out->f_op->copy_file_range(file_in, pos_in,
> > +                                                       file_out, pos_out,
> > +                                                       len, flags);
> > +               } else if (file_in->f_op->remap_file_range && samesb) {
> > +                       ret = file_in->f_op->remap_file_range(file_in, pos_in,
> > +                                       file_out, pos_out,
> > +                                       min_t(loff_t, MAX_RW_COUNT, len),
> > +                                       REMAP_FILE_CAN_SHORTEN);
> > +                       /* fallback to splice */
> > +                       if (ret <= 0)
> > +                               splice = true;
> > +               } else if (samesb) {
> > +                       /* Fallback to splice for same sb copy for
> > + backward compat */
> >                         splice = true;
> > -       } else if (samesb) {
> > -               /* Fallback to splice for same sb copy for backward compat */
> > -               splice = true;
> > +               }
> 
> This is the way-too-ugly-to-live part.
> Was pretty bad before and now it is unacceptable.
> way too much unneeded nesting and too hard to follow.
> 
> First of all, please use splice label and call goto splice (before the comment)
> instead of adding another nesting level.
> I would do that even if not adding memory fd copy support.
> 
> Second, please store result of mem_fops = memory_copy_file_ops() and
> start with:
> +    if (mem_fops) {
> +        ret = mem_fops->copy_file_range(file_in, pos_in,
> +
> file_out, pos_out,
> +                                                               len,
> + flags);
>  +   } else if (file_out->f_op->copy_file_range) { ...
> 
> and update the comment above to say that:
> + * For copy to/from memory fd, we alway call the copy method of the
> memory fd.
>    */
> 
> I think that makes the code not more ugly than it already is.
> 
> Thanks,
> Amir.
Good advice - thanks a lot!

Regards,
Wangtao.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback
  2025-05-30 14:24   ` kernel test robot
@ 2025-06-03  1:32     ` wangtao
  0 siblings, 0 replies; 9+ messages in thread
From: wangtao @ 2025-06-03  1:32 UTC (permalink / raw)
  To: kernel test robot, sumit.semwal@linaro.org,
	christian.koenig@amd.com, kraxel@redhat.com,
	vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk,
	brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org,
	amir73il@gmail.com
  Cc: oe-kbuild-all@lists.linux.dev, benjamin.gaignard@collabora.com,
	Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com,
	jack@suse.cz, baolin.wang@linux.alibaba.com,
	linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	wangbintian(BintianWang), yipengxiang, liulu 00013167,
	hanfeng 00012985



> -----Original Message-----
> From: kernel test robot <lkp@intel.com>
> Sent: Friday, May 30, 2025 10:25 PM
> To: wangtao <tao.wangtao@honor.com>; sumit.semwal@linaro.org;
> christian.koenig@amd.com; kraxel@redhat.com; vivek.kasireddy@intel.com;
> viro@zeniv.linux.org.uk; brauner@kernel.org; hughd@google.com;
> akpm@linux-foundation.org; amir73il@gmail.com
> Cc: oe-kbuild-all@lists.linux.dev; benjamin.gaignard@collabora.com;
> Brian.Starkey@arm.com; jstultz@google.com; tjmercier@google.com;
> jack@suse.cz; baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org;
> dri-devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>;
> yipengxiang <yipengxiang@honor.com>; liulu 00013167
> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com>; wangtao
> <tao.wangtao@honor.com>
> Subject: Re: [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback
> 
> Hi wangtao,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on brauner-vfs/vfs.all] [also build test ERROR on next-
> 20250530] [cannot apply to linus/master v6.15] [If your patch is applied to the
> wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/wangtao/fs-allow-
> cross-FS-copy_file_range-for-memory-backed-files/20250530-184146
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.all
> patch link:    https://lore.kernel.org/r/20250530103941.11092-4-
> tao.wangtao%40honor.com
> patch subject: [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback
> config: sparc64-randconfig-002-20250530 (https://download.01.org/0day-
> ci/archive/20250530/202505302235.mDzENMSm-lkp@intel.com/config)
> compiler: sparc64-linux-gcc (GCC) 15.1.0 reproduce (this is a W=1 build):
> (https://download.01.org/0day-
> ci/archive/20250530/202505302235.mDzENMSm-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes:
> | https://lore.kernel.org/oe-kbuild-all/202505302235.mDzENMSm-lkp@intel.
> | com/
> 
> All error/warnings (new ones prefixed by >>):
> 
Quick note: I don't have local sparc64 compilation setup, so I'll
explicitly add the header dependencies to ensure safety.

Regards,
Wangtao.
>    drivers/dma-buf/udmabuf.c: In function 'udmabuf_rw_file':
> >> drivers/dma-buf/udmabuf.c:298:25: error: storage size of 'iter' isn't
> >> known
>      298 |         struct iov_iter iter;
>          |                         ^~~~
> >> drivers/dma-buf/udmabuf.c:299:45: error: 'ITER_SOURCE' undeclared
> >> (first use in this function)
>      299 |         unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
>          |                                             ^~~~~~~~~~~
>    drivers/dma-buf/udmabuf.c:299:45: note: each undeclared identifier is
> reported only once for each function it appears in
> >> drivers/dma-buf/udmabuf.c:299:59: error: 'ITER_DEST' undeclared
> >> (first use in this function)
>      299 |         unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
>          |                                                           ^~~~~~~~~
> >> drivers/dma-buf/udmabuf.c:327:17: error: implicit declaration of
> >> function 'iov_iter_bvec'; did you mean 'bvec_iter_bvec'?
> >> [-Wimplicit-function-declaration]
>      327 |                 iov_iter_bvec(&iter, direction, bvec, bv_idx, bv_total);
>          |                 ^~~~~~~~~~~~~
>          |                 bvec_iter_bvec
> >> drivers/dma-buf/udmabuf.c:298:25: warning: unused variable 'iter'
> >> [-Wunused-variable]
>      298 |         struct iov_iter iter;
>          |                         ^~~~
> 
> 
> vim +298 drivers/dma-buf/udmabuf.c
> 
>    286
>    287	static ssize_t udmabuf_rw_file(struct dma_buf *dmabuf, loff_t
> my_pos,
>    288				struct file *other, loff_t pos,
>    289				size_t count, bool is_write)
>    290	{
>    291		struct udmabuf *ubuf = dmabuf->priv;
>    292		loff_t my_end = my_pos + count, bv_beg, bv_end = 0;
>    293		pgoff_t pg_idx = my_pos / PAGE_SIZE;
>    294		pgoff_t pg_end = DIV_ROUND_UP(my_end, PAGE_SIZE);
>    295		size_t i, bv_off, bv_len, bv_num, bv_idx = 0, bv_total = 0;
>    296		struct bio_vec *bvec;
>    297		struct kiocb kiocb;
>  > 298		struct iov_iter iter;
>  > 299		unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
>    300		ssize_t ret = 0, rw_total = 0;
>    301		struct folio *folio;
>    302
>    303		bv_num = min_t(size_t, pg_end - pg_idx + 1, 1024);
>    304		bvec = kvcalloc(bv_num, sizeof(*bvec), GFP_KERNEL);
>    305		if (!bvec)
>    306			return -ENOMEM;
>    307
>    308		init_sync_kiocb(&kiocb, other);
>    309		kiocb.ki_pos = pos;
>    310
>    311		for (i = 0; i < ubuf->nr_pinned && my_pos < my_end; i++) {
>    312			folio = ubuf->pinned_folios[i];
>    313			bv_beg = bv_end;
>    314			bv_end += folio_size(folio);
>    315			if (bv_end <= my_pos)
>    316				continue;
>    317
>    318			bv_len = min(bv_end, my_end) - my_pos;
>    319			bv_off = my_pos - bv_beg;
>    320			my_pos += bv_len;
>    321			bv_total += bv_len;
>    322			bvec_set_page(&bvec[bv_idx], &folio->page, bv_len,
> bv_off);
>    323			if (++bv_idx < bv_num && my_pos < my_end)
>    324				continue;
>    325
>    326			/* start R/W if bvec is full or count reaches zero. */
>  > 327			iov_iter_bvec(&iter, direction, bvec, bv_idx,
> bv_total);
>    328			if (is_write)
>    329				ret = other->f_op->write_iter(&kiocb, &iter);
>    330			else
>    331				ret = other->f_op->read_iter(&kiocb, &iter);
>    332			if (ret <= 0)
>    333				break;
>    334			rw_total += ret;
>    335			if (ret < bv_total || fatal_signal_pending(current))
>    336				break;
>    337
>    338			bv_idx = bv_total = 0;
>    339		}
>    340		kvfree(bvec);
>    341
>    342		return rw_total > 0 ? rw_total : ret;
>    343	}
>    344
> 
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-06-03  1:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-30 10:39 [PATCH v3 0/4] Optimizing disk file & dmabuf copies via copy_file_range wangtao
2025-05-30 10:39 ` [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files wangtao
2025-05-30 13:43   ` Amir Goldstein
2025-06-03  1:28     ` wangtao
2025-05-30 10:39 ` [PATCH v3 2/4] dmabuf: Implement copy_file_range for dmabuf wangtao
2025-05-30 10:39 ` [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback wangtao
2025-05-30 14:24   ` kernel test robot
2025-06-03  1:32     ` wangtao
2025-05-30 10:39 ` [PATCH v3 4/4] dmabuf:system_heap Implement system_heap exporter's " wangtao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).