* [PATCH v3 0/4] Optimizing disk file & dmabuf copies via copy_file_range.
@ 2025-05-30 10:39 wangtao
2025-05-30 10:39 ` [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files wangtao
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: wangtao @ 2025-05-30 10:39 UTC (permalink / raw)
To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro,
brauner, hughd, akpm, amir73il
Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack,
baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel,
linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu,
feng.han, wangtao
Typical dmabuf file reading steps:
1. dmabuf_fd = dmabuf_alloc(len, heap_fd)
2. vaddr = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, dmabuf_fd, 0)
3. file_fd = open(file_path, O_RDONLY)
4. read(file_fd, vaddr, len)
dmabuf's attachment/map/fence model uses remap_pfn_range for mmap, which
lacks Direct I/O support[1]. Buffer IO causes latency when loading large AI
model files.
My previous patch added dmabuf ioctl for Direct IO file operations, showing
good performance on low-power CPUs[2][3]. Christian suggested using existing
uAPIs (read/sendfile/splice/c_f_r) instead of new ioctls.
sendfile/splice/c_f_r enable zero-copy via Direct IO for disk-disk/network:
sendfile(skt_fd, disk_fd): [DISK]-DMA->[pipe(buf)]-DMA->[NIC]
sendfile(dst_disk, src_disk): [DISK] -DMA-> [pipe(buf)] -DMA-> [DISK]
Analysis shows existing uAPIs can't achieve zero-copy disk-to-dmabuf.
Since dmabuf lacks file ops, using tmpfs for disk-to-tmpfs CPU analysis:
| Method | CPU Copies | Key Overhead |
|-------------------|------------|----------------------------|
| 1. Buffer R+W | 2 | Alloc(cache) & 2 CPU copies|
| 2. Direct R+W | 1 | GUP(usr_buf) & 1 CPU copy |
| 3. Mmap+Buffer R | 1 | Alloc(cache) & 1 CPU copy |
| 4. Mmap+Direct R | 0 | GUP(mem_buf) ~50% CPU |
| 5. Buffer Sendfile| 1 | Alloc(cache) & 1 CPU copy |
| 6. Direct Sendfile| 1 | Small pipe, high IRQ |
| 7. Buffer Splice | 1 | Alloc(cache) & 1 CPU copy |
| 8. Direct Splice | 1 | Larger pipe buffer |
| 9. c_f_r | N/A | Cross-FS blocked |
GUP: get_user_page
Alloc(cache): allocate page cache
Data flows:
1. [DISK] -DMA-> [Alloc(cache)] -COPY-> [usr_buf] -COPY-> [MEM]
2. [DISK] -DMA-> [GUP(usr_buf)] -COPY-> [MEM]
3. [DISK] -DMA-> [Alloc(cache)] -COPY-> [mem_buf]
4. [DISK] -DMA-> [GUP(mem_buf)]
5. [DISK] -DMA-> [pipe(Alloc(cache))] -COPY-> [tmpfs page]
6. [DISK] -DMA-> [pipe(buf)] -COPY-> [tmpfs page]
7. [DISK] -DMA-> [big_pipe(Alloc(cache))] -COPY-> [tmpfs page]
8. [DISK] -DMA-> [big_pipe(buf)] -COPY-> [tmpfs page]
9. [DISK] -DMA-> [tmpfs page] (blocked)
Key findings:
- Buffer I/O requires page cache allocation and at least one CPU copy
- Read+Write incurs excessive CPU copies and will no longer be analyzed.
Future approaches will use Read instead of mmap+Read.
- Mmap+Direct has zero copies but 50% GUP overhead, and dmabuf doesn't support
- sendfile/splice require intermediate pipes, needing 1 CPU copy
- c_f_r limitations: Cross-FS blocks + missing memory FS support
Modifications:
1. Enable cross-FS c_f_r for memory file types
2. Add dmabuf c_f_r callbacks for [DISK]-DMA->[dmabuf]
3. Test tmpfs c_f_r locally only (no upstream) due to lock_page deadlock risks
Performance (1GHz CPU, UFS4@4GB):
1. tmpfs(memfd) direct c_f_r(1197 MB/s): +15% vs mmap&read(1014)
2. udmabuf+memfd(2318 MB/s): +50% vs mmap&read(1457 MB/s)
3. dmabuf direct c_f_r(3405 MB/s): 260% faster than buffer IO(918 MB/s)
40% faster than udmabuf(2318 MB/s)
| 32x32MB Read 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O%
|-------------------------|--------|--------|--------|--------|-----
| 1)Beg dmabuf buffer R/W| 52 | 5 | 1170 | 918 | 100%
| 2) udmabuf buffer R/W| 591 | 326 | 1281 | 838 | 91%
| 3) memfd buffer R/W| 1 | 323 | 2370 | 453 | 49%
| 4) memfd direct R/W| 1 | 321 | 1058 | 1014 | 110%
| 5) memfd buffer sendfile| 1 | 329 | 1577 | 681 | 74%
| 6) memfd direct sendfile| 1 | 327 | 2672 | 401 | 43%
| 7) memfd buffer splice| 2 | 321 | 1729 | 621 | 67%
| 8) memfd direct splice| 2 | 324 | 1528 | 702 | 76%
| 9) memfd buffer c_f_r| 1 | 325 | 1586 | 677 | 73%
|10) memfd direct c_f_r| 1 | 323 | 897 | 1197 | 130%
|11) u+mfd buffer R/W| 609 | 344 | 2207 | 486 | 52%
|12) u+mfd direct R/W| 580 | 342 | 737 | 1457 | 158%
|13) u+mfd buffer sendfile| 582 | 343 | 1270 | 845 | 92%
|14) u+mfd direct sendfile| 573 | 344 | 2254 | 476 | 51%
|15) u+mfd buffer splice| 584 | 341 | 1202 | 893 | 97%
|16) u+mfd direct splice| 564 | 340 | 851 | 1263 | 137%
|17) u+mfd buffer c_f_r| 585 | 344 | 1244 | 863 | 94%
|18) u+mfd direct c_f_r| 578 | 341 | 581 | 1848 | 201%
|19) udmabuf buffer c_f_r| 585 | 328 | 1163 | 923 | 100%
|20) udmabuf direct c_f_r| 579 | 328 | 464 | 2318 | 252%
|21) dmabuf buffer c_f_r| 48 | 5 | 1058 | 1015 | 110%
|22) dmabuf direct c_f_r| 48 | 5 | 316 | 3405 | 370%
|23)End dmabuf buffer R/W| 48 | 5 | 1173 | 915 | 99%
u+mfd = udma+memfd = udmabuf + pre-allocated memfd combo.
Cache cleared during tests to simulate real-world large file loading.
dmabuf file Use Cases:
- Loading large AI models using dmabuf
- Real-time data capture and storage with dmabuf
- Persisting task snapshots in Android
v1: [2]
v1 -> v2: [3]
Dma-buf exporter verify exclusive access to the dmabuf's sgtable.
v2 -> v3:
copy_file_range supports copying from disk files to memory files.
Implement the copy_file_range callback functions for dmabuf/udmabuf.
Reference:
[1] https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com
[2] https://lore.kernel.org/all/20250513092803.2096-1-tao.wangtao@honor.com
[3] https://lore.kernel.org/all/20250516092148.12778-1-tao.wangtao@honor.com
wangtao (4):
fs: allow cross-FS copy_file_range for memory-backed files
dmabuf: Implement copy_file_range for dmabuf
udmabuf: Implement udmabuf rw_file callback
dmabuf:system_heap Implement system_heap exporter's rw_file callback.
drivers/dma-buf/dma-buf.c | 32 ++++++++++++
drivers/dma-buf/heaps/system_heap.c | 79 +++++++++++++++++++++++++++++
drivers/dma-buf/udmabuf.c | 59 +++++++++++++++++++++
fs/read_write.c | 71 +++++++++++++++++++-------
include/linux/dma-buf.h | 16 ++++++
include/linux/fs.h | 2 +
6 files changed, 240 insertions(+), 19 deletions(-)
--
2.17.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files
2025-05-30 10:39 [PATCH v3 0/4] Optimizing disk file & dmabuf copies via copy_file_range wangtao
@ 2025-05-30 10:39 ` wangtao
2025-05-30 13:43 ` Amir Goldstein
2025-05-30 10:39 ` [PATCH v3 2/4] dmabuf: Implement copy_file_range for dmabuf wangtao
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: wangtao @ 2025-05-30 10:39 UTC (permalink / raw)
To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro,
brauner, hughd, akpm, amir73il
Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack,
baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel,
linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu,
feng.han, wangtao
Memory-backed files can optimize copy performance via
copy_file_range callbacks. Compared to mmap&read: reduces
GUP (get_user_pages) overhead; vs sendfile/splice: eliminates
one memory copy; supports dmabuf zero-copy implementation.
Signed-off-by: wangtao <tao.wangtao@honor.com>
---
fs/read_write.c | 71 +++++++++++++++++++++++++++++++++-------------
include/linux/fs.h | 2 ++
2 files changed, 54 insertions(+), 19 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index bb0ed26a0b3a..591c6db7b785 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1469,6 +1469,20 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
}
#endif
+static inline bool is_copy_memory_file_to_file(struct file *file_in,
+ struct file *file_out)
+{
+ return (file_in->f_op->fop_flags & FOP_MEMORY_FILE) &&
+ file_in->f_op->copy_file_range && file_out->f_op->write_iter;
+}
+
+static inline bool is_copy_file_to_memory_file(struct file *file_in,
+ struct file *file_out)
+{
+ return (file_out->f_op->fop_flags & FOP_MEMORY_FILE) &&
+ file_in->f_op->read_iter && file_out->f_op->copy_file_range;
+}
+
/*
* Performs necessary checks before doing a file copy
*
@@ -1484,11 +1498,23 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
struct inode *inode_out = file_inode(file_out);
uint64_t count = *req_count;
loff_t size_in;
+ bool splice = flags & COPY_FILE_SPLICE;
+ bool has_memory_file;
int ret;
- ret = generic_file_rw_checks(file_in, file_out);
- if (ret)
- return ret;
+ /* Skip generic checks, allow cross-sb copies for dma-buf/tmpfs */
+ has_memory_file = is_copy_memory_file_to_file(file_in, file_out) ||
+ is_copy_file_to_memory_file(file_in, file_out);
+ if (!splice && has_memory_file) {
+ if (!(file_in->f_mode & FMODE_READ) ||
+ !(file_out->f_mode & FMODE_WRITE) ||
+ (file_out->f_flags & O_APPEND))
+ return -EBADF;
+ } else {
+ ret = generic_file_rw_checks(file_in, file_out);
+ if (ret)
+ return ret;
+ }
/*
* We allow some filesystems to handle cross sb copy, but passing
@@ -1500,7 +1526,7 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
* and several different sets of file_operations, but they all end up
* using the same ->copy_file_range() function pointer.
*/
- if (flags & COPY_FILE_SPLICE) {
+ if (splice || has_memory_file) {
/* cross sb splice is allowed */
} else if (file_out->f_op->copy_file_range) {
if (file_in->f_op->copy_file_range !=
@@ -1581,23 +1607,30 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
* same sb using clone, but for filesystems where both clone and copy
* are supported (e.g. nfs,cifs), we only call the copy method.
*/
- if (!splice && file_out->f_op->copy_file_range) {
- ret = file_out->f_op->copy_file_range(file_in, pos_in,
- file_out, pos_out,
- len, flags);
- } else if (!splice && file_in->f_op->remap_file_range && samesb) {
- ret = file_in->f_op->remap_file_range(file_in, pos_in,
- file_out, pos_out,
- min_t(loff_t, MAX_RW_COUNT, len),
- REMAP_FILE_CAN_SHORTEN);
- /* fallback to splice */
- if (ret <= 0)
+ if (!splice) {
+ if (is_copy_memory_file_to_file(file_in, file_out)) {
+ ret = file_in->f_op->copy_file_range(file_in, pos_in,
+ file_out, pos_out, len, flags);
+ } else if (is_copy_file_to_memory_file(file_in, file_out)) {
+ ret = file_out->f_op->copy_file_range(file_in, pos_in,
+ file_out, pos_out, len, flags);
+ } else if (file_out->f_op->copy_file_range) {
+ ret = file_out->f_op->copy_file_range(file_in, pos_in,
+ file_out, pos_out,
+ len, flags);
+ } else if (file_in->f_op->remap_file_range && samesb) {
+ ret = file_in->f_op->remap_file_range(file_in, pos_in,
+ file_out, pos_out,
+ min_t(loff_t, MAX_RW_COUNT, len),
+ REMAP_FILE_CAN_SHORTEN);
+ /* fallback to splice */
+ if (ret <= 0)
+ splice = true;
+ } else if (samesb) {
+ /* Fallback to splice for same sb copy for backward compat */
splice = true;
- } else if (samesb) {
- /* Fallback to splice for same sb copy for backward compat */
- splice = true;
+ }
}
-
file_end_write(file_out);
if (!splice)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 016b0fe1536e..37df1b497418 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2187,6 +2187,8 @@ struct file_operations {
#define FOP_ASYNC_LOCK ((__force fop_flags_t)(1 << 6))
/* File system supports uncached read/write buffered IO */
#define FOP_DONTCACHE ((__force fop_flags_t)(1 << 7))
+/* Supports cross-FS copy_file_range for memory file */
+#define FOP_MEMORY_FILE ((__force fop_flags_t)(1 << 8))
/* Wrap a directory iterator that needs exclusive inode access */
int wrap_directory_iterator(struct file *, struct dir_context *,
--
2.17.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v3 2/4] dmabuf: Implement copy_file_range for dmabuf
2025-05-30 10:39 [PATCH v3 0/4] Optimizing disk file & dmabuf copies via copy_file_range wangtao
2025-05-30 10:39 ` [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files wangtao
@ 2025-05-30 10:39 ` wangtao
2025-05-30 10:39 ` [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback wangtao
2025-05-30 10:39 ` [PATCH v3 4/4] dmabuf:system_heap Implement system_heap exporter's " wangtao
3 siblings, 0 replies; 9+ messages in thread
From: wangtao @ 2025-05-30 10:39 UTC (permalink / raw)
To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro,
brauner, hughd, akpm, amir73il
Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack,
baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel,
linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu,
feng.han, wangtao
First determine if dmabuf reads from or writes to the file.
Then call exporter's rw_file callback function.
Signed-off-by: wangtao <tao.wangtao@honor.com>
---
drivers/dma-buf/dma-buf.c | 32 ++++++++++++++++++++++++++++++++
include/linux/dma-buf.h | 16 ++++++++++++++++
2 files changed, 48 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 5baa83b85515..fc9bf54c921a 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -523,7 +523,38 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file)
spin_unlock(&dmabuf->name_lock);
}
+static ssize_t dma_buf_rw_file(struct dma_buf *dmabuf, loff_t my_pos,
+ struct file *file, loff_t pos, size_t count, bool is_write)
+{
+ if (!dmabuf->ops->rw_file)
+ return -EINVAL;
+
+ if (my_pos >= dmabuf->size)
+ count = 0;
+ else
+ count = min_t(size_t, count, dmabuf->size - my_pos);
+ if (!count)
+ return 0;
+
+ return dmabuf->ops->rw_file(dmabuf, my_pos, file, pos, count, is_write);
+}
+
+static ssize_t dma_buf_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t count, unsigned int flags)
+{
+ if (is_dma_buf_file(file_in) && file_out->f_op->write_iter)
+ return dma_buf_rw_file(file_in->private_data, pos_in,
+ file_out, pos_out, count, true);
+ else if (is_dma_buf_file(file_out) && file_in->f_op->read_iter)
+ return dma_buf_rw_file(file_out->private_data, pos_out,
+ file_in, pos_in, count, false);
+ else
+ return -EINVAL;
+}
+
static const struct file_operations dma_buf_fops = {
+ .fop_flags = FOP_MEMORY_FILE,
.release = dma_buf_file_release,
.mmap = dma_buf_mmap_internal,
.llseek = dma_buf_llseek,
@@ -531,6 +562,7 @@ static const struct file_operations dma_buf_fops = {
.unlocked_ioctl = dma_buf_ioctl,
.compat_ioctl = compat_ptr_ioctl,
.show_fdinfo = dma_buf_show_fdinfo,
+ .copy_file_range = dma_buf_copy_file_range,
};
/*
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 36216d28d8bd..d3636e985399 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -22,6 +22,7 @@
#include <linux/fs.h>
#include <linux/dma-fence.h>
#include <linux/wait.h>
+#include <uapi/linux/dma-buf.h>
struct device;
struct dma_buf;
@@ -285,6 +286,21 @@ struct dma_buf_ops {
int (*vmap)(struct dma_buf *dmabuf, struct iosys_map *map);
void (*vunmap)(struct dma_buf *dmabuf, struct iosys_map *map);
+
+ /**
+ * @rw_file:
+ *
+ * If an Exporter needs to support Direct I/O file operations, it can
+ * implement this optional callback. The exporter must verify that no
+ * other objects hold the sg_table, ensure exclusive access to the
+ * dmabuf's sg_table, and only then proceed with the I/O operation.
+ *
+ * Returns:
+ *
+ * 0 on success or a negative error code on failure.
+ */
+ ssize_t (*rw_file)(struct dma_buf *dmabuf, loff_t my_pos,
+ struct file *file, loff_t pos, size_t count, bool is_write);
};
/**
--
2.17.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback
2025-05-30 10:39 [PATCH v3 0/4] Optimizing disk file & dmabuf copies via copy_file_range wangtao
2025-05-30 10:39 ` [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files wangtao
2025-05-30 10:39 ` [PATCH v3 2/4] dmabuf: Implement copy_file_range for dmabuf wangtao
@ 2025-05-30 10:39 ` wangtao
2025-05-30 14:24 ` kernel test robot
2025-05-30 10:39 ` [PATCH v3 4/4] dmabuf:system_heap Implement system_heap exporter's " wangtao
3 siblings, 1 reply; 9+ messages in thread
From: wangtao @ 2025-05-30 10:39 UTC (permalink / raw)
To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro,
brauner, hughd, akpm, amir73il
Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack,
baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel,
linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu,
feng.han, wangtao
Construct bio_vec from folios, then call the other file's
r/w callbacks for IO operations.
Test data shows direct I/O copy_file_range improves performance by
over 50% vs direct I/O mmap&read (2557 vs 1534).
Test data:
| 32x32MB Read 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O%
|-------------------------|--------|--------|--------|--------|-----
| 1)Beg udmabuf buffer R/W| 580 | 323 | 1238 | 867 | 100%
| 2) dmabuf buffer R/W| 48 | 5 | 1149 | 934 | 107%
| 3) udma+memfd buffer R/W| 597 | 340 | 2157 | 497 | 57%
| 4) udma+memfd direct R/W| 573 | 340 | 700 | 1534 | 176%
| 5) u+mfd buffer sendfile| 577 | 340 | 1204 | 891 | 102%
| 6) u+mfd direct sendfile| 567 | 339 | 2272 | 472 | 54%
| 7) u+mfd buffer splice| 570 | 337 | 1114 | 964 | 111%
| 8) u+mfd direct splice| 564 | 335 | 793 | 1355 | 156%
| 9) udmabuf buffer c_f_r| 577 | 323 | 1059 | 1014 | 116%
|10) udmabuf direct c_f_r| 582 | 325 | 420 | 2557 | 294%
|11)End udmabuf buffer R/W| 586 | 323 | 1188 | 903 | 104%
Signed-off-by: wangtao <tao.wangtao@honor.com>
---
drivers/dma-buf/udmabuf.c | 59 +++++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)
diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index e74e36a8ecda..573275a51674 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -284,6 +284,64 @@ static int end_cpu_udmabuf(struct dma_buf *buf,
return 0;
}
+static ssize_t udmabuf_rw_file(struct dma_buf *dmabuf, loff_t my_pos,
+ struct file *other, loff_t pos,
+ size_t count, bool is_write)
+{
+ struct udmabuf *ubuf = dmabuf->priv;
+ loff_t my_end = my_pos + count, bv_beg, bv_end = 0;
+ pgoff_t pg_idx = my_pos / PAGE_SIZE;
+ pgoff_t pg_end = DIV_ROUND_UP(my_end, PAGE_SIZE);
+ size_t i, bv_off, bv_len, bv_num, bv_idx = 0, bv_total = 0;
+ struct bio_vec *bvec;
+ struct kiocb kiocb;
+ struct iov_iter iter;
+ unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
+ ssize_t ret = 0, rw_total = 0;
+ struct folio *folio;
+
+ bv_num = min_t(size_t, pg_end - pg_idx + 1, 1024);
+ bvec = kvcalloc(bv_num, sizeof(*bvec), GFP_KERNEL);
+ if (!bvec)
+ return -ENOMEM;
+
+ init_sync_kiocb(&kiocb, other);
+ kiocb.ki_pos = pos;
+
+ for (i = 0; i < ubuf->nr_pinned && my_pos < my_end; i++) {
+ folio = ubuf->pinned_folios[i];
+ bv_beg = bv_end;
+ bv_end += folio_size(folio);
+ if (bv_end <= my_pos)
+ continue;
+
+ bv_len = min(bv_end, my_end) - my_pos;
+ bv_off = my_pos - bv_beg;
+ my_pos += bv_len;
+ bv_total += bv_len;
+ bvec_set_page(&bvec[bv_idx], &folio->page, bv_len, bv_off);
+ if (++bv_idx < bv_num && my_pos < my_end)
+ continue;
+
+ /* start R/W if bvec is full or count reaches zero. */
+ iov_iter_bvec(&iter, direction, bvec, bv_idx, bv_total);
+ if (is_write)
+ ret = other->f_op->write_iter(&kiocb, &iter);
+ else
+ ret = other->f_op->read_iter(&kiocb, &iter);
+ if (ret <= 0)
+ break;
+ rw_total += ret;
+ if (ret < bv_total || fatal_signal_pending(current))
+ break;
+
+ bv_idx = bv_total = 0;
+ }
+ kvfree(bvec);
+
+ return rw_total > 0 ? rw_total : ret;
+}
+
static const struct dma_buf_ops udmabuf_ops = {
.cache_sgt_mapping = true,
.map_dma_buf = map_udmabuf,
@@ -294,6 +352,7 @@ static const struct dma_buf_ops udmabuf_ops = {
.vunmap = vunmap_udmabuf,
.begin_cpu_access = begin_cpu_udmabuf,
.end_cpu_access = end_cpu_udmabuf,
+ .rw_file = udmabuf_rw_file,
};
#define SEALS_WANTED (F_SEAL_SHRINK)
--
2.17.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v3 4/4] dmabuf:system_heap Implement system_heap exporter's rw_file callback.
2025-05-30 10:39 [PATCH v3 0/4] Optimizing disk file & dmabuf copies via copy_file_range wangtao
` (2 preceding siblings ...)
2025-05-30 10:39 ` [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback wangtao
@ 2025-05-30 10:39 ` wangtao
3 siblings, 0 replies; 9+ messages in thread
From: wangtao @ 2025-05-30 10:39 UTC (permalink / raw)
To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro,
brauner, hughd, akpm, amir73il
Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack,
baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel,
linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu,
feng.han, wangtao
First verify system_heap exporter has exclusive dmabuf access.
Build bio_vec from sgtable, then invoke target file's r/w callbacks for IO.
Outperforms buffer IO mmap/read by 250%, beats direct I/O udmabuf
copy_file_range by over 30% with initialization time significantly lower
than udmabuf.
Test data:
| 32x32MB Read 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O%
|-------------------------|--------|--------|--------|--------|-----
| 1)Beg dmabuf buffer R/W| 47 | 5 | 1125 | 954 | 100%
| 2) udmabuf buffer R/W| 576 | 323 | 1228 | 874 | 91%
| 3) udma+memfd buffer R/W| 596 | 340 | 2166 | 495 | 51%
| 4) udma+memfd direct R/W| 570 | 338 | 711 | 1510 | 158%
| 5) udmabuf buffer c_f_r| 578 | 329 | 1128 | 952 | 99%
| 6) udmabuf direct c_f_r| 570 | 324 | 405 | 2651 | 277%
| 7) dmabuf buffer c_f_r| 47 | 5 | 1035 | 1037 | 108%
| 8) dmabuf direct c_f_r| 51 | 5 | 309 | 3480 | 364%
| 9)End dmabuf buffer R/W| 48 | 5 | 1153 | 931 | 97%
| 32x32MB Write 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O%
|-------------------------|--------|--------|--------|--------|-----
| 1)Beg dmabuf buffer R/W| 50 | 5 | 1405 | 764 | 100%
| 2) udmabuf buffer R/W| 580 | 341 | 1337 | 803 | 105%
| 3) udma+memfd buffer R/W| 588 | 331 | 1820 | 590 | 77%
| 4) udma+memfd direct R/W| 585 | 333 | 662 | 1622 | 212%
| 5) udmabuf buffer c_f_r| 577 | 329 | 1326 | 810 | 106%
| 6) udmabuf direct c_f_r| 580 | 330 | 602 | 1784 | 233%
| 7) dmabuf buffer c_f_r| 49 | 5 | 1330 | 807 | 105%
| 8) dmabuf direct c_f_r| 49 | 5 | 344 | 3127 | 409%
| 9)End dmabuf buffer R/W| 50 | 5 | 1442 | 745 | 97%
Signed-off-by: wangtao <tao.wangtao@honor.com>
---
drivers/dma-buf/heaps/system_heap.c | 79 +++++++++++++++++++++++++++++
1 file changed, 79 insertions(+)
diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
index 26d5dc89ea16..d3a1956ebad8 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -20,6 +20,9 @@
#include <linux/scatterlist.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
+#include <linux/bvec.h>
+#include <linux/bio.h>
+#include <linux/uio.h>
static struct dma_heap *sys_heap;
@@ -281,6 +284,81 @@ static void system_heap_vunmap(struct dma_buf *dmabuf, struct iosys_map *map)
iosys_map_clear(map);
}
+static ssize_t system_heap_buffer_rw_other(struct system_heap_buffer *buffer,
+ loff_t my_pos, struct file *other, loff_t pos,
+ size_t count, bool is_write)
+{
+ struct sg_table *sgt = &buffer->sg_table;
+ struct scatterlist *sg;
+ loff_t my_end = my_pos + count, bv_beg, bv_end = 0;
+ pgoff_t pg_idx = my_pos / PAGE_SIZE;
+ pgoff_t pg_end = DIV_ROUND_UP(my_end, PAGE_SIZE);
+ size_t i, bv_off, bv_len, bv_num, bv_idx = 0, bv_total = 0;
+ struct bio_vec *bvec;
+ struct kiocb kiocb;
+ struct iov_iter iter;
+ unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
+ ssize_t ret = 0, rw_total = 0;
+
+ bv_num = min_t(size_t, pg_end - pg_idx + 1, 1024);
+ bvec = kvcalloc(bv_num, sizeof(*bvec), GFP_KERNEL);
+ if (!bvec)
+ return -ENOMEM;
+
+ init_sync_kiocb(&kiocb, other);
+ kiocb.ki_pos = pos;
+
+ for_each_sg(sgt->sgl, sg, sgt->nents, i) {
+ if (my_pos >= my_end)
+ break;
+ bv_beg = bv_end;
+ bv_end += sg->length;
+ if (bv_end <= my_pos)
+ continue;
+
+ bv_len = min(bv_end, my_end) - my_pos;
+ bv_off = sg->offset + my_pos - bv_beg;
+ my_pos += bv_len;
+ bv_total += bv_len;
+ bvec_set_page(&bvec[bv_idx], sg_page(sg), bv_len, bv_off);
+ if (++bv_idx < bv_num && my_pos < my_end)
+ continue;
+
+ /* start R/W if bvec is full or count reaches zero. */
+ iov_iter_bvec(&iter, direction, bvec, bv_idx, bv_total);
+ if (is_write)
+ ret = other->f_op->write_iter(&kiocb, &iter);
+ else
+ ret = other->f_op->read_iter(&kiocb, &iter);
+ if (ret <= 0)
+ break;
+ rw_total += ret;
+ if (ret < bv_total || fatal_signal_pending(current))
+ break;
+
+ bv_idx = bv_total = 0;
+ }
+ kvfree(bvec);
+
+ return rw_total > 0 ? rw_total : ret;
+}
+
+static ssize_t system_heap_dma_buf_rw_file(struct dma_buf *dmabuf,
+ loff_t my_pos, struct file *file, loff_t pos,
+ size_t count, bool is_write)
+{
+ struct system_heap_buffer *buffer = dmabuf->priv;
+ ssize_t ret = -EBUSY;
+
+ mutex_lock(&buffer->lock);
+ if (list_empty(&buffer->attachments) && !buffer->vmap_cnt)
+ ret = system_heap_buffer_rw_other(buffer, my_pos,
+ file, pos, count, is_write);
+ mutex_unlock(&buffer->lock);
+
+ return ret;
+}
+
static void system_heap_dma_buf_release(struct dma_buf *dmabuf)
{
struct system_heap_buffer *buffer = dmabuf->priv;
@@ -308,6 +386,7 @@ static const struct dma_buf_ops system_heap_buf_ops = {
.mmap = system_heap_mmap,
.vmap = system_heap_vmap,
.vunmap = system_heap_vunmap,
+ .rw_file = system_heap_dma_buf_rw_file,
.release = system_heap_dma_buf_release,
};
--
2.17.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files
2025-05-30 10:39 ` [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files wangtao
@ 2025-05-30 13:43 ` Amir Goldstein
2025-06-03 1:28 ` wangtao
0 siblings, 1 reply; 9+ messages in thread
From: Amir Goldstein @ 2025-05-30 13:43 UTC (permalink / raw)
To: wangtao
Cc: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro,
brauner, hughd, akpm, benjamin.gaignard, Brian.Starkey, jstultz,
tjmercier, jack, baolin.wang, linux-media, dri-devel,
linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm,
bintian.wang, yipengxiang, liulu.liu, feng.han
On Fri, May 30, 2025 at 12:40 PM wangtao <tao.wangtao@honor.com> wrote:
>
> Memory-backed files can optimize copy performance via
> copy_file_range callbacks. Compared to mmap&read: reduces
> GUP (get_user_pages) overhead; vs sendfile/splice: eliminates
> one memory copy; supports dmabuf zero-copy implementation.
>
> Signed-off-by: wangtao <tao.wangtao@honor.com>
> ---
Hi wangtao,
Let me rephrase my reaction gently:
Regardless of the proposed API extension, and referring to the
implementation itself -
I wrote the current code and I can barely understand the logic every time
I come back to read it.
Your changes make it too messy to be reviewed/maintained.
I have a few suggestions for simplifications (see below).
The complication arise from the fact that normally the destination fops
are used to call the copy_range() method and this case is a deviation
from this standard, we need to make the cope simpler using a helper
to choose the fops to use.
> fs/read_write.c | 71 +++++++++++++++++++++++++++++++++-------------
> include/linux/fs.h | 2 ++
> 2 files changed, 54 insertions(+), 19 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index bb0ed26a0b3a..591c6db7b785 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1469,6 +1469,20 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
> }
> #endif
>
> +static inline bool is_copy_memory_file_to_file(struct file *file_in,
> + struct file *file_out)
> +{
> + return (file_in->f_op->fop_flags & FOP_MEMORY_FILE) &&
> + file_in->f_op->copy_file_range && file_out->f_op->write_iter;
> +}
> +
> +static inline bool is_copy_file_to_memory_file(struct file *file_in,
> + struct file *file_out)
> +{
> + return (file_out->f_op->fop_flags & FOP_MEMORY_FILE) &&
> + file_in->f_op->read_iter && file_out->f_op->copy_file_range;
> +}
> +
Please combine that to a helper:
const struct file_operations *memory_copy_file_ops(struct file
*file_in, struct file *file_out)
which returns either file_in->f_op, file_out->f_op or NULL.
> /*
> * Performs necessary checks before doing a file copy
> *
> @@ -1484,11 +1498,23 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> struct inode *inode_out = file_inode(file_out);
> uint64_t count = *req_count;
> loff_t size_in;
> + bool splice = flags & COPY_FILE_SPLICE;
> + bool has_memory_file;
> int ret;
>
> - ret = generic_file_rw_checks(file_in, file_out);
> - if (ret)
> - return ret;
> + /* Skip generic checks, allow cross-sb copies for dma-buf/tmpfs */
> + has_memory_file = is_copy_memory_file_to_file(file_in, file_out) ||
> + is_copy_file_to_memory_file(file_in, file_out);
> + if (!splice && has_memory_file) {
> + if (!(file_in->f_mode & FMODE_READ) ||
> + !(file_out->f_mode & FMODE_WRITE) ||
> + (file_out->f_flags & O_APPEND))
> + return -EBADF;
I don't like that this both duplicates code and does not check all the checks in
generic_file_rw_checks() for the non-memory file side.
I do not currently have a suggestion how to make this less messy,
more human readable and correct.
> + } else {
> + ret = generic_file_rw_checks(file_in, file_out);
> + if (ret)
> + return ret;
> + }
>
> /*
> * We allow some filesystems to handle cross sb copy, but passing
> @@ -1500,7 +1526,7 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> * and several different sets of file_operations, but they all end up
> * using the same ->copy_file_range() function pointer.
> */
> - if (flags & COPY_FILE_SPLICE) {
> + if (splice || has_memory_file) {
> /* cross sb splice is allowed */
This comment is not accurate - it should say "cross fs-type", because it skips
the test that compares the ->copy_file_range pointers, not the sb.
If you are making changes to this function, this should be clarified.
> } else if (file_out->f_op->copy_file_range) {
> if (file_in->f_op->copy_file_range !=
> @@ -1581,23 +1607,30 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> * same sb using clone, but for filesystems where both clone and copy
> * are supported (e.g. nfs,cifs), we only call the copy method.
> */
> - if (!splice && file_out->f_op->copy_file_range) {
> - ret = file_out->f_op->copy_file_range(file_in, pos_in,
> - file_out, pos_out,
> - len, flags);
> - } else if (!splice && file_in->f_op->remap_file_range && samesb) {
> - ret = file_in->f_op->remap_file_range(file_in, pos_in,
> - file_out, pos_out,
> - min_t(loff_t, MAX_RW_COUNT, len),
> - REMAP_FILE_CAN_SHORTEN);
> - /* fallback to splice */
> - if (ret <= 0)
> + if (!splice) {
> + if (is_copy_memory_file_to_file(file_in, file_out)) {
> + ret = file_in->f_op->copy_file_range(file_in, pos_in,
> + file_out, pos_out, len, flags);
> + } else if (is_copy_file_to_memory_file(file_in, file_out)) {
> + ret = file_out->f_op->copy_file_range(file_in, pos_in,
> + file_out, pos_out, len, flags);
> + } else if (file_out->f_op->copy_file_range) {
> + ret = file_out->f_op->copy_file_range(file_in, pos_in,
> + file_out, pos_out,
> + len, flags);
> + } else if (file_in->f_op->remap_file_range && samesb) {
> + ret = file_in->f_op->remap_file_range(file_in, pos_in,
> + file_out, pos_out,
> + min_t(loff_t, MAX_RW_COUNT, len),
> + REMAP_FILE_CAN_SHORTEN);
> + /* fallback to splice */
> + if (ret <= 0)
> + splice = true;
> + } else if (samesb) {
> + /* Fallback to splice for same sb copy for backward compat */
> splice = true;
> - } else if (samesb) {
> - /* Fallback to splice for same sb copy for backward compat */
> - splice = true;
> + }
This is the way-too-ugly-to-live part.
Was pretty bad before and now it is unacceptable.
way too much unneeded nesting and too hard to follow.
First of all, please use splice label and call goto splice (before the comment)
instead of adding another nesting level.
I would do that even if not adding memory fd copy support.
Second, please store result of mem_fops = memory_copy_file_ops()
and start with:
+ if (mem_fops) {
+ ret = mem_fops->copy_file_range(file_in, pos_in,
+
file_out, pos_out,
+ len, flags);
+ } else if (file_out->f_op->copy_file_range) { ...
and update the comment above to say that:
+ * For copy to/from memory fd, we alway call the copy method of the memory fd.
*/
I think that makes the code not more ugly than it already is.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback
2025-05-30 10:39 ` [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback wangtao
@ 2025-05-30 14:24 ` kernel test robot
2025-06-03 1:32 ` wangtao
0 siblings, 1 reply; 9+ messages in thread
From: kernel test robot @ 2025-05-30 14:24 UTC (permalink / raw)
To: wangtao, sumit.semwal, christian.koenig, kraxel, vivek.kasireddy,
viro, brauner, hughd, akpm, amir73il
Cc: oe-kbuild-all, benjamin.gaignard, Brian.Starkey, jstultz,
tjmercier, jack, baolin.wang, linux-media, dri-devel,
linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm,
bintian.wang, yipengxiang, liulu.liu, feng.han, wangtao
Hi wangtao,
kernel test robot noticed the following build errors:
[auto build test ERROR on brauner-vfs/vfs.all]
[also build test ERROR on next-20250530]
[cannot apply to linus/master v6.15]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/wangtao/fs-allow-cross-FS-copy_file_range-for-memory-backed-files/20250530-184146
base: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.all
patch link: https://lore.kernel.org/r/20250530103941.11092-4-tao.wangtao%40honor.com
patch subject: [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback
config: sparc64-randconfig-002-20250530 (https://download.01.org/0day-ci/archive/20250530/202505302235.mDzENMSm-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250530/202505302235.mDzENMSm-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505302235.mDzENMSm-lkp@intel.com/
All error/warnings (new ones prefixed by >>):
drivers/dma-buf/udmabuf.c: In function 'udmabuf_rw_file':
>> drivers/dma-buf/udmabuf.c:298:25: error: storage size of 'iter' isn't known
298 | struct iov_iter iter;
| ^~~~
>> drivers/dma-buf/udmabuf.c:299:45: error: 'ITER_SOURCE' undeclared (first use in this function)
299 | unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
| ^~~~~~~~~~~
drivers/dma-buf/udmabuf.c:299:45: note: each undeclared identifier is reported only once for each function it appears in
>> drivers/dma-buf/udmabuf.c:299:59: error: 'ITER_DEST' undeclared (first use in this function)
299 | unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
| ^~~~~~~~~
>> drivers/dma-buf/udmabuf.c:327:17: error: implicit declaration of function 'iov_iter_bvec'; did you mean 'bvec_iter_bvec'? [-Wimplicit-function-declaration]
327 | iov_iter_bvec(&iter, direction, bvec, bv_idx, bv_total);
| ^~~~~~~~~~~~~
| bvec_iter_bvec
>> drivers/dma-buf/udmabuf.c:298:25: warning: unused variable 'iter' [-Wunused-variable]
298 | struct iov_iter iter;
| ^~~~
vim +298 drivers/dma-buf/udmabuf.c
286
287 static ssize_t udmabuf_rw_file(struct dma_buf *dmabuf, loff_t my_pos,
288 struct file *other, loff_t pos,
289 size_t count, bool is_write)
290 {
291 struct udmabuf *ubuf = dmabuf->priv;
292 loff_t my_end = my_pos + count, bv_beg, bv_end = 0;
293 pgoff_t pg_idx = my_pos / PAGE_SIZE;
294 pgoff_t pg_end = DIV_ROUND_UP(my_end, PAGE_SIZE);
295 size_t i, bv_off, bv_len, bv_num, bv_idx = 0, bv_total = 0;
296 struct bio_vec *bvec;
297 struct kiocb kiocb;
> 298 struct iov_iter iter;
> 299 unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
300 ssize_t ret = 0, rw_total = 0;
301 struct folio *folio;
302
303 bv_num = min_t(size_t, pg_end - pg_idx + 1, 1024);
304 bvec = kvcalloc(bv_num, sizeof(*bvec), GFP_KERNEL);
305 if (!bvec)
306 return -ENOMEM;
307
308 init_sync_kiocb(&kiocb, other);
309 kiocb.ki_pos = pos;
310
311 for (i = 0; i < ubuf->nr_pinned && my_pos < my_end; i++) {
312 folio = ubuf->pinned_folios[i];
313 bv_beg = bv_end;
314 bv_end += folio_size(folio);
315 if (bv_end <= my_pos)
316 continue;
317
318 bv_len = min(bv_end, my_end) - my_pos;
319 bv_off = my_pos - bv_beg;
320 my_pos += bv_len;
321 bv_total += bv_len;
322 bvec_set_page(&bvec[bv_idx], &folio->page, bv_len, bv_off);
323 if (++bv_idx < bv_num && my_pos < my_end)
324 continue;
325
326 /* start R/W if bvec is full or count reaches zero. */
> 327 iov_iter_bvec(&iter, direction, bvec, bv_idx, bv_total);
328 if (is_write)
329 ret = other->f_op->write_iter(&kiocb, &iter);
330 else
331 ret = other->f_op->read_iter(&kiocb, &iter);
332 if (ret <= 0)
333 break;
334 rw_total += ret;
335 if (ret < bv_total || fatal_signal_pending(current))
336 break;
337
338 bv_idx = bv_total = 0;
339 }
340 kvfree(bvec);
341
342 return rw_total > 0 ? rw_total : ret;
343 }
344
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files
2025-05-30 13:43 ` Amir Goldstein
@ 2025-06-03 1:28 ` wangtao
0 siblings, 0 replies; 9+ messages in thread
From: wangtao @ 2025-06-03 1:28 UTC (permalink / raw)
To: Amir Goldstein
Cc: sumit.semwal@linaro.org, christian.koenig@amd.com,
kraxel@redhat.com, vivek.kasireddy@intel.com,
viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com,
akpm@linux-foundation.org, benjamin.gaignard@collabora.com,
Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com,
jack@suse.cz, baolin.wang@linux.alibaba.com,
linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org,
linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
wangbintian(BintianWang), yipengxiang, liulu 00013167,
hanfeng 00012985
> -----Original Message-----
> From: Amir Goldstein <amir73il@gmail.com>
> Sent: Friday, May 30, 2025 9:44 PM
> To: wangtao <tao.wangtao@honor.com>
> Cc: sumit.semwal@linaro.org; christian.koenig@amd.com;
> kraxel@redhat.com; vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk;
> brauner@kernel.org; hughd@google.com; akpm@linux-foundation.org;
> benjamin.gaignard@collabora.com; Brian.Starkey@arm.com;
> jstultz@google.com; tjmercier@google.com; jack@suse.cz;
> baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri-
> devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>;
> yipengxiang <yipengxiang@honor.com>; liulu 00013167
> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com>
> Subject: Re: [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-
> backed files
>
> On Fri, May 30, 2025 at 12:40 PM wangtao <tao.wangtao@honor.com> wrote:
> >
> > Memory-backed files can optimize copy performance via copy_file_range
> > callbacks. Compared to mmap&read: reduces GUP (get_user_pages)
> > overhead; vs sendfile/splice: eliminates one memory copy; supports
> > dmabuf zero-copy implementation.
> >
> > Signed-off-by: wangtao <tao.wangtao@honor.com>
> > ---
>
> Hi wangtao,
>
> Let me rephrase my reaction gently:
> Regardless of the proposed API extension, and referring to the
> implementation itself - I wrote the current code and I can barely understand
> the logic every time I come back to read it.
>
> Your changes make it too messy to be reviewed/maintained.
> I have a few suggestions for simplifications (see below).
>
> The complication arise from the fact that normally the destination fops are
> used to call the copy_range() method and this case is a deviation from this
> standard, we need to make the cope simpler using a helper to choose the
> fops to use.
>
> > fs/read_write.c | 71 +++++++++++++++++++++++++++++++++-----------
> --
> > include/linux/fs.h | 2 ++
> > 2 files changed, 54 insertions(+), 19 deletions(-)
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c index
> > bb0ed26a0b3a..591c6db7b785 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1469,6 +1469,20 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int,
> out_fd,
> > int, in_fd, } #endif
> >
> > +static inline bool is_copy_memory_file_to_file(struct file *file_in,
> > + struct file *file_out) {
> > + return (file_in->f_op->fop_flags & FOP_MEMORY_FILE) &&
> > + file_in->f_op->copy_file_range &&
> > +file_out->f_op->write_iter; }
> > +
> > +static inline bool is_copy_file_to_memory_file(struct file *file_in,
> > + struct file *file_out) {
> > + return (file_out->f_op->fop_flags & FOP_MEMORY_FILE) &&
> > + file_in->f_op->read_iter &&
> > +file_out->f_op->copy_file_range; }
> > +
>
> Please combine that to a helper:
> const struct file_operations *memory_copy_file_ops(struct file *file_in,
> struct file *file_out) which returns either file_in->f_op, file_out->f_op or
> NULL.
>
Great suggestions! I'll update them in the new patch version soon.
>
> > /*
> > * Performs necessary checks before doing a file copy
> > *
> > @@ -1484,11 +1498,23 @@ static int generic_copy_file_checks(struct file
> *file_in, loff_t pos_in,
> > struct inode *inode_out = file_inode(file_out);
> > uint64_t count = *req_count;
> > loff_t size_in;
> > + bool splice = flags & COPY_FILE_SPLICE;
> > + bool has_memory_file;
> > int ret;
> >
> > - ret = generic_file_rw_checks(file_in, file_out);
> > - if (ret)
> > - return ret;
> > + /* Skip generic checks, allow cross-sb copies for dma-buf/tmpfs */
> > + has_memory_file = is_copy_memory_file_to_file(file_in, file_out) ||
> > + is_copy_file_to_memory_file(file_in, file_out);
> > + if (!splice && has_memory_file) {
> > + if (!(file_in->f_mode & FMODE_READ) ||
> > + !(file_out->f_mode & FMODE_WRITE) ||
> > + (file_out->f_flags & O_APPEND))
> > + return -EBADF;
>
> I don't like that this both duplicates code and does not check all the checks in
> generic_file_rw_checks() for the non-memory file side.
> I do not currently have a suggestion how to make this less messy, more
> human readable and correct.
Since dmabuf files aren't regular files, we'll skip generic_file_rw_checks.
Adding an essential_file_rw_checks() could reduce code duplication.
>
> > + } else {
> > + ret = generic_file_rw_checks(file_in, file_out);
> > + if (ret)
> > + return ret;
> > + }
> >
> > /*
> > * We allow some filesystems to handle cross sb copy, but
> > passing @@ -1500,7 +1526,7 @@ static int generic_copy_file_checks(struct
> file *file_in, loff_t pos_in,
> > * and several different sets of file_operations, but they all end up
> > * using the same ->copy_file_range() function pointer.
> > */
> > - if (flags & COPY_FILE_SPLICE) {
> > + if (splice || has_memory_file) {
> > /* cross sb splice is allowed */
>
> This comment is not accurate - it should say "cross fs-type", because it skips
> the test that compares the ->copy_file_range pointers, not the sb.
> If you are making changes to this function, this should be clarified.
>
Will fix this shortly.
> > } else if (file_out->f_op->copy_file_range) {
> > if (file_in->f_op->copy_file_range != @@ -1581,23
> > +1607,30 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > * same sb using clone, but for filesystems where both clone and copy
> > * are supported (e.g. nfs,cifs), we only call the copy method.
> > */
> > - if (!splice && file_out->f_op->copy_file_range) {
> > - ret = file_out->f_op->copy_file_range(file_in, pos_in,
> > - file_out, pos_out,
> > - len, flags);
> > - } else if (!splice && file_in->f_op->remap_file_range && samesb) {
> > - ret = file_in->f_op->remap_file_range(file_in, pos_in,
> > - file_out, pos_out,
> > - min_t(loff_t, MAX_RW_COUNT, len),
> > - REMAP_FILE_CAN_SHORTEN);
> > - /* fallback to splice */
> > - if (ret <= 0)
> > + if (!splice) {
> > + if (is_copy_memory_file_to_file(file_in, file_out)) {
> > + ret = file_in->f_op->copy_file_range(file_in, pos_in,
> > + file_out, pos_out, len, flags);
> > + } else if (is_copy_file_to_memory_file(file_in, file_out)) {
> > + ret = file_out->f_op->copy_file_range(file_in, pos_in,
> > + file_out, pos_out, len, flags);
> > + } else if (file_out->f_op->copy_file_range) {
> > + ret = file_out->f_op->copy_file_range(file_in, pos_in,
> > + file_out, pos_out,
> > + len, flags);
> > + } else if (file_in->f_op->remap_file_range && samesb) {
> > + ret = file_in->f_op->remap_file_range(file_in, pos_in,
> > + file_out, pos_out,
> > + min_t(loff_t, MAX_RW_COUNT, len),
> > + REMAP_FILE_CAN_SHORTEN);
> > + /* fallback to splice */
> > + if (ret <= 0)
> > + splice = true;
> > + } else if (samesb) {
> > + /* Fallback to splice for same sb copy for
> > + backward compat */
> > splice = true;
> > - } else if (samesb) {
> > - /* Fallback to splice for same sb copy for backward compat */
> > - splice = true;
> > + }
>
> This is the way-too-ugly-to-live part.
> Was pretty bad before and now it is unacceptable.
> way too much unneeded nesting and too hard to follow.
>
> First of all, please use splice label and call goto splice (before the comment)
> instead of adding another nesting level.
> I would do that even if not adding memory fd copy support.
>
> Second, please store result of mem_fops = memory_copy_file_ops() and
> start with:
> + if (mem_fops) {
> + ret = mem_fops->copy_file_range(file_in, pos_in,
> +
> file_out, pos_out,
> + len,
> + flags);
> + } else if (file_out->f_op->copy_file_range) { ...
>
> and update the comment above to say that:
> + * For copy to/from memory fd, we alway call the copy method of the
> memory fd.
> */
>
> I think that makes the code not more ugly than it already is.
>
> Thanks,
> Amir.
Good advice - thanks a lot!
Regards,
Wangtao.
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback
2025-05-30 14:24 ` kernel test robot
@ 2025-06-03 1:32 ` wangtao
0 siblings, 0 replies; 9+ messages in thread
From: wangtao @ 2025-06-03 1:32 UTC (permalink / raw)
To: kernel test robot, sumit.semwal@linaro.org,
christian.koenig@amd.com, kraxel@redhat.com,
vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk,
brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org,
amir73il@gmail.com
Cc: oe-kbuild-all@lists.linux.dev, benjamin.gaignard@collabora.com,
Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com,
jack@suse.cz, baolin.wang@linux.alibaba.com,
linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org,
linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
wangbintian(BintianWang), yipengxiang, liulu 00013167,
hanfeng 00012985
> -----Original Message-----
> From: kernel test robot <lkp@intel.com>
> Sent: Friday, May 30, 2025 10:25 PM
> To: wangtao <tao.wangtao@honor.com>; sumit.semwal@linaro.org;
> christian.koenig@amd.com; kraxel@redhat.com; vivek.kasireddy@intel.com;
> viro@zeniv.linux.org.uk; brauner@kernel.org; hughd@google.com;
> akpm@linux-foundation.org; amir73il@gmail.com
> Cc: oe-kbuild-all@lists.linux.dev; benjamin.gaignard@collabora.com;
> Brian.Starkey@arm.com; jstultz@google.com; tjmercier@google.com;
> jack@suse.cz; baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org;
> dri-devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>;
> yipengxiang <yipengxiang@honor.com>; liulu 00013167
> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com>; wangtao
> <tao.wangtao@honor.com>
> Subject: Re: [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback
>
> Hi wangtao,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on brauner-vfs/vfs.all] [also build test ERROR on next-
> 20250530] [cannot apply to linus/master v6.15] [If your patch is applied to the
> wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/wangtao/fs-allow-
> cross-FS-copy_file_range-for-memory-backed-files/20250530-184146
> base: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.all
> patch link: https://lore.kernel.org/r/20250530103941.11092-4-
> tao.wangtao%40honor.com
> patch subject: [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback
> config: sparc64-randconfig-002-20250530 (https://download.01.org/0day-
> ci/archive/20250530/202505302235.mDzENMSm-lkp@intel.com/config)
> compiler: sparc64-linux-gcc (GCC) 15.1.0 reproduce (this is a W=1 build):
> (https://download.01.org/0day-
> ci/archive/20250530/202505302235.mDzENMSm-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes:
> | https://lore.kernel.org/oe-kbuild-all/202505302235.mDzENMSm-lkp@intel.
> | com/
>
> All error/warnings (new ones prefixed by >>):
>
Quick note: I don't have local sparc64 compilation setup, so I'll
explicitly add the header dependencies to ensure safety.
Regards,
Wangtao.
> drivers/dma-buf/udmabuf.c: In function 'udmabuf_rw_file':
> >> drivers/dma-buf/udmabuf.c:298:25: error: storage size of 'iter' isn't
> >> known
> 298 | struct iov_iter iter;
> | ^~~~
> >> drivers/dma-buf/udmabuf.c:299:45: error: 'ITER_SOURCE' undeclared
> >> (first use in this function)
> 299 | unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
> | ^~~~~~~~~~~
> drivers/dma-buf/udmabuf.c:299:45: note: each undeclared identifier is
> reported only once for each function it appears in
> >> drivers/dma-buf/udmabuf.c:299:59: error: 'ITER_DEST' undeclared
> >> (first use in this function)
> 299 | unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
> | ^~~~~~~~~
> >> drivers/dma-buf/udmabuf.c:327:17: error: implicit declaration of
> >> function 'iov_iter_bvec'; did you mean 'bvec_iter_bvec'?
> >> [-Wimplicit-function-declaration]
> 327 | iov_iter_bvec(&iter, direction, bvec, bv_idx, bv_total);
> | ^~~~~~~~~~~~~
> | bvec_iter_bvec
> >> drivers/dma-buf/udmabuf.c:298:25: warning: unused variable 'iter'
> >> [-Wunused-variable]
> 298 | struct iov_iter iter;
> | ^~~~
>
>
> vim +298 drivers/dma-buf/udmabuf.c
>
> 286
> 287 static ssize_t udmabuf_rw_file(struct dma_buf *dmabuf, loff_t
> my_pos,
> 288 struct file *other, loff_t pos,
> 289 size_t count, bool is_write)
> 290 {
> 291 struct udmabuf *ubuf = dmabuf->priv;
> 292 loff_t my_end = my_pos + count, bv_beg, bv_end = 0;
> 293 pgoff_t pg_idx = my_pos / PAGE_SIZE;
> 294 pgoff_t pg_end = DIV_ROUND_UP(my_end, PAGE_SIZE);
> 295 size_t i, bv_off, bv_len, bv_num, bv_idx = 0, bv_total = 0;
> 296 struct bio_vec *bvec;
> 297 struct kiocb kiocb;
> > 298 struct iov_iter iter;
> > 299 unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST;
> 300 ssize_t ret = 0, rw_total = 0;
> 301 struct folio *folio;
> 302
> 303 bv_num = min_t(size_t, pg_end - pg_idx + 1, 1024);
> 304 bvec = kvcalloc(bv_num, sizeof(*bvec), GFP_KERNEL);
> 305 if (!bvec)
> 306 return -ENOMEM;
> 307
> 308 init_sync_kiocb(&kiocb, other);
> 309 kiocb.ki_pos = pos;
> 310
> 311 for (i = 0; i < ubuf->nr_pinned && my_pos < my_end; i++) {
> 312 folio = ubuf->pinned_folios[i];
> 313 bv_beg = bv_end;
> 314 bv_end += folio_size(folio);
> 315 if (bv_end <= my_pos)
> 316 continue;
> 317
> 318 bv_len = min(bv_end, my_end) - my_pos;
> 319 bv_off = my_pos - bv_beg;
> 320 my_pos += bv_len;
> 321 bv_total += bv_len;
> 322 bvec_set_page(&bvec[bv_idx], &folio->page, bv_len,
> bv_off);
> 323 if (++bv_idx < bv_num && my_pos < my_end)
> 324 continue;
> 325
> 326 /* start R/W if bvec is full or count reaches zero. */
> > 327 iov_iter_bvec(&iter, direction, bvec, bv_idx,
> bv_total);
> 328 if (is_write)
> 329 ret = other->f_op->write_iter(&kiocb, &iter);
> 330 else
> 331 ret = other->f_op->read_iter(&kiocb, &iter);
> 332 if (ret <= 0)
> 333 break;
> 334 rw_total += ret;
> 335 if (ret < bv_total || fatal_signal_pending(current))
> 336 break;
> 337
> 338 bv_idx = bv_total = 0;
> 339 }
> 340 kvfree(bvec);
> 341
> 342 return rw_total > 0 ? rw_total : ret;
> 343 }
> 344
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-06-03 1:32 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-30 10:39 [PATCH v3 0/4] Optimizing disk file & dmabuf copies via copy_file_range wangtao
2025-05-30 10:39 ` [PATCH v3 1/4] fs: allow cross-FS copy_file_range for memory-backed files wangtao
2025-05-30 13:43 ` Amir Goldstein
2025-06-03 1:28 ` wangtao
2025-05-30 10:39 ` [PATCH v3 2/4] dmabuf: Implement copy_file_range for dmabuf wangtao
2025-05-30 10:39 ` [PATCH v3 3/4] udmabuf: Implement udmabuf rw_file callback wangtao
2025-05-30 14:24 ` kernel test robot
2025-06-03 1:32 ` wangtao
2025-05-30 10:39 ` [PATCH v3 4/4] dmabuf:system_heap Implement system_heap exporter's " wangtao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).