* [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range @ 2025-06-03 9:52 wangtao 2025-06-03 9:52 ` [PATCH v4 1/4] fs: allow cross-FS copy_file_range for memory file with direct I/O wangtao ` (4 more replies) 0 siblings, 5 replies; 30+ messages in thread From: wangtao @ 2025-06-03 9:52 UTC (permalink / raw) To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, amir73il Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han, wangtao Main steps to load file data into dmabuf: 1. dmabuf_fd = dmabuf_alloc(len, heap_fd) 2. vaddr = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, dmabuf_fd, 0) 3. file_fd = open(file_path, O_RDONLY) 4. read(file_fd, vaddr, len) dmabuf's attachment/map/fence model sets VM_PFNMAP for mmap, which lacks direct I/O support[1]. Buffer IO causes latency when loading large AI model files. My previous patch added dmabuf ioctl for Direct IO file operations, showing good performance on low-power CPUs[2][3]. Christian suggested using existing uAPIs (read/sendfile/splice/c_f_r) instead of new ioctls. sendfile/splice/c_f_r enable zero-copy via Direct IO for disk-disk/network: sendfile(skt_fd, disk_fd): [DISK]-DMA->[pipe(buf)]-DMA->[NIC] sendfile(dst_disk, src_disk): [DISK] -DMA-> [pipe(buf)] -DMA-> [DISK] Analysis shows existing uAPIs can't achieve zero-copy disk-to-dmabuf. Since dmabuf lacks file ops, using tmpfs for disk-to-tmpfs CPU analysis: | Method | CPU Copies | Key Overhead | |-------------------|------------|----------------------------| | 1. Buffer R+W | 2 | Alloc(cache) & 2 CPU copies| | 2. Direct R+W | 1 | GUP(usr_buf) & 1 CPU copy | | 3. Mmap+Buffer R | 1 | Alloc(cache) & 1 CPU copy | | 4. Mmap+Direct R | 0 | GUP(mem_buf) ~50% CPU | | 5. Buffer Sendfile| 1 | Alloc(cache) & 1 CPU copy | | 6. Direct Sendfile| 1 | Small pipe, high IRQ | | 7. Buffer Splice | 1 | Alloc(cache) & 1 CPU copy | | 8. Direct Splice | 1 | Larger pipe buffer | | 9. c_f_r | N/A | Cross-FS blocked | GUP: get_user_page Alloc(cache): allocate page cache Data flows: 1. [DISK] -DMA-> [Alloc(cache)] -COPY-> [usr_buf] -COPY-> [MEM] 2. [DISK] -DMA-> [GUP(usr_buf)] -COPY-> [MEM] 3. [DISK] -DMA-> [Alloc(cache)] -COPY-> [mem_buf] 4. [DISK] -DMA-> [GUP(mem_buf)] 5. [DISK] -DMA-> [pipe(Alloc(cache))] -COPY-> [tmpfs page] 6. [DISK] -DMA-> [pipe(buf)] -COPY-> [tmpfs page] 7. [DISK] -DMA-> [big_pipe(Alloc(cache))] -COPY-> [tmpfs page] 8. [DISK] -DMA-> [big_pipe(buf)] -COPY-> [tmpfs page] 9. [DISK] -DMA-> [tmpfs page] (blocked) Key findings: - Buffer I/O requires page cache allocation and at least one CPU copy - Read+Write incurs excessive CPU copies and will no longer be analyzed. Future approaches will use Read instead of mmap+Read. - Mmap+Direct has zero copies but 50% GUP overhead, and dmabuf doesn't support - sendfile/splice require intermediate pipes, needing 1 CPU copy - c_f_r limitations: Cross-FS blocks + missing memory FS support Modifications: 1. Enable cross-FS c_f_r for memory file types 2. Add dmabuf c_f_r callbacks for [DISK]-DMA->[dmabuf] 3. Test tmpfs c_f_r locally only (no upstream) due to lock_page deadlock risks Performance (1GHz CPU, UFS4@4GB): 1. tmpfs(memfd) direct c_f_r(1197 MB/s): +15% vs mmap&read(1014) 2. udmabuf+memfd(2318 MB/s): +50% vs mmap&read(1457 MB/s) 3. dmabuf direct c_f_r(3405 MB/s): 260% faster than buffer IO(918 MB/s) 40% faster than udmabuf(2318 MB/s) | 32x32MB Read 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O% |-------------------------|--------|--------|--------|--------|----- | 1)Beg dmabuf buffer R/W| 52 | 5 | 1170 | 918 | 100% | 2) udmabuf buffer R/W| 591 | 326 | 1281 | 838 | 91% | 3) memfd buffer R/W| 1 | 323 | 2370 | 453 | 49% | 4) memfd direct R/W| 1 | 321 | 1058 | 1014 | 110% | 5) memfd buffer sendfile| 1 | 329 | 1577 | 681 | 74% | 6) memfd direct sendfile| 1 | 327 | 2672 | 401 | 43% | 7) memfd buffer splice| 2 | 321 | 1729 | 621 | 67% | 8) memfd direct splice| 2 | 324 | 1528 | 702 | 76% | 9) memfd buffer c_f_r| 1 | 325 | 1586 | 677 | 73% |10) memfd direct c_f_r| 1 | 323 | 897 | 1197 | 130% |11) u+mfd buffer R/W| 609 | 344 | 2207 | 486 | 52% |12) u+mfd direct R/W| 580 | 342 | 737 | 1457 | 158% |13) u+mfd buffer sendfile| 582 | 343 | 1270 | 845 | 92% |14) u+mfd direct sendfile| 573 | 344 | 2254 | 476 | 51% |15) u+mfd buffer splice| 584 | 341 | 1202 | 893 | 97% |16) u+mfd direct splice| 564 | 340 | 851 | 1263 | 137% |17) u+mfd buffer c_f_r| 585 | 344 | 1244 | 863 | 94% |18) u+mfd direct c_f_r| 578 | 341 | 581 | 1848 | 201% |19) udmabuf buffer c_f_r| 585 | 328 | 1163 | 923 | 100% |20) udmabuf direct c_f_r| 579 | 328 | 464 | 2318 | 252% |21) dmabuf buffer c_f_r| 48 | 5 | 1058 | 1015 | 110% |22) dmabuf direct c_f_r| 48 | 5 | 316 | 3405 | 370% |23)End dmabuf buffer R/W| 48 | 5 | 1173 | 915 | 99% u+mfd = udma+memfd = udmabuf + pre-allocated memfd combo. Cache cleared during tests to simulate real-world large file loading. dmabuf file Use Cases: - Loading large AI models using dmabuf - Real-time data capture and storage with dmabuf - Persisting task snapshots in Android v3 -> v4: Add memory_copy_file_fops to simplify code and add FMODE_ODIRECT check Explicitly add dependency headers for udmabuf Simplify rw_file implementation in udmabuf/system_heaps Set FMODE_ODIRECT for dmabuf supporting Direct I/O v2 -> v3: [4] copy_file_range supports copying from disk files to memory files. Implement the copy_file_range callback functions for dmabuf/udmabuf. v1 -> v2: [3] Dma-buf exporter verify exclusive access to the dmabuf's sgtable. v1: [2] Reference: [1] https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com [2] https://lore.kernel.org/all/20250513092803.2096-1-tao.wangtao@honor.com [3] https://lore.kernel.org/all/20250516092148.12778-1-tao.wangtao@honor.com [4] https://lore.kernel.org/all/20250530103941.11092-1-tao.wangtao@honor.com wangtao (4): fs: allow cross-FS copy_file_range for memory file with direct I/O dmabuf: Implement copy_file_range callback for dmabuf direct I/O prep udmabuf: Implement udmabuf direct I/O dmabuf:system_heap Implement system_heap dmabuf direct I/O drivers/dma-buf/dma-buf.c | 32 +++++++++++++ drivers/dma-buf/heaps/system_heap.c | 69 +++++++++++++++++++++++++++++ drivers/dma-buf/udmabuf.c | 54 ++++++++++++++++++++++ fs/read_write.c | 64 +++++++++++++++++++++----- include/linux/dma-buf.h | 16 +++++++ include/linux/fs.h | 2 + 6 files changed, 225 insertions(+), 12 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v4 1/4] fs: allow cross-FS copy_file_range for memory file with direct I/O 2025-06-03 9:52 [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range wangtao @ 2025-06-03 9:52 ` wangtao 2025-06-03 10:56 ` Amir Goldstein 2025-06-03 9:52 ` [PATCH v4 2/4] dmabuf: Implement copy_file_range callback for dmabuf direct I/O prep wangtao ` (3 subsequent siblings) 4 siblings, 1 reply; 30+ messages in thread From: wangtao @ 2025-06-03 9:52 UTC (permalink / raw) To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, amir73il Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han, wangtao Memory files can optimize copy performance via copy_file_range callbacks: -Compared to mmap&read: reduces GUP (get_user_pages) overhead -Compared to sendfile/splice: eliminates one memory copy -Supports dma-buf direct I/O zero-copy implementation Suggested by: Christian König <christian.koenig@amd.com> Suggested by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: wangtao <tao.wangtao@honor.com> --- fs/read_write.c | 64 +++++++++++++++++++++++++++++++++++++--------- include/linux/fs.h | 2 ++ 2 files changed, 54 insertions(+), 12 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index bb0ed26a0b3a..ecb4f753c632 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1469,6 +1469,31 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, } #endif +static const struct file_operations *memory_copy_file_ops( + struct file *file_in, struct file *file_out) +{ + if ((file_in->f_op->fop_flags & FOP_MEMORY_FILE) && + (file_in->f_mode & FMODE_CAN_ODIRECT) && + file_in->f_op->copy_file_range && file_out->f_op->write_iter) + return file_in->f_op; + else if ((file_out->f_op->fop_flags & FOP_MEMORY_FILE) && + (file_out->f_mode & FMODE_CAN_ODIRECT) && + file_in->f_op->read_iter && file_out->f_op->copy_file_range) + return file_out->f_op; + else + return NULL; +} + +static int essential_file_rw_checks(struct file *file_in, struct file *file_out) +{ + if (!(file_in->f_mode & FMODE_READ) || + !(file_out->f_mode & FMODE_WRITE) || + (file_out->f_flags & O_APPEND)) + return -EBADF; + + return 0; +} + /* * Performs necessary checks before doing a file copy * @@ -1484,9 +1509,16 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in, struct inode *inode_out = file_inode(file_out); uint64_t count = *req_count; loff_t size_in; + bool splice = flags & COPY_FILE_SPLICE; + const struct file_operations *mem_fops; int ret; - ret = generic_file_rw_checks(file_in, file_out); + /* The dma-buf file is not a regular file. */ + mem_fops = memory_copy_file_ops(file_in, file_out); + if (splice || mem_fops == NULL) + ret = generic_file_rw_checks(file_in, file_out); + else + ret = essential_file_rw_checks(file_in, file_out); if (ret) return ret; @@ -1500,8 +1532,10 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in, * and several different sets of file_operations, but they all end up * using the same ->copy_file_range() function pointer. */ - if (flags & COPY_FILE_SPLICE) { + if (splice) { /* cross sb splice is allowed */ + } else if (mem_fops != NULL) { + /* cross-fs copy is allowed for memory file. */ } else if (file_out->f_op->copy_file_range) { if (file_in->f_op->copy_file_range != file_out->f_op->copy_file_range) @@ -1554,6 +1588,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, ssize_t ret; bool splice = flags & COPY_FILE_SPLICE; bool samesb = file_inode(file_in)->i_sb == file_inode(file_out)->i_sb; + const struct file_operations *mem_fops; if (flags & ~COPY_FILE_SPLICE) return -EINVAL; @@ -1574,18 +1609,27 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, if (len == 0) return 0; + if (splice) + goto do_splice; + file_start_write(file_out); /* * Cloning is supported by more file systems, so we implement copy on * same sb using clone, but for filesystems where both clone and copy * are supported (e.g. nfs,cifs), we only call the copy method. + * For copy to/from memory file, we alway call the copy method of the + * memory file. */ - if (!splice && file_out->f_op->copy_file_range) { + mem_fops = memory_copy_file_ops(file_in, file_out); + if (mem_fops) { + ret = mem_fops->copy_file_range(file_in, pos_in, + file_out, pos_out, len, flags); + } else if (file_out->f_op->copy_file_range) { ret = file_out->f_op->copy_file_range(file_in, pos_in, - file_out, pos_out, - len, flags); - } else if (!splice && file_in->f_op->remap_file_range && samesb) { + file_out, pos_out, + len, flags); + } else if (file_in->f_op->remap_file_range && samesb) { ret = file_in->f_op->remap_file_range(file_in, pos_in, file_out, pos_out, min_t(loff_t, MAX_RW_COUNT, len), @@ -1603,6 +1647,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, if (!splice) goto done; +do_splice: /* * We can get here for same sb copy of filesystems that do not implement * ->copy_file_range() in case filesystem does not support clone or in @@ -1786,12 +1831,7 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out) if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) return -EINVAL; - if (!(file_in->f_mode & FMODE_READ) || - !(file_out->f_mode & FMODE_WRITE) || - (file_out->f_flags & O_APPEND)) - return -EBADF; - - return 0; + return essential_file_rw_checks(file_in, file_out); } int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter) diff --git a/include/linux/fs.h b/include/linux/fs.h index 016b0fe1536e..37df1b497418 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2187,6 +2187,8 @@ struct file_operations { #define FOP_ASYNC_LOCK ((__force fop_flags_t)(1 << 6)) /* File system supports uncached read/write buffered IO */ #define FOP_DONTCACHE ((__force fop_flags_t)(1 << 7)) +/* Supports cross-FS copy_file_range for memory file */ +#define FOP_MEMORY_FILE ((__force fop_flags_t)(1 << 8)) /* Wrap a directory iterator that needs exclusive inode access */ int wrap_directory_iterator(struct file *, struct dir_context *, -- 2.17.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v4 1/4] fs: allow cross-FS copy_file_range for memory file with direct I/O 2025-06-03 9:52 ` [PATCH v4 1/4] fs: allow cross-FS copy_file_range for memory file with direct I/O wangtao @ 2025-06-03 10:56 ` Amir Goldstein 2025-06-03 12:38 ` wangtao 0 siblings, 1 reply; 30+ messages in thread From: Amir Goldstein @ 2025-06-03 10:56 UTC (permalink / raw) To: wangtao Cc: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han On Tue, Jun 3, 2025 at 11:53 AM wangtao <tao.wangtao@honor.com> wrote: > > Memory files can optimize copy performance via copy_file_range callbacks: > -Compared to mmap&read: reduces GUP (get_user_pages) overhead > -Compared to sendfile/splice: eliminates one memory copy > -Supports dma-buf direct I/O zero-copy implementation > > Suggested by: Christian König <christian.koenig@amd.com> > Suggested by: Amir Goldstein <amir73il@gmail.com> > Signed-off-by: wangtao <tao.wangtao@honor.com> > --- > fs/read_write.c | 64 +++++++++++++++++++++++++++++++++++++--------- > include/linux/fs.h | 2 ++ > 2 files changed, 54 insertions(+), 12 deletions(-) > > diff --git a/fs/read_write.c b/fs/read_write.c > index bb0ed26a0b3a..ecb4f753c632 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -1469,6 +1469,31 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, > } > #endif > > +static const struct file_operations *memory_copy_file_ops( > + struct file *file_in, struct file *file_out) > +{ > + if ((file_in->f_op->fop_flags & FOP_MEMORY_FILE) && > + (file_in->f_mode & FMODE_CAN_ODIRECT) && > + file_in->f_op->copy_file_range && file_out->f_op->write_iter) > + return file_in->f_op; > + else if ((file_out->f_op->fop_flags & FOP_MEMORY_FILE) && > + (file_out->f_mode & FMODE_CAN_ODIRECT) && > + file_in->f_op->read_iter && file_out->f_op->copy_file_range) > + return file_out->f_op; > + else > + return NULL; > +} > + > +static int essential_file_rw_checks(struct file *file_in, struct file *file_out) > +{ > + if (!(file_in->f_mode & FMODE_READ) || > + !(file_out->f_mode & FMODE_WRITE) || > + (file_out->f_flags & O_APPEND)) > + return -EBADF; > + > + return 0; > +} > + > /* > * Performs necessary checks before doing a file copy > * > @@ -1484,9 +1509,16 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in, > struct inode *inode_out = file_inode(file_out); > uint64_t count = *req_count; > loff_t size_in; > + bool splice = flags & COPY_FILE_SPLICE; > + const struct file_operations *mem_fops; > int ret; > > - ret = generic_file_rw_checks(file_in, file_out); > + /* The dma-buf file is not a regular file. */ > + mem_fops = memory_copy_file_ops(file_in, file_out); > + if (splice || mem_fops == NULL) nit: use !mem_fops please Considering that the flag COPY_FILE_SPLICE is not allowed from userspace and is only called by nfsd and ksmbd I think we should assert and deny the combination of mem_fops && splice because it is very much unexpected. After asserting this, it would be nicer to write as: if (mem_fops) ret = essential_file_rw_checks(file_in, file_out); else ret = generic_file_rw_checks(file_in, file_out); > + else > + ret = essential_file_rw_checks(file_in, file_out); > if (ret) > return ret; > > @@ -1500,8 +1532,10 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in, > * and several different sets of file_operations, but they all end up > * using the same ->copy_file_range() function pointer. > */ > - if (flags & COPY_FILE_SPLICE) { > + if (splice) { > /* cross sb splice is allowed */ > + } else if (mem_fops != NULL) { With the assertion that splice && mem_fops is not allowed if (splice || mem_fops) { would go well together because they both allow cross-fs copy not only cross sb. > + /* cross-fs copy is allowed for memory file. */ > } else if (file_out->f_op->copy_file_range) { > if (file_in->f_op->copy_file_range != > file_out->f_op->copy_file_range) > @@ -1554,6 +1588,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, > ssize_t ret; > bool splice = flags & COPY_FILE_SPLICE; > bool samesb = file_inode(file_in)->i_sb == file_inode(file_out)->i_sb; > + const struct file_operations *mem_fops; > > if (flags & ~COPY_FILE_SPLICE) > return -EINVAL; > @@ -1574,18 +1609,27 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, > if (len == 0) > return 0; > > + if (splice) > + goto do_splice; > + > file_start_write(file_out); > goto do_splice needs to be after file_start_write Please wait for feedback from vfs maintainers before posting another version addressing my review comments. Thanks, Amir. ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v4 1/4] fs: allow cross-FS copy_file_range for memory file with direct I/O 2025-06-03 10:56 ` Amir Goldstein @ 2025-06-03 12:38 ` wangtao 2025-06-03 12:43 ` Amir Goldstein 0 siblings, 1 reply; 30+ messages in thread From: wangtao @ 2025-06-03 12:38 UTC (permalink / raw) To: Amir Goldstein Cc: sumit.semwal@linaro.org, christian.koenig@amd.com, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 > -----Original Message----- > From: Amir Goldstein <amir73il@gmail.com> > Sent: Tuesday, June 3, 2025 6:57 PM > To: wangtao <tao.wangtao@honor.com> > Cc: sumit.semwal@linaro.org; christian.koenig@amd.com; > kraxel@redhat.com; vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; > brauner@kernel.org; hughd@google.com; akpm@linux-foundation.org; > benjamin.gaignard@collabora.com; Brian.Starkey@arm.com; > jstultz@google.com; tjmercier@google.com; jack@suse.cz; > baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri- > devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux- > kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux- > mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>; > yipengxiang <yipengxiang@honor.com>; liulu 00013167 > <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com> > Subject: Re: [PATCH v4 1/4] fs: allow cross-FS copy_file_range for memory > file with direct I/O > > On Tue, Jun 3, 2025 at 11:53 AM wangtao <tao.wangtao@honor.com> wrote: > > > > Memory files can optimize copy performance via copy_file_range callbacks: > > -Compared to mmap&read: reduces GUP (get_user_pages) overhead > > -Compared to sendfile/splice: eliminates one memory copy -Supports > > dma-buf direct I/O zero-copy implementation > > > > Suggested by: Christian König <christian.koenig@amd.com> Suggested by: > > Amir Goldstein <amir73il@gmail.com> > > Signed-off-by: wangtao <tao.wangtao@honor.com> > > --- > > fs/read_write.c | 64 +++++++++++++++++++++++++++++++++++++----- > ---- > > include/linux/fs.h | 2 ++ > > 2 files changed, 54 insertions(+), 12 deletions(-) > > > > diff --git a/fs/read_write.c b/fs/read_write.c index > > bb0ed26a0b3a..ecb4f753c632 100644 > > --- a/fs/read_write.c > > +++ b/fs/read_write.c > > @@ -1469,6 +1469,31 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, > out_fd, > > int, in_fd, } #endif > > > > +static const struct file_operations *memory_copy_file_ops( > > + struct file *file_in, struct file *file_out) { > > + if ((file_in->f_op->fop_flags & FOP_MEMORY_FILE) && > > + (file_in->f_mode & FMODE_CAN_ODIRECT) && > > + file_in->f_op->copy_file_range && file_out->f_op->write_iter) > > + return file_in->f_op; > > + else if ((file_out->f_op->fop_flags & FOP_MEMORY_FILE) && > > + (file_out->f_mode & FMODE_CAN_ODIRECT) && > > + file_in->f_op->read_iter && file_out->f_op->copy_file_range) > > + return file_out->f_op; > > + else > > + return NULL; > > +} > > + > > +static int essential_file_rw_checks(struct file *file_in, struct file > > +*file_out) { > > + if (!(file_in->f_mode & FMODE_READ) || > > + !(file_out->f_mode & FMODE_WRITE) || > > + (file_out->f_flags & O_APPEND)) > > + return -EBADF; > > + > > + return 0; > > +} > > + > > /* > > * Performs necessary checks before doing a file copy > > * > > @@ -1484,9 +1509,16 @@ static int generic_copy_file_checks(struct file > *file_in, loff_t pos_in, > > struct inode *inode_out = file_inode(file_out); > > uint64_t count = *req_count; > > loff_t size_in; > > + bool splice = flags & COPY_FILE_SPLICE; > > + const struct file_operations *mem_fops; > > int ret; > > > > - ret = generic_file_rw_checks(file_in, file_out); > > + /* The dma-buf file is not a regular file. */ > > + mem_fops = memory_copy_file_ops(file_in, file_out); > > + if (splice || mem_fops == NULL) > > nit: use !mem_fops please > > Considering that the flag COPY_FILE_SPLICE is not allowed from userspace > and is only called by nfsd and ksmbd I think we should assert and deny the > combination of mem_fops && splice because it is very much unexpected. > > After asserting this, it would be nicer to write as: > if (mem_fops) > ret = essential_file_rw_checks(file_in, file_out); > else > ret = generic_file_rw_checks(file_in, file_out); > Got it. Thanks. > > + else > > + ret = essential_file_rw_checks(file_in, file_out); > > if (ret) > > return ret; > > > > @@ -1500,8 +1532,10 @@ static int generic_copy_file_checks(struct file > *file_in, loff_t pos_in, > > * and several different sets of file_operations, but they all end up > > * using the same ->copy_file_range() function pointer. > > */ > > - if (flags & COPY_FILE_SPLICE) { > > + if (splice) { > > /* cross sb splice is allowed */ > > + } else if (mem_fops != NULL) { > > With the assertion that splice && mem_fops is not allowed if (splice || > mem_fops) { > > would go well together because they both allow cross-fs copy not only cross > sb. > Git it. > > + /* cross-fs copy is allowed for memory file. */ > > } else if (file_out->f_op->copy_file_range) { > > if (file_in->f_op->copy_file_range != > > file_out->f_op->copy_file_range) @@ -1554,6 > > +1588,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, > > ssize_t ret; > > bool splice = flags & COPY_FILE_SPLICE; > > bool samesb = file_inode(file_in)->i_sb == > > file_inode(file_out)->i_sb; > > + const struct file_operations *mem_fops; > > > > if (flags & ~COPY_FILE_SPLICE) > > return -EINVAL; > > @@ -1574,18 +1609,27 @@ ssize_t vfs_copy_file_range(struct file *file_in, > loff_t pos_in, > > if (len == 0) > > return 0; > > > > + if (splice) > > + goto do_splice; > > + > > file_start_write(file_out); > > > > goto do_splice needs to be after file_start_write > > Please wait for feedback from vfs maintainers before posting another > version addressing my review comments. > Are you asking whether both the goto do_splice and the do_splice label should be enclosed between file_start_write and file_end_write? Regards, Wangtao. > Thanks, > Amir. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 1/4] fs: allow cross-FS copy_file_range for memory file with direct I/O 2025-06-03 12:38 ` wangtao @ 2025-06-03 12:43 ` Amir Goldstein 0 siblings, 0 replies; 30+ messages in thread From: Amir Goldstein @ 2025-06-03 12:43 UTC (permalink / raw) To: wangtao Cc: sumit.semwal@linaro.org, christian.koenig@amd.com, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 On Tue, Jun 3, 2025 at 2:38 PM wangtao <tao.wangtao@honor.com> wrote: > > > > > -----Original Message----- > > From: Amir Goldstein <amir73il@gmail.com> > > Sent: Tuesday, June 3, 2025 6:57 PM > > To: wangtao <tao.wangtao@honor.com> > > Cc: sumit.semwal@linaro.org; christian.koenig@amd.com; > > kraxel@redhat.com; vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; > > brauner@kernel.org; hughd@google.com; akpm@linux-foundation.org; > > benjamin.gaignard@collabora.com; Brian.Starkey@arm.com; > > jstultz@google.com; tjmercier@google.com; jack@suse.cz; > > baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri- > > devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux- > > kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux- > > mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>; > > yipengxiang <yipengxiang@honor.com>; liulu 00013167 > > <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com> > > Subject: Re: [PATCH v4 1/4] fs: allow cross-FS copy_file_range for memory > > file with direct I/O > > > > On Tue, Jun 3, 2025 at 11:53 AM wangtao <tao.wangtao@honor.com> wrote: > > > > > > Memory files can optimize copy performance via copy_file_range callbacks: > > > -Compared to mmap&read: reduces GUP (get_user_pages) overhead > > > -Compared to sendfile/splice: eliminates one memory copy -Supports > > > dma-buf direct I/O zero-copy implementation > > > > > > Suggested by: Christian König <christian.koenig@amd.com> Suggested by: > > > Amir Goldstein <amir73il@gmail.com> > > > Signed-off-by: wangtao <tao.wangtao@honor.com> > > > --- > > > fs/read_write.c | 64 +++++++++++++++++++++++++++++++++++++----- > > ---- > > > include/linux/fs.h | 2 ++ > > > 2 files changed, 54 insertions(+), 12 deletions(-) > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c index > > > bb0ed26a0b3a..ecb4f753c632 100644 > > > --- a/fs/read_write.c > > > +++ b/fs/read_write.c > > > @@ -1469,6 +1469,31 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, > > out_fd, > > > int, in_fd, } #endif > > > > > > +static const struct file_operations *memory_copy_file_ops( > > > + struct file *file_in, struct file *file_out) { > > > + if ((file_in->f_op->fop_flags & FOP_MEMORY_FILE) && > > > + (file_in->f_mode & FMODE_CAN_ODIRECT) && > > > + file_in->f_op->copy_file_range && file_out->f_op->write_iter) > > > + return file_in->f_op; > > > + else if ((file_out->f_op->fop_flags & FOP_MEMORY_FILE) && > > > + (file_out->f_mode & FMODE_CAN_ODIRECT) && > > > + file_in->f_op->read_iter && file_out->f_op->copy_file_range) > > > + return file_out->f_op; > > > + else > > > + return NULL; > > > +} > > > + > > > +static int essential_file_rw_checks(struct file *file_in, struct file > > > +*file_out) { > > > + if (!(file_in->f_mode & FMODE_READ) || > > > + !(file_out->f_mode & FMODE_WRITE) || > > > + (file_out->f_flags & O_APPEND)) > > > + return -EBADF; > > > + > > > + return 0; > > > +} > > > + > > > /* > > > * Performs necessary checks before doing a file copy > > > * > > > @@ -1484,9 +1509,16 @@ static int generic_copy_file_checks(struct file > > *file_in, loff_t pos_in, > > > struct inode *inode_out = file_inode(file_out); > > > uint64_t count = *req_count; > > > loff_t size_in; > > > + bool splice = flags & COPY_FILE_SPLICE; > > > + const struct file_operations *mem_fops; > > > int ret; > > > > > > - ret = generic_file_rw_checks(file_in, file_out); > > > + /* The dma-buf file is not a regular file. */ > > > + mem_fops = memory_copy_file_ops(file_in, file_out); > > > + if (splice || mem_fops == NULL) > > > > nit: use !mem_fops please > > > > Considering that the flag COPY_FILE_SPLICE is not allowed from userspace > > and is only called by nfsd and ksmbd I think we should assert and deny the > > combination of mem_fops && splice because it is very much unexpected. > > > > After asserting this, it would be nicer to write as: > > if (mem_fops) > > ret = essential_file_rw_checks(file_in, file_out); > > else > > ret = generic_file_rw_checks(file_in, file_out); > > > Got it. Thanks. > > > + else > > > + ret = essential_file_rw_checks(file_in, file_out); > > > if (ret) > > > return ret; > > > > > > @@ -1500,8 +1532,10 @@ static int generic_copy_file_checks(struct file > > *file_in, loff_t pos_in, > > > * and several different sets of file_operations, but they all end up > > > * using the same ->copy_file_range() function pointer. > > > */ > > > - if (flags & COPY_FILE_SPLICE) { > > > + if (splice) { > > > /* cross sb splice is allowed */ > > > + } else if (mem_fops != NULL) { > > > > With the assertion that splice && mem_fops is not allowed if (splice || > > mem_fops) { > > > > would go well together because they both allow cross-fs copy not only cross > > sb. > > > Git it. > > > > + /* cross-fs copy is allowed for memory file. */ > > > } else if (file_out->f_op->copy_file_range) { > > > if (file_in->f_op->copy_file_range != > > > file_out->f_op->copy_file_range) @@ -1554,6 > > > +1588,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, > > > ssize_t ret; > > > bool splice = flags & COPY_FILE_SPLICE; > > > bool samesb = file_inode(file_in)->i_sb == > > > file_inode(file_out)->i_sb; > > > + const struct file_operations *mem_fops; > > > > > > if (flags & ~COPY_FILE_SPLICE) > > > return -EINVAL; > > > @@ -1574,18 +1609,27 @@ ssize_t vfs_copy_file_range(struct file *file_in, > > loff_t pos_in, > > > if (len == 0) > > > return 0; > > > > > > + if (splice) > > > + goto do_splice; > > > + > > > file_start_write(file_out); > > > > > > > goto do_splice needs to be after file_start_write > > > > Please wait for feedback from vfs maintainers before posting another > > version addressing my review comments. > > > Are you asking whether both the goto do_splice and the do_splice label should > be enclosed between file_start_write and file_end_write? No I was just wrong please ignore this comment. Thanks, Amir. ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v4 2/4] dmabuf: Implement copy_file_range callback for dmabuf direct I/O prep 2025-06-03 9:52 [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range wangtao 2025-06-03 9:52 ` [PATCH v4 1/4] fs: allow cross-FS copy_file_range for memory file with direct I/O wangtao @ 2025-06-03 9:52 ` wangtao 2025-06-03 10:42 ` Christian König 2025-06-03 13:04 ` Christoph Hellwig 2025-06-03 9:52 ` [PATCH v4 3/4] udmabuf: Implement udmabuf direct I/O wangtao ` (2 subsequent siblings) 4 siblings, 2 replies; 30+ messages in thread From: wangtao @ 2025-06-03 9:52 UTC (permalink / raw) To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, amir73il Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han, wangtao First determine if dmabuf reads from or writes to the file. Then call exporter's rw_file callback function. Signed-off-by: wangtao <tao.wangtao@honor.com> --- drivers/dma-buf/dma-buf.c | 32 ++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 16 ++++++++++++++++ 2 files changed, 48 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 5baa83b85515..fc9bf54c921a 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -523,7 +523,38 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file) spin_unlock(&dmabuf->name_lock); } +static ssize_t dma_buf_rw_file(struct dma_buf *dmabuf, loff_t my_pos, + struct file *file, loff_t pos, size_t count, bool is_write) +{ + if (!dmabuf->ops->rw_file) + return -EINVAL; + + if (my_pos >= dmabuf->size) + count = 0; + else + count = min_t(size_t, count, dmabuf->size - my_pos); + if (!count) + return 0; + + return dmabuf->ops->rw_file(dmabuf, my_pos, file, pos, count, is_write); +} + +static ssize_t dma_buf_copy_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + size_t count, unsigned int flags) +{ + if (is_dma_buf_file(file_in) && file_out->f_op->write_iter) + return dma_buf_rw_file(file_in->private_data, pos_in, + file_out, pos_out, count, true); + else if (is_dma_buf_file(file_out) && file_in->f_op->read_iter) + return dma_buf_rw_file(file_out->private_data, pos_out, + file_in, pos_in, count, false); + else + return -EINVAL; +} + static const struct file_operations dma_buf_fops = { + .fop_flags = FOP_MEMORY_FILE, .release = dma_buf_file_release, .mmap = dma_buf_mmap_internal, .llseek = dma_buf_llseek, @@ -531,6 +562,7 @@ static const struct file_operations dma_buf_fops = { .unlocked_ioctl = dma_buf_ioctl, .compat_ioctl = compat_ptr_ioctl, .show_fdinfo = dma_buf_show_fdinfo, + .copy_file_range = dma_buf_copy_file_range, }; /* diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 36216d28d8bd..d3636e985399 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -22,6 +22,7 @@ #include <linux/fs.h> #include <linux/dma-fence.h> #include <linux/wait.h> +#include <uapi/linux/dma-buf.h> struct device; struct dma_buf; @@ -285,6 +286,21 @@ struct dma_buf_ops { int (*vmap)(struct dma_buf *dmabuf, struct iosys_map *map); void (*vunmap)(struct dma_buf *dmabuf, struct iosys_map *map); + + /** + * @rw_file: + * + * If an Exporter needs to support Direct I/O file operations, it can + * implement this optional callback. The exporter must verify that no + * other objects hold the sg_table, ensure exclusive access to the + * dmabuf's sg_table, and only then proceed with the I/O operation. + * + * Returns: + * + * 0 on success or a negative error code on failure. + */ + ssize_t (*rw_file)(struct dma_buf *dmabuf, loff_t my_pos, + struct file *file, loff_t pos, size_t count, bool is_write); }; /** -- 2.17.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v4 2/4] dmabuf: Implement copy_file_range callback for dmabuf direct I/O prep 2025-06-03 9:52 ` [PATCH v4 2/4] dmabuf: Implement copy_file_range callback for dmabuf direct I/O prep wangtao @ 2025-06-03 10:42 ` Christian König 2025-06-03 12:26 ` wangtao 2025-06-03 13:04 ` Christoph Hellwig 1 sibling, 1 reply; 30+ messages in thread From: Christian König @ 2025-06-03 10:42 UTC (permalink / raw) To: wangtao, sumit.semwal, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, amir73il Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han On 6/3/25 11:52, wangtao wrote: > First determine if dmabuf reads from or writes to the file. > Then call exporter's rw_file callback function. > > Signed-off-by: wangtao <tao.wangtao@honor.com> > --- > drivers/dma-buf/dma-buf.c | 32 ++++++++++++++++++++++++++++++++ > include/linux/dma-buf.h | 16 ++++++++++++++++ > 2 files changed, 48 insertions(+) > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > index 5baa83b85515..fc9bf54c921a 100644 > --- a/drivers/dma-buf/dma-buf.c > +++ b/drivers/dma-buf/dma-buf.c > @@ -523,7 +523,38 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file) > spin_unlock(&dmabuf->name_lock); > } > > +static ssize_t dma_buf_rw_file(struct dma_buf *dmabuf, loff_t my_pos, > + struct file *file, loff_t pos, size_t count, bool is_write) > +{ > + if (!dmabuf->ops->rw_file) > + return -EINVAL; > + > + if (my_pos >= dmabuf->size) > + count = 0; > + else > + count = min_t(size_t, count, dmabuf->size - my_pos); > + if (!count) > + return 0; > + > + return dmabuf->ops->rw_file(dmabuf, my_pos, file, pos, count, is_write); > +} > + > +static ssize_t dma_buf_copy_file_range(struct file *file_in, loff_t pos_in, > + struct file *file_out, loff_t pos_out, > + size_t count, unsigned int flags) > +{ > + if (is_dma_buf_file(file_in) && file_out->f_op->write_iter) > + return dma_buf_rw_file(file_in->private_data, pos_in, > + file_out, pos_out, count, true); > + else if (is_dma_buf_file(file_out) && file_in->f_op->read_iter) > + return dma_buf_rw_file(file_out->private_data, pos_out, > + file_in, pos_in, count, false); > + else > + return -EINVAL; > +} > + > static const struct file_operations dma_buf_fops = { > + .fop_flags = FOP_MEMORY_FILE, > .release = dma_buf_file_release, > .mmap = dma_buf_mmap_internal, > .llseek = dma_buf_llseek, > @@ -531,6 +562,7 @@ static const struct file_operations dma_buf_fops = { > .unlocked_ioctl = dma_buf_ioctl, > .compat_ioctl = compat_ptr_ioctl, > .show_fdinfo = dma_buf_show_fdinfo, > + .copy_file_range = dma_buf_copy_file_range, > }; > > /* > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h > index 36216d28d8bd..d3636e985399 100644 > --- a/include/linux/dma-buf.h > +++ b/include/linux/dma-buf.h > @@ -22,6 +22,7 @@ > #include <linux/fs.h> > #include <linux/dma-fence.h> > #include <linux/wait.h> > +#include <uapi/linux/dma-buf.h> > > struct device; > struct dma_buf; > @@ -285,6 +286,21 @@ struct dma_buf_ops { > > int (*vmap)(struct dma_buf *dmabuf, struct iosys_map *map); > void (*vunmap)(struct dma_buf *dmabuf, struct iosys_map *map); > + > + /** > + * @rw_file: > + * > + * If an Exporter needs to support Direct I/O file operations, it can > + * implement this optional callback. The exporter must verify that no > + * other objects hold the sg_table, ensure exclusive access to the > + * dmabuf's sg_table, and only then proceed with the I/O operation. Explain why and not what. E.g. something like "Allows direct I/O between this DMA-buf and the file". Completely drop mentioning the sg_table, that is irrelevant. Exclusive access depends on how the exporter implements the whole thing. Regards, Christian. > + * > + * Returns: > + * > + * 0 on success or a negative error code on failure. > + */ > + ssize_t (*rw_file)(struct dma_buf *dmabuf, loff_t my_pos, > + struct file *file, loff_t pos, size_t count, bool is_write); > }; > > /** ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v4 2/4] dmabuf: Implement copy_file_range callback for dmabuf direct I/O prep 2025-06-03 10:42 ` Christian König @ 2025-06-03 12:26 ` wangtao 0 siblings, 0 replies; 30+ messages in thread From: wangtao @ 2025-06-03 12:26 UTC (permalink / raw) To: Christian König, sumit.semwal@linaro.org, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, amir73il@gmail.com Cc: benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 > -----Original Message----- > From: Christian König <christian.koenig@amd.com> > Sent: Tuesday, June 3, 2025 6:42 PM > To: wangtao <tao.wangtao@honor.com>; sumit.semwal@linaro.org; > kraxel@redhat.com; vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; > brauner@kernel.org; hughd@google.com; akpm@linux-foundation.org; > amir73il@gmail.com > Cc: benjamin.gaignard@collabora.com; Brian.Starkey@arm.com; > jstultz@google.com; tjmercier@google.com; jack@suse.cz; > baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri- > devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux- > kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux- > mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>; > yipengxiang <yipengxiang@honor.com>; liulu 00013167 > <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com> > Subject: Re: [PATCH v4 2/4] dmabuf: Implement copy_file_range callback for > dmabuf direct I/O prep > > > > On 6/3/25 11:52, wangtao wrote: > > First determine if dmabuf reads from or writes to the file. > > Then call exporter's rw_file callback function. > > > > Signed-off-by: wangtao <tao.wangtao@honor.com> > > --- > > drivers/dma-buf/dma-buf.c | 32 ++++++++++++++++++++++++++++++++ > > include/linux/dma-buf.h | 16 ++++++++++++++++ > > 2 files changed, 48 insertions(+) > > > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > > index 5baa83b85515..fc9bf54c921a 100644 > > --- a/drivers/dma-buf/dma-buf.c > > +++ b/drivers/dma-buf/dma-buf.c > > @@ -523,7 +523,38 @@ static void dma_buf_show_fdinfo(struct seq_file > *m, struct file *file) > > spin_unlock(&dmabuf->name_lock); > > } > > > > +static ssize_t dma_buf_rw_file(struct dma_buf *dmabuf, loff_t my_pos, > > + struct file *file, loff_t pos, size_t count, bool is_write) { > > + if (!dmabuf->ops->rw_file) > > + return -EINVAL; > > + > > + if (my_pos >= dmabuf->size) > > + count = 0; > > + else > > + count = min_t(size_t, count, dmabuf->size - my_pos); > > + if (!count) > > + return 0; > > + > > + return dmabuf->ops->rw_file(dmabuf, my_pos, file, pos, count, > > +is_write); } > > + > > +static ssize_t dma_buf_copy_file_range(struct file *file_in, loff_t pos_in, > > + struct file *file_out, loff_t pos_out, > > + size_t count, unsigned int flags) > > +{ > > + if (is_dma_buf_file(file_in) && file_out->f_op->write_iter) > > + return dma_buf_rw_file(file_in->private_data, pos_in, > > + file_out, pos_out, count, true); > > + else if (is_dma_buf_file(file_out) && file_in->f_op->read_iter) > > + return dma_buf_rw_file(file_out->private_data, pos_out, > > + file_in, pos_in, count, false); > > + else > > + return -EINVAL; > > +} > > + > > static const struct file_operations dma_buf_fops = { > > + .fop_flags = FOP_MEMORY_FILE, > > .release = dma_buf_file_release, > > .mmap = dma_buf_mmap_internal, > > .llseek = dma_buf_llseek, > > @@ -531,6 +562,7 @@ static const struct file_operations dma_buf_fops = { > > .unlocked_ioctl = dma_buf_ioctl, > > .compat_ioctl = compat_ptr_ioctl, > > .show_fdinfo = dma_buf_show_fdinfo, > > + .copy_file_range = dma_buf_copy_file_range, > > }; > > > > /* > > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index > > 36216d28d8bd..d3636e985399 100644 > > --- a/include/linux/dma-buf.h > > +++ b/include/linux/dma-buf.h > > @@ -22,6 +22,7 @@ > > #include <linux/fs.h> > > #include <linux/dma-fence.h> > > #include <linux/wait.h> > > +#include <uapi/linux/dma-buf.h> > > > > struct device; > > struct dma_buf; > > @@ -285,6 +286,21 @@ struct dma_buf_ops { > > > > int (*vmap)(struct dma_buf *dmabuf, struct iosys_map *map); > > void (*vunmap)(struct dma_buf *dmabuf, struct iosys_map *map); > > + > > + /** > > + * @rw_file: > > + * > > + * If an Exporter needs to support Direct I/O file operations, it can > > + * implement this optional callback. The exporter must verify that no > > + * other objects hold the sg_table, ensure exclusive access to the > > + * dmabuf's sg_table, and only then proceed with the I/O operation. > > Explain why and not what. E.g. something like "Allows direct I/O between > this DMA-buf and the file". > > Completely drop mentioning the sg_table, that is irrelevant. Exclusive access > depends on how the exporter implements the whole thing. > Will fix this shortly. Regards, Wangtao. > Regards, > Christian. > > > + * > > + * Returns: > > + * > > + * 0 on success or a negative error code on failure. > > + */ > > + ssize_t (*rw_file)(struct dma_buf *dmabuf, loff_t my_pos, > > + struct file *file, loff_t pos, size_t count, bool is_write); > > }; > > > > /** ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 2/4] dmabuf: Implement copy_file_range callback for dmabuf direct I/O prep 2025-06-03 9:52 ` [PATCH v4 2/4] dmabuf: Implement copy_file_range callback for dmabuf direct I/O prep wangtao 2025-06-03 10:42 ` Christian König @ 2025-06-03 13:04 ` Christoph Hellwig 1 sibling, 0 replies; 30+ messages in thread From: Christoph Hellwig @ 2025-06-03 13:04 UTC (permalink / raw) To: wangtao Cc: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, amir73il, benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han On Tue, Jun 03, 2025 at 05:52:43PM +0800, wangtao wrote: > +static ssize_t dma_buf_rw_file(struct dma_buf *dmabuf, loff_t my_pos, > + struct file *file, loff_t pos, size_t count, bool is_write) > +{ > + if (!dmabuf->ops->rw_file) > + return -EINVAL; > + > + if (my_pos >= dmabuf->size) > + count = 0; > + else > + count = min_t(size_t, count, dmabuf->size - my_pos); > + if (!count) > + return 0; > + > + return dmabuf->ops->rw_file(dmabuf, my_pos, file, pos, count, is_write); So despite claiming in the cover letter that dmabufs can't support direct I/O you are just reimplementing it badly here using a side interface. ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v4 3/4] udmabuf: Implement udmabuf direct I/O 2025-06-03 9:52 [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range wangtao 2025-06-03 9:52 ` [PATCH v4 1/4] fs: allow cross-FS copy_file_range for memory file with direct I/O wangtao 2025-06-03 9:52 ` [PATCH v4 2/4] dmabuf: Implement copy_file_range callback for dmabuf direct I/O prep wangtao @ 2025-06-03 9:52 ` wangtao 2025-06-03 9:52 ` [PATCH v4 4/4] dmabuf:system_heap Implement system_heap dmabuf " wangtao 2025-06-03 13:00 ` [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range Christoph Hellwig 4 siblings, 0 replies; 30+ messages in thread From: wangtao @ 2025-06-03 9:52 UTC (permalink / raw) To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, amir73il Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han, wangtao Construct bio_vec from folios, then call the other file's r/w callbacks for IO operations. Test data shows direct I/O copy_file_range improves performance by over 50% vs direct I/O mmap&read (2557 vs 1534). Test data: | 32x32MB Read 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O% |-------------------------|--------|--------|--------|--------|----- | 1)Beg udmabuf buffer R/W| 580 | 323 | 1238 | 867 | 100% | 2) dmabuf buffer R/W| 48 | 5 | 1149 | 934 | 107% | 3) udma+memfd buffer R/W| 597 | 340 | 2157 | 497 | 57% | 4) udma+memfd direct R/W| 573 | 340 | 700 | 1534 | 176% | 5) u+mfd buffer sendfile| 577 | 340 | 1204 | 891 | 102% | 6) u+mfd direct sendfile| 567 | 339 | 2272 | 472 | 54% | 7) u+mfd buffer splice| 570 | 337 | 1114 | 964 | 111% | 8) u+mfd direct splice| 564 | 335 | 793 | 1355 | 156% | 9) udmabuf buffer c_f_r| 577 | 323 | 1059 | 1014 | 116% |10) udmabuf direct c_f_r| 582 | 325 | 420 | 2557 | 294% |11)End udmabuf buffer R/W| 586 | 323 | 1188 | 903 | 104% Signed-off-by: wangtao <tao.wangtao@honor.com> --- drivers/dma-buf/udmabuf.c | 54 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c index e74e36a8ecda..511567b15340 100644 --- a/drivers/dma-buf/udmabuf.c +++ b/drivers/dma-buf/udmabuf.c @@ -15,6 +15,8 @@ #include <linux/udmabuf.h> #include <linux/vmalloc.h> #include <linux/iosys-map.h> +#include <linux/bvec.h> +#include <linux/uio.h> static int list_limit = 1024; module_param(list_limit, int, 0644); @@ -284,6 +286,55 @@ static int end_cpu_udmabuf(struct dma_buf *buf, return 0; } +static ssize_t udmabuf_rw_file(struct dma_buf *dmabuf, loff_t my_pos, + struct file *other, loff_t pos, + size_t count, bool is_write) +{ + struct udmabuf *ubuf = dmabuf->priv; + loff_t my_end = my_pos + count, bv_beg, bv_end = 0; + size_t i, bv_off, bv_len, bv_idx = 0; + struct bio_vec *bvec; + struct kiocb kiocb; + struct iov_iter iter; + unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST; + ssize_t ret = 0; + struct folio *folio; + + bvec = kvcalloc(ubuf->nr_pinned, sizeof(*bvec), GFP_KERNEL); + if (!bvec) + return -ENOMEM; + + init_sync_kiocb(&kiocb, other); + kiocb.ki_pos = pos; + + for (i = 0; i < ubuf->nr_pinned; i++) { + folio = ubuf->pinned_folios[i]; + bv_beg = bv_end; + if (bv_beg >= my_end) + break; + bv_end += folio_size(folio); + if (bv_end <= my_pos) + continue; + + bv_len = min(bv_end, my_end) - max(my_pos, bv_beg); + bv_off = my_pos > bv_beg ? my_pos - bv_beg : 0; + bvec_set_page(&bvec[bv_idx], &folio->page, bv_len, bv_off); + ++bv_idx; + } + + if (bv_idx > 0) { + /* start R/W. */ + iov_iter_bvec(&iter, direction, bvec, bv_idx, count); + if (is_write) + ret = other->f_op->write_iter(&kiocb, &iter); + else + ret = other->f_op->read_iter(&kiocb, &iter); + } + kvfree(bvec); + + return ret; +} + static const struct dma_buf_ops udmabuf_ops = { .cache_sgt_mapping = true, .map_dma_buf = map_udmabuf, @@ -294,6 +345,7 @@ static const struct dma_buf_ops udmabuf_ops = { .vunmap = vunmap_udmabuf, .begin_cpu_access = begin_cpu_udmabuf, .end_cpu_access = end_cpu_udmabuf, + .rw_file = udmabuf_rw_file, }; #define SEALS_WANTED (F_SEAL_SHRINK) @@ -455,6 +507,8 @@ static long udmabuf_create(struct miscdevice *device, ret = PTR_ERR(dmabuf); goto err; } + /* Support direct I/O */ + dmabuf->file->f_mode |= FMODE_CAN_ODIRECT; /* * Ownership of ubuf is held by the dmabuf from here. * If the following dma_buf_fd() fails, dma_buf_put() cleans up both the -- 2.17.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v4 4/4] dmabuf:system_heap Implement system_heap dmabuf direct I/O 2025-06-03 9:52 [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range wangtao ` (2 preceding siblings ...) 2025-06-03 9:52 ` [PATCH v4 3/4] udmabuf: Implement udmabuf direct I/O wangtao @ 2025-06-03 9:52 ` wangtao 2025-06-03 13:00 ` [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range Christoph Hellwig 4 siblings, 0 replies; 30+ messages in thread From: wangtao @ 2025-06-03 9:52 UTC (permalink / raw) To: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, amir73il Cc: benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han, wangtao First verify system_heap exporter has exclusive dmabuf access. Build bio_vec from sgtable, then invoke target file's r/w callbacks for IO. Outperforms buffer IO mmap/read by 250%, beats direct I/O udmabuf copy_file_range by over 30% with initialization time significantly lower than udmabuf. Test data: | 32x32MB Read 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O% |-------------------------|--------|--------|--------|--------|----- | 1)Beg dmabuf buffer R/W| 47 | 5 | 1125 | 954 | 100% | 2) udmabuf buffer R/W| 576 | 323 | 1228 | 874 | 91% | 3) udma+memfd buffer R/W| 596 | 340 | 2166 | 495 | 51% | 4) udma+memfd direct R/W| 570 | 338 | 711 | 1510 | 158% | 5) udmabuf buffer c_f_r| 578 | 329 | 1128 | 952 | 99% | 6) udmabuf direct c_f_r| 570 | 324 | 405 | 2651 | 277% | 7) dmabuf buffer c_f_r| 47 | 5 | 1035 | 1037 | 108% | 8) dmabuf direct c_f_r| 51 | 5 | 309 | 3480 | 364% | 9)End dmabuf buffer R/W| 48 | 5 | 1153 | 931 | 97% | 32x32MB Write 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O% |-------------------------|--------|--------|--------|--------|----- | 1)Beg dmabuf buffer R/W| 50 | 5 | 1405 | 764 | 100% | 2) udmabuf buffer R/W| 580 | 341 | 1337 | 803 | 105% | 3) udma+memfd buffer R/W| 588 | 331 | 1820 | 590 | 77% | 4) udma+memfd direct R/W| 585 | 333 | 662 | 1622 | 212% | 5) udmabuf buffer c_f_r| 577 | 329 | 1326 | 810 | 106% | 6) udmabuf direct c_f_r| 580 | 330 | 602 | 1784 | 233% | 7) dmabuf buffer c_f_r| 49 | 5 | 1330 | 807 | 105% | 8) dmabuf direct c_f_r| 49 | 5 | 344 | 3127 | 409% | 9)End dmabuf buffer R/W| 50 | 5 | 1442 | 745 | 97% Signed-off-by: wangtao <tao.wangtao@honor.com> --- drivers/dma-buf/heaps/system_heap.c | 69 +++++++++++++++++++++++++++++ 1 file changed, 69 insertions(+) diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 26d5dc89ea16..85ffff7ef855 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -20,6 +20,8 @@ #include <linux/scatterlist.h> #include <linux/slab.h> #include <linux/vmalloc.h> +#include <linux/bvec.h> +#include <linux/uio.h> static struct dma_heap *sys_heap; @@ -281,6 +283,70 @@ static void system_heap_vunmap(struct dma_buf *dmabuf, struct iosys_map *map) iosys_map_clear(map); } +static ssize_t system_heap_buffer_rw_other(struct system_heap_buffer *buffer, + loff_t my_pos, struct file *other, loff_t pos, + size_t count, bool is_write) +{ + struct sg_table *sgt = &buffer->sg_table; + struct scatterlist *sg; + loff_t my_end = my_pos + count, bv_beg, bv_end = 0; + size_t i, bv_off, bv_len, bv_idx = 0; + struct bio_vec *bvec; + struct kiocb kiocb; + struct iov_iter iter; + unsigned int direction = is_write ? ITER_SOURCE : ITER_DEST; + ssize_t ret = 0; + + bvec = kvcalloc(sgt->orig_nents, sizeof(*bvec), GFP_KERNEL); + if (!bvec) + return -ENOMEM; + + init_sync_kiocb(&kiocb, other); + kiocb.ki_pos = pos; + + for_each_sgtable_sg(sgt, sg, i) { + bv_beg = bv_end; + if (bv_beg >= my_end) + break; + bv_end += sg->offset + sg->length; + if (bv_end <= my_pos) + continue; + + bv_len = min(bv_end, my_end) - max(my_pos, bv_beg); + bv_off = sg->offset + (my_pos > bv_beg ? my_pos - bv_beg : 0); + bvec_set_page(&bvec[bv_idx], sg_page(sg), bv_len, bv_off); + ++bv_idx; + } + + if (bv_idx > 0) { + /* start R/W. */ + iov_iter_bvec(&iter, direction, bvec, bv_idx, count); + if (is_write) + ret = other->f_op->write_iter(&kiocb, &iter); + else + ret = other->f_op->read_iter(&kiocb, &iter); + } + kvfree(bvec); + + return ret; +} + +static ssize_t system_heap_dma_buf_rw_file(struct dma_buf *dmabuf, + loff_t my_pos, struct file *file, loff_t pos, + size_t count, bool is_write) +{ + struct system_heap_buffer *buffer = dmabuf->priv; + ssize_t ret = -EBUSY; + + mutex_lock(&buffer->lock); + if (list_empty(&buffer->attachments) && !buffer->vmap_cnt) + ret = system_heap_buffer_rw_other(buffer, my_pos, + file, pos, count, is_write); + mutex_unlock(&buffer->lock); + + return ret; +} + static void system_heap_dma_buf_release(struct dma_buf *dmabuf) { struct system_heap_buffer *buffer = dmabuf->priv; @@ -308,6 +374,7 @@ static const struct dma_buf_ops system_heap_buf_ops = { .mmap = system_heap_mmap, .vmap = system_heap_vmap, .vunmap = system_heap_vunmap, + .rw_file = system_heap_dma_buf_rw_file, .release = system_heap_dma_buf_release, }; @@ -400,6 +467,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap, ret = PTR_ERR(dmabuf); goto free_pages; } + /* Support direct I/O */ + dmabuf->file->f_mode |= FMODE_CAN_ODIRECT; return dmabuf; free_pages: -- 2.17.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-03 9:52 [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range wangtao ` (3 preceding siblings ...) 2025-06-03 9:52 ` [PATCH v4 4/4] dmabuf:system_heap Implement system_heap dmabuf " wangtao @ 2025-06-03 13:00 ` Christoph Hellwig 2025-06-03 13:14 ` Christian König 4 siblings, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2025-06-03 13:00 UTC (permalink / raw) To: wangtao Cc: sumit.semwal, christian.koenig, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, amir73il, benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han This is a really weird interface. No one has yet to explain why dmabuf is so special that we can't support direct I/O to it when we can support it to otherwise exotic mappings like PCI P2P ones. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-03 13:00 ` [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range Christoph Hellwig @ 2025-06-03 13:14 ` Christian König 2025-06-03 13:19 ` Christoph Hellwig 0 siblings, 1 reply; 30+ messages in thread From: Christian König @ 2025-06-03 13:14 UTC (permalink / raw) To: Christoph Hellwig, wangtao Cc: sumit.semwal, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, amir73il, benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han On 6/3/25 15:00, Christoph Hellwig wrote: > This is a really weird interface. No one has yet to explain why dmabuf > is so special that we can't support direct I/O to it when we can support > it to otherwise exotic mappings like PCI P2P ones. With udmabuf you can do direct I/O, it's just inefficient to walk the page tables for it when you already have an array of all the folios. Regards, Christian. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-03 13:14 ` Christian König @ 2025-06-03 13:19 ` Christoph Hellwig 2025-06-03 14:18 ` Christian König 2025-06-06 9:52 ` wangtao 0 siblings, 2 replies; 30+ messages in thread From: Christoph Hellwig @ 2025-06-03 13:19 UTC (permalink / raw) To: Christian König Cc: Christoph Hellwig, wangtao, sumit.semwal, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, amir73il, benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han On Tue, Jun 03, 2025 at 03:14:20PM +0200, Christian König wrote: > On 6/3/25 15:00, Christoph Hellwig wrote: > > This is a really weird interface. No one has yet to explain why dmabuf > > is so special that we can't support direct I/O to it when we can support > > it to otherwise exotic mappings like PCI P2P ones. > > With udmabuf you can do direct I/O, it's just inefficient to walk the > page tables for it when you already have an array of all the folios. Does it matter compared to the I/O in this case? Either way there has been talk (in case of networking implementations) that use a dmabuf as a first class container for lower level I/O. I'd much rather do that than adding odd side interfaces. I.e. have a version of splice that doesn't bother with the pipe, but instead just uses in-kernel direct I/O on one side and dmabuf-provided folios on the other. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-03 13:19 ` Christoph Hellwig @ 2025-06-03 14:18 ` Christian König 2025-06-03 14:28 ` Christoph Hellwig 2025-06-06 9:52 ` wangtao 1 sibling, 1 reply; 30+ messages in thread From: Christian König @ 2025-06-03 14:18 UTC (permalink / raw) To: Christoph Hellwig Cc: wangtao, sumit.semwal, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, amir73il, benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han On 6/3/25 15:19, Christoph Hellwig wrote: > On Tue, Jun 03, 2025 at 03:14:20PM +0200, Christian König wrote: >> On 6/3/25 15:00, Christoph Hellwig wrote: >>> This is a really weird interface. No one has yet to explain why dmabuf >>> is so special that we can't support direct I/O to it when we can support >>> it to otherwise exotic mappings like PCI P2P ones. >> >> With udmabuf you can do direct I/O, it's just inefficient to walk the >> page tables for it when you already have an array of all the folios. > > Does it matter compared to the I/O in this case? It unfortunately does, see the numbers on patch 3 and 4. I'm not very keen about it either, but I don't see much other way to do this. > Either way there has been talk (in case of networking implementations) > that use a dmabuf as a first class container for lower level I/O. > I'd much rather do that than adding odd side interfaces. I.e. have > a version of splice that doesn't bother with the pipe, but instead > just uses in-kernel direct I/O on one side and dmabuf-provided folios > on the other. That would work for me as well. But if splice or copy_file_range is used is not that important to me. My question is rather if it's ok to call f_op->write_iter() and f_op->read_iter() with pages allocated by alloc_pages(), e.g. where drivers potentially ignore the page count and just re-use pages as they like? Regards, Christian. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-03 14:18 ` Christian König @ 2025-06-03 14:28 ` Christoph Hellwig 2025-06-03 15:55 ` Christian König 0 siblings, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2025-06-03 14:28 UTC (permalink / raw) To: Christian König Cc: Christoph Hellwig, wangtao, sumit.semwal, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, amir73il, benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han On Tue, Jun 03, 2025 at 04:18:22PM +0200, Christian König wrote: > > Does it matter compared to the I/O in this case? > > It unfortunately does, see the numbers on patch 3 and 4. That's kinda weird. Why does the page table lookup tage so much time compared to normal I/O? > My question is rather if it's ok to call f_op->write_iter() and > f_op->read_iter() with pages allocated by alloc_pages(), e.g. > where drivers potentially ignore the page count and just re-use pages > as they like? read_iter and write_iter with ITER_BVEC just use the pages as source and destination of the I/O. They must not touch the refcounts or do anything fancy with them. Various places in the kernel rely on that. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-03 14:28 ` Christoph Hellwig @ 2025-06-03 15:55 ` Christian König 2025-06-03 16:01 ` Christoph Hellwig 0 siblings, 1 reply; 30+ messages in thread From: Christian König @ 2025-06-03 15:55 UTC (permalink / raw) To: Christoph Hellwig Cc: wangtao, sumit.semwal, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, amir73il, benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han On 6/3/25 16:28, Christoph Hellwig wrote: > On Tue, Jun 03, 2025 at 04:18:22PM +0200, Christian König wrote: >>> Does it matter compared to the I/O in this case? >> >> It unfortunately does, see the numbers on patch 3 and 4. > > That's kinda weird. Why does the page table lookup tage so much > time compared to normal I/O? I have absolutely no idea. It's rather surprising for me as well. The user seems to have a rather slow CPU paired with fast I/O, but it still looks rather fishy to me. Additional to that allocating memory through memfd_create() is *much* slower on that box than through dma-buf-heaps (which basically just uses GFP and an array). We have seen something similar with customers systems which we couldn't explain so far. >> My question is rather if it's ok to call f_op->write_iter() and >> f_op->read_iter() with pages allocated by alloc_pages(), e.g. >> where drivers potentially ignore the page count and just re-use pages >> as they like? > > read_iter and write_iter with ITER_BVEC just use the pages as source > and destination of the I/O. They must not touch the refcounts or > do anything fancy with them. Various places in the kernel rely on > that. Perfect, thanks for that info. Regards, Christian. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-03 15:55 ` Christian König @ 2025-06-03 16:01 ` Christoph Hellwig 2025-06-06 9:59 ` wangtao 0 siblings, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2025-06-03 16:01 UTC (permalink / raw) To: Christian König Cc: Christoph Hellwig, wangtao, sumit.semwal, kraxel, vivek.kasireddy, viro, brauner, hughd, akpm, amir73il, benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, jack, baolin.wang, linux-media, dri-devel, linaro-mm-sig, linux-kernel, linux-fsdevel, linux-mm, bintian.wang, yipengxiang, liulu.liu, feng.han On Tue, Jun 03, 2025 at 05:55:18PM +0200, Christian König wrote: > On 6/3/25 16:28, Christoph Hellwig wrote: > > On Tue, Jun 03, 2025 at 04:18:22PM +0200, Christian König wrote: > >>> Does it matter compared to the I/O in this case? > >> > >> It unfortunately does, see the numbers on patch 3 and 4. > > > > That's kinda weird. Why does the page table lookup tage so much > > time compared to normal I/O? > > I have absolutely no idea. It's rather surprising for me as well. > > The user seems to have a rather slow CPU paired with fast I/O, but it still looks rather fishy to me. > > Additional to that allocating memory through memfd_create() is *much* slower on that box than through dma-buf-heaps (which basically just uses GFP and an array). Can someone try to reproduce these results on a normal system before we're building infrastructure based on these numbers? ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-03 16:01 ` Christoph Hellwig @ 2025-06-06 9:59 ` wangtao 0 siblings, 0 replies; 30+ messages in thread From: wangtao @ 2025-06-06 9:59 UTC (permalink / raw) To: Christoph Hellwig, Christian König Cc: sumit.semwal@linaro.org, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, amir73il@gmail.com, benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 > -----Original Message----- > From: Christoph Hellwig <hch@infradead.org> > Sent: Wednesday, June 4, 2025 12:02 AM > To: Christian König <christian.koenig@amd.com> > Cc: Christoph Hellwig <hch@infradead.org>; wangtao > <tao.wangtao@honor.com>; sumit.semwal@linaro.org; kraxel@redhat.com; > vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org; > hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com; > benjamin.gaignard@collabora.com; Brian.Starkey@arm.com; > jstultz@google.com; tjmercier@google.com; jack@suse.cz; > baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri- > devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux- > kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux- > mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>; > yipengxiang <yipengxiang@honor.com>; liulu 00013167 > <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com> > Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via > copy_file_range > > On Tue, Jun 03, 2025 at 05:55:18PM +0200, Christian König wrote: > > On 6/3/25 16:28, Christoph Hellwig wrote: > > > On Tue, Jun 03, 2025 at 04:18:22PM +0200, Christian König wrote: > > >>> Does it matter compared to the I/O in this case? > > >> > > >> It unfortunately does, see the numbers on patch 3 and 4. > > > > > > That's kinda weird. Why does the page table lookup tage so much > > > time compared to normal I/O? > > > > I have absolutely no idea. It's rather surprising for me as well. > > > > The user seems to have a rather slow CPU paired with fast I/O, but it still > looks rather fishy to me. > > > > Additional to that allocating memory through memfd_create() is *much* > slower on that box than through dma-buf-heaps (which basically just uses > GFP and an array). > > Can someone try to reproduce these results on a normal system before > we're building infrastructure based on these numbers? Here's my test program. If anyone's interested, please help test it? Regards, Wangtao. [PATCH] Add dmabuf direct I/O zero-copy test program Compare latency and throughput of file read/write for memfd, udmabuf+memfd, udmabuf, and dmabuf buffers. memfd supports buffer I/O and direct I/O via read/write, sendfile, and splice user APIs. udmabuf/dmabuf only support buffer I/O via read/write, lacking direct I/O, sendfile, and splice support. Previous patch added dmabuf's copy_file_range callback, enabling buffer/direct I/O file copies for udmabuf/dmabuf. u+memfd represents using memfd-created udmabuf with memfd's user APIs for file copying. usage: dmabuf-dio [file_path] [size_MB] Signed-off-by: wangtao <tao.wangtao@honor.com> --- tools/testing/selftests/dmabuf-heaps/Makefile | 1 + .../selftests/dmabuf-heaps/dmabuf-dio.c | 617 ++++++++++++++++++ 2 files changed, 618 insertions(+) create mode 100644 tools/testing/selftests/dmabuf-heaps/dmabuf-dio.c diff --git a/tools/testing/selftests/dmabuf-heaps/Makefile b/tools/testing/selftests/dmabuf-heaps/Makefile index 9e7e158d5fa3..beb6b3e55e17 100644 --- a/tools/testing/selftests/dmabuf-heaps/Makefile +++ b/tools/testing/selftests/dmabuf-heaps/Makefile @@ -2,5 +2,6 @@ CFLAGS += -static -O3 -Wl,-no-as-needed -Wall $(KHDR_INCLUDES) TEST_GEN_PROGS = dmabuf-heap +TEST_GEN_PROGS += dmabuf-dio include ../lib.mk diff --git a/tools/testing/selftests/dmabuf-heaps/dmabuf-dio.c b/tools/testing/selftests/dmabuf-heaps/dmabuf-dio.c new file mode 100644 index 000000000000..eae902a27f29 --- /dev/null +++ b/tools/testing/selftests/dmabuf-heaps/dmabuf-dio.c @@ -0,0 +1,617 @@ +#include <linux/dma-heap.h> +#include <linux/dma-buf.h> +#include <linux/udmabuf.h> +#include <sys/mman.h> +#include <sys/sendfile.h> +#include <sys/ioctl.h> +#include <stdbool.h> +#include <stdlib.h> +#include <string.h> +#include <stdio.h> +#include <fcntl.h> +#include <unistd.h> +#include <asm/unistd.h> +#include <time.h> +#include <errno.h> + +#ifndef TEMP_FAILURE_RETRY +#define TEMP_FAILURE_RETRY(exp) ({ \ + __typeof__(exp) _rc; \ + do { \ + _rc = (exp); \ + } while (_rc == -1 && errno == EINTR); \ + _rc; }) +#endif + +#if 1 +int memfd_create(const char *name, unsigned flags) +{ + return syscall(__NR_memfd_create, name, flags); +} + +ssize_t copy_file_range(int fd_in, off_t *off_in, int fd_out, off_t *off_out, + size_t len, unsigned flags) +{ + return syscall(__NR_copy_file_range, fd_in, off_in, fd_out, off_out, len, flags); +} +#endif + +int alloc_memfd(size_t size) +{ + int memfd = memfd_create("ubuf", MFD_ALLOW_SEALING); + if (memfd < 0) + return -1; + + int ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK); + if (ret < 0) + return -1; + ret = TEMP_FAILURE_RETRY(ftruncate(memfd, size)); + if (ret < 0) + return -1; + return memfd; +} + +int alloc_udmabuf(size_t size, int memfd) +{ + static int udev_fd = -1; + if (udev_fd < 0) { + udev_fd = open("/dev/udmabuf", O_RDONLY); + if (udev_fd < 0) + return -1; + } + + struct udmabuf_create uc = {0}; + uc.memfd = memfd; + uc.offset = 0; + uc.size = size; + int buf_fd = TEMP_FAILURE_RETRY(ioctl(udev_fd, UDMABUF_CREATE, &uc)); + if (buf_fd < 0) + return -1; + + return buf_fd; +} + +int alloc_dmabuf(size_t size) +{ + static int heap_fd = -1; + + struct dma_heap_allocation_data heap_data = { 0 }; + heap_data.len = size; // length of data to be allocated in bytes + heap_data.fd_flags = O_RDWR | O_CLOEXEC; // permissions for the memory to be allocated + + if (heap_fd < 0) { + heap_fd = open("/dev/dma_heap/system", O_RDONLY); + if (heap_fd < 0) + return -1; + } + + int ret = TEMP_FAILURE_RETRY(ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &heap_data)); + if (ret < 0) { + return -1; + } + if (heap_data.fd < 0) + return -1; + + return heap_data.fd; +} + +static inline long times_us_duration(struct timespec *ts_start, struct timespec *ts_end) +{ + long long start = ts_start->tv_sec * 1000000 + ts_start->tv_nsec / 1000; + long long end = ts_end->tv_sec * 1000000 + ts_end->tv_nsec / 1000; + return end - start; +} + +static inline long time_us2ms(long us) +{ + return (us + 1000 - 1) / 1000; +} + +void drop_pagecaches(int file_fd, loff_t offset, size_t len) +{ + if (file_fd >= 0 && len > 0) { + posix_fadvise(file_fd, offset, len, POSIX_FADV_DONTNEED); + return; + } + + int fd = open("/proc/sys/vm/drop_caches", O_WRONLY | O_CLOEXEC); + if (fd < 0) { + printf("open drop_caches failed %d\n", errno); + return; + } + write(fd, "3", 1); + close(fd); +} + +const size_t SIZ_MB = 1024 * 1024; +const size_t DMABUF_SIZE_MAX = SIZ_MB * 32; + +static inline unsigned char test_data_value(unsigned int val) +{ + return val % 253; +} + +void test_fill_data(unsigned char* ptr, unsigned int val, size_t sz, bool fast) +{ + if (sz > 0 && fast) { + ptr[0] = test_data_value(val); + ptr[sz / 2] = test_data_value(val + sz / 2); + ptr[sz - 1] = test_data_value(val + sz - 1); + return; + } + for (size_t i = 0; i < sz; i++) { + ptr[i] = test_data_value(val + i); + } +} + +bool test_check_data(unsigned char* ptr, unsigned int val, size_t sz, bool fast) +{ + if (sz > 0 && fast) { + if (ptr[0] != test_data_value(val)) + return false; + if (ptr[sz / 2] != test_data_value(val + sz / 2)) + return false; + if (ptr[sz - 1] != test_data_value(val + sz - 1)) + return false; + return true; + } + for (size_t i = 0; i < sz; i++) { + if (ptr[i] != test_data_value(val + i)) + return false; + } + return true; +} + +enum mem_buf_type { + BUF_MEMFD, + BUF_UDMA_MEMFD, + BUF_UDMABUF, + BUF_DMABUF, + BUF_TYPE_MAX, +}; + +enum copy_io_type { + IO_MAP_READ_WRITE, + IO_SENDFILE, + IO_SPLICE, + IO_COPY_FILE_RANGE, + IO_TYPE_MAX, +}; + +static const char *mem_buf_type_descs[BUF_TYPE_MAX] = { + "memfd", "u+memfd", "udmabuf", "dmabuf", +}; + +static const char *io_type_descs[IO_TYPE_MAX] = { + "R/W", "sendfile", "splice", "c_f_r", +}; + +struct mem_buf_st { + enum mem_buf_type buf_type_; + int io_fd_; + int mem_fd_; + int buf_fd_; + size_t buf_len_; + unsigned char *buf_ptr_; +}; + +struct mem_buf_tc { + enum mem_buf_type buf_type_; + enum copy_io_type io_type_; + int file_fd_; + bool direct_io_; + size_t io_len_; + long times_create_; + long times_data_; + long times_io_; + long times_close_; +}; + +void membuf_deinit(struct mem_buf_st *membuf) +{ + if (membuf->buf_ptr_ != NULL && membuf->buf_ptr_ != MAP_FAILED) + munmap(membuf->buf_ptr_, membuf->buf_len_); + membuf->buf_ptr_ = NULL; + if (membuf->mem_fd_ > 0) + close(membuf->mem_fd_); + if (membuf->buf_fd_ > 0) + close(membuf->buf_fd_); + membuf->mem_fd_ = -1; + membuf->buf_fd_ = -1; +} + +bool membuf_init(struct mem_buf_st *membuf, size_t len, enum mem_buf_type buf_type) +{ + int map_fd = -1; + + membuf->mem_fd_ = -1; + membuf->buf_fd_ = -1; + membuf->buf_len_ = len; + membuf->buf_ptr_ = NULL; + if (buf_type <= BUF_UDMABUF) { + membuf->mem_fd_ = alloc_memfd(len); + if (membuf->mem_fd_ < 0) { + printf("alloc memfd %zd failed %d\n", len, errno); + return false; + } + map_fd = membuf->mem_fd_; + if (buf_type > BUF_MEMFD) { + membuf->buf_fd_ = alloc_udmabuf(len, membuf->mem_fd_); + if (membuf->buf_fd_ < 0) { + printf("alloc udmabuf %zd failed %d\n", len, errno); + return false; + } + if (buf_type == BUF_UDMABUF) + map_fd = membuf->buf_fd_; + } + } else { + membuf->buf_fd_ = alloc_dmabuf(len); + if (membuf->buf_fd_ < 0) { + printf("alloc dmabuf %zd failed %d\n", len, errno); + return false; + } + map_fd = membuf->buf_fd_; + } + membuf->io_fd_ = map_fd; + membuf->buf_ptr_ = (unsigned char *)mmap(NULL, len, + PROT_READ | PROT_WRITE, MAP_SHARED, map_fd, 0); + if (membuf->buf_ptr_ == MAP_FAILED) { + printf("fd %d map %zd failed %d\n", map_fd, len, errno); + membuf->buf_ptr_ = NULL; + return false; + } + return true; +} + +ssize_t membuf_read_write(const struct mem_buf_st *membuf, int file_fd, + loff_t off, bool is_read) +{ + if (!membuf->buf_ptr_) + return -1; + lseek(file_fd, off, SEEK_SET); + if (is_read) + return read(file_fd, membuf->buf_ptr_, membuf->buf_len_); + else + return write(file_fd, membuf->buf_ptr_, membuf->buf_len_); +} + +ssize_t membuf_sendfile(const struct mem_buf_st *membuf, int file_fd, + loff_t off, bool is_read) +{ + int mem_fd = membuf->io_fd_; + size_t buf_len = membuf->buf_len_; + + if (mem_fd < 0) + return -__LINE__; + + lseek(mem_fd, 0, SEEK_SET); + lseek(file_fd, off, SEEK_SET); + if (is_read) + return sendfile(mem_fd, file_fd, NULL, buf_len); + else + return sendfile(file_fd, mem_fd, NULL, buf_len); +} + +ssize_t membuf_splice(const struct mem_buf_st *membuf, int file_fd, + loff_t off, bool is_read) +{ + size_t len = 0, out_len = 0, buf_len = membuf->buf_len_; + int mem_fd = membuf->io_fd_; + int fd_in = file_fd, fd_out = mem_fd; + ssize_t ret = 0; + static int s_pipe_fds[2] = { -1, -1}; + + if (mem_fd < 0) + return -__LINE__; + + lseek(mem_fd, 0, SEEK_SET); + lseek(file_fd, off, SEEK_SET); + if (s_pipe_fds[0] < 0) { + const int pipe_size = SIZ_MB * 32; + int pipe_fds[2]; + ret = pipe(pipe_fds); + if (ret < 0) + return -__LINE__; + ret = fcntl(pipe_fds[1], F_SETPIPE_SZ, pipe_size); + if (ret < 0) + return -__LINE__; + ret = fcntl(pipe_fds[0], F_GETPIPE_SZ, pipe_size); + if (ret != pipe_size) + return -__LINE__; + s_pipe_fds[0] = pipe_fds[0]; + s_pipe_fds[1] = pipe_fds[1]; + } + + if (!is_read) { + fd_in = mem_fd; + fd_out = file_fd; + } + + while (buf_len > len) { + ret = splice(fd_in, NULL, s_pipe_fds[1], NULL, buf_len - len, SPLICE_F_NONBLOCK); + if (ret <= 0) + break; + len += ret; + do { + ret = splice(s_pipe_fds[0], NULL, fd_out, NULL, len - out_len, 0); + if (ret <= 0) + break; + out_len += ret; + } while (out_len < len); + } + return out_len > 0 ? out_len : ret; +} + +ssize_t membuf_cfr(const struct mem_buf_st *membuf, int file_fd, loff_t off, + bool is_read) +{ + loff_t mem_pos = 0; + loff_t file_pos = off; + size_t out_len = 0, buf_len = membuf->buf_len_; + int mem_fd = membuf->io_fd_; + int fd_in = file_fd, fd_out = mem_fd; + loff_t pos_in = file_pos, pos_out = mem_pos; + ssize_t ret = 0; + + if (mem_fd < 0) + return -__LINE__; + + lseek(mem_fd, mem_pos, SEEK_SET); + lseek(file_fd, file_pos, SEEK_SET); + + if (!is_read) { + fd_in = mem_fd; + fd_out = file_fd; + pos_in = mem_pos; + pos_out = file_pos; + } + + while (buf_len > out_len) { + ret = copy_file_range(fd_in, &pos_in, fd_out, &pos_out, buf_len - out_len, 0); + if (ret <= 0) + break; + out_len += ret; + } + return out_len > 0 ? out_len : ret; +} + +ssize_t membuf_io(const struct mem_buf_st *membuf, int file_fd, loff_t off, + bool is_read, enum copy_io_type io_type) +{ + ssize_t ret = 0; + if (io_type == IO_MAP_READ_WRITE) { + ret = membuf_read_write(membuf, file_fd, off, is_read); + } else if (io_type == IO_SENDFILE) { + ret = membuf_sendfile(membuf, file_fd, off, is_read); + } else if (io_type == IO_SPLICE) { + ret = membuf_splice(membuf, file_fd, off, is_read); + } else if (io_type == IO_COPY_FILE_RANGE) { + ret = membuf_cfr(membuf, file_fd, off, is_read); + } else + return -1; + if (ret < 0) + printf("membuf_io io failed %d\n", errno); + return ret; +} + +const char *membuf_tc_desc(const struct mem_buf_tc *tc) +{ + static char buf[32]; + snprintf(buf, sizeof(buf), "%s %s %s", mem_buf_type_descs[tc->buf_type_], + tc->direct_io_ ? "direct" : "buffer", io_type_descs[tc->io_type_]); + return buf; +} + +bool test_membuf(struct mem_buf_tc *tc, loff_t pos, size_t file_len, + size_t buf_siz, bool is_read, bool clean_pagecaches) +{ + loff_t off = pos, file_end = pos + file_len; + int file_fd = tc->file_fd_; + int i = 0, buf_num; + struct mem_buf_st *membufs; + struct timespec ts_start, ts_end; + ssize_t ret; + + if (buf_siz > file_len) + buf_siz = file_len; + buf_num = (file_len + buf_siz - 1) / buf_siz; + membufs = (struct mem_buf_st *)malloc(sizeof(*membufs) * buf_num); + if (!membufs) + return false; + + memset(membufs, 0, sizeof(*membufs) * buf_num); + drop_pagecaches(-1, 0, 0); + for (i = 0; i < buf_num && off < file_end; i++, off += buf_siz) { + if (buf_siz > file_end - off) + buf_siz = file_end - off; + + if (clean_pagecaches) + drop_pagecaches(file_fd, off, buf_siz); + + clock_gettime(CLOCK_MONOTONIC, &ts_start); + if (!membuf_init(&membufs[i], buf_siz, tc->buf_type_)) { + printf("alloc %s %d failed\n", membuf_tc_desc(tc), i); + break; + } + clock_gettime(CLOCK_MONOTONIC, &ts_end); + tc->times_create_ += times_us_duration(&ts_start, &ts_end); + + clock_gettime(CLOCK_MONOTONIC, &ts_start); + if (!membufs[i].buf_ptr_) { + printf("map %s %d failed\n", membuf_tc_desc(tc), i); + break; + } + if (!is_read) + test_fill_data(membufs[i].buf_ptr_, off + 1, buf_siz, true); + clock_gettime(CLOCK_MONOTONIC, &ts_end); + tc->times_data_ += times_us_duration(&ts_start, &ts_end); + + clock_gettime(CLOCK_MONOTONIC, &ts_start); + ret = membuf_io(&membufs[i], file_fd, off, is_read, tc->io_type_); + if (ret < 0 || ret != buf_siz) { + printf("membuf_io %s %d rw %zd ret %zd failed %d\n", + membuf_tc_desc(tc), i, buf_siz, ret, errno); + break; + } + clock_gettime(CLOCK_MONOTONIC, &ts_end); + tc->times_io_ += times_us_duration(&ts_start, &ts_end); + + clock_gettime(CLOCK_MONOTONIC, &ts_start); + if (!test_check_data(membufs[i].buf_ptr_, off + 1, buf_siz, true)) { + printf("check data %s %d failed\n", membuf_tc_desc(tc), i); + break; + } + clock_gettime(CLOCK_MONOTONIC, &ts_end); + tc->times_data_ += times_us_duration(&ts_start, &ts_end); + + if (clean_pagecaches) + drop_pagecaches(file_fd, off, buf_siz); + } + + clock_gettime(CLOCK_MONOTONIC, &ts_start); + for (i = 0; i < buf_num; i++) { + membuf_deinit(&membufs[i]); + } + clock_gettime(CLOCK_MONOTONIC, &ts_end); + tc->times_close_ += times_us_duration(&ts_start, &ts_end); + drop_pagecaches(-1, 0, 0); + tc->io_len_ = off - pos; + return off - pos >= file_end; +} + +bool prepare_init_file(int file_fd, loff_t off, size_t file_len) +{ + struct mem_buf_st membuf = {}; + ssize_t ret; + + ftruncate(file_fd, off + file_len); + + if (!membuf_init(&membuf, file_len, BUF_MEMFD)) + return false; + + test_fill_data(membuf.buf_ptr_, off + 1, file_len, false); + ret = membuf_io(&membuf, file_fd, off, false, IO_MAP_READ_WRITE); + membuf_deinit(&membuf); + + return ret >= file_len; +} + +bool prepare_file(const char *filepath, loff_t off, size_t file_len) +{ + ssize_t file_end; + bool suc = true; + int flags = O_RDWR | O_CLOEXEC | O_LARGEFILE | O_CREAT; + int file_fd = open(filepath, flags, 0660); + if (file_fd < 0) { + printf("open %s failed %d\n", filepath, errno); + return false; + } + + file_end = (size_t)lseek(file_fd, 0, SEEK_END); + if (file_end < off + file_len) + suc = prepare_init_file(file_fd, off, file_len); + + close(file_fd); + return suc; +} + +void test_membuf_cases(struct mem_buf_tc *test_cases, int test_count, + loff_t off, size_t file_len, size_t buf_siz, + bool is_read, bool clean_pagecaches) +{ + char title[64]; + long file_MB = (file_len + SIZ_MB - 1) / SIZ_MB; + long buf_MB = (buf_siz + SIZ_MB - 1) / SIZ_MB; + long base_io_time, io_times; + long base_io_speed, io_speed; + struct mem_buf_tc *tc; + int n; + + for (n = 0; n < test_count; n++) { + test_membuf(&test_cases[n], off, file_len, buf_siz, is_read, clean_pagecaches); + } + + snprintf(title, sizeof(title), "%ldx%ldMB %s %4ldMB", + (file_MB + buf_MB - 1) / buf_MB, buf_MB, + is_read ? "Read" : "Write", file_MB); + + base_io_time = test_cases[0].times_io_ ?: 1; + base_io_speed = test_cases[0].io_len_ / base_io_time ? : 1000; + + printf("| %-23s|%8s|%8s|%8s|%8s| I/O%%\n", title, + "Creat-ms", "Close-ms", "I/O-ms", "I/O-MB/s"); + printf("|---------------------------|--------|--------|--------|--------|-----\n"); + for (n = 0; n < test_count; n++) { + tc = &test_cases[n]; + io_times = tc->times_io_; + io_speed = io_times > 0 ? tc->io_len_ / io_times : 0; + + printf("|%2d) %23s| %6ld | %6ld | %6ld | %6ld |%4ld%%\n", n + 1, + membuf_tc_desc(tc), time_us2ms(tc->times_create_), + time_us2ms(tc->times_close_), time_us2ms(io_times), + io_speed, io_speed * 100 / base_io_speed); + } + return; +} + +void test_all_membufs(const char *filepath, loff_t off, size_t file_len, size_t buf_siz, + bool is_read, bool clean_pagecaches) +{ + int buffer_fd; + int direct_fd; + + buffer_fd = open(filepath, O_RDWR | O_CLOEXEC | O_LARGEFILE); + direct_fd = open(filepath, O_RDWR | O_CLOEXEC | O_LARGEFILE | O_DIRECT); + if (buffer_fd < 0 || direct_fd < 0) { + printf("buffer_fd %d direct_fd %d\n", buffer_fd, direct_fd); + return; + } + + struct mem_buf_tc test_cases[] = { + {BUF_MEMFD, IO_MAP_READ_WRITE, buffer_fd, false}, + {BUF_MEMFD, IO_MAP_READ_WRITE, direct_fd, true}, + {BUF_UDMA_MEMFD, IO_MAP_READ_WRITE, buffer_fd, false}, + {BUF_UDMA_MEMFD, IO_MAP_READ_WRITE, direct_fd, true}, + {BUF_UDMA_MEMFD, IO_SENDFILE, buffer_fd, false}, + {BUF_UDMA_MEMFD, IO_SENDFILE, direct_fd, true}, + {BUF_UDMA_MEMFD, IO_SPLICE, buffer_fd, false}, + {BUF_UDMA_MEMFD, IO_SPLICE, direct_fd, true}, + {BUF_UDMABUF, IO_MAP_READ_WRITE, buffer_fd, false}, + {BUF_DMABUF, IO_MAP_READ_WRITE, buffer_fd, false}, + {BUF_UDMABUF, IO_COPY_FILE_RANGE, buffer_fd, false}, + {BUF_UDMABUF, IO_COPY_FILE_RANGE, direct_fd, true}, + {BUF_DMABUF, IO_COPY_FILE_RANGE, buffer_fd, false}, + {BUF_DMABUF, IO_COPY_FILE_RANGE, direct_fd, true}, + }; + + test_membuf_cases(test_cases, sizeof(test_cases) / sizeof(test_cases[0]), + off, file_len, buf_siz, is_read, clean_pagecaches); + close(buffer_fd); + close(direct_fd); +} + +void usage(void) +{ + printf("usage: dmabuf-dio [file_path] [size_MB]\n"); + return; +} + +int main(int argc, char *argv[]) +{ + const char *file_path = "/data/membuf.tmp"; + size_t file_len = SIZ_MB * 1024; + + if (argc > 1) + file_path = argv[1]; + if (argc > 2) + file_len = atoi(argv[2]) * SIZ_MB; + if (file_len < 0) + file_len = SIZ_MB * 1024; + if (!prepare_file(file_path, 0, file_len)) { + usage(); + return -1; + } + + test_all_membufs(file_path, 0, file_len, SIZ_MB * 32, true, true); + return 0; +} -- ^ permalink raw reply related [flat|nested] 30+ messages in thread
* RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-03 13:19 ` Christoph Hellwig 2025-06-03 14:18 ` Christian König @ 2025-06-06 9:52 ` wangtao 2025-06-06 11:20 ` Christian König 1 sibling, 1 reply; 30+ messages in thread From: wangtao @ 2025-06-06 9:52 UTC (permalink / raw) To: Christoph Hellwig, Christian König Cc: sumit.semwal@linaro.org, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, amir73il@gmail.com, benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 > -----Original Message----- > From: Christoph Hellwig <hch@infradead.org> > Sent: Tuesday, June 3, 2025 9:20 PM > To: Christian König <christian.koenig@amd.com> > Cc: Christoph Hellwig <hch@infradead.org>; wangtao > <tao.wangtao@honor.com>; sumit.semwal@linaro.org; kraxel@redhat.com; > vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org; > hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com; > benjamin.gaignard@collabora.com; Brian.Starkey@arm.com; > jstultz@google.com; tjmercier@google.com; jack@suse.cz; > baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri- > devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux- > kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux- > mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>; > yipengxiang <yipengxiang@honor.com>; liulu 00013167 > <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com> > Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via > copy_file_range > > On Tue, Jun 03, 2025 at 03:14:20PM +0200, Christian König wrote: > > On 6/3/25 15:00, Christoph Hellwig wrote: > > > This is a really weird interface. No one has yet to explain why > > > dmabuf is so special that we can't support direct I/O to it when we > > > can support it to otherwise exotic mappings like PCI P2P ones. > > > > With udmabuf you can do direct I/O, it's just inefficient to walk the > > page tables for it when you already have an array of all the folios. > > Does it matter compared to the I/O in this case? > > Either way there has been talk (in case of networking implementations) that > use a dmabuf as a first class container for lower level I/O. > I'd much rather do that than adding odd side interfaces. I.e. have a version > of splice that doesn't bother with the pipe, but instead just uses in-kernel > direct I/O on one side and dmabuf-provided folios on the other. If the VFS layer recognizes dmabuf type and acquires its sg_table and folios, zero-copy could also be achieved. I initially thought dmabuf acts as a driver and shouldn't be handled by VFS, so I made dmabuf implement copy_file_range callbacks to support direct I/O zero-copy. I'm open to both approaches. What's the preference of VFS experts? Regards, Wangtao. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-06 9:52 ` wangtao @ 2025-06-06 11:20 ` Christian König 2025-06-09 4:35 ` Christoph Hellwig 0 siblings, 1 reply; 30+ messages in thread From: Christian König @ 2025-06-06 11:20 UTC (permalink / raw) To: wangtao, Christoph Hellwig Cc: sumit.semwal@linaro.org, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, amir73il@gmail.com, benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 On 6/6/25 11:52, wangtao wrote: > > >> -----Original Message----- >> From: Christoph Hellwig <hch@infradead.org> >> Sent: Tuesday, June 3, 2025 9:20 PM >> To: Christian König <christian.koenig@amd.com> >> Cc: Christoph Hellwig <hch@infradead.org>; wangtao >> <tao.wangtao@honor.com>; sumit.semwal@linaro.org; kraxel@redhat.com; >> vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org; >> hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com; >> benjamin.gaignard@collabora.com; Brian.Starkey@arm.com; >> jstultz@google.com; tjmercier@google.com; jack@suse.cz; >> baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri- >> devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux- >> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux- >> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>; >> yipengxiang <yipengxiang@honor.com>; liulu 00013167 >> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com> >> Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via >> copy_file_range >> >> On Tue, Jun 03, 2025 at 03:14:20PM +0200, Christian König wrote: >>> On 6/3/25 15:00, Christoph Hellwig wrote: >>>> This is a really weird interface. No one has yet to explain why >>>> dmabuf is so special that we can't support direct I/O to it when we >>>> can support it to otherwise exotic mappings like PCI P2P ones. >>> >>> With udmabuf you can do direct I/O, it's just inefficient to walk the >>> page tables for it when you already have an array of all the folios. >> >> Does it matter compared to the I/O in this case? >> >> Either way there has been talk (in case of networking implementations) that >> use a dmabuf as a first class container for lower level I/O. >> I'd much rather do that than adding odd side interfaces. I.e. have a version >> of splice that doesn't bother with the pipe, but instead just uses in-kernel >> direct I/O on one side and dmabuf-provided folios on the other. > If the VFS layer recognizes dmabuf type and acquires its sg_table > and folios, zero-copy could also be achieved. I initially thought > dmabuf acts as a driver and shouldn't be handled by VFS, so I made > dmabuf implement copy_file_range callbacks to support direct I/O > zero-copy. I'm open to both approaches. What's the preference of > VFS experts? That would probably be illegal. Using the sg_table in the DMA-buf implementation turned out to be a mistake. The question Christoph raised was rather why is your CPU so slow that walking the page tables has a significant overhead compared to the actual I/O? Regards, Christian. > > Regards, > Wangtao. > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-06 11:20 ` Christian König @ 2025-06-09 4:35 ` Christoph Hellwig 2025-06-09 9:32 ` wangtao 0 siblings, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2025-06-09 4:35 UTC (permalink / raw) To: Christian König Cc: wangtao, Christoph Hellwig, sumit.semwal@linaro.org, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, amir73il@gmail.com, benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 On Fri, Jun 06, 2025 at 01:20:48PM +0200, Christian König wrote: > > dmabuf acts as a driver and shouldn't be handled by VFS, so I made > > dmabuf implement copy_file_range callbacks to support direct I/O > > zero-copy. I'm open to both approaches. What's the preference of > > VFS experts? > > That would probably be illegal. Using the sg_table in the DMA-buf > implementation turned out to be a mistake. Two thing here that should not be directly conflated. Using the sg_table was a huge mistake, and we should try to move dmabuf to switch that to a pure dma_addr_t/len array now that the new DMA API supporting that has been merged. Is there any chance the dma-buf maintainers could start to kick this off? I'm of course happy to assist. But that notwithstanding, dma-buf is THE buffer sharing mechanism in the kernel, and we should promote it instead of reinventing it badly. And there is a use case for having a fully DMA mapped buffer in the block layer and I/O path, especially on systems with an IOMMU. So having an iov_iter backed by a dma-buf would be extremely helpful. That's mostly lib/iov_iter.c code, not VFS, though. > The question Christoph raised was rather why is your CPU so slow > that walking the page tables has a significant overhead compared to > the actual I/O? Yes, that's really puzzling and should be addressed first. ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-09 4:35 ` Christoph Hellwig @ 2025-06-09 9:32 ` wangtao 2025-06-10 10:52 ` Christian König 2025-06-10 13:34 ` Christoph Hellwig 0 siblings, 2 replies; 30+ messages in thread From: wangtao @ 2025-06-09 9:32 UTC (permalink / raw) To: Christoph Hellwig, Christian König Cc: sumit.semwal@linaro.org, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, amir73il@gmail.com, benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 > -----Original Message----- > From: Christoph Hellwig <hch@infradead.org> > Sent: Monday, June 9, 2025 12:35 PM > To: Christian König <christian.koenig@amd.com> > Cc: wangtao <tao.wangtao@honor.com>; Christoph Hellwig > <hch@infradead.org>; sumit.semwal@linaro.org; kraxel@redhat.com; > vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org; > hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com; > benjamin.gaignard@collabora.com; Brian.Starkey@arm.com; > jstultz@google.com; tjmercier@google.com; jack@suse.cz; > baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri- > devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux- > kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux- > mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>; > yipengxiang <yipengxiang@honor.com>; liulu 00013167 > <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com> > Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via > copy_file_range > > On Fri, Jun 06, 2025 at 01:20:48PM +0200, Christian König wrote: > > > dmabuf acts as a driver and shouldn't be handled by VFS, so I made > > > dmabuf implement copy_file_range callbacks to support direct I/O > > > zero-copy. I'm open to both approaches. What's the preference of VFS > > > experts? > > > > That would probably be illegal. Using the sg_table in the DMA-buf > > implementation turned out to be a mistake. > > Two thing here that should not be directly conflated. Using the sg_table was > a huge mistake, and we should try to move dmabuf to switch that to a pure I'm a bit confused: don't dmabuf importers need to traverse sg_table to access folios or dma_addr/len? Do you mean restricting sg_table access (e.g., only via iov_iter) or proposing alternative approaches? > dma_addr_t/len array now that the new DMA API supporting that has been > merged. Is there any chance the dma-buf maintainers could start to kick this > off? I'm of course happy to assist. > > But that notwithstanding, dma-buf is THE buffer sharing mechanism in the > kernel, and we should promote it instead of reinventing it badly. > And there is a use case for having a fully DMA mapped buffer in the block > layer and I/O path, especially on systems with an IOMMU. > So having an iov_iter backed by a dma-buf would be extremely helpful. > That's mostly lib/iov_iter.c code, not VFS, though. Are you suggesting adding an ITER_DMABUF type to iov_iter, or implementing dmabuf-to-iov_bvec conversion within iov_iter? > > > The question Christoph raised was rather why is your CPU so slow that > > walking the page tables has a significant overhead compared to the > > actual I/O? > > Yes, that's really puzzling and should be addressed first. With high CPU performance (e.g., 3GHz), GUP (get_user_pages) overhead is relatively low (observed in 3GHz tests). | 32x32MB Read 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O% |---------------------------|--------|--------|--------|--------|----- | 1) memfd direct R/W| 1 | 118 | 312 | 3448 | 100% | 2) u+memfd direct R/W| 196 | 123 | 295 | 3651 | 105% | 3) u+memfd direct sendfile| 175 | 102 | 976 | 1100 | 31% | 4) u+memfd direct splice| 173 | 103 | 443 | 2428 | 70% | 5) udmabuf buffer R/W| 183 | 100 | 453 | 2375 | 68% | 6) dmabuf buffer R/W| 34 | 4 | 427 | 2519 | 73% | 7) udmabuf direct c_f_r| 200 | 102 | 278 | 3874 | 112% | 8) dmabuf direct c_f_r| 36 | 5 | 269 | 4002 | 116% With lower CPU performance (e.g., 1GHz), GUP overhead becomes more significant (as seen in 1GHz tests). | 32x32MB Read 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O% |---------------------------|--------|--------|--------|--------|----- | 1) memfd direct R/W| 2 | 393 | 969 | 1109 | 100% | 2) u+memfd direct R/W| 592 | 424 | 570 | 1884 | 169% | 3) u+memfd direct sendfile| 587 | 356 | 2229 | 481 | 43% | 4) u+memfd direct splice| 568 | 352 | 795 | 1350 | 121% | 5) udmabuf buffer R/W| 597 | 343 | 1238 | 867 | 78% | 6) dmabuf buffer R/W| 69 | 13 | 1128 | 952 | 85% | 7) udmabuf direct c_f_r| 595 | 345 | 372 | 2889 | 260% | 8) dmabuf direct c_f_r| 80 | 13 | 274 | 3929 | 354% Regards, Wangtao. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-09 9:32 ` wangtao @ 2025-06-10 10:52 ` Christian König 2025-06-10 13:37 ` Christoph Hellwig 2025-06-10 13:34 ` Christoph Hellwig 1 sibling, 1 reply; 30+ messages in thread From: Christian König @ 2025-06-10 10:52 UTC (permalink / raw) To: wangtao, Christoph Hellwig Cc: sumit.semwal@linaro.org, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, amir73il@gmail.com, benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 On 6/9/25 11:32, wangtao wrote: >> -----Original Message----- >> From: Christoph Hellwig <hch@infradead.org> >> Sent: Monday, June 9, 2025 12:35 PM >> To: Christian König <christian.koenig@amd.com> >> Cc: wangtao <tao.wangtao@honor.com>; Christoph Hellwig >> <hch@infradead.org>; sumit.semwal@linaro.org; kraxel@redhat.com; >> vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org; >> hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com; >> benjamin.gaignard@collabora.com; Brian.Starkey@arm.com; >> jstultz@google.com; tjmercier@google.com; jack@suse.cz; >> baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri- >> devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux- >> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux- >> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>; >> yipengxiang <yipengxiang@honor.com>; liulu 00013167 >> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com> >> Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via >> copy_file_range >> >> On Fri, Jun 06, 2025 at 01:20:48PM +0200, Christian König wrote: >>>> dmabuf acts as a driver and shouldn't be handled by VFS, so I made >>>> dmabuf implement copy_file_range callbacks to support direct I/O >>>> zero-copy. I'm open to both approaches. What's the preference of VFS >>>> experts? >>> >>> That would probably be illegal. Using the sg_table in the DMA-buf >>> implementation turned out to be a mistake. >> >> Two thing here that should not be directly conflated. Using the sg_table was >> a huge mistake, and we should try to move dmabuf to switch that to a pure > I'm a bit confused: don't dmabuf importers need to traverse sg_table to > access folios or dma_addr/len? Do you mean restricting sg_table access > (e.g., only via iov_iter) or proposing alternative approaches? No, accessing pages folios inside the sg_table of a DMA-buf is strictly forbidden. We have removed most use cases of that over the years and push back on generating new ones. > >> dma_addr_t/len array now that the new DMA API supporting that has been >> merged. Is there any chance the dma-buf maintainers could start to kick this >> off? I'm of course happy to assist. Work on that is already underway for some time. Most GPU drivers already do sg_table -> DMA array conversion, I need to push on the remaining to clean up. But there are also tons of other users of dma_buf_map_attachment() which needs to be converted. >> But that notwithstanding, dma-buf is THE buffer sharing mechanism in the >> kernel, and we should promote it instead of reinventing it badly. >> And there is a use case for having a fully DMA mapped buffer in the block >> layer and I/O path, especially on systems with an IOMMU. >> So having an iov_iter backed by a dma-buf would be extremely helpful. >> That's mostly lib/iov_iter.c code, not VFS, though. > Are you suggesting adding an ITER_DMABUF type to iov_iter, or > implementing dmabuf-to-iov_bvec conversion within iov_iter? That would be rather nice to have, yeah. > >> >>> The question Christoph raised was rather why is your CPU so slow that >>> walking the page tables has a significant overhead compared to the >>> actual I/O? >> >> Yes, that's really puzzling and should be addressed first. > With high CPU performance (e.g., 3GHz), GUP (get_user_pages) overhead > is relatively low (observed in 3GHz tests). Even on a low end CPU walking the page tables and grabbing references shouldn't be that much of an overhead. There must be some reason why you see so much CPU overhead. E.g. compound pages are broken up or similar which should not happen in the first place. Regards, Christian. > | 32x32MB Read 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O% > |---------------------------|--------|--------|--------|--------|----- > | 1) memfd direct R/W| 1 | 118 | 312 | 3448 | 100% > | 2) u+memfd direct R/W| 196 | 123 | 295 | 3651 | 105% > | 3) u+memfd direct sendfile| 175 | 102 | 976 | 1100 | 31% > | 4) u+memfd direct splice| 173 | 103 | 443 | 2428 | 70% > | 5) udmabuf buffer R/W| 183 | 100 | 453 | 2375 | 68% > | 6) dmabuf buffer R/W| 34 | 4 | 427 | 2519 | 73% > | 7) udmabuf direct c_f_r| 200 | 102 | 278 | 3874 | 112% > | 8) dmabuf direct c_f_r| 36 | 5 | 269 | 4002 | 116% > > With lower CPU performance (e.g., 1GHz), GUP overhead becomes more > significant (as seen in 1GHz tests). > | 32x32MB Read 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O% > |---------------------------|--------|--------|--------|--------|----- > | 1) memfd direct R/W| 2 | 393 | 969 | 1109 | 100% > | 2) u+memfd direct R/W| 592 | 424 | 570 | 1884 | 169% > | 3) u+memfd direct sendfile| 587 | 356 | 2229 | 481 | 43% > | 4) u+memfd direct splice| 568 | 352 | 795 | 1350 | 121% > | 5) udmabuf buffer R/W| 597 | 343 | 1238 | 867 | 78% > | 6) dmabuf buffer R/W| 69 | 13 | 1128 | 952 | 85% > | 7) udmabuf direct c_f_r| 595 | 345 | 372 | 2889 | 260% > | 8) dmabuf direct c_f_r| 80 | 13 | 274 | 3929 | 354% > > Regards, > Wangtao. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-10 10:52 ` Christian König @ 2025-06-10 13:37 ` Christoph Hellwig 2025-06-13 9:43 ` wangtao 0 siblings, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2025-06-10 13:37 UTC (permalink / raw) To: Christian König Cc: wangtao, Christoph Hellwig, sumit.semwal@linaro.org, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, amir73il@gmail.com, benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 On Tue, Jun 10, 2025 at 12:52:18PM +0200, Christian König wrote: > >> dma_addr_t/len array now that the new DMA API supporting that has been > >> merged. Is there any chance the dma-buf maintainers could start to kick this > >> off? I'm of course happy to assist. > > Work on that is already underway for some time. > > Most GPU drivers already do sg_table -> DMA array conversion, I need > to push on the remaining to clean up. Do you have a pointer? > >> Yes, that's really puzzling and should be addressed first. > > With high CPU performance (e.g., 3GHz), GUP (get_user_pages) overhead > > is relatively low (observed in 3GHz tests). > > Even on a low end CPU walking the page tables and grabbing references > shouldn't be that much of an overhead. Yes. > > There must be some reason why you see so much CPU overhead. E.g. > compound pages are broken up or similar which should not happen in > the first place. pin_user_pages outputs an array of PAGE_SIZE (modulo offset and shorter last length) array strut pages unfortunately. The block direct I/O code has grown code to reassemble folios from them fairly recently which did speed up some workloads. Is this test using the block device or iomap direct I/O code? What kernel version is it run on? ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-10 13:37 ` Christoph Hellwig @ 2025-06-13 9:43 ` wangtao 2025-06-16 5:24 ` Christoph Hellwig 0 siblings, 1 reply; 30+ messages in thread From: wangtao @ 2025-06-13 9:43 UTC (permalink / raw) To: Christoph Hellwig, Christian König Cc: sumit.semwal@linaro.org, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, amir73il@gmail.com, benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 > On Tue, Jun 10, 2025 at 12:52:18PM +0200, Christian König wrote: > > >> dma_addr_t/len array now that the new DMA API supporting that has > > >> been merged. Is there any chance the dma-buf maintainers could > > >> start to kick this off? I'm of course happy to assist. > > > > Work on that is already underway for some time. > > > > Most GPU drivers already do sg_table -> DMA array conversion, I need > > to push on the remaining to clean up. > > Do you have a pointer? > > > >> Yes, that's really puzzling and should be addressed first. > > > With high CPU performance (e.g., 3GHz), GUP (get_user_pages) > > > overhead is relatively low (observed in 3GHz tests). > > > > Even on a low end CPU walking the page tables and grabbing references > > shouldn't be that much of an overhead. > > Yes. > > > > > There must be some reason why you see so much CPU overhead. E.g. > > compound pages are broken up or similar which should not happen in the > > first place. > > pin_user_pages outputs an array of PAGE_SIZE (modulo offset and shorter > last length) array strut pages unfortunately. The block direct I/O code has > grown code to reassemble folios from them fairly recently which did speed > up some workloads. > > Is this test using the block device or iomap direct I/O code? What kernel > version is it run on? Here's my analysis on Linux 6.6 with F2FS/iomap. Comparing udmabuf+memfd direct read vs dmabuf direct c_f_r: Systrace: On a high-end 3 GHz CPU, the former occupies >80% runtime vs <20% for the latter. On a low-end 1 GHz CPU, the former becomes CPU-bound. Perf: For the former, bio_iov_iter_get_pages/get_user_pages dominate latency. The latter avoids this via lightweight bvec assignments. |- 13.03% __arm64_sys_read |-|- 13.03% f2fs_file_read_iter |-|-|- 13.03% __iomap_dio_rw |-|-|-|- 12.95% iomap_dio_bio_iter |-|-|-|-|- 10.69% bio_iov_iter_get_pages |-|-|-|-|-|- 10.53% iov_iter_extract_pages |-|-|-|-|-|-|- 10.53% pin_user_pages_fast |-|-|-|-|-|-|-|- 10.53% internal_get_user_pages_fast |-|-|-|-|-|-|-|-|- 10.23% __gup_longterm_locked |-|-|-|-|-|-|-|-|-|- 8.85% __get_user_pages |-|-|-|-|-|-|-|-|-|-|- 6.26% handle_mm_fault |-|-|-|-|- 1.91% iomap_dio_submit_bio |-|-|-|-|-|- 1.64% submit_bio |- 1.13% __arm64_sys_copy_file_range |-|- 1.13% vfs_copy_file_range |-|-|- 1.13% dma_buf_copy_file_range |-|-|-|- 1.13% system_heap_dma_buf_rw_file |-|-|-|-|- 1.13% f2fs_file_read_iter |-|-|-|-|-|- 1.13% __iomap_dio_rw |-|-|-|-|-|-|- 1.13% iomap_dio_bio_iter |-|-|-|-|-|-|-|- 1.13% iomap_dio_submit_bio |-|-|-|-|-|-|-|-|- 1.08% submit_bio Large folios can reduce GUP overhead but still significantly slower than dmabuf to bio_vec conversion. Regards, Wangtao. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-13 9:43 ` wangtao @ 2025-06-16 5:24 ` Christoph Hellwig 0 siblings, 0 replies; 30+ messages in thread From: Christoph Hellwig @ 2025-06-16 5:24 UTC (permalink / raw) To: wangtao Cc: Christoph Hellwig, Christian König, sumit.semwal@linaro.org, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, amir73il@gmail.com, benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 On Fri, Jun 13, 2025 at 09:43:08AM +0000, wangtao wrote: > Here's my analysis on Linux 6.6 with F2FS/iomap. Linux 6.6 is almost two years old and completely irrelevant. Please provide numbers on 6.16 or current Linus' tree. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-09 9:32 ` wangtao 2025-06-10 10:52 ` Christian König @ 2025-06-10 13:34 ` Christoph Hellwig 2025-06-13 9:33 ` wangtao 1 sibling, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2025-06-10 13:34 UTC (permalink / raw) To: wangtao Cc: Christoph Hellwig, Christian König, sumit.semwal@linaro.org, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, amir73il@gmail.com, benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 On Mon, Jun 09, 2025 at 09:32:20AM +0000, wangtao wrote: > Are you suggesting adding an ITER_DMABUF type to iov_iter, Yes. ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-10 13:34 ` Christoph Hellwig @ 2025-06-13 9:33 ` wangtao 2025-06-16 5:25 ` Christoph Hellwig 0 siblings, 1 reply; 30+ messages in thread From: wangtao @ 2025-06-13 9:33 UTC (permalink / raw) To: Christoph Hellwig Cc: Christian König, sumit.semwal@linaro.org, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, amir73il@gmail.com, benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 > > On Mon, Jun 09, 2025 at 09:32:20AM +0000, wangtao wrote: > > Are you suggesting adding an ITER_DMABUF type to iov_iter, > > Yes. May I clarify: Do all disk operations require data to pass through memory (reading into memory or writing from memory)? In the block layer, the bio structure uses bio_iov_iter_get_pages to convert iter_type objects into memory-backed bio_vec representations. However, some dmabufs are not memory-based, making page-to-bio_vec conversion impossible. This suggests adding a callback function in dma_buf_ops to handle dmabuf- to-bio_vec conversion. Interestingly, if such a callback exists, the need for a dedicated ITER_DMABUF type might disappear. Would you like to discuss potential implementation tradeoffs here? Regards, Wangtao. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range 2025-06-13 9:33 ` wangtao @ 2025-06-16 5:25 ` Christoph Hellwig 0 siblings, 0 replies; 30+ messages in thread From: Christoph Hellwig @ 2025-06-16 5:25 UTC (permalink / raw) To: wangtao Cc: Christoph Hellwig, Christian König, sumit.semwal@linaro.org, kraxel@redhat.com, vivek.kasireddy@intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, hughd@google.com, akpm@linux-foundation.org, amir73il@gmail.com, benjamin.gaignard@collabora.com, Brian.Starkey@arm.com, jstultz@google.com, tjmercier@google.com, jack@suse.cz, baolin.wang@linux.alibaba.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, wangbintian(BintianWang), yipengxiang, liulu 00013167, hanfeng 00012985 On Fri, Jun 13, 2025 at 09:33:55AM +0000, wangtao wrote: > > > > > On Mon, Jun 09, 2025 at 09:32:20AM +0000, wangtao wrote: > > > Are you suggesting adding an ITER_DMABUF type to iov_iter, > > > > Yes. > > May I clarify: Do all disk operations require data to pass through > memory (reading into memory or writing from memory)? In the block layer, > the bio structure uses bio_iov_iter_get_pages to convert iter_type > objects into memory-backed bio_vec representations. > However, some dmabufs are not memory-based, making page-to-bio_vec > conversion impossible. This suggests adding a callback function in > dma_buf_ops to handle dmabuf- to-bio_vec conversion. bios do support PCI P2P tranfers. This could be fairly easily extended to other peer to peer transfers if we manage to come up with a coherent model for them. No need for a callback. ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2025-06-16 5:25 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-06-03 9:52 [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range wangtao 2025-06-03 9:52 ` [PATCH v4 1/4] fs: allow cross-FS copy_file_range for memory file with direct I/O wangtao 2025-06-03 10:56 ` Amir Goldstein 2025-06-03 12:38 ` wangtao 2025-06-03 12:43 ` Amir Goldstein 2025-06-03 9:52 ` [PATCH v4 2/4] dmabuf: Implement copy_file_range callback for dmabuf direct I/O prep wangtao 2025-06-03 10:42 ` Christian König 2025-06-03 12:26 ` wangtao 2025-06-03 13:04 ` Christoph Hellwig 2025-06-03 9:52 ` [PATCH v4 3/4] udmabuf: Implement udmabuf direct I/O wangtao 2025-06-03 9:52 ` [PATCH v4 4/4] dmabuf:system_heap Implement system_heap dmabuf " wangtao 2025-06-03 13:00 ` [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range Christoph Hellwig 2025-06-03 13:14 ` Christian König 2025-06-03 13:19 ` Christoph Hellwig 2025-06-03 14:18 ` Christian König 2025-06-03 14:28 ` Christoph Hellwig 2025-06-03 15:55 ` Christian König 2025-06-03 16:01 ` Christoph Hellwig 2025-06-06 9:59 ` wangtao 2025-06-06 9:52 ` wangtao 2025-06-06 11:20 ` Christian König 2025-06-09 4:35 ` Christoph Hellwig 2025-06-09 9:32 ` wangtao 2025-06-10 10:52 ` Christian König 2025-06-10 13:37 ` Christoph Hellwig 2025-06-13 9:43 ` wangtao 2025-06-16 5:24 ` Christoph Hellwig 2025-06-10 13:34 ` Christoph Hellwig 2025-06-13 9:33 ` wangtao 2025-06-16 5:25 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).