[PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
@ 2024-07-30  7:57 Huan Yang
  2024-07-30  7:57 ` [PATCH v2 1/5] dma-buf: heaps: " Huan Yang
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Huan Yang @ 2024-07-30  7:57 UTC (permalink / raw)
  To: Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Christian König, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel
  Cc: opensource.kernel, Huan Yang

Background
====
Some user may need load file into dma-buf, current way is:
  1. allocate a dma-buf, get dma-buf fd
  2. mmap dma-buf fd into user vaddr
  3. read(file_fd, vaddr, fsz)
Due to dma-buf user map can't support direct I/O[1], the file read
must be buffer I/O.

This means that during the process of reading the file into dma-buf,
page cache needs to be generated, and the corresponding content needs to
be first copied to the page cache before being copied to the dma-buf.

This way worked well when reading relatively small files before, as
the page cache can cache the file content, thus improving performance.

However, there are new challenges currently, especially as AI models are
becoming larger and need to be shared between DMA devices and the CPU
via dma-buf.

For example, our 7B model file size is around 3.4GB. Using the
previous would mean generating a total of 3.4GB of page cache
(even if it will be reclaimed), and also requiring the copying of 3.4GB
of content between page cache and dma-buf. 

Due to the limited resources of system memory, files in the gigabyte range
cannot persist in memory indefinitely, so this portion of page cache may
not provide much assistance for subsequent reads. Additionally, the
existence of page cache will consume additional system resources due to
the extra copying required by the CPU.

Therefore, I think it is necessary for dma-buf to support direct I/O.

However, direct I/O file reads cannot be performed using the buffer
mmaped by the user space for the dma-buf.[1]

Here are some discussions on implementing direct I/O using dma-buf:

mmap[1]
---
dma-buf never support user map vaddr use of direct I/O.

udmabuf[2]
---
Currently, udmabuf can use the memfd method to read files into
dma-buf in direct I/O mode.

However, if the size is large, the current udmabuf needs to adjust the
corresponding size_limit(default 64MB).
But using udmabuf for files at the 3GB level is not a very good approach.
It needs to make some adjustments internally to handle this.[3] Or else,
fail create.

But, it is indeed a viable way to enable dma-buf to support direct I/O.
However, it is necessary to initiate the file read after the memory allocation
is completed, and handle race conditions carefully.

sendfile/splice[4]
---
Another way to enable dma-buf to support direct I/O is by implementing
splice_write/write_iter in the dma-buf file operations (fops) to adapt
to the sendfile method.
However, the current sendfile/splice calls are based on pipe. When using
direct I/O to read a file, the content needs to be copied to the buffer
allocated by the pipe (default 64KB), and then the dma-buf fops'
splice_write needs to be called to write the content into the dma-buf.
This approach requires serially reading the content of file pipe size
into the pipe buffer and then waiting for the dma-buf to be written
before reading the next one.(The I/O performance is relatively weak
under direct I/O.)
Moreover, due to the existence of the pipe buffer, even when using
direct I/O and not needing to generate additional page cache,
there still needs to be a CPU copy.

copy_file_range[5]
---
Consider of copy_file_range, It only supports copying files within the
same file system. Similarly, it is not very practical.

So, currently, there is no particularly suitable solution on VFS to
allow dma-buf to support direct I/O for large file reads.

This patchset provides an idea to complete file reads when requesting a
dma-buf.

Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
===
This patch provides a method to immediately read the file content after
the dma-buf is allocated, and only returns the dma-buf file descriptor
after the file is fully read.

Since the dma-buf file descriptor is not returned, no other thread can
access it except for the current thread, so we don't need to worry about
race conditions.

Map the dma-buf to the vmalloc area and initiate file reads in kernel
space, supporting both buffer I/O and direct I/O.

This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
When a user needs to allocate a dma-buf and read a file, they should
pass this heap flag. As the size of the file being read is fixed, there is no
need to pass the 'len' parameter. Instead, The file_fd needs to be passed to
indicate to the kernel the file that needs to be read.

The file open flag determines the mode of file reading.
But, please note that if direct I/O(O_DIRECT) is needed to read the file,
the file size must be page aligned. (with patch 2-5, no need)

Therefore, for the user, len and file_fd are mutually exclusive,
and they are combined using a union.

Once the user obtains the dma-buf fd, the dma-buf directly contains the
file content.

Patch 1 implement it.

Patch 2-5 provides an approach for performance improvement.

The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
synchronously read files using direct I/O.

This approach helps to save CPU copying and avoid a certain degree of
memory thrashing (page cache generation and reclamation)

When dealing with large file sizes, the benefits of this approach become
particularly significant.

However, there are currently some methods that can improve performance,
not just save system resources:

Due to the large file size, for example, a AI 7B model of around 3.4GB, the
time taken to allocate DMA-BUF memory will be relatively long. Waiting
for the allocation to complete before reading the file will add to the
overall time consumption. Therefore, the total time for DMA-BUF
allocation and file read can be calculated using the formula
   T(total) = T(alloc) + T(I/O)

However, if we change our approach, we don't necessarily need to wait
for the DMA-BUF allocation to complete before initiating I/O. In fact,
during the allocation process, we already hold a portion of the page,
which means that waiting for subsequent page allocations to complete
before carrying out file reads is actually unfair to the pages that have
already been allocated.

The allocation of pages is sequential, and the reading of the file is
also sequential, with the content and size corresponding to the file.
This means that the memory location for each page, which holds the
content of a specific position in the file, can be determined at the
time of allocation.

However, to fully leverage I/O performance, it is best to wait and
gather a certain number of pages before initiating batch processing.

The default gather size is 128MB. So, ever gathered can see as a file read
work, it maps the gather page to the vmalloc area to obtain a continuous
virtual address, which is used as a buffer to store the contents of the
corresponding file. So, if using direct I/O to read a file, the file
content will be written directly to the corresponding dma-buf buffer memory
without any additional copying.(compare to pipe buffer.)

Consider other ways to read into dma-buf. If we assume reading after mmap
dma-buf, we need to map the pages of the dma-buf to the user virtual
address space. Also, udmabuf memfd need do this operations too.
Even if we support sendfile, the file copy also need buffer, you must
setup it.
So, mapping pages to the vmalloc area does not incur any additional
performance overhead compared to other methods.[6]

Certainly, the administrator can also modify the gather size through patch5.

The formula for the time taken for system_heap buffer allocation and
file reading through async_read is as follows:

  T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))

Compared to the synchronous read:
  T(total) = T(alloc) + T(I/O)

If the allocation time or I/O time is long, the time difference will be
covered by the maximum value between the allocation and I/O. The other
party will be concealed.

Therefore, the larger the size of the file that needs to be read, the
greater the corresponding benefits will be.

How to use
===
Consider the current pathway for loading model files into DMA-BUF:
  1. open dma-heap, get heap fd
  2. open file, get file_fd(can't use O_DIRECT)
  3. use file len to allocate dma-buf, get dma-buf fd
  4. mmap dma-buf fd, get vaddr
  5. read(file_fd, vaddr, file_size) into dma-buf pages
  6. share, attach, whatever you want

Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
  1. open dma-heap, get heap fd
  2. open file, get file_fd(buffer/direct)
  3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, set file_fd
     instead of len. get dma-buf fd(contains file content)
  4. share, attach, whatever you want

So, test it is easy.

How to test
===
The performance comparison will be conducted for the following scenarios:
  1. normal
  2. udmabuf with [3] patch
  3. sendfile
  4. only patch 1
  5. patch1 - patch4.

normal:
  1. open dma-heap, get heap fd
  2. open file, get file_fd(can't use O_DIRECT)
  3. use file len to allocate dma-buf, get dma-buf fd
  4. mmap dma-buf fd, get vaddr
  5. read(file_fd, vaddr, file_size) into dma-buf pages
  6. share, attach, whatever you want

UDMA-BUF step:
  1. memfd_create
  2. open file(buffer/direct)
  3. udmabuf create
  4. mmap memfd
  5. read file into memfd vaddr

Sendfile step(need suit splice_write/write_iter, just use to compare):
  1. open dma-heap, get heap fd
  2. open file, get file_fd(buffer/direct)
  3. use file len to allocate dma-buf, get dma-buf fd
  4. sendfile file_fd to dma-buf fd
  6. share, attach, whatever you want

patch1/patch1-4:
  1. open dma-heap, get heap fd
  2. open file, get file_fd(buffer/direct)
  3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, set file_fd
     instead of len. get dma-buf fd(contains file content)
  4. share, attach, whatever you want

You can create a file to test it. Compare the performance gap between the two.
It is best to compare the differences in file size from KB to MB to GB.

The following test data will compare the performance differences between 512KB,
8MB, 1GB, and 3GB under various scenarios.

Performance Test
===
  12G RAM phone
  UFS4.0(the maximum speed is 4GB/s. ),
  f2fs
  kernel 6.1 with patch[7] (or else, can't support kvec direct I/O read.)
  no memory pressure.
  drop_cache is used for each test.

The average of 5 test results:
| scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns)       | 3GB(ns)       |
| ------------------- | ---------- | ---------- | ------------- | ------------- |
| normal              | 2,790,861  | 14,535,784 | 1,520,790,492 | 3,332,438,754 |
| udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000   | 2,108,419,923 |
| sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 | 3,062,052,984 |
| patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800   | 2,187,570,861 |
| sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 | 9,777,661,077 |
| patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 | 5,648,897,554 |
| udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784   | 2,158,305,738 |
| patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538   | 1,400,006,107 |

So, based on the test results:

When the file is large, the patchset has the highest performance.
Compared to normal, patchset is a 50% improvement;
Compared to normal, patch1 only showed a degradation of 41%.
patch1 typical performance breakdown is as follows:
  1. alloc cost 188,802,693 ns
  2. vmap cost 42,491,385 ns
  3. file read cost 4,180,876,702 ns
Therefore, directly performing a single direct I/O read on a large file
may not be the most optimal way for performance.

The performance of direct I/O implemented by the sendfile method is the worst.

When file size is small, The difference in performance is not
significant. This is consistent with expectations.

Suggested use cases
===
  1. When there is a need to read large files and system resources are scarce,
     especially when the size of memory is limited.(GB level) In this
     scenario, using direct I/O for file reading can even bring performance
     improvements.(may need patch2-3)
  2. For embedded devices with limited RAM, using direct I/O can save system
     resources and avoid unnecessary data copying. Therefore, even if the
     performance is lower when read small file, it can still be used
     effectively.
  3. If there is sufficient memory, pinning the page cache of the model files
     in memory and placing file in the EROFS file system for read-only access
     maybe better.(EROFS do not support direct I/O)

Changlog
===
 v1 [8]
 v1->v2:
   Uses the heap flag method for alloc and read instead of adding a new
   DMA-buf ioctl command. [9]
   Split the patchset to facilitate review and test.
     patch 1 implement alloc and read, offer heap flag into it.
     patch 2-4 offer async read
     patch 5 can change gather limit.

Reference
===
[1] https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
[2] https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
[3] https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
[4] https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
[5] https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
[6] https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
[7] https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
[8] https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
[9] https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/

Huan Yang (5):
  dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  dma-buf: heaps: Introduce async alloc read ops
  dma-buf: heaps: support alloc async read file
  dma-buf: heaps: system_heap alloc support async read
  dma-buf: heaps: configurable async read gather limit

 drivers/dma-buf/dma-heap.c          | 552 +++++++++++++++++++++++++++-
 drivers/dma-buf/heaps/system_heap.c |  70 +++-
 include/linux/dma-heap.h            |  53 ++-
 include/uapi/linux/dma-heap.h       |  11 +-
 4 files changed, 673 insertions(+), 13 deletions(-)

base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
-- 
2.45.2

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 1/5] dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30  7:57 [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag Huan Yang
@ 2024-07-30  7:57 ` Huan Yang
  2024-07-31 11:08   ` kernel test robot
  2024-07-30  7:57 ` [PATCH v2 2/5] dma-buf: heaps: Introduce async alloc read ops Huan Yang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Huan Yang @ 2024-07-30  7:57 UTC (permalink / raw)
  To: Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Christian König, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel
  Cc: opensource.kernel, Huan Yang

Some user may need load file into dma-buf, current way is:
  1. allocate a dma-buf, get dma-buf fd
  2. mmap dma-buf fd into user vaddr
  3. read(file_fd, vaddr, fsz)
Due to dma-buf can't support direct I/O(can't pin, not pure page base),
the file read must be buffer I/O.

This means that during the process of reading the file into dma-buf,
page cache needs to be generated, and the corresponding content needs to
be first copied to the page cache before being copied to the dma-buf.

This method worked well when reading relatively small files before, as
the page cache can cache the file content, thus improving performance.

However, there are new challenges currently, especially as AI models are
becoming larger and need to be shared between DMA devices and the CPU
via dma-buf.

For example, the current 3B model file size is around 3.4GB. Using the
previous method would mean generating a total of 3.4GB of page cache
(even if it will be reclaimed), and also requiring the copying of 3.4GB
of content between page cache and dma-buf.

Due to the limited nature of system memory, files in the gigabyte range
cannot persist in memory indefinitely, so this portion of page cache may
not provide much assistance for subsequent reads. Additionally, the
existence of page cache will consume additional system resources due to
the extra copying required by the CPU.

Therefore, it is necessary for dma-buf to support direct I/O.

This patch provides a method to immediately read the file content after
the dma-buf is allocated, and only returns the dma-buf file descriptor
after the file is fully read.

Since the dma-buf file descriptor is not returned, no other thread can
access it except for the current thread, so we don't need to worry about
race conditions.

Map the dma-buf to the vmalloc area and initiate file reads in kernel
space, supporting both buffer I/O and direct I/O.

This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for upper layers.
When a user needs to allocate a dma-buf and read a file, they should
pass this flag. As the size of the file being read is fixed, there is no
need to pass the 'len' parameter.

Instead, The file_fd needs to be passed to indicate to the kernel the file
that needs to be read, and the file open flag determines the mode of
file reading. But, please note that if direct I/O(O_DIRECT) is needed to
read the file, the file size must be page aligned.

Therefore, for the user, len and file_fd are mutually exclusive,
and they are combined using a union.

Once the user obtains the dma-buf fd, the dma-buf directly contains the
file content.

Signed-off-by: Huan Yang <link@vivo.com>
---
 drivers/dma-buf/dma-heap.c    | 127 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/dma-heap.h |  11 ++-
 2 files changed, 133 insertions(+), 5 deletions(-)

diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
index 2298ca5e112e..f19b944d4eaa 100644
--- a/drivers/dma-buf/dma-heap.c
+++ b/drivers/dma-buf/dma-heap.c
@@ -43,12 +43,128 @@ struct dma_heap {
 	struct cdev heap_cdev;
 };
 
+/**
+ * struct dma_heap_file - wrap the file, read task for dma_heap allocate use.
+ * @file:		file to read from.
+ * @fsize:		file size.
+ */
+struct dma_heap_file {
+	struct file *file;
+	size_t fsize;
+};
+
 static LIST_HEAD(heap_list);
 static DEFINE_MUTEX(heap_list_lock);
 static dev_t dma_heap_devt;
 static struct class *dma_heap_class;
 static DEFINE_XARRAY_ALLOC(dma_heap_minors);
 
+static int init_dma_heap_file(struct dma_heap_file *heap_file, int file_fd)
+{
+	struct file *file;
+	size_t fsz;
+
+	file = fget(file_fd);
+	if (!file)
+		return -EINVAL;
+
+	// Direct I/O only support PAGE_SIZE aligned files.
+	fsz = i_size_read(file_inode(file));
+	if (file->f_flags & O_DIRECT && !PAGE_ALIGNED(fsz))
+		return -EINVAL;
+
+	heap_file->fsize = fsz;
+	heap_file->file = file;
+
+	return 0;
+}
+
+static void deinit_dma_heap_file(struct dma_heap_file *heap_file)
+{
+	fput(heap_file->file);
+}
+
+/**
+ * Trigger sync file read, read into dma-buf.
+ *
+ * @dmabuf:			which we done alloced and export.
+ * @heap_file:			file info wrapper to read from.
+ *
+ * Whether to use buffer I/O or direct I/O depends on the mode when the
+ * file is opened.
+ * Remember, if use direct I/O, file must be page aligned.
+ * Since the buffer used for file reading is provided by dma-buf, when
+ * using direct I/O, the file content will be directly filled into
+ * dma-buf without the need for additional CPU copying.
+ *
+ * 0 on success, negative if anything wrong.
+ */
+static int dma_heap_read_file_sync(struct dma_buf *dmabuf,
+				   struct dma_heap_file *heap_file)
+{
+	struct iosys_map map;
+	ssize_t bytes;
+	int ret;
+
+	ret = dma_buf_vmap(dmabuf, &map);
+	if (ret)
+		return ret;
+
+	/**
+	 * The kernel_read_file function can handle file reading effectively,
+	 * and if the return value does not match the file size,
+	 * then it indicates an error.
+	 */
+	bytes = kernel_read_file(heap_file->file, 0, &map.vaddr, dmabuf->size,
+				 &heap_file->fsize, READING_POLICY);
+	if (bytes != heap_file->fsize)
+		ret = -EIO;
+
+	dma_buf_vunmap(dmabuf, &map);
+
+	return ret;
+}
+
+static int dma_heap_buffer_alloc_and_read(struct dma_heap *heap, int file_fd,
+					  u32 fd_flags, u64 heap_flags)
+{
+	struct dma_heap_file heap_file;
+	struct dma_buf *dmabuf;
+	int ret, fd;
+
+	ret = init_dma_heap_file(&heap_file, file_fd);
+	if (ret)
+		return ret;
+
+	dmabuf = heap->ops->allocate(heap, heap_file.fsize, fd_flags,
+				     heap_flags);
+	if (IS_ERR(dmabuf)) {
+		ret = PTR_ERR(dmabuf);
+		goto error_file;
+	}
+
+	ret = dma_heap_read_file_sync(dmabuf, &heap_file);
+	if (ret)
+		goto error_put;
+
+	ret = dma_buf_fd(dmabuf, fd_flags);
+	if (ret < 0)
+		goto error_put;
+
+	fd = ret;
+
+	deinit_dma_heap_file(&heap_file);
+
+	return fd;
+
+error_put:
+	dma_buf_put(dmabuf);
+error_file:
+	deinit_dma_heap_file(&heap_file);
+
+	return ret;
+}
+
 static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
 				 u32 fd_flags,
 				 u64 heap_flags)
@@ -108,9 +224,14 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
 	if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
 		return -EINVAL;
 
-	fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
-				   heap_allocation->fd_flags,
-				   heap_allocation->heap_flags);
+	if (heap_allocation->heap_flags & DMA_HEAP_ALLOC_AND_READ_FILE)
+		fd = dma_heap_buffer_alloc_and_read(
+			heap, heap_allocation->file_fd,
+			heap_allocation->fd_flags, heap_allocation->heap_flags);
+	else
+		fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
+					   heap_allocation->fd_flags,
+					   heap_allocation->heap_flags);
 	if (fd < 0)
 		return fd;
 
diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
index a4cf716a49fa..ef2fbd885825 100644
--- a/include/uapi/linux/dma-heap.h
+++ b/include/uapi/linux/dma-heap.h
@@ -18,13 +18,17 @@
 /* Valid FD_FLAGS are O_CLOEXEC, O_RDONLY, O_WRONLY, O_RDWR */
 #define DMA_HEAP_VALID_FD_FLAGS (O_CLOEXEC | O_ACCMODE)
 
+/* Heap will read file after alloc done, len field change to file fd */
+#define DMA_HEAP_ALLOC_AND_READ_FILE		00000001
+
 /* Currently no heap flags */
-#define DMA_HEAP_VALID_HEAP_FLAGS (0ULL)
+#define DMA_HEAP_VALID_HEAP_FLAGS (DMA_HEAP_ALLOC_AND_READ_FILE)
 
 /**
  * struct dma_heap_allocation_data - metadata passed from userspace for
  *                                      allocations
  * @len:		size of the allocation
+ * @file_fd:		file descriptor to read the allocation from
  * @fd:			will be populated with a fd which provides the
  *			handle to the allocated dma-buf
  * @fd_flags:		file descriptor flags used when allocating
@@ -33,7 +37,10 @@
  * Provided by userspace as an argument to the ioctl
  */
 struct dma_heap_allocation_data {
-	__u64 len;
+	union {
+		__u64 len;
+		__u32 file_fd;
+	};
 	__u32 fd;
 	__u32 fd_flags;
 	__u64 heap_flags;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 2/5] dma-buf: heaps: Introduce async alloc read ops
  2024-07-30  7:57 [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag Huan Yang
  2024-07-30  7:57 ` [PATCH v2 1/5] dma-buf: heaps: " Huan Yang
@ 2024-07-30  7:57 ` Huan Yang
  2024-07-30  7:57 ` [PATCH v2 3/5] dma-buf: heaps: support alloc async read file Huan Yang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Huan Yang @ 2024-07-30  7:57 UTC (permalink / raw)
  To: Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Christian König, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel
  Cc: opensource.kernel, Huan Yang

The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
synchronously read files using direct I/O.

This approach helps to save CPU copying and avoid a certain degree of
memory thrashing (page cache generation and reclamation)

When dealing with large file sizes, the benefits of this approach become
particularly significant.

However, there are currently some methods that can improve performance,
not just save system resources:

Due to the large file size, for example, a AI 7B model of around 3.4GB, the
time taken to allocate DMA-BUF memory will be relatively long. Waiting
for the allocation to complete before reading the file will add to the
overall time consumption. Therefore, the total time for DMA-BUF
allocation and file read can be calculated using the formula
   T(total) = T(alloc) + T(I/O)

However, if we change our approach, we don't necessarily need to wait
for the DMA-BUF allocation to complete before initiating I/O. In fact,
during the allocation process, we already hold a portion of the page,
which means that waiting for subsequent page allocations to complete
before carrying out file reads is actually unfair to the pages that have
already been allocated.

The allocation of pages is sequential, and the reading of the file is
also sequential, with the content and size corresponding to the file.
This means that the memory location for each page, which holds the
content of a specific position in the file, can be determined at the
time of allocation.

However, to fully leverage I/O performance, it is best to wait and
gather a certain number of pages before initiating batch processing.

This patch only provides an allocate_async_read heap ops, without
including other infrastructure for completing async reads and the
corresponding heap implementation. When the allocate_async_read ops heap
is not implemented, it will wait for the dma-buf to be allocated before
reading the file (sync).

Signed-off-by: Huan Yang <link@vivo.com>
---
 drivers/dma-buf/dma-heap.c | 14 ++++++++++----
 include/linux/dma-heap.h   |  8 ++++++--
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
index f19b944d4eaa..91e241763ebc 100644
--- a/drivers/dma-buf/dma-heap.c
+++ b/drivers/dma-buf/dma-heap.c
@@ -131,21 +131,27 @@ static int dma_heap_buffer_alloc_and_read(struct dma_heap *heap, int file_fd,
 	struct dma_heap_file heap_file;
 	struct dma_buf *dmabuf;
 	int ret, fd;
+	bool async_read = heap->ops->allocate_async_read ? true : false;

 	ret = init_dma_heap_file(&heap_file, file_fd);
 	if (ret)
 		return ret;

-	dmabuf = heap->ops->allocate(heap, heap_file.fsize, fd_flags,
-				     heap_flags);
+	if (async_read)
+		dmabuf = heap->ops->allocate_async_read(heap, &heap_file,
+							fd_flags, heap_flags);
+	else
+		dmabuf = heap->ops->allocate(heap, heap_file.fsize, fd_flags,
+					     heap_flags);
 	if (IS_ERR(dmabuf)) {
 		ret = PTR_ERR(dmabuf);
 		goto error_file;
 	}

-	ret = dma_heap_read_file_sync(dmabuf, &heap_file);
-	if (ret)
+	if (!async_read && dma_heap_read_file_sync(dmabuf, &heap_file)) {
+		ret = -EIO;
 		goto error_put;
+	}

 	ret = dma_buf_fd(dmabuf, fd_flags);
 	if (ret < 0)
diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h
index 064bad725061..824acbf5a1bc 100644
--- a/include/linux/dma-heap.h
+++ b/include/linux/dma-heap.h
@@ -13,11 +13,12 @@
 #include <linux/types.h>

 struct dma_heap;
+struct dma_heap_file;

 /**
  * struct dma_heap_ops - ops to operate on a given heap
- * @allocate:		allocate dmabuf and return struct dma_buf ptr
- *
+ * @allocate:			allocate dmabuf and return struct dma_buf ptr
+ * @allocate_async_read:	allocate and async read file.
  * allocate returns dmabuf on success, ERR_PTR(-errno) on error.
  */
 struct dma_heap_ops {
@@ -25,6 +26,9 @@ struct dma_heap_ops {
 				    unsigned long len,
 				    u32 fd_flags,
 				    u64 heap_flags);
+	struct dma_buf *(*allocate_async_read)(struct dma_heap *heap,
+					       struct dma_heap_file *heap_file,
+					       u32 fd_flags, u64 heap_flags);
 };

 /**
-- 
2.45.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 3/5] dma-buf: heaps: support alloc async read file
  2024-07-30  7:57 [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag Huan Yang
  2024-07-30  7:57 ` [PATCH v2 1/5] dma-buf: heaps: " Huan Yang
  2024-07-30  7:57 ` [PATCH v2 2/5] dma-buf: heaps: Introduce async alloc read ops Huan Yang
@ 2024-07-30  7:57 ` Huan Yang
  2024-07-31 14:44   ` kernel test robot
  2024-07-30  7:57 ` [PATCH v2 4/5] dma-buf: heaps: system_heap alloc support async read Huan Yang
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Huan Yang @ 2024-07-30  7:57 UTC (permalink / raw)
  To: Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Christian König, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel
  Cc: opensource.kernel, Huan Yang

This patch completes the infrastructure for async read. It treats memory
allocation as the producer and assigns file reading to the heap_fwork_t
thread as the consumer.

The heap needs to gather each allocated page and, when a certain amount
(default 128MB) is gathered, it will package and pass it to heap_fwork_t
to initiate file reading.

This process is completed by the helper functions
dma_heap_gather_file_page. Ever heap declare a task and pass each page,
then wait file read done before return dma-buf.

Because the memory allocation and file reading correspond to each other,
the number of gathers during the prepare process and submit process can
determine the offset in the file as well as the size to be read.

When a gather page initiates a read, it is packaged into a work and
passed to the heap_fwork_t thread, containing the offset and size of the
file being read, the buffer obtained by mapping the gather page to
vmalloc, and the credentials used during the read.

The buffer for file reading is provided by mapping the gathered pages to
vmalloc. This means that if direct I/O is used to read a file, the file
content will be directly transferred to the corresponding memory of the
dma-buf, without the need for additional CPU copying and intermediate
buffers.

Although direct I/O requires page aligned, this patch can
automatically adapt to the file size and use buffer I/O to read the
unaligned parts.

Note that heap_fwork_t is a single-threaded process, which means that
the file read work is executed serially. Considering that the default
I/O amount initiated at a time is 128MB, which is already quite large,
multiple threads will not help accelerate I/O performance.

So, this is more suit for large size file read into dma-buf.

Signed-off-by: Huan Yang <link@vivo.com>
---
 drivers/dma-buf/dma-heap.c | 423 ++++++++++++++++++++++++++++++++++++-
 include/linux/dma-heap.h   |  45 ++++
 2 files changed, 462 insertions(+), 6 deletions(-)

diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
index 91e241763ebc..df1b2518f126 100644
--- a/drivers/dma-buf/dma-heap.c
+++ b/drivers/dma-buf/dma-heap.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>
 #include <linux/syscalls.h>
 #include <linux/dma-heap.h>
+#include <linux/vmalloc.h>
 #include <uapi/linux/dma-heap.h>
 
 #define DEVNAME "dma_heap"
@@ -46,42 +47,419 @@ struct dma_heap {
 /**
  * struct dma_heap_file - wrap the file, read task for dma_heap allocate use.
  * @file:		file to read from.
+ * @cred:		kthread use, user cred copy to use for the read.
+ * @glimit:		The size limit for gathering. Whenever the page of the
+ *			gather reaches the limit, file I/O is triggered.
+ *			This is the maximum limit for the current ALLOC_AND_READ
+ *			operation.
  * @fsize:		file size.
+ * @direct:		use direct IO?
  */
 struct dma_heap_file {
 	struct file *file;
+	struct cred *cred;
+	size_t glimit;
 	size_t fsize;
+	bool direct;
 };
 
+/**
+ * struct dma_heap_file_work - represents a dma_heap file read real work.
+ * @vaddr:		contigous virtual address alloc by vmap, file read need.
+ *
+ * @start_size:		file read start offset, same to @dma_heap_file_task->roffset.
+ *
+ * @need_size:		file read need size, same to @dma_heap_file_task->rsize.
+ *
+ * @heap_file:		file wrapper.
+ *
+ * @list:		child node of @dma_heap_file_control->works.
+ *
+ * @refp:		same @dma_heap_file_task->ref, if end of read, put ref.
+ *
+ * @failp:		if any work io failed, set it true, pointp @dma_heap_file_task->fail.
+ */
+struct dma_heap_file_work {
+	void *vaddr;
+	ssize_t start_size;
+	ssize_t need_size;
+	struct dma_heap_file *heap_file;
+	struct list_head list;
+	atomic_t *refp;
+	bool *failp;
+};
+
+/**
+ * struct dma_heap_file_task - represents a dma_heap file read process
+ * @ref:		current file work counter, if zero, allocate and read
+ *			done.
+ *
+ * @roffset:		last read offset, current prepared work' begin file
+ *			start offset.
+ *
+ * @rsize:		current allocated page size use to read, if reach rbatch,
+ *			trigger commit.
+ *
+ * @nr_gathered:	current gathered page, Take the minimum value
+ *			between the @glimit and the remaining allocation amount.
+ *
+ * @heap_file:		current dma_heap_file
+ *
+ * @parray:		used for vmap, size is @dma_heap_file's batch's number
+ *			pages.(this is maximum). Due to single thread file read,
+ *			one page array reuse in ftask prepare is OK.
+ *			Each index in parray is PAGE_SIZE.(vmap need)
+ *
+ * @pindex:		current allocated page filled in @parray's index.
+ *
+ * @fail:		any work failed when file read?
+ *
+ * dma_heap_file_task is the production of file read, will prepare each work
+ * during allocate dma_buf pages, if match current batch, then trigger commit
+ * and prepare next work. After all batch queued, user going on prepare dma_buf
+ * and so on, but before return dma_buf fd, need to wait file read end and
+ * check read result.
+ */
+struct dma_heap_file_task {
+	atomic_t ref;
+	size_t roffset;
+	size_t rsize;
+	size_t nr_gathered;
+	struct dma_heap_file *heap_file;
+	struct page **parray;
+	unsigned int pindex;
+	bool fail;
+};
+
+/**
+ * struct dma_heap_file_control - global control of dma_heap file read.
+ * @works:		@dma_heap_file_work's list head.
+ *
+ * @threadwq:		wait queue for @work_thread, if commit work, @work_thread
+ *			wakeup and read this work's file contains.
+ *
+ * @workwq:		used for main thread wait for file read end, if allocation
+ *			end before file read. @dma_heap_file_task ref effect this.
+ *
+ * @work_thread:	file read kthread. the dma_heap_file_task work's consumer.
+ *
+ * @heap_fwork_cachep:	@dma_heap_file_work's cachep, it's alloc/free frequently.
+ *
+ * @nr_work:		global number of how many work committed.
+ */
+struct dma_heap_file_control {
+	struct list_head works;
+	spinlock_t lock; // only lock for @works.
+	wait_queue_head_t threadwq;
+	wait_queue_head_t workwq;
+	struct task_struct *work_thread;
+	struct kmem_cache *heap_fwork_cachep;
+	atomic_t nr_work;
+};
+
+static struct dma_heap_file_control *heap_fctl;
 static LIST_HEAD(heap_list);
 static DEFINE_MUTEX(heap_list_lock);
 static dev_t dma_heap_devt;
 static struct class *dma_heap_class;
 static DEFINE_XARRAY_ALLOC(dma_heap_minors);
 
+static struct dma_heap_file_work *
+init_file_work(struct dma_heap_file_task *heap_ftask)
+{
+	struct dma_heap_file_work *heap_fwork;
+	struct dma_heap_file *heap_file = heap_ftask->heap_file;
+
+	if (READ_ONCE(heap_ftask->fail))
+		return NULL;
+
+	heap_fwork = kmem_cache_alloc(heap_fctl->heap_fwork_cachep, GFP_KERNEL);
+	if (unlikely(!heap_fwork))
+		return NULL;
+
+	/**
+	 * Map the gathered page to the vmalloc area.
+	 * So we get a continuous virtual address, even if the physical address
+	 * is scatter, can use this to trigger file read, if use direct I/O,
+	 * all content can direct read into dma-buf pages without extra copy.
+	 *
+	 * Now that we get vaddr page, cached pages can return to original user, so we
+	 * will not effect dma-buf export even if file read not end.
+	 */
+	heap_fwork->vaddr = vmap(heap_ftask->parray, heap_ftask->pindex, VM_MAP,
+				 PAGE_KERNEL);
+	if (unlikely(!heap_fwork->vaddr)) {
+		kmem_cache_free(heap_fctl->heap_fwork_cachep, heap_fwork);
+		return NULL;
+	}
+
+	heap_fwork->heap_file = heap_file;
+	heap_fwork->start_size = heap_ftask->roffset;
+	heap_fwork->need_size = heap_ftask->rsize;
+	heap_fwork->refp = &heap_ftask->ref;
+	heap_fwork->failp = &heap_ftask->fail;
+	atomic_inc(&heap_ftask->ref);
+	return heap_fwork;
+}
+
+static void deinit_file_work(struct dma_heap_file_work *heap_fwork)
+{
+	vunmap(heap_fwork->vaddr);
+	atomic_dec(heap_fwork->refp);
+	wake_up(&heap_fctl->workwq);
+
+	kmem_cache_free(heap_fctl->heap_fwork_cachep, heap_fwork);
+}
+
+/**
+ * dma_heap_submit_file_read -  prepare collect enough memory, going to trigger IO
+ * @heap_ftask:			info that current IO needs
+ *
+ * This will also check if reach to tail read.
+ * For direct I/O submissions, it is necessary to pay attention to file reads
+ * that are not page-aligned. For the unaligned portion of the read, buffer IO
+ * needs to be triggered.
+ * Returns:
+ *   0 if all right, negative if something wrong
+ */
+static int dma_heap_submit_file_read(struct dma_heap_file_task *heap_ftask)
+{
+	struct dma_heap_file_work *heap_fwork = init_file_work(heap_ftask);
+	struct page *last = NULL;
+	struct dma_heap_file *heap_file = heap_ftask->heap_file;
+	size_t start = heap_ftask->roffset;
+	struct file *file = heap_file->file;
+	size_t fsz = heap_file->fsize;
+
+	if (unlikely(!heap_fwork))
+		return -ENOMEM;
+
+	/**
+	 * If file size is not page aligned, direct io can't process the tail.
+	 * So, if reach to tail, remain the last page use buffer read.
+	 */
+	if (heap_file->direct && start + heap_ftask->rsize > fsz) {
+		heap_fwork->need_size -= PAGE_SIZE;
+		last = heap_ftask->parray[heap_ftask->pindex - 1];
+	}
+
+	spin_lock(&heap_fctl->lock);
+	list_add_tail(&heap_fwork->list, &heap_fctl->works);
+	spin_unlock(&heap_fctl->lock);
+	atomic_inc(&heap_fctl->nr_work);
+
+	wake_up(&heap_fctl->threadwq);
+
+	if (last) {
+		char *buf, *pathp;
+		ssize_t err;
+		void *buffer;
+
+		buf = kmalloc(PATH_MAX, GFP_KERNEL);
+		if (unlikely(!buf))
+			return -ENOMEM;
+
+		start = PAGE_ALIGN_DOWN(fsz);
+
+		pathp = file_path(file, buf, PATH_MAX);
+		if (IS_ERR(pathp)) {
+			kfree(buf);
+			return PTR_ERR(pathp);
+		}
+
+		// use page's kaddr as file read buffer.
+		buffer = kmap_local_page(last);
+		err = kernel_read_file_from_path(pathp, start, &buffer,
+						 fsz - start, &fsz,
+						 READING_POLICY);
+		kunmap_local(buffer);
+		kfree(buf);
+		if (err < 0)
+			return err;
+	}
+
+	heap_ftask->roffset += heap_ftask->rsize;
+	heap_ftask->rsize = 0;
+	heap_ftask->pindex = 0;
+	heap_ftask->nr_gathered = min_t(size_t,
+					PAGE_ALIGN(fsz) - heap_ftask->roffset,
+					heap_ftask->nr_gathered);
+	return 0;
+}
+
+int dma_heap_gather_file_page(struct dma_heap_file_task *heap_ftask,
+			      struct page *page)
+{
+	struct page **array = heap_ftask->parray;
+	int index = heap_ftask->pindex;
+	int num = compound_nr(page), i;
+	unsigned long sz = page_size(page);
+
+	heap_ftask->rsize += sz;
+	for (i = 0; i < num; ++i)
+		array[index++] = &page[i];
+	heap_ftask->pindex = index;
+
+	if (heap_ftask->rsize < heap_ftask->nr_gathered)
+		return 0;
+
+	// already reach to limit, trigger file read.
+	return dma_heap_submit_file_read(heap_ftask);
+}
+
+int dma_heap_wait_for_file_read(struct dma_heap_file_task *heap_ftask)
+{
+	wait_event_freezable(heap_fctl->workwq,
+			     atomic_read(&heap_ftask->ref) == 0);
+	return heap_ftask->fail ? -EIO : 0;
+}
+
+int dma_heap_end_file_read(struct dma_heap_file_task *heap_ftask)
+{
+	int ret;
+
+	ret = dma_heap_wait_for_file_read(heap_ftask);
+	kvfree(heap_ftask->parray);
+	kfree(heap_ftask);
+
+	return ret;
+}
+
+struct dma_heap_file_task *
+dma_heap_declare_file_read(struct dma_heap_file *heap_file)
+{
+	struct dma_heap_file_task *heap_ftask =
+		kzalloc(sizeof(*heap_ftask), GFP_KERNEL);
+	if (unlikely(!heap_ftask))
+		return NULL;
+
+	/**
+	 * glimit is the maximum size which we prepare work will meet.
+	 * So, direct alloc this number's page array is OK.
+	 */
+	heap_ftask->parray = kvmalloc_array(heap_file->glimit >> PAGE_SHIFT,
+					    sizeof(struct page *), GFP_KERNEL);
+	if (unlikely(!heap_ftask->parray))
+		goto put;
+
+	heap_ftask->heap_file = heap_file;
+	heap_ftask->nr_gathered = heap_file->glimit;
+	return heap_ftask;
+
+put:
+	kfree(heap_ftask);
+	return NULL;
+}
+
+static void __work_this_io(struct dma_heap_file_work *heap_fwork)
+{
+	struct dma_heap_file *heap_file = heap_fwork->heap_file;
+	struct file *file = heap_file->file;
+	ssize_t start = heap_fwork->start_size;
+	ssize_t size = heap_fwork->need_size;
+	void *buffer = heap_fwork->vaddr;
+	const struct cred *old_cred;
+	ssize_t err;
+
+	// use real task's cred to read this file.
+	old_cred = override_creds(heap_file->cred);
+	err = kernel_read_file(file, start, &buffer, size, &heap_file->fsize,
+			       READING_POLICY);
+	if (err < 0)
+		WRITE_ONCE(*heap_fwork->failp, true);
+	// recovery to my cred.
+	revert_creds(old_cred);
+}
+
+static int dma_heap_file_work_thread(void *data)
+{
+	struct dma_heap_file_control *heap_fctl =
+		(struct dma_heap_file_control *)data;
+	struct dma_heap_file_work *worker, *tmp;
+	int nr_work;
+
+	LIST_HEAD(pages);
+	LIST_HEAD(workers);
+
+	while (true) {
+		wait_event_freezable(heap_fctl->threadwq,
+				     atomic_read(&heap_fctl->nr_work) > 0);
+recheck:
+		spin_lock(&heap_fctl->lock);
+		list_splice_init(&heap_fctl->works, &workers);
+		spin_unlock(&heap_fctl->lock);
+
+		if (unlikely(kthread_should_stop())) {
+			list_for_each_entry_safe(worker, tmp, &workers, list) {
+				list_del(&worker->list);
+				deinit_file_work(worker);
+			}
+			break;
+		}
+
+		nr_work = 0;
+		list_for_each_entry_safe(worker, tmp, &workers, list) {
+			++nr_work;
+			list_del(&worker->list);
+			__work_this_io(worker);
+
+			deinit_file_work(worker);
+		}
+
+		if (atomic_sub_return(nr_work, &heap_fctl->nr_work) > 0)
+			goto recheck;
+	}
+	return 0;
+}
+
+size_t dma_heap_file_size(struct dma_heap_file *heap_file)
+{
+	return heap_file->fsize;
+}
+
 static int init_dma_heap_file(struct dma_heap_file *heap_file, int file_fd)
 {
 	struct file *file;
 	size_t fsz;
+	int ret;
 
 	file = fget(file_fd);
 	if (!file)
 		return -EINVAL;
 
-	// Direct I/O only support PAGE_SIZE aligned files.
 	fsz = i_size_read(file_inode(file));
-	if (file->f_flags & O_DIRECT && !PAGE_ALIGNED(fsz))
-		return -EINVAL;
 
-	heap_file->fsize = fsz;
+	/**
+	 * Selinux block our read, but actually we are reading the stand-in
+	 * for this file.
+	 * So save current's cred and when going to read, override mine, and
+	 * end of read, revert.
+	 */
+	heap_file->cred = prepare_kernel_cred(current);
+	if (unlikely(!heap_file->cred)) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
 	heap_file->file = file;
+#define DEFAULT_DMA_BUF_HEAPS_GATHER_LIMIT (128 << 20)
+	heap_file->glimit = min_t(size_t, PAGE_ALIGN(fsz),
+				  DEFAULT_DMA_BUF_HEAPS_GATHER_LIMIT);
+	heap_file->fsize = fsz;
+
+	heap_file->direct = file->f_flags & O_DIRECT;
 
 	return 0;
+
+err:
+	fput(file);
+	return ret;
 }
 
 static void deinit_dma_heap_file(struct dma_heap_file *heap_file)
 {
 	fput(heap_file->file);
+	put_cred(heap_file->cred);
 }
 
 /**
@@ -443,11 +821,44 @@ static int dma_heap_init(void)
 
 	dma_heap_class = class_create(DEVNAME);
 	if (IS_ERR(dma_heap_class)) {
-		unregister_chrdev_region(dma_heap_devt, NUM_HEAP_MINORS);
-		return PTR_ERR(dma_heap_class);
+		ret = PTR_ERR(dma_heap_class);
+		goto fail_class;
 	}
 	dma_heap_class->devnode = dma_heap_devnode;
 
+	heap_fctl = kzalloc(sizeof(*heap_fctl), GFP_KERNEL);
+	if (unlikely(!heap_fctl)) {
+		ret =  -ENOMEM;
+		goto fail_alloc;
+	}
+
+	INIT_LIST_HEAD(&heap_fctl->works);
+	init_waitqueue_head(&heap_fctl->threadwq);
+	init_waitqueue_head(&heap_fctl->workwq);
+
+	heap_fctl->work_thread = kthread_run(dma_heap_file_work_thread,
+					     heap_fctl, "heap_fwork_t");
+	if (IS_ERR(heap_fctl->work_thread)) {
+		ret = -ENOMEM;
+		goto fail_thread;
+	}
+
+	heap_fctl->heap_fwork_cachep = KMEM_CACHE(dma_heap_file_work, 0);
+	if (unlikely(!heap_fctl->heap_fwork_cachep)) {
+		ret = -ENOMEM;
+		goto fail_cache;
+	}
+
 	return 0;
+
+fail_cache:
+	kthread_stop(heap_fctl->work_thread);
+fail_thread:
+	kfree(heap_fctl);
+fail_alloc:
+	class_destroy(dma_heap_class);
+fail_class:
+	unregister_chrdev_region(dma_heap_devt, NUM_HEAP_MINORS);
+	return ret;
 }
 subsys_initcall(dma_heap_init);
diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h
index 824acbf5a1bc..3becbd08963a 100644
--- a/include/linux/dma-heap.h
+++ b/include/linux/dma-heap.h
@@ -14,6 +14,8 @@
 
 struct dma_heap;
 struct dma_heap_file;
+struct dma_heap_file_task;
+struct dma_heap_file;
 
 /**
  * struct dma_heap_ops - ops to operate on a given heap
@@ -69,4 +71,47 @@ const char *dma_heap_get_name(struct dma_heap *heap);
  */
 struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info);
 
+/**
+ * dma_heap_wait_for_file_read - waits for a file read to complete
+ *
+ * Some users need to call this function before destroying the page to ensure
+ * that all file work has been completed, in order to avoid UAF issues.
+ * Remember, this function does not destroy the data structure corresponding to
+ * the ftask. Before ending the actual processing, you need to call
+ * @dma_heap_end_file_read.
+ *
+ * 0 - success, -EIO - if any file work failed
+ */
+int dma_heap_wait_for_file_read(struct dma_heap_file_task *heap_ftask);
+
+/**
+ * dma_heap_end_file_read - waits for a file read to complete then destroy it
+ * 0 - success, -EIO - if any file work failed
+ */
+int dma_heap_end_file_read(struct dma_heap_file_task *heap_ftask);
+
+/**
+ * dma_heap_alloc_file_read - Declare a task to read file when allocate pages.
+ * @heap_file:		target file to read
+ *
+ * Return NULL if failed, otherwise return a struct pointer.
+ */
+struct dma_heap_file_task *
+dma_heap_declare_file_read(struct dma_heap_file *heap_file);
+
+/**
+ * dma_heap_gather_file_page - gather each allocated page.
+ * @heap_ftask:		prepared and need to commit's work.
+ * @page:		current allocated page. don't care which order.
+ *
+ * This function gather all allocated pages, automatically submit when the
+ * gathering reaches the limit. Submit will package pages, prepare the data
+ * required for reading file, then submit to async read thread.
+ *
+ * 0 - success, nagtive - failed.
+ */
+int dma_heap_gather_file_page(struct dma_heap_file_task *heap_ftask,
+			      struct page *page);
+size_t dma_heap_file_size(struct dma_heap_file *heap_file);
+
 #endif /* _DMA_HEAPS_H */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 4/5] dma-buf: heaps: system_heap alloc support async read
  2024-07-30  7:57 [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag Huan Yang
                   ` (2 preceding siblings ...)
  2024-07-30  7:57 ` [PATCH v2 3/5] dma-buf: heaps: support alloc async read file Huan Yang
@ 2024-07-30  7:57 ` Huan Yang
  2024-07-30  7:57 ` [PATCH v2 5/5] dma-buf: heaps: configurable async read gather limit Huan Yang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Huan Yang @ 2024-07-30  7:57 UTC (permalink / raw)
  To: Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Christian König, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel
  Cc: opensource.kernel, Huan Yang

The system heap cyclically allocates pages and then places these pages
into a scatter_list, which is then managed by the dma-buf.

This process can parallelize memory allocation and I/O read operations:

Gather each allocated page and trigger a submit once the limit is reached.
Once the memory allocation is complete, there is no need to wait
immediately for the file read to finish. Instead, continue preparing the
dma-buf until it is necessary to return the dma-buf, at which point
waiting for the file content to be fully read is required.

Note that the content of the page cannot be modified after it is
allocated in the heap, as it may cause conflicts with accessing the page
when reading from the file. There are currently no conflicts in the
system_heap for this part.

The formula for the time taken for system_heap buffer allocation and
file reading through async_read is as follows:

  T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))

Compared to the synchronous read:
  T(total) = T(alloc) + T(I/O)

If the allocation time or I/O time is long, the time difference will be
covered by the maximum value between the allocation and I/O. The other
party will be concealed.

Therefore, the larger the size of the file that needs to be read, the
greater the corresponding benefits will be.

Signed-off-by: Huan Yang <link@vivo.com>
---
 drivers/dma-buf/heaps/system_heap.c | 70 +++++++++++++++++++++++++++--
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
index d78cdb9d01e5..ba0c3d8ce090 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -331,10 +331,10 @@ static struct page *alloc_largest_available(unsigned long size,
 	return NULL;
 }
 
-static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
-					    unsigned long len,
-					    u32 fd_flags,
-					    u64 heap_flags)
+static struct dma_buf *__system_heap_allocate(struct dma_heap *heap,
+					      struct dma_heap_file *heap_file,
+					      unsigned long len, u32 fd_flags,
+					      u64 heap_flags)
 {
 	struct system_heap_buffer *buffer;
 	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
@@ -346,6 +346,7 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 	struct list_head pages;
 	struct page *page, *tmp_page;
 	int i, ret = -ENOMEM;
+	struct dma_heap_file_task *heap_ftask;
 
 	buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
 	if (!buffer)
@@ -357,6 +358,15 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 	buffer->len = len;
 
 	INIT_LIST_HEAD(&pages);
+
+	if (heap_file) {
+		heap_ftask = dma_heap_declare_file_read(heap_file);
+		if (!heap_ftask) {
+			kfree(buffer);
+			return ERR_PTR(-ENOMEM);
+		}
+	}
+
 	i = 0;
 	while (size_remaining > 0) {
 		/*
@@ -372,6 +382,13 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 		if (!page)
 			goto free_buffer;
 
+		/**
+		 * If alloc and read, gather each page to read task.
+		 * If got error, free buffer and return error.
+		 */
+		if (heap_file && dma_heap_gather_file_page(heap_ftask, page))
+			goto free_buffer;
+
 		list_add_tail(&page->lru, &pages);
 		size_remaining -= page_size(page);
 		max_order = compound_order(page);
@@ -400,9 +417,29 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 		ret = PTR_ERR(dmabuf);
 		goto free_pages;
 	}
+
+	/**
+	 * End of alloc, dma-buf export and anything we need, but maybe
+	 * file read is still work, so, wait it. if done, destroy all file
+	 * task.
+	 * But maybe something wrong when read file, if so, abandon dma-buf
+	 * return error.
+	 */
+	if (heap_file && dma_heap_end_file_read(heap_ftask)) {
+		dma_buf_put(dmabuf);
+		dmabuf = ERR_PTR(-EIO);
+	}
+
 	return dmabuf;
 
 free_pages:
+	/**
+	 * maybe we already trigger file read, so, before release pages,
+	 * wait for all running file read task done.
+	 */
+	if (heap_file)
+		dma_heap_wait_for_file_read(heap_ftask);
+
 	for_each_sgtable_sg(table, sg, i) {
 		struct page *p = sg_page(sg);
 
@@ -410,6 +447,13 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 	}
 	sg_free_table(table);
 free_buffer:
+	/**
+	 * maybe we already trigger file read, so, before release pages and
+	 * return, destroy file task, include running task.
+	 */
+	if (heap_file)
+		dma_heap_end_file_read(heap_ftask);
+
 	list_for_each_entry_safe(page, tmp_page, &pages, lru)
 		__free_pages(page, compound_order(page));
 	kfree(buffer);
@@ -417,8 +461,26 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 	return ERR_PTR(ret);
 }
 
+static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
+					    unsigned long len, u32 fd_flags,
+					    u64 heap_flags)
+{
+	return __system_heap_allocate(heap, NULL, len, fd_flags, heap_flags);
+}
+
+static struct dma_buf *
+system_heap_allocate_async_read_file(struct dma_heap *heap,
+				     struct dma_heap_file *heap_file,
+				     u32 fd_flags, u64 heap_flags)
+{
+	return __system_heap_allocate(heap, heap_file,
+				      PAGE_ALIGN(dma_heap_file_size(heap_file)),
+				      fd_flags, heap_flags);
+}
+
 static const struct dma_heap_ops system_heap_ops = {
 	.allocate = system_heap_allocate,
+	.allocate_async_read = system_heap_allocate_async_read_file,
 };
 
 static int system_heap_create(void)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 5/5] dma-buf: heaps: configurable async read gather limit
  2024-07-30  7:57 [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag Huan Yang
                   ` (3 preceding siblings ...)
  2024-07-30  7:57 ` [PATCH v2 4/5] dma-buf: heaps: system_heap alloc support async read Huan Yang
@ 2024-07-30  7:57 ` Huan Yang
  2024-07-30  8:03 ` [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag Christian König
  2024-07-30  8:56 ` Daniel Vetter
  6 siblings, 0 replies; 26+ messages in thread
From: Huan Yang @ 2024-07-30  7:57 UTC (permalink / raw)
  To: Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Christian König, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel
  Cc: opensource.kernel, Huan Yang

The current limit default is 128MB, which is a good experience value for
I/O reading. However, system administrators should be given a
considerable degree of freedom to adjust based on the system's
situation.

This patch exports the limit to the corresponding area of the dma-heap.

Signed-off-by: Huan Yang <link@vivo.com>
---
 drivers/dma-buf/dma-heap.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
index df1b2518f126..2b69cf3ca570 100644
--- a/drivers/dma-buf/dma-heap.c
+++ b/drivers/dma-buf/dma-heap.c
@@ -417,6 +417,11 @@ size_t dma_heap_file_size(struct dma_heap_file *heap_file)
 	return heap_file->fsize;
 }
 
+#define DEFAULT_DMA_BUF_HEAPS_GATHER_LIMIT (128 << 20)
+static int dma_buf_heaps_gather_limit = DEFAULT_DMA_BUF_HEAPS_GATHER_LIMIT;
+module_param_named(gather_limit, dma_buf_heaps_gather_limit, int, 0644);
+MODULE_PARM_DESC(gather_limit, "Asynchronous file reading, with a maximum limit on the amount to be gathered");
+
 static int init_dma_heap_file(struct dma_heap_file *heap_file, int file_fd)
 {
 	struct file *file;
@@ -442,9 +447,8 @@ static int init_dma_heap_file(struct dma_heap_file *heap_file, int file_fd)
 	}
 
 	heap_file->file = file;
-#define DEFAULT_DMA_BUF_HEAPS_GATHER_LIMIT (128 << 20)
 	heap_file->glimit = min_t(size_t, PAGE_ALIGN(fsz),
-				  DEFAULT_DMA_BUF_HEAPS_GATHER_LIMIT);
+				  PAGE_ALIGN(dma_buf_heaps_gather_limit));
 	heap_file->fsize = fsz;
 
 	heap_file->direct = file->f_flags & O_DIRECT;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30  7:57 [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag Huan Yang
                   ` (4 preceding siblings ...)
  2024-07-30  7:57 ` [PATCH v2 5/5] dma-buf: heaps: configurable async read gather limit Huan Yang
@ 2024-07-30  8:03 ` Christian König
  2024-07-30  8:14   ` Huan Yang
  2024-07-30  8:56 ` Daniel Vetter
  6 siblings, 1 reply; 26+ messages in thread
From: Christian König @ 2024-07-30  8:03 UTC (permalink / raw)
  To: Huan Yang, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, linux-media, dri-devel, linaro-mm-sig,
	linux-kernel
  Cc: opensource.kernel

Am 30.07.24 um 09:57 schrieb Huan Yang:
> Background
> ====
> Some user may need load file into dma-buf, current way is:
>    1. allocate a dma-buf, get dma-buf fd
>    2. mmap dma-buf fd into user vaddr
>    3. read(file_fd, vaddr, fsz)
> Due to dma-buf user map can't support direct I/O[1], the file read
> must be buffer I/O.
>
> This means that during the process of reading the file into dma-buf,
> page cache needs to be generated, and the corresponding content needs to
> be first copied to the page cache before being copied to the dma-buf.
>
> This way worked well when reading relatively small files before, as
> the page cache can cache the file content, thus improving performance.
>
> However, there are new challenges currently, especially as AI models are
> becoming larger and need to be shared between DMA devices and the CPU
> via dma-buf.
>
> For example, our 7B model file size is around 3.4GB. Using the
> previous would mean generating a total of 3.4GB of page cache
> (even if it will be reclaimed), and also requiring the copying of 3.4GB
> of content between page cache and dma-buf.
>
> Due to the limited resources of system memory, files in the gigabyte range
> cannot persist in memory indefinitely, so this portion of page cache may
> not provide much assistance for subsequent reads. Additionally, the
> existence of page cache will consume additional system resources due to
> the extra copying required by the CPU.
>
> Therefore, I think it is necessary for dma-buf to support direct I/O.
>
> However, direct I/O file reads cannot be performed using the buffer
> mmaped by the user space for the dma-buf.[1]
>
> Here are some discussions on implementing direct I/O using dma-buf:
>
> mmap[1]
> ---
> dma-buf never support user map vaddr use of direct I/O.
>
> udmabuf[2]
> ---
> Currently, udmabuf can use the memfd method to read files into
> dma-buf in direct I/O mode.
>
> However, if the size is large, the current udmabuf needs to adjust the
> corresponding size_limit(default 64MB).
> But using udmabuf for files at the 3GB level is not a very good approach.
> It needs to make some adjustments internally to handle this.[3] Or else,
> fail create.
>
> But, it is indeed a viable way to enable dma-buf to support direct I/O.
> However, it is necessary to initiate the file read after the memory allocation
> is completed, and handle race conditions carefully.
>
> sendfile/splice[4]
> ---
> Another way to enable dma-buf to support direct I/O is by implementing
> splice_write/write_iter in the dma-buf file operations (fops) to adapt
> to the sendfile method.
> However, the current sendfile/splice calls are based on pipe. When using
> direct I/O to read a file, the content needs to be copied to the buffer
> allocated by the pipe (default 64KB), and then the dma-buf fops'
> splice_write needs to be called to write the content into the dma-buf.
> This approach requires serially reading the content of file pipe size
> into the pipe buffer and then waiting for the dma-buf to be written
> before reading the next one.(The I/O performance is relatively weak
> under direct I/O.)
> Moreover, due to the existence of the pipe buffer, even when using
> direct I/O and not needing to generate additional page cache,
> there still needs to be a CPU copy.
>
> copy_file_range[5]
> ---
> Consider of copy_file_range, It only supports copying files within the
> same file system. Similarly, it is not very practical.
>
>
> So, currently, there is no particularly suitable solution on VFS to
> allow dma-buf to support direct I/O for large file reads.
>
> This patchset provides an idea to complete file reads when requesting a
> dma-buf.
>
> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
> ===
> This patch provides a method to immediately read the file content after
> the dma-buf is allocated, and only returns the dma-buf file descriptor
> after the file is fully read.
>
> Since the dma-buf file descriptor is not returned, no other thread can
> access it except for the current thread, so we don't need to worry about
> race conditions.

That is a completely false assumption.

>
> Map the dma-buf to the vmalloc area and initiate file reads in kernel
> space, supporting both buffer I/O and direct I/O.
>
> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
> When a user needs to allocate a dma-buf and read a file, they should
> pass this heap flag. As the size of the file being read is fixed, there is no
> need to pass the 'len' parameter. Instead, The file_fd needs to be passed to
> indicate to the kernel the file that needs to be read.
>
> The file open flag determines the mode of file reading.
> But, please note that if direct I/O(O_DIRECT) is needed to read the file,
> the file size must be page aligned. (with patch 2-5, no need)
>
> Therefore, for the user, len and file_fd are mutually exclusive,
> and they are combined using a union.
>
> Once the user obtains the dma-buf fd, the dma-buf directly contains the
> file content.

And I'm repeating myself, but this is a complete NAK from my side to 
this approach.

We pointed out multiple ways of how to implement this cleanly and not by 
hacking functionality into the kernel which absolutely doesn't belong there.

Regards,
Christian.

>
> Patch 1 implement it.
>
> Patch 2-5 provides an approach for performance improvement.
>
> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
> synchronously read files using direct I/O.
>
> This approach helps to save CPU copying and avoid a certain degree of
> memory thrashing (page cache generation and reclamation)
>
> When dealing with large file sizes, the benefits of this approach become
> particularly significant.
>
> However, there are currently some methods that can improve performance,
> not just save system resources:
>
> Due to the large file size, for example, a AI 7B model of around 3.4GB, the
> time taken to allocate DMA-BUF memory will be relatively long. Waiting
> for the allocation to complete before reading the file will add to the
> overall time consumption. Therefore, the total time for DMA-BUF
> allocation and file read can be calculated using the formula
>     T(total) = T(alloc) + T(I/O)
>
> However, if we change our approach, we don't necessarily need to wait
> for the DMA-BUF allocation to complete before initiating I/O. In fact,
> during the allocation process, we already hold a portion of the page,
> which means that waiting for subsequent page allocations to complete
> before carrying out file reads is actually unfair to the pages that have
> already been allocated.
>
> The allocation of pages is sequential, and the reading of the file is
> also sequential, with the content and size corresponding to the file.
> This means that the memory location for each page, which holds the
> content of a specific position in the file, can be determined at the
> time of allocation.
>
> However, to fully leverage I/O performance, it is best to wait and
> gather a certain number of pages before initiating batch processing.
>
> The default gather size is 128MB. So, ever gathered can see as a file read
> work, it maps the gather page to the vmalloc area to obtain a continuous
> virtual address, which is used as a buffer to store the contents of the
> corresponding file. So, if using direct I/O to read a file, the file
> content will be written directly to the corresponding dma-buf buffer memory
> without any additional copying.(compare to pipe buffer.)
>
> Consider other ways to read into dma-buf. If we assume reading after mmap
> dma-buf, we need to map the pages of the dma-buf to the user virtual
> address space. Also, udmabuf memfd need do this operations too.
> Even if we support sendfile, the file copy also need buffer, you must
> setup it.
> So, mapping pages to the vmalloc area does not incur any additional
> performance overhead compared to other methods.[6]
>
> Certainly, the administrator can also modify the gather size through patch5.
>
> The formula for the time taken for system_heap buffer allocation and
> file reading through async_read is as follows:
>
>    T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>
> Compared to the synchronous read:
>    T(total) = T(alloc) + T(I/O)
>
> If the allocation time or I/O time is long, the time difference will be
> covered by the maximum value between the allocation and I/O. The other
> party will be concealed.
>
> Therefore, the larger the size of the file that needs to be read, the
> greater the corresponding benefits will be.
>
> How to use
> ===
> Consider the current pathway for loading model files into DMA-BUF:
>    1. open dma-heap, get heap fd
>    2. open file, get file_fd(can't use O_DIRECT)
>    3. use file len to allocate dma-buf, get dma-buf fd
>    4. mmap dma-buf fd, get vaddr
>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>    6. share, attach, whatever you want
>
> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>    1. open dma-heap, get heap fd
>    2. open file, get file_fd(buffer/direct)
>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, set file_fd
>       instead of len. get dma-buf fd(contains file content)
>    4. share, attach, whatever you want
>
> So, test it is easy.
>
> How to test
> ===
> The performance comparison will be conducted for the following scenarios:
>    1. normal
>    2. udmabuf with [3] patch
>    3. sendfile
>    4. only patch 1
>    5. patch1 - patch4.
>
> normal:
>    1. open dma-heap, get heap fd
>    2. open file, get file_fd(can't use O_DIRECT)
>    3. use file len to allocate dma-buf, get dma-buf fd
>    4. mmap dma-buf fd, get vaddr
>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>    6. share, attach, whatever you want
>
> UDMA-BUF step:
>    1. memfd_create
>    2. open file(buffer/direct)
>    3. udmabuf create
>    4. mmap memfd
>    5. read file into memfd vaddr
>
> Sendfile step(need suit splice_write/write_iter, just use to compare):
>    1. open dma-heap, get heap fd
>    2. open file, get file_fd(buffer/direct)
>    3. use file len to allocate dma-buf, get dma-buf fd
>    4. sendfile file_fd to dma-buf fd
>    6. share, attach, whatever you want
>
> patch1/patch1-4:
>    1. open dma-heap, get heap fd
>    2. open file, get file_fd(buffer/direct)
>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, set file_fd
>       instead of len. get dma-buf fd(contains file content)
>    4. share, attach, whatever you want
>
> You can create a file to test it. Compare the performance gap between the two.
> It is best to compare the differences in file size from KB to MB to GB.
>
> The following test data will compare the performance differences between 512KB,
> 8MB, 1GB, and 3GB under various scenarios.
>
> Performance Test
> ===
>    12G RAM phone
>    UFS4.0(the maximum speed is 4GB/s. ),
>    f2fs
>    kernel 6.1 with patch[7] (or else, can't support kvec direct I/O read.)
>    no memory pressure.
>    drop_cache is used for each test.
>
> The average of 5 test results:
> | scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns)       | 3GB(ns)       |
> | ------------------- | ---------- | ---------- | ------------- | ------------- |
> | normal              | 2,790,861  | 14,535,784 | 1,520,790,492 | 3,332,438,754 |
> | udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000   | 2,108,419,923 |
> | sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 | 3,062,052,984 |
> | patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800   | 2,187,570,861 |
> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 | 9,777,661,077 |
> | patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 | 5,648,897,554 |
> | udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784   | 2,158,305,738 |
> | patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538   | 1,400,006,107 |
>
> So, based on the test results:
>
> When the file is large, the patchset has the highest performance.
> Compared to normal, patchset is a 50% improvement;
> Compared to normal, patch1 only showed a degradation of 41%.
> patch1 typical performance breakdown is as follows:
>    1. alloc cost 188,802,693 ns
>    2. vmap cost 42,491,385 ns
>    3. file read cost 4,180,876,702 ns
> Therefore, directly performing a single direct I/O read on a large file
> may not be the most optimal way for performance.
>
> The performance of direct I/O implemented by the sendfile method is the worst.
>
> When file size is small, The difference in performance is not
> significant. This is consistent with expectations.
>
>
>
> Suggested use cases
> ===
>    1. When there is a need to read large files and system resources are scarce,
>       especially when the size of memory is limited.(GB level) In this
>       scenario, using direct I/O for file reading can even bring performance
>       improvements.(may need patch2-3)
>    2. For embedded devices with limited RAM, using direct I/O can save system
>       resources and avoid unnecessary data copying. Therefore, even if the
>       performance is lower when read small file, it can still be used
>       effectively.
>    3. If there is sufficient memory, pinning the page cache of the model files
>       in memory and placing file in the EROFS file system for read-only access
>       maybe better.(EROFS do not support direct I/O)
>
>
> Changlog
> ===
>   v1 [8]
>   v1->v2:
>     Uses the heap flag method for alloc and read instead of adding a new
>     DMA-buf ioctl command. [9]
>     Split the patchset to facilitate review and test.
>       patch 1 implement alloc and read, offer heap flag into it.
>       patch 2-4 offer async read
>       patch 5 can change gather limit.
>
> Reference
> ===
> [1] https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
> [2] https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
> [3] https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
> [4] https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
> [5] https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
> [6] https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
> [7] https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
> [8] https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
> [9] https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>
> Huan Yang (5):
>    dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>    dma-buf: heaps: Introduce async alloc read ops
>    dma-buf: heaps: support alloc async read file
>    dma-buf: heaps: system_heap alloc support async read
>    dma-buf: heaps: configurable async read gather limit
>
>   drivers/dma-buf/dma-heap.c          | 552 +++++++++++++++++++++++++++-
>   drivers/dma-buf/heaps/system_heap.c |  70 +++-
>   include/linux/dma-heap.h            |  53 ++-
>   include/uapi/linux/dma-heap.h       |  11 +-
>   4 files changed, 673 insertions(+), 13 deletions(-)
>
>
> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30  8:03 ` [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag Christian König
@ 2024-07-30  8:14   ` Huan Yang
  2024-07-30  8:37     ` Christian König
  2024-07-30 17:19     ` T.J. Mercier
  0 siblings, 2 replies; 26+ messages in thread
From: Huan Yang @ 2024-07-30  8:14 UTC (permalink / raw)
  To: Christian König, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, T.J. Mercier, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel
  Cc: opensource.kernel


在 2024/7/30 16:03, Christian König 写道:
> Am 30.07.24 um 09:57 schrieb Huan Yang:
>> Background
>> ====
>> Some user may need load file into dma-buf, current way is:
>>    1. allocate a dma-buf, get dma-buf fd
>>    2. mmap dma-buf fd into user vaddr
>>    3. read(file_fd, vaddr, fsz)
>> Due to dma-buf user map can't support direct I/O[1], the file read
>> must be buffer I/O.
>>
>> This means that during the process of reading the file into dma-buf,
>> page cache needs to be generated, and the corresponding content needs to
>> be first copied to the page cache before being copied to the dma-buf.
>>
>> This way worked well when reading relatively small files before, as
>> the page cache can cache the file content, thus improving performance.
>>
>> However, there are new challenges currently, especially as AI models are
>> becoming larger and need to be shared between DMA devices and the CPU
>> via dma-buf.
>>
>> For example, our 7B model file size is around 3.4GB. Using the
>> previous would mean generating a total of 3.4GB of page cache
>> (even if it will be reclaimed), and also requiring the copying of 3.4GB
>> of content between page cache and dma-buf.
>>
>> Due to the limited resources of system memory, files in the gigabyte 
>> range
>> cannot persist in memory indefinitely, so this portion of page cache may
>> not provide much assistance for subsequent reads. Additionally, the
>> existence of page cache will consume additional system resources due to
>> the extra copying required by the CPU.
>>
>> Therefore, I think it is necessary for dma-buf to support direct I/O.
>>
>> However, direct I/O file reads cannot be performed using the buffer
>> mmaped by the user space for the dma-buf.[1]
>>
>> Here are some discussions on implementing direct I/O using dma-buf:
>>
>> mmap[1]
>> ---
>> dma-buf never support user map vaddr use of direct I/O.
>>
>> udmabuf[2]
>> ---
>> Currently, udmabuf can use the memfd method to read files into
>> dma-buf in direct I/O mode.
>>
>> However, if the size is large, the current udmabuf needs to adjust the
>> corresponding size_limit(default 64MB).
>> But using udmabuf for files at the 3GB level is not a very good 
>> approach.
>> It needs to make some adjustments internally to handle this.[3] Or else,
>> fail create.
>>
>> But, it is indeed a viable way to enable dma-buf to support direct I/O.
>> However, it is necessary to initiate the file read after the memory 
>> allocation
>> is completed, and handle race conditions carefully.
>>
>> sendfile/splice[4]
>> ---
>> Another way to enable dma-buf to support direct I/O is by implementing
>> splice_write/write_iter in the dma-buf file operations (fops) to adapt
>> to the sendfile method.
>> However, the current sendfile/splice calls are based on pipe. When using
>> direct I/O to read a file, the content needs to be copied to the buffer
>> allocated by the pipe (default 64KB), and then the dma-buf fops'
>> splice_write needs to be called to write the content into the dma-buf.
>> This approach requires serially reading the content of file pipe size
>> into the pipe buffer and then waiting for the dma-buf to be written
>> before reading the next one.(The I/O performance is relatively weak
>> under direct I/O.)
>> Moreover, due to the existence of the pipe buffer, even when using
>> direct I/O and not needing to generate additional page cache,
>> there still needs to be a CPU copy.
>>
>> copy_file_range[5]
>> ---
>> Consider of copy_file_range, It only supports copying files within the
>> same file system. Similarly, it is not very practical.
>>
>>
>> So, currently, there is no particularly suitable solution on VFS to
>> allow dma-buf to support direct I/O for large file reads.
>>
>> This patchset provides an idea to complete file reads when requesting a
>> dma-buf.
>>
>> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>> ===
>> This patch provides a method to immediately read the file content after
>> the dma-buf is allocated, and only returns the dma-buf file descriptor
>> after the file is fully read.
>>
>> Since the dma-buf file descriptor is not returned, no other thread can
>> access it except for the current thread, so we don't need to worry about
>> race conditions.
>
> That is a completely false assumption.
Can you provide a detailed explanation as to why this assumption is 
incorrect? thanks.
>
>>
>> Map the dma-buf to the vmalloc area and initiate file reads in kernel
>> space, supporting both buffer I/O and direct I/O.
>>
>> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
>> When a user needs to allocate a dma-buf and read a file, they should
>> pass this heap flag. As the size of the file being read is fixed, 
>> there is no
>> need to pass the 'len' parameter. Instead, The file_fd needs to be 
>> passed to
>> indicate to the kernel the file that needs to be read.
>>
>> The file open flag determines the mode of file reading.
>> But, please note that if direct I/O(O_DIRECT) is needed to read the 
>> file,
>> the file size must be page aligned. (with patch 2-5, no need)
>>
>> Therefore, for the user, len and file_fd are mutually exclusive,
>> and they are combined using a union.
>>
>> Once the user obtains the dma-buf fd, the dma-buf directly contains the
>> file content.
>
> And I'm repeating myself, but this is a complete NAK from my side to 
> this approach.
>
> We pointed out multiple ways of how to implement this cleanly and not 
> by hacking functionality into the kernel which absolutely doesn't 
> belong there.
In this patchset, I have provided performance comparisons of each of 
these methods.  Can you please provide more opinions?
>
> Regards,
> Christian.
>
>>
>> Patch 1 implement it.
>>
>> Patch 2-5 provides an approach for performance improvement.
>>
>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>> synchronously read files using direct I/O.
>>
>> This approach helps to save CPU copying and avoid a certain degree of
>> memory thrashing (page cache generation and reclamation)
>>
>> When dealing with large file sizes, the benefits of this approach become
>> particularly significant.
>>
>> However, there are currently some methods that can improve performance,
>> not just save system resources:
>>
>> Due to the large file size, for example, a AI 7B model of around 
>> 3.4GB, the
>> time taken to allocate DMA-BUF memory will be relatively long. Waiting
>> for the allocation to complete before reading the file will add to the
>> overall time consumption. Therefore, the total time for DMA-BUF
>> allocation and file read can be calculated using the formula
>>     T(total) = T(alloc) + T(I/O)
>>
>> However, if we change our approach, we don't necessarily need to wait
>> for the DMA-BUF allocation to complete before initiating I/O. In fact,
>> during the allocation process, we already hold a portion of the page,
>> which means that waiting for subsequent page allocations to complete
>> before carrying out file reads is actually unfair to the pages that have
>> already been allocated.
>>
>> The allocation of pages is sequential, and the reading of the file is
>> also sequential, with the content and size corresponding to the file.
>> This means that the memory location for each page, which holds the
>> content of a specific position in the file, can be determined at the
>> time of allocation.
>>
>> However, to fully leverage I/O performance, it is best to wait and
>> gather a certain number of pages before initiating batch processing.
>>
>> The default gather size is 128MB. So, ever gathered can see as a file 
>> read
>> work, it maps the gather page to the vmalloc area to obtain a continuous
>> virtual address, which is used as a buffer to store the contents of the
>> corresponding file. So, if using direct I/O to read a file, the file
>> content will be written directly to the corresponding dma-buf buffer 
>> memory
>> without any additional copying.(compare to pipe buffer.)
>>
>> Consider other ways to read into dma-buf. If we assume reading after 
>> mmap
>> dma-buf, we need to map the pages of the dma-buf to the user virtual
>> address space. Also, udmabuf memfd need do this operations too.
>> Even if we support sendfile, the file copy also need buffer, you must
>> setup it.
>> So, mapping pages to the vmalloc area does not incur any additional
>> performance overhead compared to other methods.[6]
>>
>> Certainly, the administrator can also modify the gather size through 
>> patch5.
>>
>> The formula for the time taken for system_heap buffer allocation and
>> file reading through async_read is as follows:
>>
>>    T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>
>> Compared to the synchronous read:
>>    T(total) = T(alloc) + T(I/O)
>>
>> If the allocation time or I/O time is long, the time difference will be
>> covered by the maximum value between the allocation and I/O. The other
>> party will be concealed.
>>
>> Therefore, the larger the size of the file that needs to be read, the
>> greater the corresponding benefits will be.
>>
>> How to use
>> ===
>> Consider the current pathway for loading model files into DMA-BUF:
>>    1. open dma-heap, get heap fd
>>    2. open file, get file_fd(can't use O_DIRECT)
>>    3. use file len to allocate dma-buf, get dma-buf fd
>>    4. mmap dma-buf fd, get vaddr
>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>    6. share, attach, whatever you want
>>
>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>    1. open dma-heap, get heap fd
>>    2. open file, get file_fd(buffer/direct)
>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, 
>> set file_fd
>>       instead of len. get dma-buf fd(contains file content)
>>    4. share, attach, whatever you want
>>
>> So, test it is easy.
>>
>> How to test
>> ===
>> The performance comparison will be conducted for the following 
>> scenarios:
>>    1. normal
>>    2. udmabuf with [3] patch
>>    3. sendfile
>>    4. only patch 1
>>    5. patch1 - patch4.
>>
>> normal:
>>    1. open dma-heap, get heap fd
>>    2. open file, get file_fd(can't use O_DIRECT)
>>    3. use file len to allocate dma-buf, get dma-buf fd
>>    4. mmap dma-buf fd, get vaddr
>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>    6. share, attach, whatever you want
>>
>> UDMA-BUF step:
>>    1. memfd_create
>>    2. open file(buffer/direct)
>>    3. udmabuf create
>>    4. mmap memfd
>>    5. read file into memfd vaddr
>>
>> Sendfile step(need suit splice_write/write_iter, just use to compare):
>>    1. open dma-heap, get heap fd
>>    2. open file, get file_fd(buffer/direct)
>>    3. use file len to allocate dma-buf, get dma-buf fd
>>    4. sendfile file_fd to dma-buf fd
>>    6. share, attach, whatever you want
>>
>> patch1/patch1-4:
>>    1. open dma-heap, get heap fd
>>    2. open file, get file_fd(buffer/direct)
>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, 
>> set file_fd
>>       instead of len. get dma-buf fd(contains file content)
>>    4. share, attach, whatever you want
>>
>> You can create a file to test it. Compare the performance gap between 
>> the two.
>> It is best to compare the differences in file size from KB to MB to GB.
>>
>> The following test data will compare the performance differences 
>> between 512KB,
>> 8MB, 1GB, and 3GB under various scenarios.
>>
>> Performance Test
>> ===
>>    12G RAM phone
>>    UFS4.0(the maximum speed is 4GB/s. ),
>>    f2fs
>>    kernel 6.1 with patch[7] (or else, can't support kvec direct I/O 
>> read.)
>>    no memory pressure.
>>    drop_cache is used for each test.
>>
>> The average of 5 test results:
>> | scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns) | 
>> 3GB(ns)       |
>> | ------------------- | ---------- | ---------- | ------------- | 
>> ------------- |
>> | normal              | 2,790,861  | 14,535,784 | 1,520,790,492 | 
>> 3,332,438,754 |
>> | udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000 | 
>> 2,108,419,923 |
>> | sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 | 
>> 3,062,052,984 |
>> | patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800 | 
>> 2,187,570,861 |
>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 | 
>> 9,777,661,077 |
>> | patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 | 
>> 5,648,897,554 |
>> | udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784 | 
>> 2,158,305,738 |
>> | patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538 | 
>> 1,400,006,107 |

With this test, sendfile can't give a good help base on pipe buffer.

udmabuf is good, but I think our oem driver can't suit it. (And, AOSP do 
not open this feature)


Anyway, I am sending this patchset in the hope of further discussion.

Thanks.

>>
>> So, based on the test results:
>>
>> When the file is large, the patchset has the highest performance.
>> Compared to normal, patchset is a 50% improvement;
>> Compared to normal, patch1 only showed a degradation of 41%.
>> patch1 typical performance breakdown is as follows:
>>    1. alloc cost 188,802,693 ns
>>    2. vmap cost 42,491,385 ns
>>    3. file read cost 4,180,876,702 ns
>> Therefore, directly performing a single direct I/O read on a large file
>> may not be the most optimal way for performance.
>>
>> The performance of direct I/O implemented by the sendfile method is 
>> the worst.
>>
>> When file size is small, The difference in performance is not
>> significant. This is consistent with expectations.
>>
>>
>>
>> Suggested use cases
>> ===
>>    1. When there is a need to read large files and system resources 
>> are scarce,
>>       especially when the size of memory is limited.(GB level) In this
>>       scenario, using direct I/O for file reading can even bring 
>> performance
>>       improvements.(may need patch2-3)
>>    2. For embedded devices with limited RAM, using direct I/O can 
>> save system
>>       resources and avoid unnecessary data copying. Therefore, even 
>> if the
>>       performance is lower when read small file, it can still be used
>>       effectively.
>>    3. If there is sufficient memory, pinning the page cache of the 
>> model files
>>       in memory and placing file in the EROFS file system for 
>> read-only access
>>       maybe better.(EROFS do not support direct I/O)
>>
>>
>> Changlog
>> ===
>>   v1 [8]
>>   v1->v2:
>>     Uses the heap flag method for alloc and read instead of adding a new
>>     DMA-buf ioctl command. [9]
>>     Split the patchset to facilitate review and test.
>>       patch 1 implement alloc and read, offer heap flag into it.
>>       patch 2-4 offer async read
>>       patch 5 can change gather limit.
>>
>> Reference
>> ===
>> [1] 
>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>> [2] 
>> https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>> [3] 
>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>> [4] 
>> https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>> [5] 
>> https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>> [6] 
>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>> [7] 
>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
>> [8] 
>> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>> [9] 
>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>
>> Huan Yang (5):
>>    dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>    dma-buf: heaps: Introduce async alloc read ops
>>    dma-buf: heaps: support alloc async read file
>>    dma-buf: heaps: system_heap alloc support async read
>>    dma-buf: heaps: configurable async read gather limit
>>
>>   drivers/dma-buf/dma-heap.c          | 552 +++++++++++++++++++++++++++-
>>   drivers/dma-buf/heaps/system_heap.c |  70 +++-
>>   include/linux/dma-heap.h            |  53 ++-
>>   include/uapi/linux/dma-heap.h       |  11 +-
>>   4 files changed, 673 insertions(+), 13 deletions(-)
>>
>>
>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30  8:14   ` Huan Yang
@ 2024-07-30  8:37     ` Christian König
  2024-07-30  8:46       ` Huan Yang
  2024-07-30 17:19     ` T.J. Mercier
  1 sibling, 1 reply; 26+ messages in thread
From: Christian König @ 2024-07-30  8:37 UTC (permalink / raw)
  To: Huan Yang, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, linux-media, dri-devel, linaro-mm-sig,
	linux-kernel
  Cc: opensource.kernel

Am 30.07.24 um 10:14 schrieb Huan Yang:
> 在 2024/7/30 16:03, Christian König 写道:
>> Am 30.07.24 um 09:57 schrieb Huan Yang:
>>> Background
>>> ====
>>> Some user may need load file into dma-buf, current way is:
>>>    1. allocate a dma-buf, get dma-buf fd
>>>    2. mmap dma-buf fd into user vaddr
>>>    3. read(file_fd, vaddr, fsz)
>>> Due to dma-buf user map can't support direct I/O[1], the file read
>>> must be buffer I/O.
>>>
>>> This means that during the process of reading the file into dma-buf,
>>> page cache needs to be generated, and the corresponding content 
>>> needs to
>>> be first copied to the page cache before being copied to the dma-buf.
>>>
>>> This way worked well when reading relatively small files before, as
>>> the page cache can cache the file content, thus improving performance.
>>>
>>> However, there are new challenges currently, especially as AI models 
>>> are
>>> becoming larger and need to be shared between DMA devices and the CPU
>>> via dma-buf.
>>>
>>> For example, our 7B model file size is around 3.4GB. Using the
>>> previous would mean generating a total of 3.4GB of page cache
>>> (even if it will be reclaimed), and also requiring the copying of 3.4GB
>>> of content between page cache and dma-buf.
>>>
>>> Due to the limited resources of system memory, files in the gigabyte 
>>> range
>>> cannot persist in memory indefinitely, so this portion of page cache 
>>> may
>>> not provide much assistance for subsequent reads. Additionally, the
>>> existence of page cache will consume additional system resources due to
>>> the extra copying required by the CPU.
>>>
>>> Therefore, I think it is necessary for dma-buf to support direct I/O.
>>>
>>> However, direct I/O file reads cannot be performed using the buffer
>>> mmaped by the user space for the dma-buf.[1]
>>>
>>> Here are some discussions on implementing direct I/O using dma-buf:
>>>
>>> mmap[1]
>>> ---
>>> dma-buf never support user map vaddr use of direct I/O.
>>>
>>> udmabuf[2]
>>> ---
>>> Currently, udmabuf can use the memfd method to read files into
>>> dma-buf in direct I/O mode.
>>>
>>> However, if the size is large, the current udmabuf needs to adjust the
>>> corresponding size_limit(default 64MB).
>>> But using udmabuf for files at the 3GB level is not a very good 
>>> approach.
>>> It needs to make some adjustments internally to handle this.[3] Or 
>>> else,
>>> fail create.
>>>
>>> But, it is indeed a viable way to enable dma-buf to support direct I/O.
>>> However, it is necessary to initiate the file read after the memory 
>>> allocation
>>> is completed, and handle race conditions carefully.
>>>
>>> sendfile/splice[4]
>>> ---
>>> Another way to enable dma-buf to support direct I/O is by implementing
>>> splice_write/write_iter in the dma-buf file operations (fops) to adapt
>>> to the sendfile method.
>>> However, the current sendfile/splice calls are based on pipe. When 
>>> using
>>> direct I/O to read a file, the content needs to be copied to the buffer
>>> allocated by the pipe (default 64KB), and then the dma-buf fops'
>>> splice_write needs to be called to write the content into the dma-buf.
>>> This approach requires serially reading the content of file pipe size
>>> into the pipe buffer and then waiting for the dma-buf to be written
>>> before reading the next one.(The I/O performance is relatively weak
>>> under direct I/O.)
>>> Moreover, due to the existence of the pipe buffer, even when using
>>> direct I/O and not needing to generate additional page cache,
>>> there still needs to be a CPU copy.
>>>
>>> copy_file_range[5]
>>> ---
>>> Consider of copy_file_range, It only supports copying files within the
>>> same file system. Similarly, it is not very practical.
>>>
>>>
>>> So, currently, there is no particularly suitable solution on VFS to
>>> allow dma-buf to support direct I/O for large file reads.
>>>
>>> This patchset provides an idea to complete file reads when requesting a
>>> dma-buf.
>>>
>>> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>> ===
>>> This patch provides a method to immediately read the file content after
>>> the dma-buf is allocated, and only returns the dma-buf file descriptor
>>> after the file is fully read.
>>>
>>> Since the dma-buf file descriptor is not returned, no other thread can
>>> access it except for the current thread, so we don't need to worry 
>>> about
>>> race conditions.
>>
>> That is a completely false assumption.
> Can you provide a detailed explanation as to why this assumption is 
> incorrect? thanks.

File descriptors can be guessed and is available to userspace as soon as 
dma_buf_fd() is called.

What could potentially work is to call system_heap_allocate() without 
calling dma_buf_fd(), but I'm not sure if you can then make I/O to the 
underlying pages.

>>
>>>
>>> Map the dma-buf to the vmalloc area and initiate file reads in kernel
>>> space, supporting both buffer I/O and direct I/O.
>>>
>>> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
>>> When a user needs to allocate a dma-buf and read a file, they should
>>> pass this heap flag. As the size of the file being read is fixed, 
>>> there is no
>>> need to pass the 'len' parameter. Instead, The file_fd needs to be 
>>> passed to
>>> indicate to the kernel the file that needs to be read.
>>>
>>> The file open flag determines the mode of file reading.
>>> But, please note that if direct I/O(O_DIRECT) is needed to read the 
>>> file,
>>> the file size must be page aligned. (with patch 2-5, no need)
>>>
>>> Therefore, for the user, len and file_fd are mutually exclusive,
>>> and they are combined using a union.
>>>
>>> Once the user obtains the dma-buf fd, the dma-buf directly contains the
>>> file content.
>>
>> And I'm repeating myself, but this is a complete NAK from my side to 
>> this approach.
>>
>> We pointed out multiple ways of how to implement this cleanly and not 
>> by hacking functionality into the kernel which absolutely doesn't 
>> belong there.
> In this patchset, I have provided performance comparisons of each of 
> these methods.  Can you please provide more opinions?

Either drop the whole approach or change udmabuf to do what you want to do.

Apart from that I don't see a doable way which can be accepted into the 
kernel.

Regards,
Christian.

>>
>> Regards,
>> Christian.
>>
>>>
>>> Patch 1 implement it.
>>>
>>> Patch 2-5 provides an approach for performance improvement.
>>>
>>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>>> synchronously read files using direct I/O.
>>>
>>> This approach helps to save CPU copying and avoid a certain degree of
>>> memory thrashing (page cache generation and reclamation)
>>>
>>> When dealing with large file sizes, the benefits of this approach 
>>> become
>>> particularly significant.
>>>
>>> However, there are currently some methods that can improve performance,
>>> not just save system resources:
>>>
>>> Due to the large file size, for example, a AI 7B model of around 
>>> 3.4GB, the
>>> time taken to allocate DMA-BUF memory will be relatively long. Waiting
>>> for the allocation to complete before reading the file will add to the
>>> overall time consumption. Therefore, the total time for DMA-BUF
>>> allocation and file read can be calculated using the formula
>>>     T(total) = T(alloc) + T(I/O)
>>>
>>> However, if we change our approach, we don't necessarily need to wait
>>> for the DMA-BUF allocation to complete before initiating I/O. In fact,
>>> during the allocation process, we already hold a portion of the page,
>>> which means that waiting for subsequent page allocations to complete
>>> before carrying out file reads is actually unfair to the pages that 
>>> have
>>> already been allocated.
>>>
>>> The allocation of pages is sequential, and the reading of the file is
>>> also sequential, with the content and size corresponding to the file.
>>> This means that the memory location for each page, which holds the
>>> content of a specific position in the file, can be determined at the
>>> time of allocation.
>>>
>>> However, to fully leverage I/O performance, it is best to wait and
>>> gather a certain number of pages before initiating batch processing.
>>>
>>> The default gather size is 128MB. So, ever gathered can see as a 
>>> file read
>>> work, it maps the gather page to the vmalloc area to obtain a 
>>> continuous
>>> virtual address, which is used as a buffer to store the contents of the
>>> corresponding file. So, if using direct I/O to read a file, the file
>>> content will be written directly to the corresponding dma-buf buffer 
>>> memory
>>> without any additional copying.(compare to pipe buffer.)
>>>
>>> Consider other ways to read into dma-buf. If we assume reading after 
>>> mmap
>>> dma-buf, we need to map the pages of the dma-buf to the user virtual
>>> address space. Also, udmabuf memfd need do this operations too.
>>> Even if we support sendfile, the file copy also need buffer, you must
>>> setup it.
>>> So, mapping pages to the vmalloc area does not incur any additional
>>> performance overhead compared to other methods.[6]
>>>
>>> Certainly, the administrator can also modify the gather size through 
>>> patch5.
>>>
>>> The formula for the time taken for system_heap buffer allocation and
>>> file reading through async_read is as follows:
>>>
>>>    T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>>
>>> Compared to the synchronous read:
>>>    T(total) = T(alloc) + T(I/O)
>>>
>>> If the allocation time or I/O time is long, the time difference will be
>>> covered by the maximum value between the allocation and I/O. The other
>>> party will be concealed.
>>>
>>> Therefore, the larger the size of the file that needs to be read, the
>>> greater the corresponding benefits will be.
>>>
>>> How to use
>>> ===
>>> Consider the current pathway for loading model files into DMA-BUF:
>>>    1. open dma-heap, get heap fd
>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>    4. mmap dma-buf fd, get vaddr
>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>    6. share, attach, whatever you want
>>>
>>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>>    1. open dma-heap, get heap fd
>>>    2. open file, get file_fd(buffer/direct)
>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, 
>>> set file_fd
>>>       instead of len. get dma-buf fd(contains file content)
>>>    4. share, attach, whatever you want
>>>
>>> So, test it is easy.
>>>
>>> How to test
>>> ===
>>> The performance comparison will be conducted for the following 
>>> scenarios:
>>>    1. normal
>>>    2. udmabuf with [3] patch
>>>    3. sendfile
>>>    4. only patch 1
>>>    5. patch1 - patch4.
>>>
>>> normal:
>>>    1. open dma-heap, get heap fd
>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>    4. mmap dma-buf fd, get vaddr
>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>    6. share, attach, whatever you want
>>>
>>> UDMA-BUF step:
>>>    1. memfd_create
>>>    2. open file(buffer/direct)
>>>    3. udmabuf create
>>>    4. mmap memfd
>>>    5. read file into memfd vaddr
>>>
>>> Sendfile step(need suit splice_write/write_iter, just use to compare):
>>>    1. open dma-heap, get heap fd
>>>    2. open file, get file_fd(buffer/direct)
>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>    4. sendfile file_fd to dma-buf fd
>>>    6. share, attach, whatever you want
>>>
>>> patch1/patch1-4:
>>>    1. open dma-heap, get heap fd
>>>    2. open file, get file_fd(buffer/direct)
>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, 
>>> set file_fd
>>>       instead of len. get dma-buf fd(contains file content)
>>>    4. share, attach, whatever you want
>>>
>>> You can create a file to test it. Compare the performance gap 
>>> between the two.
>>> It is best to compare the differences in file size from KB to MB to GB.
>>>
>>> The following test data will compare the performance differences 
>>> between 512KB,
>>> 8MB, 1GB, and 3GB under various scenarios.
>>>
>>> Performance Test
>>> ===
>>>    12G RAM phone
>>>    UFS4.0(the maximum speed is 4GB/s. ),
>>>    f2fs
>>>    kernel 6.1 with patch[7] (or else, can't support kvec direct I/O 
>>> read.)
>>>    no memory pressure.
>>>    drop_cache is used for each test.
>>>
>>> The average of 5 test results:
>>> | scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns) | 
>>> 3GB(ns)       |
>>> | ------------------- | ---------- | ---------- | ------------- | 
>>> ------------- |
>>> | normal              | 2,790,861  | 14,535,784 | 1,520,790,492 | 
>>> 3,332,438,754 |
>>> | udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000 | 
>>> 2,108,419,923 |
>>> | sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 | 
>>> 3,062,052,984 |
>>> | patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800 | 
>>> 2,187,570,861 |
>>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 | 
>>> 9,777,661,077 |
>>> | patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 | 
>>> 5,648,897,554 |
>>> | udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784 | 
>>> 2,158,305,738 |
>>> | patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538 | 
>>> 1,400,006,107 |
>
> With this test, sendfile can't give a good help base on pipe buffer.
>
> udmabuf is good, but I think our oem driver can't suit it. (And, AOSP 
> do not open this feature)
>
>
> Anyway, I am sending this patchset in the hope of further discussion.
>
> Thanks.
>
>>>
>>> So, based on the test results:
>>>
>>> When the file is large, the patchset has the highest performance.
>>> Compared to normal, patchset is a 50% improvement;
>>> Compared to normal, patch1 only showed a degradation of 41%.
>>> patch1 typical performance breakdown is as follows:
>>>    1. alloc cost 188,802,693 ns
>>>    2. vmap cost 42,491,385 ns
>>>    3. file read cost 4,180,876,702 ns
>>> Therefore, directly performing a single direct I/O read on a large file
>>> may not be the most optimal way for performance.
>>>
>>> The performance of direct I/O implemented by the sendfile method is 
>>> the worst.
>>>
>>> When file size is small, The difference in performance is not
>>> significant. This is consistent with expectations.
>>>
>>>
>>>
>>> Suggested use cases
>>> ===
>>>    1. When there is a need to read large files and system resources 
>>> are scarce,
>>>       especially when the size of memory is limited.(GB level) In this
>>>       scenario, using direct I/O for file reading can even bring 
>>> performance
>>>       improvements.(may need patch2-3)
>>>    2. For embedded devices with limited RAM, using direct I/O can 
>>> save system
>>>       resources and avoid unnecessary data copying. Therefore, even 
>>> if the
>>>       performance is lower when read small file, it can still be used
>>>       effectively.
>>>    3. If there is sufficient memory, pinning the page cache of the 
>>> model files
>>>       in memory and placing file in the EROFS file system for 
>>> read-only access
>>>       maybe better.(EROFS do not support direct I/O)
>>>
>>>
>>> Changlog
>>> ===
>>>   v1 [8]
>>>   v1->v2:
>>>     Uses the heap flag method for alloc and read instead of adding a 
>>> new
>>>     DMA-buf ioctl command. [9]
>>>     Split the patchset to facilitate review and test.
>>>       patch 1 implement alloc and read, offer heap flag into it.
>>>       patch 2-4 offer async read
>>>       patch 5 can change gather limit.
>>>
>>> Reference
>>> ===
>>> [1] 
>>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>>> [2] https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>>> [3] https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>>> [4] https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>>> [5] https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>>> [6] 
>>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>>> [7] 
>>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
>>> [8] https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>>> [9] 
>>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>>
>>> Huan Yang (5):
>>>    dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>    dma-buf: heaps: Introduce async alloc read ops
>>>    dma-buf: heaps: support alloc async read file
>>>    dma-buf: heaps: system_heap alloc support async read
>>>    dma-buf: heaps: configurable async read gather limit
>>>
>>>   drivers/dma-buf/dma-heap.c          | 552 
>>> +++++++++++++++++++++++++++-
>>>   drivers/dma-buf/heaps/system_heap.c |  70 +++-
>>>   include/linux/dma-heap.h            |  53 ++-
>>>   include/uapi/linux/dma-heap.h       |  11 +-
>>>   4 files changed, 673 insertions(+), 13 deletions(-)
>>>
>>>
>>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30  8:37     ` Christian König
@ 2024-07-30  8:46       ` Huan Yang
  2024-07-30 10:43         ` Christian König
  0 siblings, 1 reply; 26+ messages in thread
From: Huan Yang @ 2024-07-30  8:46 UTC (permalink / raw)
  To: Christian König, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, T.J. Mercier, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel
  Cc: opensource.kernel


在 2024/7/30 16:37, Christian König 写道:
> Am 30.07.24 um 10:14 schrieb Huan Yang:
>> 在 2024/7/30 16:03, Christian König 写道:
>>> Am 30.07.24 um 09:57 schrieb Huan Yang:
>>>> Background
>>>> ====
>>>> Some user may need load file into dma-buf, current way is:
>>>>    1. allocate a dma-buf, get dma-buf fd
>>>>    2. mmap dma-buf fd into user vaddr
>>>>    3. read(file_fd, vaddr, fsz)
>>>> Due to dma-buf user map can't support direct I/O[1], the file read
>>>> must be buffer I/O.
>>>>
>>>> This means that during the process of reading the file into dma-buf,
>>>> page cache needs to be generated, and the corresponding content 
>>>> needs to
>>>> be first copied to the page cache before being copied to the dma-buf.
>>>>
>>>> This way worked well when reading relatively small files before, as
>>>> the page cache can cache the file content, thus improving performance.
>>>>
>>>> However, there are new challenges currently, especially as AI 
>>>> models are
>>>> becoming larger and need to be shared between DMA devices and the CPU
>>>> via dma-buf.
>>>>
>>>> For example, our 7B model file size is around 3.4GB. Using the
>>>> previous would mean generating a total of 3.4GB of page cache
>>>> (even if it will be reclaimed), and also requiring the copying of 
>>>> 3.4GB
>>>> of content between page cache and dma-buf.
>>>>
>>>> Due to the limited resources of system memory, files in the 
>>>> gigabyte range
>>>> cannot persist in memory indefinitely, so this portion of page 
>>>> cache may
>>>> not provide much assistance for subsequent reads. Additionally, the
>>>> existence of page cache will consume additional system resources 
>>>> due to
>>>> the extra copying required by the CPU.
>>>>
>>>> Therefore, I think it is necessary for dma-buf to support direct I/O.
>>>>
>>>> However, direct I/O file reads cannot be performed using the buffer
>>>> mmaped by the user space for the dma-buf.[1]
>>>>
>>>> Here are some discussions on implementing direct I/O using dma-buf:
>>>>
>>>> mmap[1]
>>>> ---
>>>> dma-buf never support user map vaddr use of direct I/O.
>>>>
>>>> udmabuf[2]
>>>> ---
>>>> Currently, udmabuf can use the memfd method to read files into
>>>> dma-buf in direct I/O mode.
>>>>
>>>> However, if the size is large, the current udmabuf needs to adjust the
>>>> corresponding size_limit(default 64MB).
>>>> But using udmabuf for files at the 3GB level is not a very good 
>>>> approach.
>>>> It needs to make some adjustments internally to handle this.[3] Or 
>>>> else,
>>>> fail create.
>>>>
>>>> But, it is indeed a viable way to enable dma-buf to support direct 
>>>> I/O.
>>>> However, it is necessary to initiate the file read after the memory 
>>>> allocation
>>>> is completed, and handle race conditions carefully.
>>>>
>>>> sendfile/splice[4]
>>>> ---
>>>> Another way to enable dma-buf to support direct I/O is by implementing
>>>> splice_write/write_iter in the dma-buf file operations (fops) to adapt
>>>> to the sendfile method.
>>>> However, the current sendfile/splice calls are based on pipe. When 
>>>> using
>>>> direct I/O to read a file, the content needs to be copied to the 
>>>> buffer
>>>> allocated by the pipe (default 64KB), and then the dma-buf fops'
>>>> splice_write needs to be called to write the content into the dma-buf.
>>>> This approach requires serially reading the content of file pipe size
>>>> into the pipe buffer and then waiting for the dma-buf to be written
>>>> before reading the next one.(The I/O performance is relatively weak
>>>> under direct I/O.)
>>>> Moreover, due to the existence of the pipe buffer, even when using
>>>> direct I/O and not needing to generate additional page cache,
>>>> there still needs to be a CPU copy.
>>>>
>>>> copy_file_range[5]
>>>> ---
>>>> Consider of copy_file_range, It only supports copying files within the
>>>> same file system. Similarly, it is not very practical.
>>>>
>>>>
>>>> So, currently, there is no particularly suitable solution on VFS to
>>>> allow dma-buf to support direct I/O for large file reads.
>>>>
>>>> This patchset provides an idea to complete file reads when 
>>>> requesting a
>>>> dma-buf.
>>>>
>>>> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>> ===
>>>> This patch provides a method to immediately read the file content 
>>>> after
>>>> the dma-buf is allocated, and only returns the dma-buf file descriptor
>>>> after the file is fully read.
>>>>
>>>> Since the dma-buf file descriptor is not returned, no other thread can
>>>> access it except for the current thread, so we don't need to worry 
>>>> about
>>>> race conditions.
>>>
>>> That is a completely false assumption.
>> Can you provide a detailed explanation as to why this assumption is 
>> incorrect? thanks.
>
> File descriptors can be guessed and is available to userspace as soon 
> as dma_buf_fd() is called.
>
> What could potentially work is to call system_heap_allocate() without 
> calling dma_buf_fd(), but I'm not sure if you can then make I/O to the 
> underlying pages.

Actually, the dma-buf file descriptor is obtained only after a 
successful file read in the code, so there is no issue.

If you are interested, you can take a look at the 
dma_heap_buffer_alloc_and_read function in patch1.

>
>>>
>>>>
>>>> Map the dma-buf to the vmalloc area and initiate file reads in kernel
>>>> space, supporting both buffer I/O and direct I/O.
>>>>
>>>> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
>>>> When a user needs to allocate a dma-buf and read a file, they should
>>>> pass this heap flag. As the size of the file being read is fixed, 
>>>> there is no
>>>> need to pass the 'len' parameter. Instead, The file_fd needs to be 
>>>> passed to
>>>> indicate to the kernel the file that needs to be read.
>>>>
>>>> The file open flag determines the mode of file reading.
>>>> But, please note that if direct I/O(O_DIRECT) is needed to read the 
>>>> file,
>>>> the file size must be page aligned. (with patch 2-5, no need)
>>>>
>>>> Therefore, for the user, len and file_fd are mutually exclusive,
>>>> and they are combined using a union.
>>>>
>>>> Once the user obtains the dma-buf fd, the dma-buf directly contains 
>>>> the
>>>> file content.
>>>
>>> And I'm repeating myself, but this is a complete NAK from my side to 
>>> this approach.
>>>
>>> We pointed out multiple ways of how to implement this cleanly and 
>>> not by hacking functionality into the kernel which absolutely 
>>> doesn't belong there.
>> In this patchset, I have provided performance comparisons of each of 
>> these methods.  Can you please provide more opinions?
>
> Either drop the whole approach or change udmabuf to do what you want 
> to do.
OK, if so, do I need to send a patch to make dma-buf support sendfile?
>
> Apart from that I don't see a doable way which can be accepted into 
> the kernel.
Thanks for your suggestion.
>
> Regards,
> Christian.
>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Patch 1 implement it.
>>>>
>>>> Patch 2-5 provides an approach for performance improvement.
>>>>
>>>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>>>> synchronously read files using direct I/O.
>>>>
>>>> This approach helps to save CPU copying and avoid a certain degree of
>>>> memory thrashing (page cache generation and reclamation)
>>>>
>>>> When dealing with large file sizes, the benefits of this approach 
>>>> become
>>>> particularly significant.
>>>>
>>>> However, there are currently some methods that can improve 
>>>> performance,
>>>> not just save system resources:
>>>>
>>>> Due to the large file size, for example, a AI 7B model of around 
>>>> 3.4GB, the
>>>> time taken to allocate DMA-BUF memory will be relatively long. Waiting
>>>> for the allocation to complete before reading the file will add to the
>>>> overall time consumption. Therefore, the total time for DMA-BUF
>>>> allocation and file read can be calculated using the formula
>>>>     T(total) = T(alloc) + T(I/O)
>>>>
>>>> However, if we change our approach, we don't necessarily need to wait
>>>> for the DMA-BUF allocation to complete before initiating I/O. In fact,
>>>> during the allocation process, we already hold a portion of the page,
>>>> which means that waiting for subsequent page allocations to complete
>>>> before carrying out file reads is actually unfair to the pages that 
>>>> have
>>>> already been allocated.
>>>>
>>>> The allocation of pages is sequential, and the reading of the file is
>>>> also sequential, with the content and size corresponding to the file.
>>>> This means that the memory location for each page, which holds the
>>>> content of a specific position in the file, can be determined at the
>>>> time of allocation.
>>>>
>>>> However, to fully leverage I/O performance, it is best to wait and
>>>> gather a certain number of pages before initiating batch processing.
>>>>
>>>> The default gather size is 128MB. So, ever gathered can see as a 
>>>> file read
>>>> work, it maps the gather page to the vmalloc area to obtain a 
>>>> continuous
>>>> virtual address, which is used as a buffer to store the contents of 
>>>> the
>>>> corresponding file. So, if using direct I/O to read a file, the file
>>>> content will be written directly to the corresponding dma-buf 
>>>> buffer memory
>>>> without any additional copying.(compare to pipe buffer.)
>>>>
>>>> Consider other ways to read into dma-buf. If we assume reading 
>>>> after mmap
>>>> dma-buf, we need to map the pages of the dma-buf to the user virtual
>>>> address space. Also, udmabuf memfd need do this operations too.
>>>> Even if we support sendfile, the file copy also need buffer, you must
>>>> setup it.
>>>> So, mapping pages to the vmalloc area does not incur any additional
>>>> performance overhead compared to other methods.[6]
>>>>
>>>> Certainly, the administrator can also modify the gather size 
>>>> through patch5.
>>>>
>>>> The formula for the time taken for system_heap buffer allocation and
>>>> file reading through async_read is as follows:
>>>>
>>>>    T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>>>
>>>> Compared to the synchronous read:
>>>>    T(total) = T(alloc) + T(I/O)
>>>>
>>>> If the allocation time or I/O time is long, the time difference 
>>>> will be
>>>> covered by the maximum value between the allocation and I/O. The other
>>>> party will be concealed.
>>>>
>>>> Therefore, the larger the size of the file that needs to be read, the
>>>> greater the corresponding benefits will be.
>>>>
>>>> How to use
>>>> ===
>>>> Consider the current pathway for loading model files into DMA-BUF:
>>>>    1. open dma-heap, get heap fd
>>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>    4. mmap dma-buf fd, get vaddr
>>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>    6. share, attach, whatever you want
>>>>
>>>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>>>    1. open dma-heap, get heap fd
>>>>    2. open file, get file_fd(buffer/direct)
>>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, 
>>>> set file_fd
>>>>       instead of len. get dma-buf fd(contains file content)
>>>>    4. share, attach, whatever you want
>>>>
>>>> So, test it is easy.
>>>>
>>>> How to test
>>>> ===
>>>> The performance comparison will be conducted for the following 
>>>> scenarios:
>>>>    1. normal
>>>>    2. udmabuf with [3] patch
>>>>    3. sendfile
>>>>    4. only patch 1
>>>>    5. patch1 - patch4.
>>>>
>>>> normal:
>>>>    1. open dma-heap, get heap fd
>>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>    4. mmap dma-buf fd, get vaddr
>>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>    6. share, attach, whatever you want
>>>>
>>>> UDMA-BUF step:
>>>>    1. memfd_create
>>>>    2. open file(buffer/direct)
>>>>    3. udmabuf create
>>>>    4. mmap memfd
>>>>    5. read file into memfd vaddr
>>>>
>>>> Sendfile step(need suit splice_write/write_iter, just use to compare):
>>>>    1. open dma-heap, get heap fd
>>>>    2. open file, get file_fd(buffer/direct)
>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>    4. sendfile file_fd to dma-buf fd
>>>>    6. share, attach, whatever you want
>>>>
>>>> patch1/patch1-4:
>>>>    1. open dma-heap, get heap fd
>>>>    2. open file, get file_fd(buffer/direct)
>>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, 
>>>> set file_fd
>>>>       instead of len. get dma-buf fd(contains file content)
>>>>    4. share, attach, whatever you want
>>>>
>>>> You can create a file to test it. Compare the performance gap 
>>>> between the two.
>>>> It is best to compare the differences in file size from KB to MB to 
>>>> GB.
>>>>
>>>> The following test data will compare the performance differences 
>>>> between 512KB,
>>>> 8MB, 1GB, and 3GB under various scenarios.
>>>>
>>>> Performance Test
>>>> ===
>>>>    12G RAM phone
>>>>    UFS4.0(the maximum speed is 4GB/s. ),
>>>>    f2fs
>>>>    kernel 6.1 with patch[7] (or else, can't support kvec direct I/O 
>>>> read.)
>>>>    no memory pressure.
>>>>    drop_cache is used for each test.
>>>>
>>>> The average of 5 test results:
>>>> | scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns) | 
>>>> 3GB(ns)       |
>>>> | ------------------- | ---------- | ---------- | ------------- | 
>>>> ------------- |
>>>> | normal              | 2,790,861  | 14,535,784 | 1,520,790,492 | 
>>>> 3,332,438,754 |
>>>> | udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000 | 
>>>> 2,108,419,923 |
>>>> | sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 | 
>>>> 3,062,052,984 |
>>>> | patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800 | 
>>>> 2,187,570,861 |
>>>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 | 
>>>> 9,777,661,077 |
>>>> | patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 | 
>>>> 5,648,897,554 |
>>>> | udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784 | 
>>>> 2,158,305,738 |
>>>> | patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538 | 
>>>> 1,400,006,107 |
>>
>> With this test, sendfile can't give a good help base on pipe buffer.
>>
>> udmabuf is good, but I think our oem driver can't suit it. (And, AOSP 
>> do not open this feature)
>>
>>
>> Anyway, I am sending this patchset in the hope of further discussion.
>>
>> Thanks.
>>
>>>>
>>>> So, based on the test results:
>>>>
>>>> When the file is large, the patchset has the highest performance.
>>>> Compared to normal, patchset is a 50% improvement;
>>>> Compared to normal, patch1 only showed a degradation of 41%.
>>>> patch1 typical performance breakdown is as follows:
>>>>    1. alloc cost 188,802,693 ns
>>>>    2. vmap cost 42,491,385 ns
>>>>    3. file read cost 4,180,876,702 ns
>>>> Therefore, directly performing a single direct I/O read on a large 
>>>> file
>>>> may not be the most optimal way for performance.
>>>>
>>>> The performance of direct I/O implemented by the sendfile method is 
>>>> the worst.
>>>>
>>>> When file size is small, The difference in performance is not
>>>> significant. This is consistent with expectations.
>>>>
>>>>
>>>>
>>>> Suggested use cases
>>>> ===
>>>>    1. When there is a need to read large files and system resources 
>>>> are scarce,
>>>>       especially when the size of memory is limited.(GB level) In this
>>>>       scenario, using direct I/O for file reading can even bring 
>>>> performance
>>>>       improvements.(may need patch2-3)
>>>>    2. For embedded devices with limited RAM, using direct I/O can 
>>>> save system
>>>>       resources and avoid unnecessary data copying. Therefore, even 
>>>> if the
>>>>       performance is lower when read small file, it can still be used
>>>>       effectively.
>>>>    3. If there is sufficient memory, pinning the page cache of the 
>>>> model files
>>>>       in memory and placing file in the EROFS file system for 
>>>> read-only access
>>>>       maybe better.(EROFS do not support direct I/O)
>>>>
>>>>
>>>> Changlog
>>>> ===
>>>>   v1 [8]
>>>>   v1->v2:
>>>>     Uses the heap flag method for alloc and read instead of adding 
>>>> a new
>>>>     DMA-buf ioctl command. [9]
>>>>     Split the patchset to facilitate review and test.
>>>>       patch 1 implement alloc and read, offer heap flag into it.
>>>>       patch 2-4 offer async read
>>>>       patch 5 can change gather limit.
>>>>
>>>> Reference
>>>> ===
>>>> [1] 
>>>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>>>> [2] 
>>>> https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>>>> [3] 
>>>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>>>> [4] 
>>>> https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>>>> [5] 
>>>> https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>>>> [6] 
>>>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>>>> [7] 
>>>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
>>>> [8] 
>>>> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>>>> [9] 
>>>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>>>
>>>> Huan Yang (5):
>>>>    dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>>    dma-buf: heaps: Introduce async alloc read ops
>>>>    dma-buf: heaps: support alloc async read file
>>>>    dma-buf: heaps: system_heap alloc support async read
>>>>    dma-buf: heaps: configurable async read gather limit
>>>>
>>>>   drivers/dma-buf/dma-heap.c          | 552 
>>>> +++++++++++++++++++++++++++-
>>>>   drivers/dma-buf/heaps/system_heap.c |  70 +++-
>>>>   include/linux/dma-heap.h            |  53 ++-
>>>>   include/uapi/linux/dma-heap.h       |  11 +-
>>>>   4 files changed, 673 insertions(+), 13 deletions(-)
>>>>
>>>>
>>>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
>>>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30  7:57 [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag Huan Yang
                   ` (5 preceding siblings ...)
  2024-07-30  8:03 ` [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag Christian König
@ 2024-07-30  8:56 ` Daniel Vetter
  2024-07-30  9:05   ` Huan Yang
  6 siblings, 1 reply; 26+ messages in thread
From: Daniel Vetter @ 2024-07-30  8:56 UTC (permalink / raw)
  To: Huan Yang
  Cc: Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Christian König, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, opensource.kernel

On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
> UDMA-BUF step:
>   1. memfd_create
>   2. open file(buffer/direct)
>   3. udmabuf create
>   4. mmap memfd
>   5. read file into memfd vaddr

Yeah this is really slow and the worst way to do it. You absolutely want
to start _all_ the io before you start creating the dma-buf, ideally with
everything running in parallel. But just starting the direct I/O with
async and then creating the umdabuf should be a lot faster and avoid
needlessly serialization operations.

The other issue is that the mmap has some overhead, but might not be too
bad.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30  8:56 ` Daniel Vetter
@ 2024-07-30  9:05   ` Huan Yang
  2024-07-30 10:42     ` Christian König
  2024-07-30 12:04     ` Huan Yang
  0 siblings, 2 replies; 26+ messages in thread
From: Huan Yang @ 2024-07-30  9:05 UTC (permalink / raw)
  To: Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Christian König, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, opensource.kernel


在 2024/7/30 16:56, Daniel Vetter 写道:
> [????????? daniel.vetter@ffwll.ch ????????? https://aka.ms/LearnAboutSenderIdentification?????????????]
>
> On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
>> UDMA-BUF step:
>>    1. memfd_create
>>    2. open file(buffer/direct)
>>    3. udmabuf create
>>    4. mmap memfd
>>    5. read file into memfd vaddr
> Yeah this is really slow and the worst way to do it. You absolutely want
> to start _all_ the io before you start creating the dma-buf, ideally with
> everything running in parallel. But just starting the direct I/O with
> async and then creating the umdabuf should be a lot faster and avoid
That's greate,  Let me rephrase that, and please correct me if I'm wrong.

UDMA-BUF step:
   1. memfd_create
   2. mmap memfd
   3. open file(buffer/direct)
   4. start thread to async read
   3. udmabuf create

With this, can improve

> needlessly serialization operations.
>
> The other issue is that the mmap has some overhead, but might not be too
> bad.
Yes, the time spent on page fault in mmap should be negligible compared 
to the time spent on file read.
> -Sima
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30  9:05   ` Huan Yang
@ 2024-07-30 10:42     ` Christian König
  2024-07-30 11:33       ` Huan Yang
  2024-07-30 12:04     ` Huan Yang
  1 sibling, 1 reply; 26+ messages in thread
From: Christian König @ 2024-07-30 10:42 UTC (permalink / raw)
  To: Huan Yang, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, linux-media, dri-devel, linaro-mm-sig,
	linux-kernel, opensource.kernel

Am 30.07.24 um 11:05 schrieb Huan Yang:
>
> 在 2024/7/30 16:56, Daniel Vetter 写道:
>> [????????? daniel.vetter@ffwll.ch ????????? 
>> https://aka.ms/LearnAboutSenderIdentification?????????????]
>>
>> On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
>>> UDMA-BUF step:
>>>    1. memfd_create
>>>    2. open file(buffer/direct)
>>>    3. udmabuf create
>>>    4. mmap memfd
>>>    5. read file into memfd vaddr
>> Yeah this is really slow and the worst way to do it. You absolutely want
>> to start _all_ the io before you start creating the dma-buf, ideally 
>> with
>> everything running in parallel. But just starting the direct I/O with
>> async and then creating the umdabuf should be a lot faster and avoid
> That's greate,  Let me rephrase that, and please correct me if I'm wrong.
>
> UDMA-BUF step:
>   1. memfd_create
>   2. mmap memfd
>   3. open file(buffer/direct)
>   4. start thread to async read
>   3. udmabuf create
>
> With this, can improve
>
>> needlessly serialization operations.
>>
>> The other issue is that the mmap has some overhead, but might not be too
>> bad.
> Yes, the time spent on page fault in mmap should be negligible 
> compared to the time spent on file read.

You should try to avoid mmap as much as possible. Especially the TLB 
invalidation overhead is really huge on platforms with a large number of 
CPUs.

Regards,
Christian.

>> -Sima
>> -- 
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30  8:46       ` Huan Yang
@ 2024-07-30 10:43         ` Christian König
  2024-07-30 11:36           ` Huan Yang
  0 siblings, 1 reply; 26+ messages in thread
From: Christian König @ 2024-07-30 10:43 UTC (permalink / raw)
  To: Huan Yang, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, linux-media, dri-devel, linaro-mm-sig,
	linux-kernel
  Cc: opensource.kernel

Am 30.07.24 um 10:46 schrieb Huan Yang:
>
> 在 2024/7/30 16:37, Christian König 写道:
>> Am 30.07.24 um 10:14 schrieb Huan Yang:
>>> 在 2024/7/30 16:03, Christian König 写道:
>>>> Am 30.07.24 um 09:57 schrieb Huan Yang:
>>>>> Background
>>>>> ====
>>>>> Some user may need load file into dma-buf, current way is:
>>>>>    1. allocate a dma-buf, get dma-buf fd
>>>>>    2. mmap dma-buf fd into user vaddr
>>>>>    3. read(file_fd, vaddr, fsz)
>>>>> Due to dma-buf user map can't support direct I/O[1], the file read
>>>>> must be buffer I/O.
>>>>>
>>>>> This means that during the process of reading the file into dma-buf,
>>>>> page cache needs to be generated, and the corresponding content 
>>>>> needs to
>>>>> be first copied to the page cache before being copied to the dma-buf.
>>>>>
>>>>> This way worked well when reading relatively small files before, as
>>>>> the page cache can cache the file content, thus improving 
>>>>> performance.
>>>>>
>>>>> However, there are new challenges currently, especially as AI 
>>>>> models are
>>>>> becoming larger and need to be shared between DMA devices and the CPU
>>>>> via dma-buf.
>>>>>
>>>>> For example, our 7B model file size is around 3.4GB. Using the
>>>>> previous would mean generating a total of 3.4GB of page cache
>>>>> (even if it will be reclaimed), and also requiring the copying of 
>>>>> 3.4GB
>>>>> of content between page cache and dma-buf.
>>>>>
>>>>> Due to the limited resources of system memory, files in the 
>>>>> gigabyte range
>>>>> cannot persist in memory indefinitely, so this portion of page 
>>>>> cache may
>>>>> not provide much assistance for subsequent reads. Additionally, the
>>>>> existence of page cache will consume additional system resources 
>>>>> due to
>>>>> the extra copying required by the CPU.
>>>>>
>>>>> Therefore, I think it is necessary for dma-buf to support direct I/O.
>>>>>
>>>>> However, direct I/O file reads cannot be performed using the buffer
>>>>> mmaped by the user space for the dma-buf.[1]
>>>>>
>>>>> Here are some discussions on implementing direct I/O using dma-buf:
>>>>>
>>>>> mmap[1]
>>>>> ---
>>>>> dma-buf never support user map vaddr use of direct I/O.
>>>>>
>>>>> udmabuf[2]
>>>>> ---
>>>>> Currently, udmabuf can use the memfd method to read files into
>>>>> dma-buf in direct I/O mode.
>>>>>
>>>>> However, if the size is large, the current udmabuf needs to adjust 
>>>>> the
>>>>> corresponding size_limit(default 64MB).
>>>>> But using udmabuf for files at the 3GB level is not a very good 
>>>>> approach.
>>>>> It needs to make some adjustments internally to handle this.[3] Or 
>>>>> else,
>>>>> fail create.
>>>>>
>>>>> But, it is indeed a viable way to enable dma-buf to support direct 
>>>>> I/O.
>>>>> However, it is necessary to initiate the file read after the 
>>>>> memory allocation
>>>>> is completed, and handle race conditions carefully.
>>>>>
>>>>> sendfile/splice[4]
>>>>> ---
>>>>> Another way to enable dma-buf to support direct I/O is by 
>>>>> implementing
>>>>> splice_write/write_iter in the dma-buf file operations (fops) to 
>>>>> adapt
>>>>> to the sendfile method.
>>>>> However, the current sendfile/splice calls are based on pipe. When 
>>>>> using
>>>>> direct I/O to read a file, the content needs to be copied to the 
>>>>> buffer
>>>>> allocated by the pipe (default 64KB), and then the dma-buf fops'
>>>>> splice_write needs to be called to write the content into the 
>>>>> dma-buf.
>>>>> This approach requires serially reading the content of file pipe size
>>>>> into the pipe buffer and then waiting for the dma-buf to be written
>>>>> before reading the next one.(The I/O performance is relatively weak
>>>>> under direct I/O.)
>>>>> Moreover, due to the existence of the pipe buffer, even when using
>>>>> direct I/O and not needing to generate additional page cache,
>>>>> there still needs to be a CPU copy.
>>>>>
>>>>> copy_file_range[5]
>>>>> ---
>>>>> Consider of copy_file_range, It only supports copying files within 
>>>>> the
>>>>> same file system. Similarly, it is not very practical.
>>>>>
>>>>>
>>>>> So, currently, there is no particularly suitable solution on VFS to
>>>>> allow dma-buf to support direct I/O for large file reads.
>>>>>
>>>>> This patchset provides an idea to complete file reads when 
>>>>> requesting a
>>>>> dma-buf.
>>>>>
>>>>> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>>> ===
>>>>> This patch provides a method to immediately read the file content 
>>>>> after
>>>>> the dma-buf is allocated, and only returns the dma-buf file 
>>>>> descriptor
>>>>> after the file is fully read.
>>>>>
>>>>> Since the dma-buf file descriptor is not returned, no other thread 
>>>>> can
>>>>> access it except for the current thread, so we don't need to worry 
>>>>> about
>>>>> race conditions.
>>>>
>>>> That is a completely false assumption.
>>> Can you provide a detailed explanation as to why this assumption is 
>>> incorrect? thanks.
>>
>> File descriptors can be guessed and is available to userspace as soon 
>> as dma_buf_fd() is called.
>>
>> What could potentially work is to call system_heap_allocate() without 
>> calling dma_buf_fd(), but I'm not sure if you can then make I/O to 
>> the underlying pages.
>
> Actually, the dma-buf file descriptor is obtained only after a 
> successful file read in the code, so there is no issue.
>
> If you are interested, you can take a look at the 
> dma_heap_buffer_alloc_and_read function in patch1.
>
>>
>>>>
>>>>>
>>>>> Map the dma-buf to the vmalloc area and initiate file reads in kernel
>>>>> space, supporting both buffer I/O and direct I/O.
>>>>>
>>>>> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
>>>>> When a user needs to allocate a dma-buf and read a file, they should
>>>>> pass this heap flag. As the size of the file being read is fixed, 
>>>>> there is no
>>>>> need to pass the 'len' parameter. Instead, The file_fd needs to be 
>>>>> passed to
>>>>> indicate to the kernel the file that needs to be read.
>>>>>
>>>>> The file open flag determines the mode of file reading.
>>>>> But, please note that if direct I/O(O_DIRECT) is needed to read 
>>>>> the file,
>>>>> the file size must be page aligned. (with patch 2-5, no need)
>>>>>
>>>>> Therefore, for the user, len and file_fd are mutually exclusive,
>>>>> and they are combined using a union.
>>>>>
>>>>> Once the user obtains the dma-buf fd, the dma-buf directly 
>>>>> contains the
>>>>> file content.
>>>>
>>>> And I'm repeating myself, but this is a complete NAK from my side 
>>>> to this approach.
>>>>
>>>> We pointed out multiple ways of how to implement this cleanly and 
>>>> not by hacking functionality into the kernel which absolutely 
>>>> doesn't belong there.
>>> In this patchset, I have provided performance comparisons of each of 
>>> these methods.  Can you please provide more opinions?
>>
>> Either drop the whole approach or change udmabuf to do what you want 
>> to do.
> OK, if so, do I need to send a patch to make dma-buf support sendfile?

Well the udmabuf approach doesn't need to use sendfile, so no.

Regards,
Christian.

>
>>
>> Apart from that I don't see a doable way which can be accepted into 
>> the kernel.
> Thanks for your suggestion.
>>
>> Regards,
>> Christian.
>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Patch 1 implement it.
>>>>>
>>>>> Patch 2-5 provides an approach for performance improvement.
>>>>>
>>>>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>>>>> synchronously read files using direct I/O.
>>>>>
>>>>> This approach helps to save CPU copying and avoid a certain degree of
>>>>> memory thrashing (page cache generation and reclamation)
>>>>>
>>>>> When dealing with large file sizes, the benefits of this approach 
>>>>> become
>>>>> particularly significant.
>>>>>
>>>>> However, there are currently some methods that can improve 
>>>>> performance,
>>>>> not just save system resources:
>>>>>
>>>>> Due to the large file size, for example, a AI 7B model of around 
>>>>> 3.4GB, the
>>>>> time taken to allocate DMA-BUF memory will be relatively long. 
>>>>> Waiting
>>>>> for the allocation to complete before reading the file will add to 
>>>>> the
>>>>> overall time consumption. Therefore, the total time for DMA-BUF
>>>>> allocation and file read can be calculated using the formula
>>>>>     T(total) = T(alloc) + T(I/O)
>>>>>
>>>>> However, if we change our approach, we don't necessarily need to wait
>>>>> for the DMA-BUF allocation to complete before initiating I/O. In 
>>>>> fact,
>>>>> during the allocation process, we already hold a portion of the page,
>>>>> which means that waiting for subsequent page allocations to complete
>>>>> before carrying out file reads is actually unfair to the pages 
>>>>> that have
>>>>> already been allocated.
>>>>>
>>>>> The allocation of pages is sequential, and the reading of the file is
>>>>> also sequential, with the content and size corresponding to the file.
>>>>> This means that the memory location for each page, which holds the
>>>>> content of a specific position in the file, can be determined at the
>>>>> time of allocation.
>>>>>
>>>>> However, to fully leverage I/O performance, it is best to wait and
>>>>> gather a certain number of pages before initiating batch processing.
>>>>>
>>>>> The default gather size is 128MB. So, ever gathered can see as a 
>>>>> file read
>>>>> work, it maps the gather page to the vmalloc area to obtain a 
>>>>> continuous
>>>>> virtual address, which is used as a buffer to store the contents 
>>>>> of the
>>>>> corresponding file. So, if using direct I/O to read a file, the file
>>>>> content will be written directly to the corresponding dma-buf 
>>>>> buffer memory
>>>>> without any additional copying.(compare to pipe buffer.)
>>>>>
>>>>> Consider other ways to read into dma-buf. If we assume reading 
>>>>> after mmap
>>>>> dma-buf, we need to map the pages of the dma-buf to the user virtual
>>>>> address space. Also, udmabuf memfd need do this operations too.
>>>>> Even if we support sendfile, the file copy also need buffer, you must
>>>>> setup it.
>>>>> So, mapping pages to the vmalloc area does not incur any additional
>>>>> performance overhead compared to other methods.[6]
>>>>>
>>>>> Certainly, the administrator can also modify the gather size 
>>>>> through patch5.
>>>>>
>>>>> The formula for the time taken for system_heap buffer allocation and
>>>>> file reading through async_read is as follows:
>>>>>
>>>>>    T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>>>>
>>>>> Compared to the synchronous read:
>>>>>    T(total) = T(alloc) + T(I/O)
>>>>>
>>>>> If the allocation time or I/O time is long, the time difference 
>>>>> will be
>>>>> covered by the maximum value between the allocation and I/O. The 
>>>>> other
>>>>> party will be concealed.
>>>>>
>>>>> Therefore, the larger the size of the file that needs to be read, the
>>>>> greater the corresponding benefits will be.
>>>>>
>>>>> How to use
>>>>> ===
>>>>> Consider the current pathway for loading model files into DMA-BUF:
>>>>>    1. open dma-heap, get heap fd
>>>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>    4. mmap dma-buf fd, get vaddr
>>>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>    6. share, attach, whatever you want
>>>>>
>>>>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>>>>    1. open dma-heap, get heap fd
>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap 
>>>>> flag, set file_fd
>>>>>       instead of len. get dma-buf fd(contains file content)
>>>>>    4. share, attach, whatever you want
>>>>>
>>>>> So, test it is easy.
>>>>>
>>>>> How to test
>>>>> ===
>>>>> The performance comparison will be conducted for the following 
>>>>> scenarios:
>>>>>    1. normal
>>>>>    2. udmabuf with [3] patch
>>>>>    3. sendfile
>>>>>    4. only patch 1
>>>>>    5. patch1 - patch4.
>>>>>
>>>>> normal:
>>>>>    1. open dma-heap, get heap fd
>>>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>    4. mmap dma-buf fd, get vaddr
>>>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>    6. share, attach, whatever you want
>>>>>
>>>>> UDMA-BUF step:
>>>>>    1. memfd_create
>>>>>    2. open file(buffer/direct)
>>>>>    3. udmabuf create
>>>>>    4. mmap memfd
>>>>>    5. read file into memfd vaddr
>>>>>
>>>>> Sendfile step(need suit splice_write/write_iter, just use to 
>>>>> compare):
>>>>>    1. open dma-heap, get heap fd
>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>    4. sendfile file_fd to dma-buf fd
>>>>>    6. share, attach, whatever you want
>>>>>
>>>>> patch1/patch1-4:
>>>>>    1. open dma-heap, get heap fd
>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap 
>>>>> flag, set file_fd
>>>>>       instead of len. get dma-buf fd(contains file content)
>>>>>    4. share, attach, whatever you want
>>>>>
>>>>> You can create a file to test it. Compare the performance gap 
>>>>> between the two.
>>>>> It is best to compare the differences in file size from KB to MB 
>>>>> to GB.
>>>>>
>>>>> The following test data will compare the performance differences 
>>>>> between 512KB,
>>>>> 8MB, 1GB, and 3GB under various scenarios.
>>>>>
>>>>> Performance Test
>>>>> ===
>>>>>    12G RAM phone
>>>>>    UFS4.0(the maximum speed is 4GB/s. ),
>>>>>    f2fs
>>>>>    kernel 6.1 with patch[7] (or else, can't support kvec direct 
>>>>> I/O read.)
>>>>>    no memory pressure.
>>>>>    drop_cache is used for each test.
>>>>>
>>>>> The average of 5 test results:
>>>>> | scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns) | 
>>>>> 3GB(ns)       |
>>>>> | ------------------- | ---------- | ---------- | ------------- | 
>>>>> ------------- |
>>>>> | normal              | 2,790,861  | 14,535,784 | 1,520,790,492 | 
>>>>> 3,332,438,754 |
>>>>> | udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000 | 
>>>>> 2,108,419,923 |
>>>>> | sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 | 
>>>>> 3,062,052,984 |
>>>>> | patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800 | 
>>>>> 2,187,570,861 |
>>>>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 | 
>>>>> 9,777,661,077 |
>>>>> | patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 | 
>>>>> 5,648,897,554 |
>>>>> | udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784 | 
>>>>> 2,158,305,738 |
>>>>> | patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538 | 
>>>>> 1,400,006,107 |
>>>
>>> With this test, sendfile can't give a good help base on pipe buffer.
>>>
>>> udmabuf is good, but I think our oem driver can't suit it. (And, 
>>> AOSP do not open this feature)
>>>
>>>
>>> Anyway, I am sending this patchset in the hope of further discussion.
>>>
>>> Thanks.
>>>
>>>>>
>>>>> So, based on the test results:
>>>>>
>>>>> When the file is large, the patchset has the highest performance.
>>>>> Compared to normal, patchset is a 50% improvement;
>>>>> Compared to normal, patch1 only showed a degradation of 41%.
>>>>> patch1 typical performance breakdown is as follows:
>>>>>    1. alloc cost 188,802,693 ns
>>>>>    2. vmap cost 42,491,385 ns
>>>>>    3. file read cost 4,180,876,702 ns
>>>>> Therefore, directly performing a single direct I/O read on a large 
>>>>> file
>>>>> may not be the most optimal way for performance.
>>>>>
>>>>> The performance of direct I/O implemented by the sendfile method 
>>>>> is the worst.
>>>>>
>>>>> When file size is small, The difference in performance is not
>>>>> significant. This is consistent with expectations.
>>>>>
>>>>>
>>>>>
>>>>> Suggested use cases
>>>>> ===
>>>>>    1. When there is a need to read large files and system 
>>>>> resources are scarce,
>>>>>       especially when the size of memory is limited.(GB level) In 
>>>>> this
>>>>>       scenario, using direct I/O for file reading can even bring 
>>>>> performance
>>>>>       improvements.(may need patch2-3)
>>>>>    2. For embedded devices with limited RAM, using direct I/O can 
>>>>> save system
>>>>>       resources and avoid unnecessary data copying. Therefore, 
>>>>> even if the
>>>>>       performance is lower when read small file, it can still be used
>>>>>       effectively.
>>>>>    3. If there is sufficient memory, pinning the page cache of the 
>>>>> model files
>>>>>       in memory and placing file in the EROFS file system for 
>>>>> read-only access
>>>>>       maybe better.(EROFS do not support direct I/O)
>>>>>
>>>>>
>>>>> Changlog
>>>>> ===
>>>>>   v1 [8]
>>>>>   v1->v2:
>>>>>     Uses the heap flag method for alloc and read instead of adding 
>>>>> a new
>>>>>     DMA-buf ioctl command. [9]
>>>>>     Split the patchset to facilitate review and test.
>>>>>       patch 1 implement alloc and read, offer heap flag into it.
>>>>>       patch 2-4 offer async read
>>>>>       patch 5 can change gather limit.
>>>>>
>>>>> Reference
>>>>> ===
>>>>> [1] 
>>>>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>>>>> [2] https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>>>>> [3] 
>>>>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>>>>> [4] https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>>>>> [5] https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>>>>> [6] 
>>>>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>>>>> [7] 
>>>>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
>>>>> [8] 
>>>>> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>>>>> [9] 
>>>>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>>>>
>>>>> Huan Yang (5):
>>>>>    dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>>>    dma-buf: heaps: Introduce async alloc read ops
>>>>>    dma-buf: heaps: support alloc async read file
>>>>>    dma-buf: heaps: system_heap alloc support async read
>>>>>    dma-buf: heaps: configurable async read gather limit
>>>>>
>>>>>   drivers/dma-buf/dma-heap.c          | 552 
>>>>> +++++++++++++++++++++++++++-
>>>>>   drivers/dma-buf/heaps/system_heap.c |  70 +++-
>>>>>   include/linux/dma-heap.h            |  53 ++-
>>>>>   include/uapi/linux/dma-heap.h       |  11 +-
>>>>>   4 files changed, 673 insertions(+), 13 deletions(-)
>>>>>
>>>>>
>>>>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
>>>>
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30 10:42     ` Christian König
@ 2024-07-30 11:33       ` Huan Yang
  0 siblings, 0 replies; 26+ messages in thread
From: Huan Yang @ 2024-07-30 11:33 UTC (permalink / raw)
  To: Christian König, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, T.J. Mercier, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, opensource.kernel


在 2024/7/30 18:42, Christian König 写道:
> Am 30.07.24 um 11:05 schrieb Huan Yang:
>>
>> 在 2024/7/30 16:56, Daniel Vetter 写道:
>>> [????????? daniel.vetter@ffwll.ch ????????? 
>>> https://aka.ms/LearnAboutSenderIdentification?????????????]
>>>
>>> On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
>>>> UDMA-BUF step:
>>>>    1. memfd_create
>>>>    2. open file(buffer/direct)
>>>>    3. udmabuf create
>>>>    4. mmap memfd
>>>>    5. read file into memfd vaddr
>>> Yeah this is really slow and the worst way to do it. You absolutely 
>>> want
>>> to start _all_ the io before you start creating the dma-buf, ideally 
>>> with
>>> everything running in parallel. But just starting the direct I/O with
>>> async and then creating the umdabuf should be a lot faster and avoid
>> That's greate,  Let me rephrase that, and please correct me if I'm 
>> wrong.
>>
>> UDMA-BUF step:
>>   1. memfd_create
>>   2. mmap memfd
>>   3. open file(buffer/direct)
>>   4. start thread to async read
>>   3. udmabuf create
>>
>> With this, can improve
>>
>>> needlessly serialization operations.
>>>
>>> The other issue is that the mmap has some overhead, but might not be 
>>> too
>>> bad.
>> Yes, the time spent on page fault in mmap should be negligible 
>> compared to the time spent on file read.
>
> You should try to avoid mmap as much as possible. Especially the TLB 
> invalidation overhead is really huge on platforms with a large number 
> of CPUs.
But, if not mmap, how to read file from fd to fd? sendfile may 
help(shmemfs may support it, I do not test it), but use pipe buffer is 
not good through test.
>
> Regards,
> Christian.
>
>>> -Sima
>>> -- 
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch/ 
>>>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30 10:43         ` Christian König
@ 2024-07-30 11:36           ` Huan Yang
  2024-07-30 13:11             ` Christian König
  0 siblings, 1 reply; 26+ messages in thread
From: Huan Yang @ 2024-07-30 11:36 UTC (permalink / raw)
  To: Christian König, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, T.J. Mercier, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel
  Cc: opensource.kernel


在 2024/7/30 18:43, Christian König 写道:
> Am 30.07.24 um 10:46 schrieb Huan Yang:
>>
>> 在 2024/7/30 16:37, Christian König 写道:
>>> Am 30.07.24 um 10:14 schrieb Huan Yang:
>>>> 在 2024/7/30 16:03, Christian König 写道:
>>>>> Am 30.07.24 um 09:57 schrieb Huan Yang:
>>>>>> Background
>>>>>> ====
>>>>>> Some user may need load file into dma-buf, current way is:
>>>>>>    1. allocate a dma-buf, get dma-buf fd
>>>>>>    2. mmap dma-buf fd into user vaddr
>>>>>>    3. read(file_fd, vaddr, fsz)
>>>>>> Due to dma-buf user map can't support direct I/O[1], the file read
>>>>>> must be buffer I/O.
>>>>>>
>>>>>> This means that during the process of reading the file into dma-buf,
>>>>>> page cache needs to be generated, and the corresponding content 
>>>>>> needs to
>>>>>> be first copied to the page cache before being copied to the 
>>>>>> dma-buf.
>>>>>>
>>>>>> This way worked well when reading relatively small files before, as
>>>>>> the page cache can cache the file content, thus improving 
>>>>>> performance.
>>>>>>
>>>>>> However, there are new challenges currently, especially as AI 
>>>>>> models are
>>>>>> becoming larger and need to be shared between DMA devices and the 
>>>>>> CPU
>>>>>> via dma-buf.
>>>>>>
>>>>>> For example, our 7B model file size is around 3.4GB. Using the
>>>>>> previous would mean generating a total of 3.4GB of page cache
>>>>>> (even if it will be reclaimed), and also requiring the copying of 
>>>>>> 3.4GB
>>>>>> of content between page cache and dma-buf.
>>>>>>
>>>>>> Due to the limited resources of system memory, files in the 
>>>>>> gigabyte range
>>>>>> cannot persist in memory indefinitely, so this portion of page 
>>>>>> cache may
>>>>>> not provide much assistance for subsequent reads. Additionally, the
>>>>>> existence of page cache will consume additional system resources 
>>>>>> due to
>>>>>> the extra copying required by the CPU.
>>>>>>
>>>>>> Therefore, I think it is necessary for dma-buf to support direct 
>>>>>> I/O.
>>>>>>
>>>>>> However, direct I/O file reads cannot be performed using the buffer
>>>>>> mmaped by the user space for the dma-buf.[1]
>>>>>>
>>>>>> Here are some discussions on implementing direct I/O using dma-buf:
>>>>>>
>>>>>> mmap[1]
>>>>>> ---
>>>>>> dma-buf never support user map vaddr use of direct I/O.
>>>>>>
>>>>>> udmabuf[2]
>>>>>> ---
>>>>>> Currently, udmabuf can use the memfd method to read files into
>>>>>> dma-buf in direct I/O mode.
>>>>>>
>>>>>> However, if the size is large, the current udmabuf needs to 
>>>>>> adjust the
>>>>>> corresponding size_limit(default 64MB).
>>>>>> But using udmabuf for files at the 3GB level is not a very good 
>>>>>> approach.
>>>>>> It needs to make some adjustments internally to handle this.[3] 
>>>>>> Or else,
>>>>>> fail create.
>>>>>>
>>>>>> But, it is indeed a viable way to enable dma-buf to support 
>>>>>> direct I/O.
>>>>>> However, it is necessary to initiate the file read after the 
>>>>>> memory allocation
>>>>>> is completed, and handle race conditions carefully.
>>>>>>
>>>>>> sendfile/splice[4]
>>>>>> ---
>>>>>> Another way to enable dma-buf to support direct I/O is by 
>>>>>> implementing
>>>>>> splice_write/write_iter in the dma-buf file operations (fops) to 
>>>>>> adapt
>>>>>> to the sendfile method.
>>>>>> However, the current sendfile/splice calls are based on pipe. 
>>>>>> When using
>>>>>> direct I/O to read a file, the content needs to be copied to the 
>>>>>> buffer
>>>>>> allocated by the pipe (default 64KB), and then the dma-buf fops'
>>>>>> splice_write needs to be called to write the content into the 
>>>>>> dma-buf.
>>>>>> This approach requires serially reading the content of file pipe 
>>>>>> size
>>>>>> into the pipe buffer and then waiting for the dma-buf to be written
>>>>>> before reading the next one.(The I/O performance is relatively weak
>>>>>> under direct I/O.)
>>>>>> Moreover, due to the existence of the pipe buffer, even when using
>>>>>> direct I/O and not needing to generate additional page cache,
>>>>>> there still needs to be a CPU copy.
>>>>>>
>>>>>> copy_file_range[5]
>>>>>> ---
>>>>>> Consider of copy_file_range, It only supports copying files 
>>>>>> within the
>>>>>> same file system. Similarly, it is not very practical.
>>>>>>
>>>>>>
>>>>>> So, currently, there is no particularly suitable solution on VFS to
>>>>>> allow dma-buf to support direct I/O for large file reads.
>>>>>>
>>>>>> This patchset provides an idea to complete file reads when 
>>>>>> requesting a
>>>>>> dma-buf.
>>>>>>
>>>>>> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>>>> ===
>>>>>> This patch provides a method to immediately read the file content 
>>>>>> after
>>>>>> the dma-buf is allocated, and only returns the dma-buf file 
>>>>>> descriptor
>>>>>> after the file is fully read.
>>>>>>
>>>>>> Since the dma-buf file descriptor is not returned, no other 
>>>>>> thread can
>>>>>> access it except for the current thread, so we don't need to 
>>>>>> worry about
>>>>>> race conditions.
>>>>>
>>>>> That is a completely false assumption.
>>>> Can you provide a detailed explanation as to why this assumption is 
>>>> incorrect? thanks.
>>>
>>> File descriptors can be guessed and is available to userspace as 
>>> soon as dma_buf_fd() is called.
>>>
>>> What could potentially work is to call system_heap_allocate() 
>>> without calling dma_buf_fd(), but I'm not sure if you can then make 
>>> I/O to the underlying pages.
>>
>> Actually, the dma-buf file descriptor is obtained only after a 
>> successful file read in the code, so there is no issue.
>>
>> If you are interested, you can take a look at the 
>> dma_heap_buffer_alloc_and_read function in patch1.
>>
>>>
>>>>>
>>>>>>
>>>>>> Map the dma-buf to the vmalloc area and initiate file reads in 
>>>>>> kernel
>>>>>> space, supporting both buffer I/O and direct I/O.
>>>>>>
>>>>>> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
>>>>>> When a user needs to allocate a dma-buf and read a file, they should
>>>>>> pass this heap flag. As the size of the file being read is fixed, 
>>>>>> there is no
>>>>>> need to pass the 'len' parameter. Instead, The file_fd needs to 
>>>>>> be passed to
>>>>>> indicate to the kernel the file that needs to be read.
>>>>>>
>>>>>> The file open flag determines the mode of file reading.
>>>>>> But, please note that if direct I/O(O_DIRECT) is needed to read 
>>>>>> the file,
>>>>>> the file size must be page aligned. (with patch 2-5, no need)
>>>>>>
>>>>>> Therefore, for the user, len and file_fd are mutually exclusive,
>>>>>> and they are combined using a union.
>>>>>>
>>>>>> Once the user obtains the dma-buf fd, the dma-buf directly 
>>>>>> contains the
>>>>>> file content.
>>>>>
>>>>> And I'm repeating myself, but this is a complete NAK from my side 
>>>>> to this approach.
>>>>>
>>>>> We pointed out multiple ways of how to implement this cleanly and 
>>>>> not by hacking functionality into the kernel which absolutely 
>>>>> doesn't belong there.
>>>> In this patchset, I have provided performance comparisons of each 
>>>> of these methods.  Can you please provide more opinions?
>>>
>>> Either drop the whole approach or change udmabuf to do what you want 
>>> to do.
>> OK, if so, do I need to send a patch to make dma-buf support sendfile?
>
> Well the udmabuf approach doesn't need to use sendfile, so no.

Get it, I'll not send again.

About udmabuf, I test find it can't support larget find read due to page 
array alloc.

I already upload this patch, but do not recive answer.

https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/

Is there anything wrong with my understanding of it?

>
> Regards,
> Christian.
>
>>
>>>
>>> Apart from that I don't see a doable way which can be accepted into 
>>> the kernel.
>> Thanks for your suggestion.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Patch 1 implement it.
>>>>>>
>>>>>> Patch 2-5 provides an approach for performance improvement.
>>>>>>
>>>>>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>>>>>> synchronously read files using direct I/O.
>>>>>>
>>>>>> This approach helps to save CPU copying and avoid a certain 
>>>>>> degree of
>>>>>> memory thrashing (page cache generation and reclamation)
>>>>>>
>>>>>> When dealing with large file sizes, the benefits of this approach 
>>>>>> become
>>>>>> particularly significant.
>>>>>>
>>>>>> However, there are currently some methods that can improve 
>>>>>> performance,
>>>>>> not just save system resources:
>>>>>>
>>>>>> Due to the large file size, for example, a AI 7B model of around 
>>>>>> 3.4GB, the
>>>>>> time taken to allocate DMA-BUF memory will be relatively long. 
>>>>>> Waiting
>>>>>> for the allocation to complete before reading the file will add 
>>>>>> to the
>>>>>> overall time consumption. Therefore, the total time for DMA-BUF
>>>>>> allocation and file read can be calculated using the formula
>>>>>>     T(total) = T(alloc) + T(I/O)
>>>>>>
>>>>>> However, if we change our approach, we don't necessarily need to 
>>>>>> wait
>>>>>> for the DMA-BUF allocation to complete before initiating I/O. In 
>>>>>> fact,
>>>>>> during the allocation process, we already hold a portion of the 
>>>>>> page,
>>>>>> which means that waiting for subsequent page allocations to complete
>>>>>> before carrying out file reads is actually unfair to the pages 
>>>>>> that have
>>>>>> already been allocated.
>>>>>>
>>>>>> The allocation of pages is sequential, and the reading of the 
>>>>>> file is
>>>>>> also sequential, with the content and size corresponding to the 
>>>>>> file.
>>>>>> This means that the memory location for each page, which holds the
>>>>>> content of a specific position in the file, can be determined at the
>>>>>> time of allocation.
>>>>>>
>>>>>> However, to fully leverage I/O performance, it is best to wait and
>>>>>> gather a certain number of pages before initiating batch processing.
>>>>>>
>>>>>> The default gather size is 128MB. So, ever gathered can see as a 
>>>>>> file read
>>>>>> work, it maps the gather page to the vmalloc area to obtain a 
>>>>>> continuous
>>>>>> virtual address, which is used as a buffer to store the contents 
>>>>>> of the
>>>>>> corresponding file. So, if using direct I/O to read a file, the file
>>>>>> content will be written directly to the corresponding dma-buf 
>>>>>> buffer memory
>>>>>> without any additional copying.(compare to pipe buffer.)
>>>>>>
>>>>>> Consider other ways to read into dma-buf. If we assume reading 
>>>>>> after mmap
>>>>>> dma-buf, we need to map the pages of the dma-buf to the user virtual
>>>>>> address space. Also, udmabuf memfd need do this operations too.
>>>>>> Even if we support sendfile, the file copy also need buffer, you 
>>>>>> must
>>>>>> setup it.
>>>>>> So, mapping pages to the vmalloc area does not incur any additional
>>>>>> performance overhead compared to other methods.[6]
>>>>>>
>>>>>> Certainly, the administrator can also modify the gather size 
>>>>>> through patch5.
>>>>>>
>>>>>> The formula for the time taken for system_heap buffer allocation and
>>>>>> file reading through async_read is as follows:
>>>>>>
>>>>>>    T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>>>>>
>>>>>> Compared to the synchronous read:
>>>>>>    T(total) = T(alloc) + T(I/O)
>>>>>>
>>>>>> If the allocation time or I/O time is long, the time difference 
>>>>>> will be
>>>>>> covered by the maximum value between the allocation and I/O. The 
>>>>>> other
>>>>>> party will be concealed.
>>>>>>
>>>>>> Therefore, the larger the size of the file that needs to be read, 
>>>>>> the
>>>>>> greater the corresponding benefits will be.
>>>>>>
>>>>>> How to use
>>>>>> ===
>>>>>> Consider the current pathway for loading model files into DMA-BUF:
>>>>>>    1. open dma-heap, get heap fd
>>>>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>    4. mmap dma-buf fd, get vaddr
>>>>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>>    6. share, attach, whatever you want
>>>>>>
>>>>>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>>>>>    1. open dma-heap, get heap fd
>>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap 
>>>>>> flag, set file_fd
>>>>>>       instead of len. get dma-buf fd(contains file content)
>>>>>>    4. share, attach, whatever you want
>>>>>>
>>>>>> So, test it is easy.
>>>>>>
>>>>>> How to test
>>>>>> ===
>>>>>> The performance comparison will be conducted for the following 
>>>>>> scenarios:
>>>>>>    1. normal
>>>>>>    2. udmabuf with [3] patch
>>>>>>    3. sendfile
>>>>>>    4. only patch 1
>>>>>>    5. patch1 - patch4.
>>>>>>
>>>>>> normal:
>>>>>>    1. open dma-heap, get heap fd
>>>>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>    4. mmap dma-buf fd, get vaddr
>>>>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>>    6. share, attach, whatever you want
>>>>>>
>>>>>> UDMA-BUF step:
>>>>>>    1. memfd_create
>>>>>>    2. open file(buffer/direct)
>>>>>>    3. udmabuf create
>>>>>>    4. mmap memfd
>>>>>>    5. read file into memfd vaddr
>>>>>>
>>>>>> Sendfile step(need suit splice_write/write_iter, just use to 
>>>>>> compare):
>>>>>>    1. open dma-heap, get heap fd
>>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>    4. sendfile file_fd to dma-buf fd
>>>>>>    6. share, attach, whatever you want
>>>>>>
>>>>>> patch1/patch1-4:
>>>>>>    1. open dma-heap, get heap fd
>>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap 
>>>>>> flag, set file_fd
>>>>>>       instead of len. get dma-buf fd(contains file content)
>>>>>>    4. share, attach, whatever you want
>>>>>>
>>>>>> You can create a file to test it. Compare the performance gap 
>>>>>> between the two.
>>>>>> It is best to compare the differences in file size from KB to MB 
>>>>>> to GB.
>>>>>>
>>>>>> The following test data will compare the performance differences 
>>>>>> between 512KB,
>>>>>> 8MB, 1GB, and 3GB under various scenarios.
>>>>>>
>>>>>> Performance Test
>>>>>> ===
>>>>>>    12G RAM phone
>>>>>>    UFS4.0(the maximum speed is 4GB/s. ),
>>>>>>    f2fs
>>>>>>    kernel 6.1 with patch[7] (or else, can't support kvec direct 
>>>>>> I/O read.)
>>>>>>    no memory pressure.
>>>>>>    drop_cache is used for each test.
>>>>>>
>>>>>> The average of 5 test results:
>>>>>> | scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns) | 
>>>>>> 3GB(ns)       |
>>>>>> | ------------------- | ---------- | ---------- | ------------- | 
>>>>>> ------------- |
>>>>>> | normal              | 2,790,861  | 14,535,784 | 1,520,790,492 | 
>>>>>> 3,332,438,754 |
>>>>>> | udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000 | 
>>>>>> 2,108,419,923 |
>>>>>> | sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 | 
>>>>>> 3,062,052,984 |
>>>>>> | patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800 | 
>>>>>> 2,187,570,861 |
>>>>>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 | 
>>>>>> 9,777,661,077 |
>>>>>> | patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 | 
>>>>>> 5,648,897,554 |
>>>>>> | udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784 | 
>>>>>> 2,158,305,738 |
>>>>>> | patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538 | 
>>>>>> 1,400,006,107 |
>>>>
>>>> With this test, sendfile can't give a good help base on pipe buffer.
>>>>
>>>> udmabuf is good, but I think our oem driver can't suit it. (And, 
>>>> AOSP do not open this feature)
>>>>
>>>>
>>>> Anyway, I am sending this patchset in the hope of further discussion.
>>>>
>>>> Thanks.
>>>>
>>>>>>
>>>>>> So, based on the test results:
>>>>>>
>>>>>> When the file is large, the patchset has the highest performance.
>>>>>> Compared to normal, patchset is a 50% improvement;
>>>>>> Compared to normal, patch1 only showed a degradation of 41%.
>>>>>> patch1 typical performance breakdown is as follows:
>>>>>>    1. alloc cost 188,802,693 ns
>>>>>>    2. vmap cost 42,491,385 ns
>>>>>>    3. file read cost 4,180,876,702 ns
>>>>>> Therefore, directly performing a single direct I/O read on a 
>>>>>> large file
>>>>>> may not be the most optimal way for performance.
>>>>>>
>>>>>> The performance of direct I/O implemented by the sendfile method 
>>>>>> is the worst.
>>>>>>
>>>>>> When file size is small, The difference in performance is not
>>>>>> significant. This is consistent with expectations.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Suggested use cases
>>>>>> ===
>>>>>>    1. When there is a need to read large files and system 
>>>>>> resources are scarce,
>>>>>>       especially when the size of memory is limited.(GB level) In 
>>>>>> this
>>>>>>       scenario, using direct I/O for file reading can even bring 
>>>>>> performance
>>>>>>       improvements.(may need patch2-3)
>>>>>>    2. For embedded devices with limited RAM, using direct I/O can 
>>>>>> save system
>>>>>>       resources and avoid unnecessary data copying. Therefore, 
>>>>>> even if the
>>>>>>       performance is lower when read small file, it can still be 
>>>>>> used
>>>>>>       effectively.
>>>>>>    3. If there is sufficient memory, pinning the page cache of 
>>>>>> the model files
>>>>>>       in memory and placing file in the EROFS file system for 
>>>>>> read-only access
>>>>>>       maybe better.(EROFS do not support direct I/O)
>>>>>>
>>>>>>
>>>>>> Changlog
>>>>>> ===
>>>>>>   v1 [8]
>>>>>>   v1->v2:
>>>>>>     Uses the heap flag method for alloc and read instead of 
>>>>>> adding a new
>>>>>>     DMA-buf ioctl command. [9]
>>>>>>     Split the patchset to facilitate review and test.
>>>>>>       patch 1 implement alloc and read, offer heap flag into it.
>>>>>>       patch 2-4 offer async read
>>>>>>       patch 5 can change gather limit.
>>>>>>
>>>>>> Reference
>>>>>> ===
>>>>>> [1] 
>>>>>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>>>>>> [2] 
>>>>>> https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>>>>>> [3] 
>>>>>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>>>>>> [4] 
>>>>>> https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>>>>>> [5] 
>>>>>> https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>>>>>> [6] 
>>>>>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>>>>>> [7] 
>>>>>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
>>>>>> [8] 
>>>>>> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>>>>>> [9] 
>>>>>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>>>>>
>>>>>> Huan Yang (5):
>>>>>>    dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>>>>    dma-buf: heaps: Introduce async alloc read ops
>>>>>>    dma-buf: heaps: support alloc async read file
>>>>>>    dma-buf: heaps: system_heap alloc support async read
>>>>>>    dma-buf: heaps: configurable async read gather limit
>>>>>>
>>>>>>   drivers/dma-buf/dma-heap.c          | 552 
>>>>>> +++++++++++++++++++++++++++-
>>>>>>   drivers/dma-buf/heaps/system_heap.c |  70 +++-
>>>>>>   include/linux/dma-heap.h            |  53 ++-
>>>>>>   include/uapi/linux/dma-heap.h       |  11 +-
>>>>>>   4 files changed, 673 insertions(+), 13 deletions(-)
>>>>>>
>>>>>>
>>>>>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30  9:05   ` Huan Yang
  2024-07-30 10:42     ` Christian König
@ 2024-07-30 12:04     ` Huan Yang
  2024-07-31 20:46       ` Daniel Vetter
  1 sibling, 1 reply; 26+ messages in thread
From: Huan Yang @ 2024-07-30 12:04 UTC (permalink / raw)
  To: Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Christian König, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, opensource.kernel


在 2024/7/30 17:05, Huan Yang 写道:
>
> 在 2024/7/30 16:56, Daniel Vetter 写道:
>> [????????? daniel.vetter@ffwll.ch ????????? 
>> https://aka.ms/LearnAboutSenderIdentification?????????????]
>>
>> On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
>>> UDMA-BUF step:
>>>    1. memfd_create
>>>    2. open file(buffer/direct)
>>>    3. udmabuf create
>>>    4. mmap memfd
>>>    5. read file into memfd vaddr
>> Yeah this is really slow and the worst way to do it. You absolutely want
>> to start _all_ the io before you start creating the dma-buf, ideally 
>> with
>> everything running in parallel. But just starting the direct I/O with
>> async and then creating the umdabuf should be a lot faster and avoid
> That's greate,  Let me rephrase that, and please correct me if I'm wrong.
>
> UDMA-BUF step:
>   1. memfd_create
>   2. mmap memfd
>   3. open file(buffer/direct)
>   4. start thread to async read
>   3. udmabuf create
>
> With this, can improve

I just test with it. Step is:

UDMA-BUF step:
   1. memfd_create
   2. mmap memfd
   3. open file(buffer/direct)
   4. start thread to async read
   5. udmabuf create

   6 . join wait

3G file read all step cost 1,527,103,431ns, it's greate.

>
>> needlessly serialization operations.
>>
>> The other issue is that the mmap has some overhead, but might not be too
>> bad.
> Yes, the time spent on page fault in mmap should be negligible 
> compared to the time spent on file read.
>> -Sima
>> -- 
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30 11:36           ` Huan Yang
@ 2024-07-30 13:11             ` Christian König
  2024-07-31  1:48               ` Huan Yang
  0 siblings, 1 reply; 26+ messages in thread
From: Christian König @ 2024-07-30 13:11 UTC (permalink / raw)
  To: Huan Yang, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, linux-media, dri-devel, linaro-mm-sig,
	linux-kernel
  Cc: opensource.kernel

Am 30.07.24 um 13:36 schrieb Huan Yang:
>>>> Either drop the whole approach or change udmabuf to do what you 
>>>> want to do.
>>> OK, if so, do I need to send a patch to make dma-buf support sendfile?
>>
>> Well the udmabuf approach doesn't need to use sendfile, so no.
>
> Get it, I'll not send again.
>
> About udmabuf, I test find it can't support larget find read due to 
> page array alloc.
>
> I already upload this patch, but do not recive answer.
>
> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>
> Is there anything wrong with my understanding of it?

No, that patch was totally fine. Not getting a response is usually 
something good.

In other words when maintainer see something which won't work at all 
they immediately react, but when nobody complains it usually means you 
are on the right track.

As long as nobody has any good arguments against it I'm happy to take 
that one upstream through drm-misc-next immediately since it's clearly a 
stand a lone improvement on it's own.

Regards,
Christian.

>
>>
>> Regards,
>> Christian.
>>
>>>
>>>>
>>>> Apart from that I don't see a doable way which can be accepted into 
>>>> the kernel.
>>> Thanks for your suggestion.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Patch 1 implement it.
>>>>>>>
>>>>>>> Patch 2-5 provides an approach for performance improvement.
>>>>>>>
>>>>>>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>>>>>>> synchronously read files using direct I/O.
>>>>>>>
>>>>>>> This approach helps to save CPU copying and avoid a certain 
>>>>>>> degree of
>>>>>>> memory thrashing (page cache generation and reclamation)
>>>>>>>
>>>>>>> When dealing with large file sizes, the benefits of this 
>>>>>>> approach become
>>>>>>> particularly significant.
>>>>>>>
>>>>>>> However, there are currently some methods that can improve 
>>>>>>> performance,
>>>>>>> not just save system resources:
>>>>>>>
>>>>>>> Due to the large file size, for example, a AI 7B model of around 
>>>>>>> 3.4GB, the
>>>>>>> time taken to allocate DMA-BUF memory will be relatively long. 
>>>>>>> Waiting
>>>>>>> for the allocation to complete before reading the file will add 
>>>>>>> to the
>>>>>>> overall time consumption. Therefore, the total time for DMA-BUF
>>>>>>> allocation and file read can be calculated using the formula
>>>>>>>     T(total) = T(alloc) + T(I/O)
>>>>>>>
>>>>>>> However, if we change our approach, we don't necessarily need to 
>>>>>>> wait
>>>>>>> for the DMA-BUF allocation to complete before initiating I/O. In 
>>>>>>> fact,
>>>>>>> during the allocation process, we already hold a portion of the 
>>>>>>> page,
>>>>>>> which means that waiting for subsequent page allocations to 
>>>>>>> complete
>>>>>>> before carrying out file reads is actually unfair to the pages 
>>>>>>> that have
>>>>>>> already been allocated.
>>>>>>>
>>>>>>> The allocation of pages is sequential, and the reading of the 
>>>>>>> file is
>>>>>>> also sequential, with the content and size corresponding to the 
>>>>>>> file.
>>>>>>> This means that the memory location for each page, which holds the
>>>>>>> content of a specific position in the file, can be determined at 
>>>>>>> the
>>>>>>> time of allocation.
>>>>>>>
>>>>>>> However, to fully leverage I/O performance, it is best to wait and
>>>>>>> gather a certain number of pages before initiating batch 
>>>>>>> processing.
>>>>>>>
>>>>>>> The default gather size is 128MB. So, ever gathered can see as a 
>>>>>>> file read
>>>>>>> work, it maps the gather page to the vmalloc area to obtain a 
>>>>>>> continuous
>>>>>>> virtual address, which is used as a buffer to store the contents 
>>>>>>> of the
>>>>>>> corresponding file. So, if using direct I/O to read a file, the 
>>>>>>> file
>>>>>>> content will be written directly to the corresponding dma-buf 
>>>>>>> buffer memory
>>>>>>> without any additional copying.(compare to pipe buffer.)
>>>>>>>
>>>>>>> Consider other ways to read into dma-buf. If we assume reading 
>>>>>>> after mmap
>>>>>>> dma-buf, we need to map the pages of the dma-buf to the user 
>>>>>>> virtual
>>>>>>> address space. Also, udmabuf memfd need do this operations too.
>>>>>>> Even if we support sendfile, the file copy also need buffer, you 
>>>>>>> must
>>>>>>> setup it.
>>>>>>> So, mapping pages to the vmalloc area does not incur any additional
>>>>>>> performance overhead compared to other methods.[6]
>>>>>>>
>>>>>>> Certainly, the administrator can also modify the gather size 
>>>>>>> through patch5.
>>>>>>>
>>>>>>> The formula for the time taken for system_heap buffer allocation 
>>>>>>> and
>>>>>>> file reading through async_read is as follows:
>>>>>>>
>>>>>>>    T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>>>>>>
>>>>>>> Compared to the synchronous read:
>>>>>>>    T(total) = T(alloc) + T(I/O)
>>>>>>>
>>>>>>> If the allocation time or I/O time is long, the time difference 
>>>>>>> will be
>>>>>>> covered by the maximum value between the allocation and I/O. The 
>>>>>>> other
>>>>>>> party will be concealed.
>>>>>>>
>>>>>>> Therefore, the larger the size of the file that needs to be 
>>>>>>> read, the
>>>>>>> greater the corresponding benefits will be.
>>>>>>>
>>>>>>> How to use
>>>>>>> ===
>>>>>>> Consider the current pathway for loading model files into DMA-BUF:
>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>>    4. mmap dma-buf fd, get vaddr
>>>>>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>>>    6. share, attach, whatever you want
>>>>>>>
>>>>>>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap 
>>>>>>> flag, set file_fd
>>>>>>>       instead of len. get dma-buf fd(contains file content)
>>>>>>>    4. share, attach, whatever you want
>>>>>>>
>>>>>>> So, test it is easy.
>>>>>>>
>>>>>>> How to test
>>>>>>> ===
>>>>>>> The performance comparison will be conducted for the following 
>>>>>>> scenarios:
>>>>>>>    1. normal
>>>>>>>    2. udmabuf with [3] patch
>>>>>>>    3. sendfile
>>>>>>>    4. only patch 1
>>>>>>>    5. patch1 - patch4.
>>>>>>>
>>>>>>> normal:
>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>>    4. mmap dma-buf fd, get vaddr
>>>>>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>>>    6. share, attach, whatever you want
>>>>>>>
>>>>>>> UDMA-BUF step:
>>>>>>>    1. memfd_create
>>>>>>>    2. open file(buffer/direct)
>>>>>>>    3. udmabuf create
>>>>>>>    4. mmap memfd
>>>>>>>    5. read file into memfd vaddr
>>>>>>>
>>>>>>> Sendfile step(need suit splice_write/write_iter, just use to 
>>>>>>> compare):
>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>>    4. sendfile file_fd to dma-buf fd
>>>>>>>    6. share, attach, whatever you want
>>>>>>>
>>>>>>> patch1/patch1-4:
>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap 
>>>>>>> flag, set file_fd
>>>>>>>       instead of len. get dma-buf fd(contains file content)
>>>>>>>    4. share, attach, whatever you want
>>>>>>>
>>>>>>> You can create a file to test it. Compare the performance gap 
>>>>>>> between the two.
>>>>>>> It is best to compare the differences in file size from KB to MB 
>>>>>>> to GB.
>>>>>>>
>>>>>>> The following test data will compare the performance differences 
>>>>>>> between 512KB,
>>>>>>> 8MB, 1GB, and 3GB under various scenarios.
>>>>>>>
>>>>>>> Performance Test
>>>>>>> ===
>>>>>>>    12G RAM phone
>>>>>>>    UFS4.0(the maximum speed is 4GB/s. ),
>>>>>>>    f2fs
>>>>>>>    kernel 6.1 with patch[7] (or else, can't support kvec direct 
>>>>>>> I/O read.)
>>>>>>>    no memory pressure.
>>>>>>>    drop_cache is used for each test.
>>>>>>>
>>>>>>> The average of 5 test results:
>>>>>>> | scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns) | 
>>>>>>> 3GB(ns)       |
>>>>>>> | ------------------- | ---------- | ---------- | ------------- 
>>>>>>> | ------------- |
>>>>>>> | normal              | 2,790,861  | 14,535,784 | 1,520,790,492 
>>>>>>> | 3,332,438,754 |
>>>>>>> | udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000 | 
>>>>>>> 2,108,419,923 |
>>>>>>> | sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 
>>>>>>> | 3,062,052,984 |
>>>>>>> | patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800 | 
>>>>>>> 2,187,570,861 |
>>>>>>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 
>>>>>>> | 9,777,661,077 |
>>>>>>> | patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 
>>>>>>> | 5,648,897,554 |
>>>>>>> | udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784 | 
>>>>>>> 2,158,305,738 |
>>>>>>> | patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538 | 
>>>>>>> 1,400,006,107 |
>>>>>
>>>>> With this test, sendfile can't give a good help base on pipe buffer.
>>>>>
>>>>> udmabuf is good, but I think our oem driver can't suit it. (And, 
>>>>> AOSP do not open this feature)
>>>>>
>>>>>
>>>>> Anyway, I am sending this patchset in the hope of further discussion.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>>>
>>>>>>> So, based on the test results:
>>>>>>>
>>>>>>> When the file is large, the patchset has the highest performance.
>>>>>>> Compared to normal, patchset is a 50% improvement;
>>>>>>> Compared to normal, patch1 only showed a degradation of 41%.
>>>>>>> patch1 typical performance breakdown is as follows:
>>>>>>>    1. alloc cost 188,802,693 ns
>>>>>>>    2. vmap cost 42,491,385 ns
>>>>>>>    3. file read cost 4,180,876,702 ns
>>>>>>> Therefore, directly performing a single direct I/O read on a 
>>>>>>> large file
>>>>>>> may not be the most optimal way for performance.
>>>>>>>
>>>>>>> The performance of direct I/O implemented by the sendfile method 
>>>>>>> is the worst.
>>>>>>>
>>>>>>> When file size is small, The difference in performance is not
>>>>>>> significant. This is consistent with expectations.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Suggested use cases
>>>>>>> ===
>>>>>>>    1. When there is a need to read large files and system 
>>>>>>> resources are scarce,
>>>>>>>       especially when the size of memory is limited.(GB level) 
>>>>>>> In this
>>>>>>>       scenario, using direct I/O for file reading can even bring 
>>>>>>> performance
>>>>>>>       improvements.(may need patch2-3)
>>>>>>>    2. For embedded devices with limited RAM, using direct I/O 
>>>>>>> can save system
>>>>>>>       resources and avoid unnecessary data copying. Therefore, 
>>>>>>> even if the
>>>>>>>       performance is lower when read small file, it can still be 
>>>>>>> used
>>>>>>>       effectively.
>>>>>>>    3. If there is sufficient memory, pinning the page cache of 
>>>>>>> the model files
>>>>>>>       in memory and placing file in the EROFS file system for 
>>>>>>> read-only access
>>>>>>>       maybe better.(EROFS do not support direct I/O)
>>>>>>>
>>>>>>>
>>>>>>> Changlog
>>>>>>> ===
>>>>>>>   v1 [8]
>>>>>>>   v1->v2:
>>>>>>>     Uses the heap flag method for alloc and read instead of 
>>>>>>> adding a new
>>>>>>>     DMA-buf ioctl command. [9]
>>>>>>>     Split the patchset to facilitate review and test.
>>>>>>>       patch 1 implement alloc and read, offer heap flag into it.
>>>>>>>       patch 2-4 offer async read
>>>>>>>       patch 5 can change gather limit.
>>>>>>>
>>>>>>> Reference
>>>>>>> ===
>>>>>>> [1] 
>>>>>>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>>>>>>> [2] 
>>>>>>> https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>>>>>>> [3] 
>>>>>>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>>>>>>> [4] https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>>>>>>> [5] https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>>>>>>> [6] 
>>>>>>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>>>>>>> [7] 
>>>>>>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
>>>>>>> [8] 
>>>>>>> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>>>>>>> [9] 
>>>>>>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>>>>>>
>>>>>>> Huan Yang (5):
>>>>>>>    dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>>>>>    dma-buf: heaps: Introduce async alloc read ops
>>>>>>>    dma-buf: heaps: support alloc async read file
>>>>>>>    dma-buf: heaps: system_heap alloc support async read
>>>>>>>    dma-buf: heaps: configurable async read gather limit
>>>>>>>
>>>>>>>   drivers/dma-buf/dma-heap.c          | 552 
>>>>>>> +++++++++++++++++++++++++++-
>>>>>>>   drivers/dma-buf/heaps/system_heap.c |  70 +++-
>>>>>>>   include/linux/dma-heap.h            |  53 ++-
>>>>>>>   include/uapi/linux/dma-heap.h       |  11 +-
>>>>>>>   4 files changed, 673 insertions(+), 13 deletions(-)
>>>>>>>
>>>>>>>
>>>>>>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30  8:14   ` Huan Yang
  2024-07-30  8:37     ` Christian König
@ 2024-07-30 17:19     ` T.J. Mercier
  2024-07-31  1:47       ` Huan Yang
  1 sibling, 1 reply; 26+ messages in thread
From: T.J. Mercier @ 2024-07-30 17:19 UTC (permalink / raw)
  To: Huan Yang
  Cc: Christian König, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, linux-media, dri-devel, linaro-mm-sig,
	linux-kernel, opensource.kernel

On Tue, Jul 30, 2024 at 1:14 AM Huan Yang <link@vivo.com> wrote:
>
>
> 在 2024/7/30 16:03, Christian König 写道:
> > Am 30.07.24 um 09:57 schrieb Huan Yang:
> >> Background
> >> ====
> >> Some user may need load file into dma-buf, current way is:
> >>    1. allocate a dma-buf, get dma-buf fd
> >>    2. mmap dma-buf fd into user vaddr
> >>    3. read(file_fd, vaddr, fsz)
> >> Due to dma-buf user map can't support direct I/O[1], the file read
> >> must be buffer I/O.
> >>
> >> This means that during the process of reading the file into dma-buf,
> >> page cache needs to be generated, and the corresponding content needs to
> >> be first copied to the page cache before being copied to the dma-buf.
> >>
> >> This way worked well when reading relatively small files before, as
> >> the page cache can cache the file content, thus improving performance.
> >>
> >> However, there are new challenges currently, especially as AI models are
> >> becoming larger and need to be shared between DMA devices and the CPU
> >> via dma-buf.
> >>
> >> For example, our 7B model file size is around 3.4GB. Using the
> >> previous would mean generating a total of 3.4GB of page cache
> >> (even if it will be reclaimed), and also requiring the copying of 3.4GB
> >> of content between page cache and dma-buf.
> >>
> >> Due to the limited resources of system memory, files in the gigabyte
> >> range
> >> cannot persist in memory indefinitely, so this portion of page cache may
> >> not provide much assistance for subsequent reads. Additionally, the
> >> existence of page cache will consume additional system resources due to
> >> the extra copying required by the CPU.
> >>
> >> Therefore, I think it is necessary for dma-buf to support direct I/O.
> >>
> >> However, direct I/O file reads cannot be performed using the buffer
> >> mmaped by the user space for the dma-buf.[1]
> >>
> >> Here are some discussions on implementing direct I/O using dma-buf:
> >>
> >> mmap[1]
> >> ---
> >> dma-buf never support user map vaddr use of direct I/O.
> >>
> >> udmabuf[2]
> >> ---
> >> Currently, udmabuf can use the memfd method to read files into
> >> dma-buf in direct I/O mode.
> >>
> >> However, if the size is large, the current udmabuf needs to adjust the
> >> corresponding size_limit(default 64MB).
> >> But using udmabuf for files at the 3GB level is not a very good
> >> approach.
> >> It needs to make some adjustments internally to handle this.[3] Or else,
> >> fail create.
> >>
> >> But, it is indeed a viable way to enable dma-buf to support direct I/O.
> >> However, it is necessary to initiate the file read after the memory
> >> allocation
> >> is completed, and handle race conditions carefully.
> >>
> >> sendfile/splice[4]
> >> ---
> >> Another way to enable dma-buf to support direct I/O is by implementing
> >> splice_write/write_iter in the dma-buf file operations (fops) to adapt
> >> to the sendfile method.
> >> However, the current sendfile/splice calls are based on pipe. When using
> >> direct I/O to read a file, the content needs to be copied to the buffer
> >> allocated by the pipe (default 64KB), and then the dma-buf fops'
> >> splice_write needs to be called to write the content into the dma-buf.
> >> This approach requires serially reading the content of file pipe size
> >> into the pipe buffer and then waiting for the dma-buf to be written
> >> before reading the next one.(The I/O performance is relatively weak
> >> under direct I/O.)
> >> Moreover, due to the existence of the pipe buffer, even when using
> >> direct I/O and not needing to generate additional page cache,
> >> there still needs to be a CPU copy.
> >>
> >> copy_file_range[5]
> >> ---
> >> Consider of copy_file_range, It only supports copying files within the
> >> same file system. Similarly, it is not very practical.
> >>
> >>
> >> So, currently, there is no particularly suitable solution on VFS to
> >> allow dma-buf to support direct I/O for large file reads.
> >>
> >> This patchset provides an idea to complete file reads when requesting a
> >> dma-buf.
> >>
> >> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
> >> ===
> >> This patch provides a method to immediately read the file content after
> >> the dma-buf is allocated, and only returns the dma-buf file descriptor
> >> after the file is fully read.
> >>
> >> Since the dma-buf file descriptor is not returned, no other thread can
> >> access it except for the current thread, so we don't need to worry about
> >> race conditions.
> >
> > That is a completely false assumption.
> Can you provide a detailed explanation as to why this assumption is
> incorrect? thanks.
> >
> >>
> >> Map the dma-buf to the vmalloc area and initiate file reads in kernel
> >> space, supporting both buffer I/O and direct I/O.
> >>
> >> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
> >> When a user needs to allocate a dma-buf and read a file, they should
> >> pass this heap flag. As the size of the file being read is fixed,
> >> there is no
> >> need to pass the 'len' parameter. Instead, The file_fd needs to be
> >> passed to
> >> indicate to the kernel the file that needs to be read.
> >>
> >> The file open flag determines the mode of file reading.
> >> But, please note that if direct I/O(O_DIRECT) is needed to read the
> >> file,
> >> the file size must be page aligned. (with patch 2-5, no need)
> >>
> >> Therefore, for the user, len and file_fd are mutually exclusive,
> >> and they are combined using a union.
> >>
> >> Once the user obtains the dma-buf fd, the dma-buf directly contains the
> >> file content.
> >
> > And I'm repeating myself, but this is a complete NAK from my side to
> > this approach.
> >
> > We pointed out multiple ways of how to implement this cleanly and not
> > by hacking functionality into the kernel which absolutely doesn't
> > belong there.
> In this patchset, I have provided performance comparisons of each of
> these methods.  Can you please provide more opinions?
> >
> > Regards,
> > Christian.
> >
> >>
> >> Patch 1 implement it.
> >>
> >> Patch 2-5 provides an approach for performance improvement.
> >>
> >> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
> >> synchronously read files using direct I/O.
> >>
> >> This approach helps to save CPU copying and avoid a certain degree of
> >> memory thrashing (page cache generation and reclamation)
> >>
> >> When dealing with large file sizes, the benefits of this approach become
> >> particularly significant.
> >>
> >> However, there are currently some methods that can improve performance,
> >> not just save system resources:
> >>
> >> Due to the large file size, for example, a AI 7B model of around
> >> 3.4GB, the
> >> time taken to allocate DMA-BUF memory will be relatively long. Waiting
> >> for the allocation to complete before reading the file will add to the
> >> overall time consumption. Therefore, the total time for DMA-BUF
> >> allocation and file read can be calculated using the formula
> >>     T(total) = T(alloc) + T(I/O)
> >>
> >> However, if we change our approach, we don't necessarily need to wait
> >> for the DMA-BUF allocation to complete before initiating I/O. In fact,
> >> during the allocation process, we already hold a portion of the page,
> >> which means that waiting for subsequent page allocations to complete
> >> before carrying out file reads is actually unfair to the pages that have
> >> already been allocated.
> >>
> >> The allocation of pages is sequential, and the reading of the file is
> >> also sequential, with the content and size corresponding to the file.
> >> This means that the memory location for each page, which holds the
> >> content of a specific position in the file, can be determined at the
> >> time of allocation.
> >>
> >> However, to fully leverage I/O performance, it is best to wait and
> >> gather a certain number of pages before initiating batch processing.
> >>
> >> The default gather size is 128MB. So, ever gathered can see as a file
> >> read
> >> work, it maps the gather page to the vmalloc area to obtain a continuous
> >> virtual address, which is used as a buffer to store the contents of the
> >> corresponding file. So, if using direct I/O to read a file, the file
> >> content will be written directly to the corresponding dma-buf buffer
> >> memory
> >> without any additional copying.(compare to pipe buffer.)
> >>
> >> Consider other ways to read into dma-buf. If we assume reading after
> >> mmap
> >> dma-buf, we need to map the pages of the dma-buf to the user virtual
> >> address space. Also, udmabuf memfd need do this operations too.
> >> Even if we support sendfile, the file copy also need buffer, you must
> >> setup it.
> >> So, mapping pages to the vmalloc area does not incur any additional
> >> performance overhead compared to other methods.[6]
> >>
> >> Certainly, the administrator can also modify the gather size through
> >> patch5.
> >>
> >> The formula for the time taken for system_heap buffer allocation and
> >> file reading through async_read is as follows:
> >>
> >>    T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
> >>
> >> Compared to the synchronous read:
> >>    T(total) = T(alloc) + T(I/O)
> >>
> >> If the allocation time or I/O time is long, the time difference will be
> >> covered by the maximum value between the allocation and I/O. The other
> >> party will be concealed.
> >>
> >> Therefore, the larger the size of the file that needs to be read, the
> >> greater the corresponding benefits will be.
> >>
> >> How to use
> >> ===
> >> Consider the current pathway for loading model files into DMA-BUF:
> >>    1. open dma-heap, get heap fd
> >>    2. open file, get file_fd(can't use O_DIRECT)
> >>    3. use file len to allocate dma-buf, get dma-buf fd
> >>    4. mmap dma-buf fd, get vaddr
> >>    5. read(file_fd, vaddr, file_size) into dma-buf pages
> >>    6. share, attach, whatever you want
> >>
> >> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
> >>    1. open dma-heap, get heap fd
> >>    2. open file, get file_fd(buffer/direct)
> >>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag,
> >> set file_fd
> >>       instead of len. get dma-buf fd(contains file content)
> >>    4. share, attach, whatever you want
> >>
> >> So, test it is easy.
> >>
> >> How to test
> >> ===
> >> The performance comparison will be conducted for the following
> >> scenarios:
> >>    1. normal
> >>    2. udmabuf with [3] patch
> >>    3. sendfile
> >>    4. only patch 1
> >>    5. patch1 - patch4.
> >>
> >> normal:
> >>    1. open dma-heap, get heap fd
> >>    2. open file, get file_fd(can't use O_DIRECT)
> >>    3. use file len to allocate dma-buf, get dma-buf fd
> >>    4. mmap dma-buf fd, get vaddr
> >>    5. read(file_fd, vaddr, file_size) into dma-buf pages
> >>    6. share, attach, whatever you want
> >>
> >> UDMA-BUF step:
> >>    1. memfd_create
> >>    2. open file(buffer/direct)
> >>    3. udmabuf create
> >>    4. mmap memfd
> >>    5. read file into memfd vaddr
> >>
> >> Sendfile step(need suit splice_write/write_iter, just use to compare):
> >>    1. open dma-heap, get heap fd
> >>    2. open file, get file_fd(buffer/direct)
> >>    3. use file len to allocate dma-buf, get dma-buf fd
> >>    4. sendfile file_fd to dma-buf fd
> >>    6. share, attach, whatever you want
> >>
> >> patch1/patch1-4:
> >>    1. open dma-heap, get heap fd
> >>    2. open file, get file_fd(buffer/direct)
> >>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag,
> >> set file_fd
> >>       instead of len. get dma-buf fd(contains file content)
> >>    4. share, attach, whatever you want
> >>
> >> You can create a file to test it. Compare the performance gap between
> >> the two.
> >> It is best to compare the differences in file size from KB to MB to GB.
> >>
> >> The following test data will compare the performance differences
> >> between 512KB,
> >> 8MB, 1GB, and 3GB under various scenarios.
> >>
> >> Performance Test
> >> ===
> >>    12G RAM phone
> >>    UFS4.0(the maximum speed is 4GB/s. ),
> >>    f2fs
> >>    kernel 6.1 with patch[7] (or else, can't support kvec direct I/O
> >> read.)
> >>    no memory pressure.
> >>    drop_cache is used for each test.
> >>
> >> The average of 5 test results:
> >> | scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns) |
> >> 3GB(ns)       |
> >> | ------------------- | ---------- | ---------- | ------------- |
> >> ------------- |
> >> | normal              | 2,790,861  | 14,535,784 | 1,520,790,492 |
> >> 3,332,438,754 |
> >> | udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000 |
> >> 2,108,419,923 |
> >> | sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 |
> >> 3,062,052,984 |
> >> | patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800 |
> >> 2,187,570,861 |
> >> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 |
> >> 9,777,661,077 |
> >> | patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 |
> >> 5,648,897,554 |
> >> | udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784 |
> >> 2,158,305,738 |
> >> | patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538 |
> >> 1,400,006,107 |
>
> With this test, sendfile can't give a good help base on pipe buffer.
>
> udmabuf is good, but I think our oem driver can't suit it. (And, AOSP do
> not open this feature)

Hi Huan,

We should be able to turn on udmabuf for the Android kernels. We don't
have CONFIG_UDMABUF because nobody has wanted it so far. It's
encouraging to see your latest results!

-T.J.


>
> Anyway, I am sending this patchset in the hope of further discussion.
>
> Thanks.
>
> >>
> >> So, based on the test results:
> >>
> >> When the file is large, the patchset has the highest performance.
> >> Compared to normal, patchset is a 50% improvement;
> >> Compared to normal, patch1 only showed a degradation of 41%.
> >> patch1 typical performance breakdown is as follows:
> >>    1. alloc cost 188,802,693 ns
> >>    2. vmap cost 42,491,385 ns
> >>    3. file read cost 4,180,876,702 ns
> >> Therefore, directly performing a single direct I/O read on a large file
> >> may not be the most optimal way for performance.
> >>
> >> The performance of direct I/O implemented by the sendfile method is
> >> the worst.
> >>
> >> When file size is small, The difference in performance is not
> >> significant. This is consistent with expectations.
> >>
> >>
> >>
> >> Suggested use cases
> >> ===
> >>    1. When there is a need to read large files and system resources
> >> are scarce,
> >>       especially when the size of memory is limited.(GB level) In this
> >>       scenario, using direct I/O for file reading can even bring
> >> performance
> >>       improvements.(may need patch2-3)
> >>    2. For embedded devices with limited RAM, using direct I/O can
> >> save system
> >>       resources and avoid unnecessary data copying. Therefore, even
> >> if the
> >>       performance is lower when read small file, it can still be used
> >>       effectively.
> >>    3. If there is sufficient memory, pinning the page cache of the
> >> model files
> >>       in memory and placing file in the EROFS file system for
> >> read-only access
> >>       maybe better.(EROFS do not support direct I/O)
> >>
> >>
> >> Changlog
> >> ===
> >>   v1 [8]
> >>   v1->v2:
> >>     Uses the heap flag method for alloc and read instead of adding a new
> >>     DMA-buf ioctl command. [9]
> >>     Split the patchset to facilitate review and test.
> >>       patch 1 implement alloc and read, offer heap flag into it.
> >>       patch 2-4 offer async read
> >>       patch 5 can change gather limit.
> >>
> >> Reference
> >> ===
> >> [1]
> >> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
> >> [2]
> >> https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
> >> [3]
> >> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
> >> [4]
> >> https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
> >> [5]
> >> https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
> >> [6]
> >> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
> >> [7]
> >> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
> >> [8]
> >> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
> >> [9]
> >> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
> >>
> >> Huan Yang (5):
> >>    dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
> >>    dma-buf: heaps: Introduce async alloc read ops
> >>    dma-buf: heaps: support alloc async read file
> >>    dma-buf: heaps: system_heap alloc support async read
> >>    dma-buf: heaps: configurable async read gather limit
> >>
> >>   drivers/dma-buf/dma-heap.c          | 552 +++++++++++++++++++++++++++-
> >>   drivers/dma-buf/heaps/system_heap.c |  70 +++-
> >>   include/linux/dma-heap.h            |  53 ++-
> >>   include/uapi/linux/dma-heap.h       |  11 +-
> >>   4 files changed, 673 insertions(+), 13 deletions(-)
> >>
> >>
> >> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
> >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30 17:19     ` T.J. Mercier
@ 2024-07-31  1:47       ` Huan Yang
  0 siblings, 0 replies; 26+ messages in thread
From: Huan Yang @ 2024-07-31  1:47 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Christian König, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, linux-media, dri-devel, linaro-mm-sig,
	linux-kernel, opensource.kernel


在 2024/7/31 1:19, T.J. Mercier 写道:
> On Tue, Jul 30, 2024 at 1:14 AM Huan Yang <link@vivo.com> wrote:
>>
>> 在 2024/7/30 16:03, Christian König 写道:
>>> Am 30.07.24 um 09:57 schrieb Huan Yang:
>>>> Background
>>>> ====
>>>> Some user may need load file into dma-buf, current way is:
>>>>     1. allocate a dma-buf, get dma-buf fd
>>>>     2. mmap dma-buf fd into user vaddr
>>>>     3. read(file_fd, vaddr, fsz)
>>>> Due to dma-buf user map can't support direct I/O[1], the file read
>>>> must be buffer I/O.
>>>>
>>>> This means that during the process of reading the file into dma-buf,
>>>> page cache needs to be generated, and the corresponding content needs to
>>>> be first copied to the page cache before being copied to the dma-buf.
>>>>
>>>> This way worked well when reading relatively small files before, as
>>>> the page cache can cache the file content, thus improving performance.
>>>>
>>>> However, there are new challenges currently, especially as AI models are
>>>> becoming larger and need to be shared between DMA devices and the CPU
>>>> via dma-buf.
>>>>
>>>> For example, our 7B model file size is around 3.4GB. Using the
>>>> previous would mean generating a total of 3.4GB of page cache
>>>> (even if it will be reclaimed), and also requiring the copying of 3.4GB
>>>> of content between page cache and dma-buf.
>>>>
>>>> Due to the limited resources of system memory, files in the gigabyte
>>>> range
>>>> cannot persist in memory indefinitely, so this portion of page cache may
>>>> not provide much assistance for subsequent reads. Additionally, the
>>>> existence of page cache will consume additional system resources due to
>>>> the extra copying required by the CPU.
>>>>
>>>> Therefore, I think it is necessary for dma-buf to support direct I/O.
>>>>
>>>> However, direct I/O file reads cannot be performed using the buffer
>>>> mmaped by the user space for the dma-buf.[1]
>>>>
>>>> Here are some discussions on implementing direct I/O using dma-buf:
>>>>
>>>> mmap[1]
>>>> ---
>>>> dma-buf never support user map vaddr use of direct I/O.
>>>>
>>>> udmabuf[2]
>>>> ---
>>>> Currently, udmabuf can use the memfd method to read files into
>>>> dma-buf in direct I/O mode.
>>>>
>>>> However, if the size is large, the current udmabuf needs to adjust the
>>>> corresponding size_limit(default 64MB).
>>>> But using udmabuf for files at the 3GB level is not a very good
>>>> approach.
>>>> It needs to make some adjustments internally to handle this.[3] Or else,
>>>> fail create.
>>>>
>>>> But, it is indeed a viable way to enable dma-buf to support direct I/O.
>>>> However, it is necessary to initiate the file read after the memory
>>>> allocation
>>>> is completed, and handle race conditions carefully.
>>>>
>>>> sendfile/splice[4]
>>>> ---
>>>> Another way to enable dma-buf to support direct I/O is by implementing
>>>> splice_write/write_iter in the dma-buf file operations (fops) to adapt
>>>> to the sendfile method.
>>>> However, the current sendfile/splice calls are based on pipe. When using
>>>> direct I/O to read a file, the content needs to be copied to the buffer
>>>> allocated by the pipe (default 64KB), and then the dma-buf fops'
>>>> splice_write needs to be called to write the content into the dma-buf.
>>>> This approach requires serially reading the content of file pipe size
>>>> into the pipe buffer and then waiting for the dma-buf to be written
>>>> before reading the next one.(The I/O performance is relatively weak
>>>> under direct I/O.)
>>>> Moreover, due to the existence of the pipe buffer, even when using
>>>> direct I/O and not needing to generate additional page cache,
>>>> there still needs to be a CPU copy.
>>>>
>>>> copy_file_range[5]
>>>> ---
>>>> Consider of copy_file_range, It only supports copying files within the
>>>> same file system. Similarly, it is not very practical.
>>>>
>>>>
>>>> So, currently, there is no particularly suitable solution on VFS to
>>>> allow dma-buf to support direct I/O for large file reads.
>>>>
>>>> This patchset provides an idea to complete file reads when requesting a
>>>> dma-buf.
>>>>
>>>> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>> ===
>>>> This patch provides a method to immediately read the file content after
>>>> the dma-buf is allocated, and only returns the dma-buf file descriptor
>>>> after the file is fully read.
>>>>
>>>> Since the dma-buf file descriptor is not returned, no other thread can
>>>> access it except for the current thread, so we don't need to worry about
>>>> race conditions.
>>> That is a completely false assumption.
>> Can you provide a detailed explanation as to why this assumption is
>> incorrect? thanks.
>>>> Map the dma-buf to the vmalloc area and initiate file reads in kernel
>>>> space, supporting both buffer I/O and direct I/O.
>>>>
>>>> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
>>>> When a user needs to allocate a dma-buf and read a file, they should
>>>> pass this heap flag. As the size of the file being read is fixed,
>>>> there is no
>>>> need to pass the 'len' parameter. Instead, The file_fd needs to be
>>>> passed to
>>>> indicate to the kernel the file that needs to be read.
>>>>
>>>> The file open flag determines the mode of file reading.
>>>> But, please note that if direct I/O(O_DIRECT) is needed to read the
>>>> file,
>>>> the file size must be page aligned. (with patch 2-5, no need)
>>>>
>>>> Therefore, for the user, len and file_fd are mutually exclusive,
>>>> and they are combined using a union.
>>>>
>>>> Once the user obtains the dma-buf fd, the dma-buf directly contains the
>>>> file content.
>>> And I'm repeating myself, but this is a complete NAK from my side to
>>> this approach.
>>>
>>> We pointed out multiple ways of how to implement this cleanly and not
>>> by hacking functionality into the kernel which absolutely doesn't
>>> belong there.
>> In this patchset, I have provided performance comparisons of each of
>> these methods.  Can you please provide more opinions?
>>> Regards,
>>> Christian.
>>>
>>>> Patch 1 implement it.
>>>>
>>>> Patch 2-5 provides an approach for performance improvement.
>>>>
>>>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>>>> synchronously read files using direct I/O.
>>>>
>>>> This approach helps to save CPU copying and avoid a certain degree of
>>>> memory thrashing (page cache generation and reclamation)
>>>>
>>>> When dealing with large file sizes, the benefits of this approach become
>>>> particularly significant.
>>>>
>>>> However, there are currently some methods that can improve performance,
>>>> not just save system resources:
>>>>
>>>> Due to the large file size, for example, a AI 7B model of around
>>>> 3.4GB, the
>>>> time taken to allocate DMA-BUF memory will be relatively long. Waiting
>>>> for the allocation to complete before reading the file will add to the
>>>> overall time consumption. Therefore, the total time for DMA-BUF
>>>> allocation and file read can be calculated using the formula
>>>>      T(total) = T(alloc) + T(I/O)
>>>>
>>>> However, if we change our approach, we don't necessarily need to wait
>>>> for the DMA-BUF allocation to complete before initiating I/O. In fact,
>>>> during the allocation process, we already hold a portion of the page,
>>>> which means that waiting for subsequent page allocations to complete
>>>> before carrying out file reads is actually unfair to the pages that have
>>>> already been allocated.
>>>>
>>>> The allocation of pages is sequential, and the reading of the file is
>>>> also sequential, with the content and size corresponding to the file.
>>>> This means that the memory location for each page, which holds the
>>>> content of a specific position in the file, can be determined at the
>>>> time of allocation.
>>>>
>>>> However, to fully leverage I/O performance, it is best to wait and
>>>> gather a certain number of pages before initiating batch processing.
>>>>
>>>> The default gather size is 128MB. So, ever gathered can see as a file
>>>> read
>>>> work, it maps the gather page to the vmalloc area to obtain a continuous
>>>> virtual address, which is used as a buffer to store the contents of the
>>>> corresponding file. So, if using direct I/O to read a file, the file
>>>> content will be written directly to the corresponding dma-buf buffer
>>>> memory
>>>> without any additional copying.(compare to pipe buffer.)
>>>>
>>>> Consider other ways to read into dma-buf. If we assume reading after
>>>> mmap
>>>> dma-buf, we need to map the pages of the dma-buf to the user virtual
>>>> address space. Also, udmabuf memfd need do this operations too.
>>>> Even if we support sendfile, the file copy also need buffer, you must
>>>> setup it.
>>>> So, mapping pages to the vmalloc area does not incur any additional
>>>> performance overhead compared to other methods.[6]
>>>>
>>>> Certainly, the administrator can also modify the gather size through
>>>> patch5.
>>>>
>>>> The formula for the time taken for system_heap buffer allocation and
>>>> file reading through async_read is as follows:
>>>>
>>>>     T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>>>
>>>> Compared to the synchronous read:
>>>>     T(total) = T(alloc) + T(I/O)
>>>>
>>>> If the allocation time or I/O time is long, the time difference will be
>>>> covered by the maximum value between the allocation and I/O. The other
>>>> party will be concealed.
>>>>
>>>> Therefore, the larger the size of the file that needs to be read, the
>>>> greater the corresponding benefits will be.
>>>>
>>>> How to use
>>>> ===
>>>> Consider the current pathway for loading model files into DMA-BUF:
>>>>     1. open dma-heap, get heap fd
>>>>     2. open file, get file_fd(can't use O_DIRECT)
>>>>     3. use file len to allocate dma-buf, get dma-buf fd
>>>>     4. mmap dma-buf fd, get vaddr
>>>>     5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>     6. share, attach, whatever you want
>>>>
>>>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>>>     1. open dma-heap, get heap fd
>>>>     2. open file, get file_fd(buffer/direct)
>>>>     3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag,
>>>> set file_fd
>>>>        instead of len. get dma-buf fd(contains file content)
>>>>     4. share, attach, whatever you want
>>>>
>>>> So, test it is easy.
>>>>
>>>> How to test
>>>> ===
>>>> The performance comparison will be conducted for the following
>>>> scenarios:
>>>>     1. normal
>>>>     2. udmabuf with [3] patch
>>>>     3. sendfile
>>>>     4. only patch 1
>>>>     5. patch1 - patch4.
>>>>
>>>> normal:
>>>>     1. open dma-heap, get heap fd
>>>>     2. open file, get file_fd(can't use O_DIRECT)
>>>>     3. use file len to allocate dma-buf, get dma-buf fd
>>>>     4. mmap dma-buf fd, get vaddr
>>>>     5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>     6. share, attach, whatever you want
>>>>
>>>> UDMA-BUF step:
>>>>     1. memfd_create
>>>>     2. open file(buffer/direct)
>>>>     3. udmabuf create
>>>>     4. mmap memfd
>>>>     5. read file into memfd vaddr
>>>>
>>>> Sendfile step(need suit splice_write/write_iter, just use to compare):
>>>>     1. open dma-heap, get heap fd
>>>>     2. open file, get file_fd(buffer/direct)
>>>>     3. use file len to allocate dma-buf, get dma-buf fd
>>>>     4. sendfile file_fd to dma-buf fd
>>>>     6. share, attach, whatever you want
>>>>
>>>> patch1/patch1-4:
>>>>     1. open dma-heap, get heap fd
>>>>     2. open file, get file_fd(buffer/direct)
>>>>     3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag,
>>>> set file_fd
>>>>        instead of len. get dma-buf fd(contains file content)
>>>>     4. share, attach, whatever you want
>>>>
>>>> You can create a file to test it. Compare the performance gap between
>>>> the two.
>>>> It is best to compare the differences in file size from KB to MB to GB.
>>>>
>>>> The following test data will compare the performance differences
>>>> between 512KB,
>>>> 8MB, 1GB, and 3GB under various scenarios.
>>>>
>>>> Performance Test
>>>> ===
>>>>     12G RAM phone
>>>>     UFS4.0(the maximum speed is 4GB/s. ),
>>>>     f2fs
>>>>     kernel 6.1 with patch[7] (or else, can't support kvec direct I/O
>>>> read.)
>>>>     no memory pressure.
>>>>     drop_cache is used for each test.
>>>>
>>>> The average of 5 test results:
>>>> | scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns) |
>>>> 3GB(ns)       |
>>>> | ------------------- | ---------- | ---------- | ------------- |
>>>> ------------- |
>>>> | normal              | 2,790,861  | 14,535,784 | 1,520,790,492 |
>>>> 3,332,438,754 |
>>>> | udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000 |
>>>> 2,108,419,923 |
>>>> | sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 |
>>>> 3,062,052,984 |
>>>> | patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800 |
>>>> 2,187,570,861 |
>>>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 |
>>>> 9,777,661,077 |
>>>> | patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 |
>>>> 5,648,897,554 |
>>>> | udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784 |
>>>> 2,158,305,738 |
>>>> | patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538 |
>>>> 1,400,006,107 |
>> With this test, sendfile can't give a good help base on pipe buffer.
>>
>> udmabuf is good, but I think our oem driver can't suit it. (And, AOSP do
>> not open this feature)
> Hi Huan,
>
> We should be able to turn on udmabuf for the Android kernels. We don't
> have CONFIG_UDMABUF because nobody has wanted it so far. It's
> encouraging to see your latest results!
OK, that's greate. I will further study the use of udmabuf and notify 
our partners, and make every effort to encourage them to adapt udmabuf.
>
> -T.J.
>
>
>> Anyway, I am sending this patchset in the hope of further discussion.
>>
>> Thanks.
>>
>>>> So, based on the test results:
>>>>
>>>> When the file is large, the patchset has the highest performance.
>>>> Compared to normal, patchset is a 50% improvement;
>>>> Compared to normal, patch1 only showed a degradation of 41%.
>>>> patch1 typical performance breakdown is as follows:
>>>>     1. alloc cost 188,802,693 ns
>>>>     2. vmap cost 42,491,385 ns
>>>>     3. file read cost 4,180,876,702 ns
>>>> Therefore, directly performing a single direct I/O read on a large file
>>>> may not be the most optimal way for performance.
>>>>
>>>> The performance of direct I/O implemented by the sendfile method is
>>>> the worst.
>>>>
>>>> When file size is small, The difference in performance is not
>>>> significant. This is consistent with expectations.
>>>>
>>>>
>>>>
>>>> Suggested use cases
>>>> ===
>>>>     1. When there is a need to read large files and system resources
>>>> are scarce,
>>>>        especially when the size of memory is limited.(GB level) In this
>>>>        scenario, using direct I/O for file reading can even bring
>>>> performance
>>>>        improvements.(may need patch2-3)
>>>>     2. For embedded devices with limited RAM, using direct I/O can
>>>> save system
>>>>        resources and avoid unnecessary data copying. Therefore, even
>>>> if the
>>>>        performance is lower when read small file, it can still be used
>>>>        effectively.
>>>>     3. If there is sufficient memory, pinning the page cache of the
>>>> model files
>>>>        in memory and placing file in the EROFS file system for
>>>> read-only access
>>>>        maybe better.(EROFS do not support direct I/O)
>>>>
>>>>
>>>> Changlog
>>>> ===
>>>>    v1 [8]
>>>>    v1->v2:
>>>>      Uses the heap flag method for alloc and read instead of adding a new
>>>>      DMA-buf ioctl command. [9]
>>>>      Split the patchset to facilitate review and test.
>>>>        patch 1 implement alloc and read, offer heap flag into it.
>>>>        patch 2-4 offer async read
>>>>        patch 5 can change gather limit.
>>>>
>>>> Reference
>>>> ===
>>>> [1]
>>>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>>>> [2]
>>>> https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>>>> [3]
>>>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>>>> [4]
>>>> https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>>>> [5]
>>>> https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>>>> [6]
>>>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>>>> [7]
>>>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
>>>> [8]
>>>> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>>>> [9]
>>>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>>>
>>>> Huan Yang (5):
>>>>     dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>>     dma-buf: heaps: Introduce async alloc read ops
>>>>     dma-buf: heaps: support alloc async read file
>>>>     dma-buf: heaps: system_heap alloc support async read
>>>>     dma-buf: heaps: configurable async read gather limit
>>>>
>>>>    drivers/dma-buf/dma-heap.c          | 552 +++++++++++++++++++++++++++-
>>>>    drivers/dma-buf/heaps/system_heap.c |  70 +++-
>>>>    include/linux/dma-heap.h            |  53 ++-
>>>>    include/uapi/linux/dma-heap.h       |  11 +-
>>>>    4 files changed, 673 insertions(+), 13 deletions(-)
>>>>
>>>>
>>>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30 13:11             ` Christian König
@ 2024-07-31  1:48               ` Huan Yang
  0 siblings, 0 replies; 26+ messages in thread
From: Huan Yang @ 2024-07-31  1:48 UTC (permalink / raw)
  To: Christian König, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, T.J. Mercier, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel
  Cc: opensource.kernel


在 2024/7/30 21:11, Christian König 写道:
> Am 30.07.24 um 13:36 schrieb Huan Yang:
>>>>> Either drop the whole approach or change udmabuf to do what you 
>>>>> want to do.
>>>> OK, if so, do I need to send a patch to make dma-buf support sendfile?
>>>
>>> Well the udmabuf approach doesn't need to use sendfile, so no.
>>
>> Get it, I'll not send again.
>>
>> About udmabuf, I test find it can't support larget find read due to 
>> page array alloc.
>>
>> I already upload this patch, but do not recive answer.
>>
>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/ 
>>
>>
>> Is there anything wrong with my understanding of it?
>
> No, that patch was totally fine. Not getting a response is usually 
> something good.
>
> In other words when maintainer see something which won't work at all 
> they immediately react, but when nobody complains it usually means you 
> are on the right track.
Thank you for your answer.
>
> As long as nobody has any good arguments against it I'm happy to take 
> that one upstream through drm-misc-next immediately since it's clearly 
> a stand a lone improvement on it's own.

OK, well to know this.

Thank you

>
> Regards,
> Christian.
>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>>
>>>>> Apart from that I don't see a doable way which can be accepted 
>>>>> into the kernel.
>>>> Thanks for your suggestion.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>> Patch 1 implement it.
>>>>>>>>
>>>>>>>> Patch 2-5 provides an approach for performance improvement.
>>>>>>>>
>>>>>>>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>>>>>>>> synchronously read files using direct I/O.
>>>>>>>>
>>>>>>>> This approach helps to save CPU copying and avoid a certain 
>>>>>>>> degree of
>>>>>>>> memory thrashing (page cache generation and reclamation)
>>>>>>>>
>>>>>>>> When dealing with large file sizes, the benefits of this 
>>>>>>>> approach become
>>>>>>>> particularly significant.
>>>>>>>>
>>>>>>>> However, there are currently some methods that can improve 
>>>>>>>> performance,
>>>>>>>> not just save system resources:
>>>>>>>>
>>>>>>>> Due to the large file size, for example, a AI 7B model of 
>>>>>>>> around 3.4GB, the
>>>>>>>> time taken to allocate DMA-BUF memory will be relatively long. 
>>>>>>>> Waiting
>>>>>>>> for the allocation to complete before reading the file will add 
>>>>>>>> to the
>>>>>>>> overall time consumption. Therefore, the total time for DMA-BUF
>>>>>>>> allocation and file read can be calculated using the formula
>>>>>>>>     T(total) = T(alloc) + T(I/O)
>>>>>>>>
>>>>>>>> However, if we change our approach, we don't necessarily need 
>>>>>>>> to wait
>>>>>>>> for the DMA-BUF allocation to complete before initiating I/O. 
>>>>>>>> In fact,
>>>>>>>> during the allocation process, we already hold a portion of the 
>>>>>>>> page,
>>>>>>>> which means that waiting for subsequent page allocations to 
>>>>>>>> complete
>>>>>>>> before carrying out file reads is actually unfair to the pages 
>>>>>>>> that have
>>>>>>>> already been allocated.
>>>>>>>>
>>>>>>>> The allocation of pages is sequential, and the reading of the 
>>>>>>>> file is
>>>>>>>> also sequential, with the content and size corresponding to the 
>>>>>>>> file.
>>>>>>>> This means that the memory location for each page, which holds the
>>>>>>>> content of a specific position in the file, can be determined 
>>>>>>>> at the
>>>>>>>> time of allocation.
>>>>>>>>
>>>>>>>> However, to fully leverage I/O performance, it is best to wait and
>>>>>>>> gather a certain number of pages before initiating batch 
>>>>>>>> processing.
>>>>>>>>
>>>>>>>> The default gather size is 128MB. So, ever gathered can see as 
>>>>>>>> a file read
>>>>>>>> work, it maps the gather page to the vmalloc area to obtain a 
>>>>>>>> continuous
>>>>>>>> virtual address, which is used as a buffer to store the 
>>>>>>>> contents of the
>>>>>>>> corresponding file. So, if using direct I/O to read a file, the 
>>>>>>>> file
>>>>>>>> content will be written directly to the corresponding dma-buf 
>>>>>>>> buffer memory
>>>>>>>> without any additional copying.(compare to pipe buffer.)
>>>>>>>>
>>>>>>>> Consider other ways to read into dma-buf. If we assume reading 
>>>>>>>> after mmap
>>>>>>>> dma-buf, we need to map the pages of the dma-buf to the user 
>>>>>>>> virtual
>>>>>>>> address space. Also, udmabuf memfd need do this operations too.
>>>>>>>> Even if we support sendfile, the file copy also need buffer, 
>>>>>>>> you must
>>>>>>>> setup it.
>>>>>>>> So, mapping pages to the vmalloc area does not incur any 
>>>>>>>> additional
>>>>>>>> performance overhead compared to other methods.[6]
>>>>>>>>
>>>>>>>> Certainly, the administrator can also modify the gather size 
>>>>>>>> through patch5.
>>>>>>>>
>>>>>>>> The formula for the time taken for system_heap buffer 
>>>>>>>> allocation and
>>>>>>>> file reading through async_read is as follows:
>>>>>>>>
>>>>>>>>    T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>>>>>>>
>>>>>>>> Compared to the synchronous read:
>>>>>>>>    T(total) = T(alloc) + T(I/O)
>>>>>>>>
>>>>>>>> If the allocation time or I/O time is long, the time difference 
>>>>>>>> will be
>>>>>>>> covered by the maximum value between the allocation and I/O. 
>>>>>>>> The other
>>>>>>>> party will be concealed.
>>>>>>>>
>>>>>>>> Therefore, the larger the size of the file that needs to be 
>>>>>>>> read, the
>>>>>>>> greater the corresponding benefits will be.
>>>>>>>>
>>>>>>>> How to use
>>>>>>>> ===
>>>>>>>> Consider the current pathway for loading model files into DMA-BUF:
>>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>>>    4. mmap dma-buf fd, get vaddr
>>>>>>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>>>>    6. share, attach, whatever you want
>>>>>>>>
>>>>>>>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap 
>>>>>>>> flag, set file_fd
>>>>>>>>       instead of len. get dma-buf fd(contains file content)
>>>>>>>>    4. share, attach, whatever you want
>>>>>>>>
>>>>>>>> So, test it is easy.
>>>>>>>>
>>>>>>>> How to test
>>>>>>>> ===
>>>>>>>> The performance comparison will be conducted for the following 
>>>>>>>> scenarios:
>>>>>>>>    1. normal
>>>>>>>>    2. udmabuf with [3] patch
>>>>>>>>    3. sendfile
>>>>>>>>    4. only patch 1
>>>>>>>>    5. patch1 - patch4.
>>>>>>>>
>>>>>>>> normal:
>>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>>>    4. mmap dma-buf fd, get vaddr
>>>>>>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>>>>    6. share, attach, whatever you want
>>>>>>>>
>>>>>>>> UDMA-BUF step:
>>>>>>>>    1. memfd_create
>>>>>>>>    2. open file(buffer/direct)
>>>>>>>>    3. udmabuf create
>>>>>>>>    4. mmap memfd
>>>>>>>>    5. read file into memfd vaddr
>>>>>>>>
>>>>>>>> Sendfile step(need suit splice_write/write_iter, just use to 
>>>>>>>> compare):
>>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>>>    4. sendfile file_fd to dma-buf fd
>>>>>>>>    6. share, attach, whatever you want
>>>>>>>>
>>>>>>>> patch1/patch1-4:
>>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap 
>>>>>>>> flag, set file_fd
>>>>>>>>       instead of len. get dma-buf fd(contains file content)
>>>>>>>>    4. share, attach, whatever you want
>>>>>>>>
>>>>>>>> You can create a file to test it. Compare the performance gap 
>>>>>>>> between the two.
>>>>>>>> It is best to compare the differences in file size from KB to 
>>>>>>>> MB to GB.
>>>>>>>>
>>>>>>>> The following test data will compare the performance 
>>>>>>>> differences between 512KB,
>>>>>>>> 8MB, 1GB, and 3GB under various scenarios.
>>>>>>>>
>>>>>>>> Performance Test
>>>>>>>> ===
>>>>>>>>    12G RAM phone
>>>>>>>>    UFS4.0(the maximum speed is 4GB/s. ),
>>>>>>>>    f2fs
>>>>>>>>    kernel 6.1 with patch[7] (or else, can't support kvec direct 
>>>>>>>> I/O read.)
>>>>>>>>    no memory pressure.
>>>>>>>>    drop_cache is used for each test.
>>>>>>>>
>>>>>>>> The average of 5 test results:
>>>>>>>> | scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns) | 
>>>>>>>> 3GB(ns)       |
>>>>>>>> | ------------------- | ---------- | ---------- | ------------- 
>>>>>>>> | ------------- |
>>>>>>>> | normal              | 2,790,861  | 14,535,784 | 1,520,790,492 
>>>>>>>> | 3,332,438,754 |
>>>>>>>> | udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000 | 
>>>>>>>> 2,108,419,923 |
>>>>>>>> | sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 
>>>>>>>> | 3,062,052,984 |
>>>>>>>> | patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800 | 
>>>>>>>> 2,187,570,861 |
>>>>>>>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 
>>>>>>>> | 9,777,661,077 |
>>>>>>>> | patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 
>>>>>>>> | 5,648,897,554 |
>>>>>>>> | udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784 | 
>>>>>>>> 2,158,305,738 |
>>>>>>>> | patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538 | 
>>>>>>>> 1,400,006,107 |
>>>>>>
>>>>>> With this test, sendfile can't give a good help base on pipe buffer.
>>>>>>
>>>>>> udmabuf is good, but I think our oem driver can't suit it. (And, 
>>>>>> AOSP do not open this feature)
>>>>>>
>>>>>>
>>>>>> Anyway, I am sending this patchset in the hope of further 
>>>>>> discussion.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>>>
>>>>>>>> So, based on the test results:
>>>>>>>>
>>>>>>>> When the file is large, the patchset has the highest performance.
>>>>>>>> Compared to normal, patchset is a 50% improvement;
>>>>>>>> Compared to normal, patch1 only showed a degradation of 41%.
>>>>>>>> patch1 typical performance breakdown is as follows:
>>>>>>>>    1. alloc cost 188,802,693 ns
>>>>>>>>    2. vmap cost 42,491,385 ns
>>>>>>>>    3. file read cost 4,180,876,702 ns
>>>>>>>> Therefore, directly performing a single direct I/O read on a 
>>>>>>>> large file
>>>>>>>> may not be the most optimal way for performance.
>>>>>>>>
>>>>>>>> The performance of direct I/O implemented by the sendfile 
>>>>>>>> method is the worst.
>>>>>>>>
>>>>>>>> When file size is small, The difference in performance is not
>>>>>>>> significant. This is consistent with expectations.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Suggested use cases
>>>>>>>> ===
>>>>>>>>    1. When there is a need to read large files and system 
>>>>>>>> resources are scarce,
>>>>>>>>       especially when the size of memory is limited.(GB level) 
>>>>>>>> In this
>>>>>>>>       scenario, using direct I/O for file reading can even 
>>>>>>>> bring performance
>>>>>>>>       improvements.(may need patch2-3)
>>>>>>>>    2. For embedded devices with limited RAM, using direct I/O 
>>>>>>>> can save system
>>>>>>>>       resources and avoid unnecessary data copying. Therefore, 
>>>>>>>> even if the
>>>>>>>>       performance is lower when read small file, it can still 
>>>>>>>> be used
>>>>>>>>       effectively.
>>>>>>>>    3. If there is sufficient memory, pinning the page cache of 
>>>>>>>> the model files
>>>>>>>>       in memory and placing file in the EROFS file system for 
>>>>>>>> read-only access
>>>>>>>>       maybe better.(EROFS do not support direct I/O)
>>>>>>>>
>>>>>>>>
>>>>>>>> Changlog
>>>>>>>> ===
>>>>>>>>   v1 [8]
>>>>>>>>   v1->v2:
>>>>>>>>     Uses the heap flag method for alloc and read instead of 
>>>>>>>> adding a new
>>>>>>>>     DMA-buf ioctl command. [9]
>>>>>>>>     Split the patchset to facilitate review and test.
>>>>>>>>       patch 1 implement alloc and read, offer heap flag into it.
>>>>>>>>       patch 2-4 offer async read
>>>>>>>>       patch 5 can change gather limit.
>>>>>>>>
>>>>>>>> Reference
>>>>>>>> ===
>>>>>>>> [1] 
>>>>>>>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>>>>>>>> [2] 
>>>>>>>> https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>>>>>>>> [3] 
>>>>>>>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>>>>>>>> [4] 
>>>>>>>> https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>>>>>>>> [5] 
>>>>>>>> https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>>>>>>>> [6] 
>>>>>>>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>>>>>>>> [7] 
>>>>>>>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
>>>>>>>> [8] 
>>>>>>>> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>>>>>>>> [9] 
>>>>>>>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>>>>>>>
>>>>>>>> Huan Yang (5):
>>>>>>>>    dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap 
>>>>>>>> flag
>>>>>>>>    dma-buf: heaps: Introduce async alloc read ops
>>>>>>>>    dma-buf: heaps: support alloc async read file
>>>>>>>>    dma-buf: heaps: system_heap alloc support async read
>>>>>>>>    dma-buf: heaps: configurable async read gather limit
>>>>>>>>
>>>>>>>>   drivers/dma-buf/dma-heap.c          | 552 
>>>>>>>> +++++++++++++++++++++++++++-
>>>>>>>>   drivers/dma-buf/heaps/system_heap.c |  70 +++-
>>>>>>>>   include/linux/dma-heap.h            |  53 ++-
>>>>>>>>   include/uapi/linux/dma-heap.h       |  11 +-
>>>>>>>>   4 files changed, 673 insertions(+), 13 deletions(-)
>>>>>>>>
>>>>>>>>
>>>>>>>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
>>>>>>>
>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/5] dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30  7:57 ` [PATCH v2 1/5] dma-buf: heaps: " Huan Yang
@ 2024-07-31 11:08   ` kernel test robot
  0 siblings, 0 replies; 26+ messages in thread
From: kernel test robot @ 2024-07-31 11:08 UTC (permalink / raw)
  To: Huan Yang, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Christian König, linux-media,
	dri-devel, linaro-mm-sig, linux-kernel
  Cc: oe-kbuild-all, opensource.kernel, Huan Yang

Hi Huan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 931a3b3bccc96e7708c82b30b2b5fa82dfd04890]

url:    https://github.com/intel-lab-lkp/linux/commits/Huan-Yang/dma-buf-heaps-Introduce-DMA_HEAP_ALLOC_AND_READ_FILE-heap-flag/20240730-170340
base:   931a3b3bccc96e7708c82b30b2b5fa82dfd04890
patch link:    https://lore.kernel.org/r/20240730075755.10941-2-link%40vivo.com
patch subject: [PATCH v2 1/5] dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
config: xtensa-allyesconfig (https://download.01.org/0day-ci/archive/20240731/202407311822.ZneNMq5I-lkp@intel.com/config)
compiler: xtensa-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240731/202407311822.ZneNMq5I-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407311822.ZneNMq5I-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/dma-buf/dma-heap.c:44: warning: Function parameter or struct member 'priv' not described in 'dma_heap'
   drivers/dma-buf/dma-heap.c:44: warning: Function parameter or struct member 'heap_devt' not described in 'dma_heap'
   drivers/dma-buf/dma-heap.c:44: warning: Function parameter or struct member 'list' not described in 'dma_heap'
   drivers/dma-buf/dma-heap.c:44: warning: Function parameter or struct member 'heap_cdev' not described in 'dma_heap'
>> drivers/dma-buf/dma-heap.c:104: warning: expecting prototype for Trigger sync file read, read into dma(). Prototype was for dma_heap_read_file_sync() instead


vim +104 drivers/dma-buf/dma-heap.c

    86	
    87	/**
    88	 * Trigger sync file read, read into dma-buf.
    89	 *
    90	 * @dmabuf:			which we done alloced and export.
    91	 * @heap_file:			file info wrapper to read from.
    92	 *
    93	 * Whether to use buffer I/O or direct I/O depends on the mode when the
    94	 * file is opened.
    95	 * Remember, if use direct I/O, file must be page aligned.
    96	 * Since the buffer used for file reading is provided by dma-buf, when
    97	 * using direct I/O, the file content will be directly filled into
    98	 * dma-buf without the need for additional CPU copying.
    99	 *
   100	 * 0 on success, negative if anything wrong.
   101	 */
   102	static int dma_heap_read_file_sync(struct dma_buf *dmabuf,
   103					   struct dma_heap_file *heap_file)
 > 104	{
   105		struct iosys_map map;
   106		ssize_t bytes;
   107		int ret;
   108	
   109		ret = dma_buf_vmap(dmabuf, &map);
   110		if (ret)
   111			return ret;
   112	
   113		/**
   114		 * The kernel_read_file function can handle file reading effectively,
   115		 * and if the return value does not match the file size,
   116		 * then it indicates an error.
   117		 */
   118		bytes = kernel_read_file(heap_file->file, 0, &map.vaddr, dmabuf->size,
   119					 &heap_file->fsize, READING_POLICY);
   120		if (bytes != heap_file->fsize)
   121			ret = -EIO;
   122	
   123		dma_buf_vunmap(dmabuf, &map);
   124	
   125		return ret;
   126	}
   127	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/5] dma-buf: heaps: support alloc async read file
  2024-07-30  7:57 ` [PATCH v2 3/5] dma-buf: heaps: support alloc async read file Huan Yang
@ 2024-07-31 14:44   ` kernel test robot
  0 siblings, 0 replies; 26+ messages in thread
From: kernel test robot @ 2024-07-31 14:44 UTC (permalink / raw)
  To: Huan Yang, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T.J. Mercier, Christian König, linux-media,
	dri-devel, linaro-mm-sig, linux-kernel
  Cc: oe-kbuild-all, opensource.kernel, Huan Yang

Hi Huan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 931a3b3bccc96e7708c82b30b2b5fa82dfd04890]

url:    https://github.com/intel-lab-lkp/linux/commits/Huan-Yang/dma-buf-heaps-Introduce-DMA_HEAP_ALLOC_AND_READ_FILE-heap-flag/20240730-170340
base:   931a3b3bccc96e7708c82b30b2b5fa82dfd04890
patch link:    https://lore.kernel.org/r/20240730075755.10941-4-link%40vivo.com
patch subject: [PATCH v2 3/5] dma-buf: heaps: support alloc async read file
config: xtensa-allyesconfig (https://download.01.org/0day-ci/archive/20240731/202407312202.LhLTLEhX-lkp@intel.com/config)
compiler: xtensa-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240731/202407312202.LhLTLEhX-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407312202.LhLTLEhX-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'priv' not described in 'dma_heap'
   drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'heap_devt' not described in 'dma_heap'
   drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'list' not described in 'dma_heap'
   drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'heap_cdev' not described in 'dma_heap'
>> drivers/dma-buf/dma-heap.c:158: warning: Function parameter or struct member 'lock' not described in 'dma_heap_file_control'
   drivers/dma-buf/dma-heap.c:482: warning: expecting prototype for Trigger sync file read, read into dma(). Prototype was for dma_heap_read_file_sync() instead


vim +158 drivers/dma-buf/dma-heap.c

   133	
   134	/**
   135	 * struct dma_heap_file_control - global control of dma_heap file read.
   136	 * @works:		@dma_heap_file_work's list head.
   137	 *
   138	 * @threadwq:		wait queue for @work_thread, if commit work, @work_thread
   139	 *			wakeup and read this work's file contains.
   140	 *
   141	 * @workwq:		used for main thread wait for file read end, if allocation
   142	 *			end before file read. @dma_heap_file_task ref effect this.
   143	 *
   144	 * @work_thread:	file read kthread. the dma_heap_file_task work's consumer.
   145	 *
   146	 * @heap_fwork_cachep:	@dma_heap_file_work's cachep, it's alloc/free frequently.
   147	 *
   148	 * @nr_work:		global number of how many work committed.
   149	 */
   150	struct dma_heap_file_control {
   151		struct list_head works;
   152		spinlock_t lock; // only lock for @works.
   153		wait_queue_head_t threadwq;
   154		wait_queue_head_t workwq;
   155		struct task_struct *work_thread;
   156		struct kmem_cache *heap_fwork_cachep;
   157		atomic_t nr_work;
 > 158	};
   159	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-30 12:04     ` Huan Yang
@ 2024-07-31 20:46       ` Daniel Vetter
  2024-08-01  2:53         ` Huan Yang
  0 siblings, 1 reply; 26+ messages in thread
From: Daniel Vetter @ 2024-07-31 20:46 UTC (permalink / raw)
  To: Huan Yang
  Cc: Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Christian König, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, opensource.kernel

On Tue, Jul 30, 2024 at 08:04:04PM +0800, Huan Yang wrote:
> 
> 在 2024/7/30 17:05, Huan Yang 写道:
> > 
> > 在 2024/7/30 16:56, Daniel Vetter 写道:
> > > [????????? daniel.vetter@ffwll.ch ?????????
> > > https://aka.ms/LearnAboutSenderIdentification?????????????]
> > > 
> > > On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
> > > > UDMA-BUF step:
> > > >    1. memfd_create
> > > >    2. open file(buffer/direct)
> > > >    3. udmabuf create
> > > >    4. mmap memfd
> > > >    5. read file into memfd vaddr
> > > Yeah this is really slow and the worst way to do it. You absolutely want
> > > to start _all_ the io before you start creating the dma-buf, ideally
> > > with
> > > everything running in parallel. But just starting the direct I/O with
> > > async and then creating the umdabuf should be a lot faster and avoid
> > That's greate,  Let me rephrase that, and please correct me if I'm wrong.
> > 
> > UDMA-BUF step:
> >   1. memfd_create
> >   2. mmap memfd
> >   3. open file(buffer/direct)
> >   4. start thread to async read
> >   3. udmabuf create
> > 
> > With this, can improve
> 
> I just test with it. Step is:
> 
> UDMA-BUF step:
>   1. memfd_create
>   2. mmap memfd
>   3. open file(buffer/direct)
>   4. start thread to async read
>   5. udmabuf create
> 
>   6 . join wait
> 
> 3G file read all step cost 1,527,103,431ns, it's greate.

Ok that's almost the throughput of your patch set, which I think is close
enough. The remaining difference is probably just the mmap overhead, not
sure whether/how we can do direct i/o to an fd directly ... in principle
it's possible for any file that uses the standard pagecache.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-07-31 20:46       ` Daniel Vetter
@ 2024-08-01  2:53         ` Huan Yang
  2024-08-05 17:53           ` Daniel Vetter
  0 siblings, 1 reply; 26+ messages in thread
From: Huan Yang @ 2024-08-01  2:53 UTC (permalink / raw)
  To: Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Christian König, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, opensource.kernel


在 2024/8/1 4:46, Daniel Vetter 写道:
> On Tue, Jul 30, 2024 at 08:04:04PM +0800, Huan Yang wrote:
>> 在 2024/7/30 17:05, Huan Yang 写道:
>>> 在 2024/7/30 16:56, Daniel Vetter 写道:
>>>> [????????? daniel.vetter@ffwll.ch ?????????
>>>> https://aka.ms/LearnAboutSenderIdentification?????????????]
>>>>
>>>> On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
>>>>> UDMA-BUF step:
>>>>>     1. memfd_create
>>>>>     2. open file(buffer/direct)
>>>>>     3. udmabuf create
>>>>>     4. mmap memfd
>>>>>     5. read file into memfd vaddr
>>>> Yeah this is really slow and the worst way to do it. You absolutely want
>>>> to start _all_ the io before you start creating the dma-buf, ideally
>>>> with
>>>> everything running in parallel. But just starting the direct I/O with
>>>> async and then creating the umdabuf should be a lot faster and avoid
>>> That's greate,  Let me rephrase that, and please correct me if I'm wrong.
>>>
>>> UDMA-BUF step:
>>>    1. memfd_create
>>>    2. mmap memfd
>>>    3. open file(buffer/direct)
>>>    4. start thread to async read
>>>    3. udmabuf create
>>>
>>> With this, can improve
>> I just test with it. Step is:
>>
>> UDMA-BUF step:
>>    1. memfd_create
>>    2. mmap memfd
>>    3. open file(buffer/direct)
>>    4. start thread to async read
>>    5. udmabuf create
>>
>>    6 . join wait
>>
>> 3G file read all step cost 1,527,103,431ns, it's greate.
> Ok that's almost the throughput of your patch set, which I think is close
> enough. The remaining difference is probably just the mmap overhead, not
> sure whether/how we can do direct i/o to an fd directly ... in principle
> it's possible for any file that uses the standard pagecache.

Yes, for mmap, IMO, now that we get all folios and pin it. That's mean 
all pfn it's got when udmabuf created.

So, I think mmap with page fault is helpless for save memory but 
increase the mmap access cost.(maybe can save a little page table's memory)

I want to offer a patchset to remove it and more suitable for folios 
operate(And remove unpin list). And contains some fix patch.

I'll send it when I test it's good.


About fd operation for direct I/O, maybe use sendfile or copy_file_range?

sendfile base pipe buffer, it's low performance when I test is.

copy_file_range can't work due to it's not the same file system.

So, I can't find other way to do it. Can someone give some suggestions?

> -Sima

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  2024-08-01  2:53         ` Huan Yang
@ 2024-08-05 17:53           ` Daniel Vetter
  0 siblings, 0 replies; 26+ messages in thread
From: Daniel Vetter @ 2024-08-05 17:53 UTC (permalink / raw)
  To: Huan Yang
  Cc: Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Christian König, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, opensource.kernel

On Thu, Aug 01, 2024 at 10:53:45AM +0800, Huan Yang wrote:
> 
> 在 2024/8/1 4:46, Daniel Vetter 写道:
> > On Tue, Jul 30, 2024 at 08:04:04PM +0800, Huan Yang wrote:
> > > 在 2024/7/30 17:05, Huan Yang 写道:
> > > > 在 2024/7/30 16:56, Daniel Vetter 写道:
> > > > > [????????? daniel.vetter@ffwll.ch ?????????
> > > > > https://aka.ms/LearnAboutSenderIdentification?????????????]
> > > > > 
> > > > > On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
> > > > > > UDMA-BUF step:
> > > > > >     1. memfd_create
> > > > > >     2. open file(buffer/direct)
> > > > > >     3. udmabuf create
> > > > > >     4. mmap memfd
> > > > > >     5. read file into memfd vaddr
> > > > > Yeah this is really slow and the worst way to do it. You absolutely want
> > > > > to start _all_ the io before you start creating the dma-buf, ideally
> > > > > with
> > > > > everything running in parallel. But just starting the direct I/O with
> > > > > async and then creating the umdabuf should be a lot faster and avoid
> > > > That's greate,  Let me rephrase that, and please correct me if I'm wrong.
> > > > 
> > > > UDMA-BUF step:
> > > >    1. memfd_create
> > > >    2. mmap memfd
> > > >    3. open file(buffer/direct)
> > > >    4. start thread to async read
> > > >    3. udmabuf create
> > > > 
> > > > With this, can improve
> > > I just test with it. Step is:
> > > 
> > > UDMA-BUF step:
> > >    1. memfd_create
> > >    2. mmap memfd
> > >    3. open file(buffer/direct)
> > >    4. start thread to async read
> > >    5. udmabuf create
> > > 
> > >    6 . join wait
> > > 
> > > 3G file read all step cost 1,527,103,431ns, it's greate.
> > Ok that's almost the throughput of your patch set, which I think is close
> > enough. The remaining difference is probably just the mmap overhead, not
> > sure whether/how we can do direct i/o to an fd directly ... in principle
> > it's possible for any file that uses the standard pagecache.
> 
> Yes, for mmap, IMO, now that we get all folios and pin it. That's mean all
> pfn it's got when udmabuf created.
> 
> So, I think mmap with page fault is helpless for save memory but increase
> the mmap access cost.(maybe can save a little page table's memory)
> 
> I want to offer a patchset to remove it and more suitable for folios
> operate(And remove unpin list). And contains some fix patch.
> 
> I'll send it when I test it's good.
> 
> 
> About fd operation for direct I/O, maybe use sendfile or copy_file_range?
> 
> sendfile base pipe buffer, it's low performance when I test is.
> 
> copy_file_range can't work due to it's not the same file system.
> 
> So, I can't find other way to do it. Can someone give some suggestions?

Yeah direct I/O to pagecache without an mmap might be too niche to be
supported. Maybe io_uring has something, but I guess as unlikely as
anything else.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-08-05 17:53 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-30  7:57 [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag Huan Yang
2024-07-30  7:57 ` [PATCH v2 1/5] dma-buf: heaps: " Huan Yang
2024-07-31 11:08   ` kernel test robot
2024-07-30  7:57 ` [PATCH v2 2/5] dma-buf: heaps: Introduce async alloc read ops Huan Yang
2024-07-30  7:57 ` [PATCH v2 3/5] dma-buf: heaps: support alloc async read file Huan Yang
2024-07-31 14:44   ` kernel test robot
2024-07-30  7:57 ` [PATCH v2 4/5] dma-buf: heaps: system_heap alloc support async read Huan Yang
2024-07-30  7:57 ` [PATCH v2 5/5] dma-buf: heaps: configurable async read gather limit Huan Yang
2024-07-30  8:03 ` [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag Christian König
2024-07-30  8:14   ` Huan Yang
2024-07-30  8:37     ` Christian König
2024-07-30  8:46       ` Huan Yang
2024-07-30 10:43         ` Christian König
2024-07-30 11:36           ` Huan Yang
2024-07-30 13:11             ` Christian König
2024-07-31  1:48               ` Huan Yang
2024-07-30 17:19     ` T.J. Mercier
2024-07-31  1:47       ` Huan Yang
2024-07-30  8:56 ` Daniel Vetter
2024-07-30  9:05   ` Huan Yang
2024-07-30 10:42     ` Christian König
2024-07-30 11:33       ` Huan Yang
2024-07-30 12:04     ` Huan Yang
2024-07-31 20:46       ` Daniel Vetter
2024-08-01  2:53         ` Huan Yang
2024-08-05 17:53           ` Daniel Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox