All of lore.kernel.org
 help / color / mirror / Atom feed
From: wangtao <tao.wangtao@honor.com>
To: "T.J. Mercier" <tjmercier@google.com>,
	"Christian König" <christian.koenig@amd.com>
Cc: "sumit.semwal@linaro.org" <sumit.semwal@linaro.org>,
	"benjamin.gaignard@collabora.com"
	<benjamin.gaignard@collabora.com>,
	"Brian.Starkey@arm.com" <Brian.Starkey@arm.com>,
	"jstultz@google.com" <jstultz@google.com>,
	"linux-media@vger.kernel.org" <linux-media@vger.kernel.org>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"linaro-mm-sig@lists.linaro.org" <linaro-mm-sig@lists.linaro.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"wangbintian(BintianWang)" <bintian.wang@honor.com>,
	yipengxiang <yipengxiang@honor.com>,
	liulu 00013167 <liulu.liu@honor.com>,
	hanfeng 00012985 <feng.han@honor.com>
Subject: RE: [PATCH 2/2] dmabuf/heaps: implement DMA_BUF_IOCTL_RW_FILE for system_heap
Date: Mon, 19 May 2025 04:37:17 +0000	[thread overview]
Message-ID: <64aa801ccf4e4e74b8d699a9330ecb2a@honor.com> (raw)
In-Reply-To: <CABdmKX30c_5N34FYMre6Qx5LLLWicsi_XdUdu0QtsOmQ=RcYxQ@mail.gmail.com>



> -----Original Message-----
> From: T.J. Mercier <tjmercier@google.com>
> Sent: Saturday, May 17, 2025 2:37 AM
> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement
> DMA_BUF_IOCTL_RW_FILE for system_heap
> 
> On Fri, May 16, 2025 at 1:36 AM Christian König <christian.koenig@amd.com>
> wrote:
> >
> > On 5/16/25 09:40, wangtao wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Christian König <christian.koenig@amd.com>
> > >> Sent: Thursday, May 15, 2025 10:26 PM
> > >> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement
> > >> DMA_BUF_IOCTL_RW_FILE for system_heap
> > >>
> > >> On 5/15/25 16:03, wangtao wrote:
> > >>> [wangtao] My Test Configuration (CPU 1GHz, 5-test average):
> > >>> Allocation: 32x32MB buffer creation
> > >>> - dmabuf 53ms vs. udmabuf 694ms (10X slower)
> > >>> - Note: shmem shows excessive allocation time
> > >>
> > >> Yeah, that is something already noted by others as well. But that
> > >> is orthogonal.
> > >>
> > >>>
> > >>> Read 1024MB File:
> > >>> - dmabuf direct 326ms vs. udmabuf direct 461ms (40% slower)
> > >>> - Note: pin_user_pages_fast consumes majority CPU cycles
> > >>>
> > >>> Key function call timing: See details below.
> > >>
> > >> Those aren't valid, you are comparing different functionalities here.
> > >>
> > >> Please try using udmabuf with sendfile() as confirmed to be working by
> T.J.
> > > [wangtao] Using buffer IO with dmabuf file read/write requires one
> memory copy.
> > > Direct IO removes this copy to enable zero-copy. The sendfile system
> > > call reduces memory copies from two (read/write) to one. However,
> > > with udmabuf, sendfile still keeps at least one copy, failing zero-copy.
> >
> >
> > Then please work on fixing this.
> >
> > Regards,
> > Christian.
> >
> >
> > >
> > > If udmabuf sendfile uses buffer IO (file page cache), read latency
> > > matches dmabuf buffer read, but allocation time is much longer.
> > > With Direct IO, the default 16-page pipe size makes it slower than buffer
> IO.
> > >
> > > Test data shows:
> > > udmabuf direct read is much faster than udmabuf sendfile.
> > > dmabuf direct read outperforms udmabuf direct read by a large margin.
> > >
> > > Issue: After udmabuf is mapped via map_dma_buf, apps using memfd or
> > > udmabuf for Direct IO might cause errors, but there are no
> > > safeguards to prevent this.
> > >
> > > Allocate 32x32MB buffer and read 1024 MB file Test:
> > > Metric                 | alloc (ms) | read (ms) | total (ms)
> > > -----------------------|------------|-----------|-----------
> > > udmabuf buffer read    | 539        | 2017      | 2555
> > > udmabuf direct read    | 522        | 658       | 1179
> 
> I can't reproduce the part where udmabuf direct reads are faster than
> buffered reads. That's the opposite of what I'd expect. Something seems
> wrong with those buffered reads.
> 
[wangtao] Buffer read requires an extra CPU memory copy. Our device's low CPU
performance leads to longer latency. On high-performance 3.5GHz CPUs, buffer
read shows better ratios but still lags behind direct I/O.

Tests used single-thread programs with 32MB readahead to minimize latency(Embedded mobile devices usually <= 2MB).

Test results (time in ms):
|                   |     little core @1GHz     |      big core @3.5GHz     |
|                   | alloc             | read  | alloc             | read  |
|-------------------|-------------------|-------|-------------------|-------|
| udmabuf buffer RD | 543               | 2078  | 135               | 549   |
| udmabuf direct RD | 543               | 640   | 163               | 291   |
| udmabuf buffer SF | 494               | 1058  | 137               | 315   |
| udmabuf direct SF | 529               | 2335  | 143               | 909   |
| dmabuf buffer  RD | 39                | 1077  | 23                | 349   |
| patch direct RD   | 51                | 306   | 30                | 267   |

> > > udmabuf buffer sendfile| 505        | 1040      | 1546
> > > udmabuf direct sendfile| 510        | 2269      | 2780
> 
> I can reproduce the 3.5x slower udambuf direct sendfile compared to
> udmabuf direct read. It's a pretty disappointing result, so it seems like
> something could be improved there.
> 
> 1G from ext4 on 6.12.17 | read/sendfile (ms)
> ------------------------|-------------------
> udmabuf buffer read     | 351
> udmabuf direct read     | 540
> udmabuf buffer sendfile | 255
> udmabuf direct sendfile | 1990
> 
[wangtao] Key observations:
1. Direct sendfile underperforms due to small pipe buffers/memory file page,
   requiring more DMA operations.
2. ext4 vs f2fs: ext4 supports hugepage/larger folio (unlike f2fs). Mobile
   devices mostly use f2fs, which affects performance.

I/O path comparison:
- Buffer read: [DISK] → DMA → [page cache] → CPU copy → [memory file]
- Direct read: [DISK] → DMA → [memory file]
- Buffer sendfile: [DISK] → DMA → [page cache] → CPU copy → [memory file]
- Direct sendfile: [DISK] → DMA → [pipe buffer] → CPU copy → [memory file]

The extra CPU copy and pipe limitations explain the performance gap.

> 
> > > dmabuf buffer read     | 51         | 1068      | 1118
> > > dmabuf direct read     | 52         | 297       | 349
> > >
> > > udmabuf sendfile test steps:
> > > 1. Open data file(1024MB), get back_fd 2. Create memfd(32MB) # Loop
> > > steps 2-6 3. Allocate udmabuf with memfd 4. Call sendfile(memfd,
> > > back_fd) 5. Close memfd after sendfile 6. Close udmabuf 7. Close
> > > back_fd
> > >
> > >>
> > >> Regards,
> > >> Christian.
> > >
> >


  reply	other threads:[~2025-05-19  4:37 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-13  9:28 [PATCH 2/2] dmabuf/heaps: implement DMA_BUF_IOCTL_RW_FILE for system_heap wangtao
2025-05-13 11:32 ` Christian König
2025-05-13 12:30   ` wangtao
2025-05-13 13:17     ` Christian König
2025-05-14 11:02       ` wangtao
2025-05-14 12:00         ` Christian König
2025-05-15 14:03           ` wangtao
2025-05-15 14:26             ` Christian König
2025-05-16  7:40               ` wangtao
2025-05-16  8:36                 ` Christian König
2025-05-16  9:49                   ` wangtao
2025-05-16 10:29                     ` Christian König
2025-05-19  4:08                       ` wangtao
2025-05-19  7:47                         ` Christian König
2025-05-16 18:37                   ` T.J. Mercier
2025-05-19  4:37                     ` wangtao [this message]
2025-05-19 12:03                     ` wangtao
2025-05-20  4:06                       ` wangtao
2025-05-21  2:00                         ` T.J. Mercier
2025-05-21  4:17                           ` wangtao
2025-05-21  7:35                             ` Christian König
2025-05-21 10:25                               ` wangtao
2025-05-21 11:56                                 ` Christian König
2025-05-22  8:02                                   ` wangtao
2025-05-22 11:57                                     ` Christian König
2025-05-22 12:29                                       ` wangtao
2025-05-27 14:35                                       ` wangtao
2025-05-27 15:10                                         ` Christian König
  -- strict thread matches above, loose matches on Subject: below --
2025-05-14 12:57 kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=64aa801ccf4e4e74b8d699a9330ecb2a@honor.com \
    --to=tao.wangtao@honor.com \
    --cc=Brian.Starkey@arm.com \
    --cc=benjamin.gaignard@collabora.com \
    --cc=bintian.wang@honor.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=feng.han@honor.com \
    --cc=jstultz@google.com \
    --cc=linaro-mm-sig@lists.linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=liulu.liu@honor.com \
    --cc=sumit.semwal@linaro.org \
    --cc=tjmercier@google.com \
    --cc=yipengxiang@honor.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.