Direct I/O performance problems with 1GB pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Direct I/O performance problems with 1GB pages
@ 2025-01-26  0:46 Matthew Wilcox
  2025-01-27 14:09 ` David Hildenbrand
  2025-01-28  5:56 ` Christoph Hellwig
  0 siblings, 2 replies; 18+ messages in thread
From: Matthew Wilcox @ 2025-01-26  0:46 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-block, Muchun Song, Jane Chu

Postgres are experimenting with doing direct I/O to 1GB hugetlb pages.
Andres has gathered some performance data showing significantly worse
performance with 1GB pages compared to 2MB pages.  I sent a patch
recently which improves matters [1], but problems remain.

The primary problem we've identified is contention of folio->_refcount
with a strong secondary contention on folio->_pincount.  This is coming
from the call chain:

iov_iter_extract_pages ->
gup_fast_fallback ->
try_grab_folio_fast

Obviously we can fix this by sharding the counts.  We could do that by
address, since there's no observed performance problem with 2MB pages.
But I think we'd do better to shard by CPU.  We have percpu-refcount.h
already, and I think it'll work.

The key to percpu refcounts is knowing at what point you need to start
caring about whether the refcount has hit zero (we don't care if the
refcount oscillates between 1 and 2, but we very much care about when
we hit 0).

I think the point at which we call percpu_ref_kill() is when we remove a
folio from the page cache.  Before that point, the refcount is guaranteed
to always be positive.  After that point, once the refcount hits zero,
we must free the folio.

It's pretty rare to remove a hugetlb page from the page cache while it's
still mapped.  So we don't need to worry about scalability at that point.

Any volunteers to prototype this?  Andres is a delight to work with,
but I just don't have time to take on this project right now.

[1] https://lore.kernel.org/linux-block/20250124225104.326613-1-willy@infradead.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-26  0:46 Direct I/O performance problems with 1GB pages Matthew Wilcox
@ 2025-01-27 14:09 ` David Hildenbrand
  2025-01-27 16:02   ` Matthew Wilcox
  2025-01-27 17:25   ` Andres Freund
  2025-01-28  5:56 ` Christoph Hellwig
  1 sibling, 2 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-01-27 14:09 UTC (permalink / raw)
  To: Matthew Wilcox, linux-mm; +Cc: linux-block, Muchun Song, Jane Chu

On 26.01.25 01:46, Matthew Wilcox wrote:
> Postgres are experimenting with doing direct I/O to 1GB hugetlb pages.
> Andres has gathered some performance data showing significantly worse
> performance with 1GB pages compared to 2MB pages.  I sent a patch
> recently which improves matters [1], but problems remain.
> 
> The primary problem we've identified is contention of folio->_refcount
> with a strong secondary contention on folio->_pincount.  This is coming
> from the call chain:
> 
> iov_iter_extract_pages ->
> gup_fast_fallback ->
> try_grab_folio_fast
> 
> Obviously we can fix this by sharding the counts.  We could do that by
> address, since there's no observed performance problem with 2MB pages.
> But I think we'd do better to shard by CPU.  We have percpu-refcount.h
> already, and I think it'll work.
> 
> The key to percpu refcounts is knowing at what point you need to start
> caring about whether the refcount has hit zero (we don't care if the
> refcount oscillates between 1 and 2, but we very much care about when
> we hit 0).
> 
> I think the point at which we call percpu_ref_kill() is when we remove a
> folio from the page cache.  Before that point, the refcount is guaranteed
> to always be positive.  After that point, once the refcount hits zero,
> we must free the folio.
> 
> It's pretty rare to remove a hugetlb page from the page cache while it's
> still mapped.  So we don't need to worry about scalability at that point.
> 
> Any volunteers to prototype this?  Andres is a delight to work with,
> but I just don't have time to take on this project right now.

Hmmm ... do we really want to make refcounting more complicated, and 
more importantly, hugetlb-refcounting more special ?! :)

If the workload doing a lot of single-page try_grab_folio_fast(), could 
it do so on a larger area (multiple pages at once -> single refcount 
update)?

Maybe there is a link to the report you could share, thanks.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-27 14:09 ` David Hildenbrand
@ 2025-01-27 16:02   ` Matthew Wilcox
  2025-01-27 16:09     ` David Hildenbrand
  2025-01-27 16:24     ` Keith Busch
  2025-01-27 17:25   ` Andres Freund
  1 sibling, 2 replies; 18+ messages in thread
From: Matthew Wilcox @ 2025-01-27 16:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-block, Muchun Song, Jane Chu, Andres Freund

[Adding Andres to the cc.  Sorry for leaving you off in the initial mail]

On Mon, Jan 27, 2025 at 03:09:23PM +0100, David Hildenbrand wrote:
> On 26.01.25 01:46, Matthew Wilcox wrote:
> > Postgres are experimenting with doing direct I/O to 1GB hugetlb pages.
> > Andres has gathered some performance data showing significantly worse
> > performance with 1GB pages compared to 2MB pages.  I sent a patch
> > recently which improves matters [1], but problems remain.
> > 
> > The primary problem we've identified is contention of folio->_refcount
> > with a strong secondary contention on folio->_pincount.  This is coming
> > from the call chain:
> > 
> > iov_iter_extract_pages ->
> > gup_fast_fallback ->
> > try_grab_folio_fast
> > 
> > Obviously we can fix this by sharding the counts.  We could do that by
> > address, since there's no observed performance problem with 2MB pages.
> > But I think we'd do better to shard by CPU.  We have percpu-refcount.h
> > already, and I think it'll work.
> > 
> > The key to percpu refcounts is knowing at what point you need to start
> > caring about whether the refcount has hit zero (we don't care if the
> > refcount oscillates between 1 and 2, but we very much care about when
> > we hit 0).
> > 
> > I think the point at which we call percpu_ref_kill() is when we remove a
> > folio from the page cache.  Before that point, the refcount is guaranteed
> > to always be positive.  After that point, once the refcount hits zero,
> > we must free the folio.
> > 
> > It's pretty rare to remove a hugetlb page from the page cache while it's
> > still mapped.  So we don't need to worry about scalability at that point.
> > 
> > Any volunteers to prototype this?  Andres is a delight to work with,
> > but I just don't have time to take on this project right now.
> 
> Hmmm ... do we really want to make refcounting more complicated, and more
> importantly, hugetlb-refcounting more special ?! :)

No, I really don't.  But I've always been mildly concerned about extra
contention on folio locks, folio refcounts, etc.  I don't know if 2MB
page performance might be improved by a scheme like this, and we might
even want to cut over for sizes larger than, say, 64kB.  That would be
something interesting to investigate.

> If the workload doing a lot of single-page try_grab_folio_fast(), could it
> do so on a larger area (multiple pages at once -> single refcount update)?

Not really.  This is memory that's being used as the buffer cache, so
every thread in your database is hammering on it and pulling in exactly
the data that it needs for the SQL query that it's processing.

> Maybe there is a link to the report you could share, thanks.

Andres shared some gists, but I don't want to send those to a
mailing list without permission.  Here's the kernel part of the
perf report:

    14.04%  postgres         [kernel.kallsyms]          [k] try_grab_folio_fast
            |
             --14.04%--try_grab_folio_fast
                       gup_fast_fallback
                       |
                        --13.85%--iov_iter_extract_pages
                                  bio_iov_iter_get_pages
                                  iomap_dio_bio_iter
                                  __iomap_dio_rw
                                  iomap_dio_rw
                                  xfs_file_dio_read
                                  xfs_file_read_iter
                                  __io_read
                                  io_read
                                  io_issue_sqe
                                  io_submit_sqes
                                  __do_sys_io_uring_enter
                                  do_syscall_64

Now, since postgres is using io_uring, perhaps there could be a path
which registers the memory with the iouring (doing the refcount/pincount
dance once), and then use that pinned memory for each I/O.  Maybe that
already exists; I'm not keeping up with io_uring development and I can't
seem to find any documentation on what things like io_provide_buffers()
actually do.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-27 16:02   ` Matthew Wilcox
@ 2025-01-27 16:09     ` David Hildenbrand
  2025-01-27 16:20       ` David Hildenbrand
  2025-01-27 18:21       ` Andres Freund
  2025-01-27 16:24     ` Keith Busch
  1 sibling, 2 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-01-27 16:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, linux-block, Muchun Song, Jane Chu, Andres Freund

> 
>> If the workload doing a lot of single-page try_grab_folio_fast(), could it
>> do so on a larger area (multiple pages at once -> single refcount update)?
> 
> Not really.  This is memory that's being used as the buffer cache, so
> every thread in your database is hammering on it and pulling in exactly
> the data that it needs for the SQL query that it's processing.

Ouch.

> 
>> Maybe there is a link to the report you could share, thanks.
> 
> Andres shared some gists, but I don't want to send those to a
> mailing list without permission.  Here's the kernel part of the
> perf report:
> 
>      14.04%  postgres         [kernel.kallsyms]          [k] try_grab_folio_fast
>              |
>               --14.04%--try_grab_folio_fast
>                         gup_fast_fallback
>                         |
>                          --13.85%--iov_iter_extract_pages
>                                    bio_iov_iter_get_pages
>                                    iomap_dio_bio_iter
>                                    __iomap_dio_rw
>                                    iomap_dio_rw
>                                    xfs_file_dio_read
>                                    xfs_file_read_iter
>                                    __io_read
>                                    io_read
>                                    io_issue_sqe
>                                    io_submit_sqes
>                                    __do_sys_io_uring_enter
>                                    do_syscall_64
> 
> Now, since postgres is using io_uring, perhaps there could be a path
> which registers the memory with the iouring (doing the refcount/pincount
> dance once), and then use that pinned memory for each I/O.  Maybe that
> already exists; I'm not keeping up with io_uring development and I can't
> seem to find any documentation on what things like io_provide_buffers()
> actually do.

That's precisely what io-uring fixed buffers do :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-27 16:09     ` David Hildenbrand
@ 2025-01-27 16:20       ` David Hildenbrand
  2025-01-27 16:56         ` Matthew Wilcox
  2025-01-27 18:21       ` Andres Freund
  1 sibling, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-01-27 16:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, linux-block, Muchun Song, Jane Chu, Andres Freund

On 27.01.25 17:09, David Hildenbrand wrote:
>>
>>> If the workload doing a lot of single-page try_grab_folio_fast(), could it
>>> do so on a larger area (multiple pages at once -> single refcount update)?
>>
>> Not really.  This is memory that's being used as the buffer cache, so
>> every thread in your database is hammering on it and pulling in exactly
>> the data that it needs for the SQL query that it's processing.
> 
> Ouch.
> 
>>
>>> Maybe there is a link to the report you could share, thanks.
>>
>> Andres shared some gists, but I don't want to send those to a
>> mailing list without permission.  Here's the kernel part of the
>> perf report:
>>
>>       14.04%  postgres         [kernel.kallsyms]          [k] try_grab_folio_fast
>>               |
>>                --14.04%--try_grab_folio_fast
>>                          gup_fast_fallback
>>                          |
>>                           --13.85%--iov_iter_extract_pages
>>                                     bio_iov_iter_get_pages
>>                                     iomap_dio_bio_iter
>>                                     __iomap_dio_rw
>>                                     iomap_dio_rw
>>                                     xfs_file_dio_read
>>                                     xfs_file_read_iter
>>                                     __io_read
>>                                     io_read
>>                                     io_issue_sqe
>>                                     io_submit_sqes
>>                                     __do_sys_io_uring_enter
>>                                     do_syscall_64

BTW, two things that come to mind:


(1) We always fallback to GUP-fast, I wonder why. GUP-fast would go via 
try_grab_folio_fast().

(2) During GUP slow, we must take the PT lock of the PUD table. So the 
folio refcount/pincount/whatever is actually sync'ed by the ... PT lock 
here?

See assert_spin_locked(pud_lockptr(mm, pudp)); in follow_huge_pud().

Note that that PUD table lock is likely a per-MM lock ... and yes, it 
indeed is. We don't have split PUD locks.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-27 16:02   ` Matthew Wilcox
  2025-01-27 16:09     ` David Hildenbrand
@ 2025-01-27 16:24     ` Keith Busch
  1 sibling, 0 replies; 18+ messages in thread
From: Keith Busch @ 2025-01-27 16:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, linux-mm, linux-block, Muchun Song, Jane Chu,
	Andres Freund

On Mon, Jan 27, 2025 at 04:02:41PM +0000, Matthew Wilcox wrote:
> Now, since postgres is using io_uring, perhaps there could be a path
> which registers the memory with the iouring (doing the refcount/pincount
> dance once), and then use that pinned memory for each I/O.  Maybe that
> already exists; I'm not keeping up with io_uring development and I can't
> seem to find any documentation on what things like io_provide_buffers()
> actually do.

Yes, io_uring does hqve this capability. Here's a doc describing the
liburing function that sets it up, and explains how to reference it in
subsequent IO:

  https://unixism.net/loti/ref-liburing/advanced_usage.html#c.io_uring_register_buffers


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-27 16:20       ` David Hildenbrand
@ 2025-01-27 16:56         ` Matthew Wilcox
  2025-01-27 16:59           ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2025-01-27 16:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-block, Muchun Song, Jane Chu, Andres Freund

On Mon, Jan 27, 2025 at 05:20:25PM +0100, David Hildenbrand wrote:
> > >       14.04%  postgres         [kernel.kallsyms]          [k] try_grab_folio_fast
> > >               |
> > >                --14.04%--try_grab_folio_fast
> > >                          gup_fast_fallback
> > >                          |
> > >                           --13.85%--iov_iter_extract_pages
> > >                                     bio_iov_iter_get_pages
> > >                                     iomap_dio_bio_iter
> > >                                     __iomap_dio_rw
> > >                                     iomap_dio_rw
> > >                                     xfs_file_dio_read
> > >                                     xfs_file_read_iter
> > >                                     __io_read
> > >                                     io_read
> > >                                     io_issue_sqe
> > >                                     io_submit_sqes
> > >                                     __do_sys_io_uring_enter
> > >                                     do_syscall_64
> 
> BTW, two things that come to mind:
> 
> 
> (1) We always fallback to GUP-slow, I wonder why. GUP-fast would go via
> try_grab_folio_fast().

I don't think we do?

iov_iter_extract_pages() calls
iov_iter_extract_user_pages() calls
pin_user_pages_fast() calls
gup_fast_fallback() calls
gup_fast() calls
gup_fast_pgd_range() calls
gup_fast_p4d_range() calls
gup_fast_pud_range() calls
gup_fast_pud_leaf() calls
try_grab_folio_fast() which is where we see the contention.

If that were to fail, we'd see contention in __get_user_pages_locked(),
right?



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-27 16:56         ` Matthew Wilcox
@ 2025-01-27 16:59           ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-01-27 16:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, linux-block, Muchun Song, Jane Chu, Andres Freund

On 27.01.25 17:56, Matthew Wilcox wrote:
> On Mon, Jan 27, 2025 at 05:20:25PM +0100, David Hildenbrand wrote:
>>>>        14.04%  postgres         [kernel.kallsyms]          [k] try_grab_folio_fast
>>>>                |
>>>>                 --14.04%--try_grab_folio_fast
>>>>                           gup_fast_fallback
>>>>                           |
>>>>                            --13.85%--iov_iter_extract_pages
>>>>                                      bio_iov_iter_get_pages
>>>>                                      iomap_dio_bio_iter
>>>>                                      __iomap_dio_rw
>>>>                                      iomap_dio_rw
>>>>                                      xfs_file_dio_read
>>>>                                      xfs_file_read_iter
>>>>                                      __io_read
>>>>                                      io_read
>>>>                                      io_issue_sqe
>>>>                                      io_submit_sqes
>>>>                                      __do_sys_io_uring_enter
>>>>                                      do_syscall_64
>>
>> BTW, two things that come to mind:
>>
>>
>> (1) We always fallback to GUP-slow, I wonder why. GUP-fast would go via
>> try_grab_folio_fast().
> 
> I don't think we do?
> 
> iov_iter_extract_pages() calls
> iov_iter_extract_user_pages() calls
> pin_user_pages_fast() calls
> gup_fast_fallback() calls
> gup_fast() calls
> gup_fast_pgd_range() calls
> gup_fast_p4d_range() calls
> gup_fast_pud_range() calls
> gup_fast_pud_leaf() calls
> try_grab_folio_fast() which is where we see the contention.
> 
> If that were to fail, we'd see contention in __get_user_pages_locked(),
> right?

Sorry, my brain stripped the "_fast" in the callchain above :( -- it 
*is* try_grab_folio_fast() :)

So yes, we are in GUP-fast as one would expect.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-27 14:09 ` David Hildenbrand
  2025-01-27 16:02   ` Matthew Wilcox
@ 2025-01-27 17:25   ` Andres Freund
  2025-01-27 19:20     ` David Hildenbrand
  1 sibling, 1 reply; 18+ messages in thread
From: Andres Freund @ 2025-01-27 17:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu

Hi,

On 2025-01-27 15:09:23 +0100, David Hildenbrand wrote:
> Hmmm ... do we really want to make refcounting more complicated, and more
> importantly, hugetlb-refcounting more special ?! :)

I don't know the answer to that - I mainly wanted to report the issue because
it was pretty nasty to debug and initially surprising (to me).

> If the workload doing a lot of single-page try_grab_folio_fast(), could it
> do so on a larger area (multiple pages at once -> single refcount update)?

In the original case I hit this I (a VM with 10 PCIe 3x NVMEs JBODed
together), the IO size averaged something like ~240kB (most 256kB, with some
smaller ones thrown in). Increasing the IO size further than that starts to
hurt latency and thus requires even deeper IO queues...

Unfortunately for the VMs with those disks I don't have access to hardware
performance counters :(.

> Maybe there is a link to the report you could share, thanks.

A profile of the "original" case where I hit this, without the patch that
Willy linked to:

Note this is a profile *not* using hardware perf counters, thus likely to be
rather skewed:
https://gist.github.com/anarazel/304aa6b81d05feb3f4990b467d02dabc
(this was on Debian Sid's 6.12.6)

Without the patch I achieved ~18GB/s with 1GB pages and ~35GB/s with 2MB
pages.

After applying the patch to add an unlocked already-dirty check to
bio_set_pages_dirty() performance improves to ~20GB/s when using 1GB pages.

A differential profile comparing 2MB and 1GB pages with the patch applied
(again, without hardware perf counters):
https://gist.github.com/anarazel/f993c238ea7d2c34f44440336d90ad8f

Willy then asked me for perf annotate of where in gup_fast_fallback() time is
spent.  I didn't have access to the VM at that point, and tried to repro the
problem with local hardware.

As I don't have quite enough IO throughput available locally, I couldn't repro
the problem quite as easily. But after lowering the average IO size (which is
not unrealistic, far from every workload is just a bulk sequential scan), it
showed up when just using two PCIe 4 NVMe SSDs.

Here are profiles of the 2MB and 1GB cases, with the bio_set_pages_dirty()
patch applied:
https://gist.github.com/anarazel/f0d0a884c55ee18851dc9f15f03f7583

2MB pages get ~12.5GB/s, 1GB pages ~7GB/s, with a *lot* of variance.

This time it's actual hardware perf counters...

Relevant details about the c2c report, excerpted from IRC:

andres | willy: Looking at a bit more detail into the c2c report, it looks
         like the dirtying is due to folio->_pincount and folio->_refcount in
         about equal measure and folio->flags being modified in
         gup_fast_fallback(). The modifications then, unsurprisingly, cause a
         lot of cache misses for reads (like in bio_set_pages_dirty() and
         bio_check_pages_dirty()).

 willy | andres: that makes perfect sense, thanks
 willy | really, the only way to fix that is to split it up
 willy | and either we can split it per-cpu or per-physical-address-range

andres | willy: Yea, that's probably the only fundamental fix. I guess there
         might be some around-the-edges improvements by colocating the write
         heavy data on a separate cache line from flags and whatever is at
         0x8, which are read more often than written. But I really don't know
         enough about how all this is used.

 willy | 0x8 is compound_head which is definitely read more often than written

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-27 16:09     ` David Hildenbrand
  2025-01-27 16:20       ` David Hildenbrand
@ 2025-01-27 18:21       ` Andres Freund
  2025-01-27 18:54         ` Jens Axboe
  1 sibling, 1 reply; 18+ messages in thread
From: Andres Freund @ 2025-01-27 18:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu,
	Jens Axboe

Hi,

On 2025-01-27 17:09:57 +0100, David Hildenbrand wrote:
> > Andres shared some gists, but I don't want to send those to a
> > mailing list without permission.  Here's the kernel part of the
> > perf report:
> >
> >      14.04%  postgres         [kernel.kallsyms]          [k] try_grab_folio_fast
> >              |
> >               --14.04%--try_grab_folio_fast
> >                         gup_fast_fallback
> >                         |
> >                          --13.85%--iov_iter_extract_pages
> >                                    bio_iov_iter_get_pages
> >                                    iomap_dio_bio_iter
> >                                    __iomap_dio_rw
> >                                    iomap_dio_rw
> >                                    xfs_file_dio_read
> >                                    xfs_file_read_iter
> >                                    __io_read
> >                                    io_read
> >                                    io_issue_sqe
> >                                    io_submit_sqes
> >                                    __do_sys_io_uring_enter
> >                                    do_syscall_64
> >
> > Now, since postgres is using io_uring, perhaps there could be a path
> > which registers the memory with the iouring (doing the refcount/pincount
> > dance once), and then use that pinned memory for each I/O.  Maybe that
> > already exists; I'm not keeping up with io_uring development and I can't
> > seem to find any documentation on what things like io_provide_buffers()
> > actually do.

Worth noting that we'll not always use io_uring. Partially for portability to
other platforms, partially because it turns out that io_uring is disabled in
enough environments that we can't rely on it. The generic fallback
implementation is a pool of worker processes connected via shared memory. The
worker process approach did run into this issue, fwiw.

That's not to say that a legit answer to this scalability issue can't be "use
fixed bufs with io_uring", just wanted to give context.


> That's precisely what io-uring fixed buffers do :)

I looked at using them at some point - unfortunately it seems that there is
just {READ,WRITE}_FIXED not {READV,WRITEV}_FIXED. It's *exceedingly* common
for us to do reads/writes where source/target buffers aren't wholly
contiguous. Thus - unless I am misunderstanding something, entirely plausible
- using fixed buffers would unfortunately increase the number of IOs
noticeably.

Should have sent an email about that...

I guess we could add some heuristic to use _FIXED if it doesn't require
splitting an IO into too many sub-ios. But that seems pretty gnarly.


I dimly recall that I also ran into some around using fixed buffers as a
non-root user. It might just be the accounting of registered buffers as
mlocked memory and the difficulty of configuring that across
distributions. But I unfortunately don't remember any details anymore.


Greetings,

Andres Freund


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-27 18:21       ` Andres Freund
@ 2025-01-27 18:54         ` Jens Axboe
  2025-01-27 19:07           ` David Hildenbrand
  2025-01-27 21:32           ` Pavel Begunkov
  0 siblings, 2 replies; 18+ messages in thread
From: Jens Axboe @ 2025-01-27 18:54 UTC (permalink / raw)
  To: Andres Freund, David Hildenbrand
  Cc: Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu,
	Pavel Begunkov

On 1/27/25 11:21 AM, Andres Freund wrote:
>> That's precisely what io-uring fixed buffers do :)
> 
> I looked at using them at some point - unfortunately it seems that there is
> just {READ,WRITE}_FIXED not {READV,WRITEV}_FIXED. It's *exceedingly* common
> for us to do reads/writes where source/target buffers aren't wholly
> contiguous. Thus - unless I am misunderstanding something, entirely plausible
> - using fixed buffers would unfortunately increase the number of IOs
> noticeably.
> 
> Should have sent an email about that...
> 
> I guess we could add some heuristic to use _FIXED if it doesn't require
> splitting an IO into too many sub-ios. But that seems pretty gnarly.

Adding Pavel, he's been working on support registered buffers with
readv/writev.

> I dimly recall that I also ran into some around using fixed buffers as a
> non-root user. It might just be the accounting of registered buffers as
> mlocked memory and the difficulty of configuring that across
> distributions. But I unfortunately don't remember any details anymore.

Should just be accounting.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-27 18:54         ` Jens Axboe
@ 2025-01-27 19:07           ` David Hildenbrand
  2025-01-27 21:32           ` Pavel Begunkov
  1 sibling, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-01-27 19:07 UTC (permalink / raw)
  To: Jens Axboe, Andres Freund
  Cc: Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu,
	Pavel Begunkov

On 27.01.25 19:54, Jens Axboe wrote:
> On 1/27/25 11:21 AM, Andres Freund wrote:
>>> That's precisely what io-uring fixed buffers do :)
>>
>> I looked at using them at some point - unfortunately it seems that there is
>> just {READ,WRITE}_FIXED not {READV,WRITEV}_FIXED. It's *exceedingly* common
>> for us to do reads/writes where source/target buffers aren't wholly
>> contiguous. Thus - unless I am misunderstanding something, entirely plausible
>> - using fixed buffers would unfortunately increase the number of IOs
>> noticeably.
>>
>> Should have sent an email about that...
>>
>> I guess we could add some heuristic to use _FIXED if it doesn't require
>> splitting an IO into too many sub-ios. But that seems pretty gnarly.
> 
> Adding Pavel, he's been working on support registered buffers with
> readv/writev.
> 
>> I dimly recall that I also ran into some around using fixed buffers as a
>> non-root user. It might just be the accounting of registered buffers as
>> mlocked memory and the difficulty of configuring that across
>> distributions. But I unfortunately don't remember any details anymore.
> 
> Should just be accounting.

For hugetlb we might be able to relax the accounting restriction. It's 
effectively mlocked already, and setting up hugetlb requires privileges.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-27 17:25   ` Andres Freund
@ 2025-01-27 19:20     ` David Hildenbrand
  2025-01-27 19:36       ` Andres Freund
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-01-27 19:20 UTC (permalink / raw)
  To: Andres Freund
  Cc: Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu

On 27.01.25 18:25, Andres Freund wrote:
> Hi,
> 
> On 2025-01-27 15:09:23 +0100, David Hildenbrand wrote:
>> Hmmm ... do we really want to make refcounting more complicated, and more
>> importantly, hugetlb-refcounting more special ?! :)
> 
> I don't know the answer to that - I mainly wanted to report the issue because
> it was pretty nasty to debug and initially surprising (to me).

Thanks for reporting!
> 
> 
>> If the workload doing a lot of single-page try_grab_folio_fast(), could it
>> do so on a larger area (multiple pages at once -> single refcount update)?
> 
> In the original case I hit this I (a VM with 10 PCIe 3x NVMEs JBODed
> together), the IO size averaged something like ~240kB (most 256kB, with some
> smaller ones thrown in). Increasing the IO size further than that starts to
> hurt latency and thus requires even deeper IO queues...

Makes sense.

> 
> Unfortunately for the VMs with those disks I don't have access to hardware
> performance counters :(.
 > >
>> Maybe there is a link to the report you could share, thanks.
> 
> A profile of the "original" case where I hit this, without the patch that
> Willy linked to:
> 
> Note this is a profile *not* using hardware perf counters, thus likely to be
> rather skewed:
> https://gist.github.com/anarazel/304aa6b81d05feb3f4990b467d02dabc
> (this was on Debian Sid's 6.12.6)
> 
> Without the patch I achieved ~18GB/s with 1GB pages and ~35GB/s with 2MB
> pages.

Out of interest, did you ever compare it to 4k?

> 
> After applying the patch to add an unlocked already-dirty check to
> bio_set_pages_dirty() performance improves to ~20GB/s when using 1GB pages.
> 
> A differential profile comparing 2MB and 1GB pages with the patch applied
> (again, without hardware perf counters):
> https://gist.github.com/anarazel/f993c238ea7d2c34f44440336d90ad8f
> 
> 
> Willy then asked me for perf annotate of where in gup_fast_fallback() time is
> spent.  I didn't have access to the VM at that point, and tried to repro the
> problem with local hardware.
> 
> 
> As I don't have quite enough IO throughput available locally, I couldn't repro
> the problem quite as easily. But after lowering the average IO size (which is
> not unrealistic, far from every workload is just a bulk sequential scan), it
> showed up when just using two PCIe 4 NVMe SSDs.
> 
> Here are profiles of the 2MB and 1GB cases, with the bio_set_pages_dirty()
> patch applied:
> https://gist.github.com/anarazel/f0d0a884c55ee18851dc9f15f03f7583
> 
> 2MB pages get ~12.5GB/s, 1GB pages ~7GB/s, with a *lot* of variance.

Thanks!

> 
> This time it's actual hardware perf counters...
> 
> Relevant details about the c2c report, excerpted from IRC:
> 
> andres | willy: Looking at a bit more detail into the c2c report, it looks
>           like the dirtying is due to folio->_pincount and folio->_refcount in
>           about equal measure and folio->flags being modified in
>           gup_fast_fallback(). The modifications then, unsurprisingly, cause a
>           lot of cache misses for reads (like in bio_set_pages_dirty() and
>           bio_check_pages_dirty()).
> 
>   willy | andres: that makes perfect sense, thanks
>   willy | really, the only way to fix that is to split it up
>   willy | and either we can split it per-cpu or per-physical-address-range

As discussed, even better is "not repeatedly pinning/unpinning" at all :)

I'm curious, are multiple processes involved, or is this all within a 
single process?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-27 19:20     ` David Hildenbrand
@ 2025-01-27 19:36       ` Andres Freund
  0 siblings, 0 replies; 18+ messages in thread
From: Andres Freund @ 2025-01-27 19:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu

Hi,

On 2025-01-27 20:20:41 +0100, David Hildenbrand wrote:
> On 27.01.25 18:25, Andres Freund wrote:
> > On 2025-01-27 15:09:23 +0100, David Hildenbrand wrote:
> > Unfortunately for the VMs with those disks I don't have access to hardware
> > performance counters :(.
> > >
> > > Maybe there is a link to the report you could share, thanks.
> > 
> > A profile of the "original" case where I hit this, without the patch that
> > Willy linked to:
> > 
> > Note this is a profile *not* using hardware perf counters, thus likely to be
> > rather skewed:
> > https://gist.github.com/anarazel/304aa6b81d05feb3f4990b467d02dabc
> > (this was on Debian Sid's 6.12.6)
> > 
> > Without the patch I achieved ~18GB/s with 1GB pages and ~35GB/s with 2MB
> > pages.
> 
> Out of interest, did you ever compare it to 4k?

I didn't. Postgres will always do at least 8kB (unless compiled with
non-default settings). But I also don't think I tested just doing 8kB on that
VM. I doubt I'd have gotten close to the max, even with 2MB huge pages. At
least not without block-layer-level merging of IOs.

If it's particularly interesting, I can bring a similar VM up and run that
comparison.



> > This time it's actual hardware perf counters...
> > 
> > Relevant details about the c2c report, excerpted from IRC:
> > 
> > andres | willy: Looking at a bit more detail into the c2c report, it looks
> >           like the dirtying is due to folio->_pincount and folio->_refcount in
> >           about equal measure and folio->flags being modified in
> >           gup_fast_fallback(). The modifications then, unsurprisingly, cause a
> >           lot of cache misses for reads (like in bio_set_pages_dirty() and
> >           bio_check_pages_dirty()).
> > 
> >   willy | andres: that makes perfect sense, thanks
> >   willy | really, the only way to fix that is to split it up
> >   willy | and either we can split it per-cpu or per-physical-address-range
> 
> As discussed, even better is "not repeatedly pinning/unpinning" at all :)

Indeed ;)


> I'm curious, are multiple processes involved, or is this all within a single
> process?

In the test case here multiple processes are involved, I was testing a
parallel sequential scan, with a high limit to the paralellism.

There are cases in which a fair bit of read IO is done from a single proccess
(e.g. to prerewarm the buffer pool after a restart, that's currently done by a
single process), but it's more common for high throughput to happen across
multiple processes. With modern drives a single task won't be able to execute
non-trivial queries at full disk speed.

Greetings,

Andres Freund


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-27 18:54         ` Jens Axboe
  2025-01-27 19:07           ` David Hildenbrand
@ 2025-01-27 21:32           ` Pavel Begunkov
  1 sibling, 0 replies; 18+ messages in thread
From: Pavel Begunkov @ 2025-01-27 21:32 UTC (permalink / raw)
  To: Jens Axboe, Andres Freund, David Hildenbrand
  Cc: Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu

On 1/27/25 18:54, Jens Axboe wrote:
> On 1/27/25 11:21 AM, Andres Freund wrote:
>>> That's precisely what io-uring fixed buffers do :)
>>
>> I looked at using them at some point - unfortunately it seems that there is
>> just {READ,WRITE}_FIXED not {READV,WRITEV}_FIXED. It's *exceedingly* common
>> for us to do reads/writes where source/target buffers aren't wholly
>> contiguous. Thus - unless I am misunderstanding something, entirely plausible
>> - using fixed buffers would unfortunately increase the number of IOs
>> noticeably.
>>
>> Should have sent an email about that...
>>
>> I guess we could add some heuristic to use _FIXED if it doesn't require
>> splitting an IO into too many sub-ios. But that seems pretty gnarly.
> 
> Adding Pavel, he's been working on support registered buffers with
> readv/writev.

Yes, there are patches from last year out there, I'm going to pick
them up and, hopefully for 6.15. I'll keep you in the loop.


>> I dimly recall that I also ran into some around using fixed buffers as a
>> non-root user. It might just be the accounting of registered buffers as
>> mlocked memory and the difficulty of configuring that across
>> distributions. But I unfortunately don't remember any details anymore.
> 
> Should just be accounting.
> 

-- 
Pavel Begunkov



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-26  0:46 Direct I/O performance problems with 1GB pages Matthew Wilcox
  2025-01-27 14:09 ` David Hildenbrand
@ 2025-01-28  5:56 ` Christoph Hellwig
  2025-01-28  9:47   ` David Hildenbrand
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2025-01-28  5:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, linux-block, Muchun Song, Jane Chu, David Hildenbrand

I read through this and unfortunately have nothing useful to contribute
to the actual lock sharding.  Just two semi-related bits:

On Sun, Jan 26, 2025 at 12:46:45AM +0000, Matthew Wilcox wrote:
> Postgres are experimenting with doing direct I/O to 1GB hugetlb pages.
> Andres has gathered some performance data showing significantly worse
> performance with 1GB pages compared to 2MB pages.  I sent a patch
> recently which improves matters [1], but problems remain.
> 
> The primary problem we've identified is contention of folio->_refcount
> with a strong secondary contention on folio->_pincount.  This is coming
> from the call chain:
> 
> iov_iter_extract_pages ->
> gup_fast_fallback ->
> try_grab_folio_fast

Eww, gup_fast_fallback sent me down the wrong path, as it suggests that
is is the fallback slow path, but it's not.  It got renamed from
internal_get_user_pages_fast in commit 23babe1934d7 ("mm/gup:
consistently name GUP-fast functions").  While old name wasn't all
that great the new one including fallback is just horrible.  Can we
please fix this up?

The other thing is that the whole GUP machinery takes a reference
per page fragment it touches and not just per folio fragment.
I'm not sure how fair access to the atomics is, but I suspect the
multiple calls to increment/decrement them per operation probably
don't make the lock contention any better.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-28  5:56 ` Christoph Hellwig
@ 2025-01-28  9:47   ` David Hildenbrand
  2025-01-29  6:03     ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-01-28  9:47 UTC (permalink / raw)
  To: Christoph Hellwig, Matthew Wilcox
  Cc: linux-mm, linux-block, Muchun Song, Jane Chu

On 28.01.25 06:56, Christoph Hellwig wrote:
> I read through this and unfortunately have nothing useful to contribute
> to the actual lock sharding.  Just two semi-related bits:
> 
> On Sun, Jan 26, 2025 at 12:46:45AM +0000, Matthew Wilcox wrote:
>> Postgres are experimenting with doing direct I/O to 1GB hugetlb pages.
>> Andres has gathered some performance data showing significantly worse
>> performance with 1GB pages compared to 2MB pages.  I sent a patch
>> recently which improves matters [1], but problems remain.
>>
>> The primary problem we've identified is contention of folio->_refcount
>> with a strong secondary contention on folio->_pincount.  This is coming
>> from the call chain:
>>
>> iov_iter_extract_pages ->
>> gup_fast_fallback ->
>> try_grab_folio_fast
> 
> Eww, gup_fast_fallback sent me down the wrong path, as it suggests that
> is is the fallback slow path, but it's not.  It got renamed from
> internal_get_user_pages_fast in commit 23babe1934d7 ("mm/gup:
> consistently name GUP-fast functions").  While old name wasn't all
> that great the new one including fallback is just horrible.  Can we
> please fix this up?

Yes, I did that renaming as part of that series after the name was 
suggested during review. I got confused myself reading this report.

	internal_get_user_pages_fast() -> gup_fast_fallback()

Was certainly an improvement. Naming is hard, we want to express "try 
fast, but fallback to slow if it fails".

Maybe "gup_fast_with_fallback" might be clearer, not sure.

> 
> The other thing is that the whole GUP machinery takes a reference
> per page fragment it touches and not just per folio fragment.

Right, that's what calling code expects: hands out pages, one ref per page.

> I'm not sure how fair access to the atomics is, but I suspect the
> multiple calls to increment/decrement them per operation probably
> don't make the lock contention any better.

The "advanced" logic in __get_user_pages() makes sure that if you GUP 
multiple pages covered by a single PMD/PUD, that we will adjust the 
refcount only twice (once deep down in the page table walker, then in 
__get_user_pages()).

There is certainly a lot of room for improvement in these page table 
walkers ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Direct I/O performance problems with 1GB pages
  2025-01-28  9:47   ` David Hildenbrand
@ 2025-01-29  6:03     ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2025-01-29  6:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Hellwig, Matthew Wilcox, linux-mm, linux-block,
	Muchun Song, Jane Chu

On Tue, Jan 28, 2025 at 10:47:12AM +0100, David Hildenbrand wrote:
> Yes, I did that renaming as part of that series after the name was suggested
> during review. I got confused myself reading this report.
> 
> 	internal_get_user_pages_fast() -> gup_fast_fallback()
> 
> Was certainly an improvement. Naming is hard, we want to express "try fast,
> but fallback to slow if it fails".
> 
> Maybe "gup_fast_with_fallback" might be clearer, not sure.

gup_common?



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-01-29  6:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-26  0:46 Direct I/O performance problems with 1GB pages Matthew Wilcox
2025-01-27 14:09 ` David Hildenbrand
2025-01-27 16:02   ` Matthew Wilcox
2025-01-27 16:09     ` David Hildenbrand
2025-01-27 16:20       ` David Hildenbrand
2025-01-27 16:56         ` Matthew Wilcox
2025-01-27 16:59           ` David Hildenbrand
2025-01-27 18:21       ` Andres Freund
2025-01-27 18:54         ` Jens Axboe
2025-01-27 19:07           ` David Hildenbrand
2025-01-27 21:32           ` Pavel Begunkov
2025-01-27 16:24     ` Keith Busch
2025-01-27 17:25   ` Andres Freund
2025-01-27 19:20     ` David Hildenbrand
2025-01-27 19:36       ` Andres Freund
2025-01-28  5:56 ` Christoph Hellwig
2025-01-28  9:47   ` David Hildenbrand
2025-01-29  6:03     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).