* Direct I/O performance problems with 1GB pages @ 2025-01-26 0:46 Matthew Wilcox 2025-01-27 14:09 ` David Hildenbrand 2025-01-28 5:56 ` Christoph Hellwig 0 siblings, 2 replies; 18+ messages in thread From: Matthew Wilcox @ 2025-01-26 0:46 UTC (permalink / raw) To: linux-mm; +Cc: linux-block, Muchun Song, Jane Chu Postgres are experimenting with doing direct I/O to 1GB hugetlb pages. Andres has gathered some performance data showing significantly worse performance with 1GB pages compared to 2MB pages. I sent a patch recently which improves matters [1], but problems remain. The primary problem we've identified is contention of folio->_refcount with a strong secondary contention on folio->_pincount. This is coming from the call chain: iov_iter_extract_pages -> gup_fast_fallback -> try_grab_folio_fast Obviously we can fix this by sharding the counts. We could do that by address, since there's no observed performance problem with 2MB pages. But I think we'd do better to shard by CPU. We have percpu-refcount.h already, and I think it'll work. The key to percpu refcounts is knowing at what point you need to start caring about whether the refcount has hit zero (we don't care if the refcount oscillates between 1 and 2, but we very much care about when we hit 0). I think the point at which we call percpu_ref_kill() is when we remove a folio from the page cache. Before that point, the refcount is guaranteed to always be positive. After that point, once the refcount hits zero, we must free the folio. It's pretty rare to remove a hugetlb page from the page cache while it's still mapped. So we don't need to worry about scalability at that point. Any volunteers to prototype this? Andres is a delight to work with, but I just don't have time to take on this project right now. [1] https://lore.kernel.org/linux-block/20250124225104.326613-1-willy@infradead.org/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-26 0:46 Direct I/O performance problems with 1GB pages Matthew Wilcox @ 2025-01-27 14:09 ` David Hildenbrand 2025-01-27 16:02 ` Matthew Wilcox 2025-01-27 17:25 ` Andres Freund 2025-01-28 5:56 ` Christoph Hellwig 1 sibling, 2 replies; 18+ messages in thread From: David Hildenbrand @ 2025-01-27 14:09 UTC (permalink / raw) To: Matthew Wilcox, linux-mm; +Cc: linux-block, Muchun Song, Jane Chu On 26.01.25 01:46, Matthew Wilcox wrote: > Postgres are experimenting with doing direct I/O to 1GB hugetlb pages. > Andres has gathered some performance data showing significantly worse > performance with 1GB pages compared to 2MB pages. I sent a patch > recently which improves matters [1], but problems remain. > > The primary problem we've identified is contention of folio->_refcount > with a strong secondary contention on folio->_pincount. This is coming > from the call chain: > > iov_iter_extract_pages -> > gup_fast_fallback -> > try_grab_folio_fast > > Obviously we can fix this by sharding the counts. We could do that by > address, since there's no observed performance problem with 2MB pages. > But I think we'd do better to shard by CPU. We have percpu-refcount.h > already, and I think it'll work. > > The key to percpu refcounts is knowing at what point you need to start > caring about whether the refcount has hit zero (we don't care if the > refcount oscillates between 1 and 2, but we very much care about when > we hit 0). > > I think the point at which we call percpu_ref_kill() is when we remove a > folio from the page cache. Before that point, the refcount is guaranteed > to always be positive. After that point, once the refcount hits zero, > we must free the folio. > > It's pretty rare to remove a hugetlb page from the page cache while it's > still mapped. So we don't need to worry about scalability at that point. > > Any volunteers to prototype this? Andres is a delight to work with, > but I just don't have time to take on this project right now. Hmmm ... do we really want to make refcounting more complicated, and more importantly, hugetlb-refcounting more special ?! :) If the workload doing a lot of single-page try_grab_folio_fast(), could it do so on a larger area (multiple pages at once -> single refcount update)? Maybe there is a link to the report you could share, thanks. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-27 14:09 ` David Hildenbrand @ 2025-01-27 16:02 ` Matthew Wilcox 2025-01-27 16:09 ` David Hildenbrand 2025-01-27 16:24 ` Keith Busch 2025-01-27 17:25 ` Andres Freund 1 sibling, 2 replies; 18+ messages in thread From: Matthew Wilcox @ 2025-01-27 16:02 UTC (permalink / raw) To: David Hildenbrand Cc: linux-mm, linux-block, Muchun Song, Jane Chu, Andres Freund [Adding Andres to the cc. Sorry for leaving you off in the initial mail] On Mon, Jan 27, 2025 at 03:09:23PM +0100, David Hildenbrand wrote: > On 26.01.25 01:46, Matthew Wilcox wrote: > > Postgres are experimenting with doing direct I/O to 1GB hugetlb pages. > > Andres has gathered some performance data showing significantly worse > > performance with 1GB pages compared to 2MB pages. I sent a patch > > recently which improves matters [1], but problems remain. > > > > The primary problem we've identified is contention of folio->_refcount > > with a strong secondary contention on folio->_pincount. This is coming > > from the call chain: > > > > iov_iter_extract_pages -> > > gup_fast_fallback -> > > try_grab_folio_fast > > > > Obviously we can fix this by sharding the counts. We could do that by > > address, since there's no observed performance problem with 2MB pages. > > But I think we'd do better to shard by CPU. We have percpu-refcount.h > > already, and I think it'll work. > > > > The key to percpu refcounts is knowing at what point you need to start > > caring about whether the refcount has hit zero (we don't care if the > > refcount oscillates between 1 and 2, but we very much care about when > > we hit 0). > > > > I think the point at which we call percpu_ref_kill() is when we remove a > > folio from the page cache. Before that point, the refcount is guaranteed > > to always be positive. After that point, once the refcount hits zero, > > we must free the folio. > > > > It's pretty rare to remove a hugetlb page from the page cache while it's > > still mapped. So we don't need to worry about scalability at that point. > > > > Any volunteers to prototype this? Andres is a delight to work with, > > but I just don't have time to take on this project right now. > > Hmmm ... do we really want to make refcounting more complicated, and more > importantly, hugetlb-refcounting more special ?! :) No, I really don't. But I've always been mildly concerned about extra contention on folio locks, folio refcounts, etc. I don't know if 2MB page performance might be improved by a scheme like this, and we might even want to cut over for sizes larger than, say, 64kB. That would be something interesting to investigate. > If the workload doing a lot of single-page try_grab_folio_fast(), could it > do so on a larger area (multiple pages at once -> single refcount update)? Not really. This is memory that's being used as the buffer cache, so every thread in your database is hammering on it and pulling in exactly the data that it needs for the SQL query that it's processing. > Maybe there is a link to the report you could share, thanks. Andres shared some gists, but I don't want to send those to a mailing list without permission. Here's the kernel part of the perf report: 14.04% postgres [kernel.kallsyms] [k] try_grab_folio_fast | --14.04%--try_grab_folio_fast gup_fast_fallback | --13.85%--iov_iter_extract_pages bio_iov_iter_get_pages iomap_dio_bio_iter __iomap_dio_rw iomap_dio_rw xfs_file_dio_read xfs_file_read_iter __io_read io_read io_issue_sqe io_submit_sqes __do_sys_io_uring_enter do_syscall_64 Now, since postgres is using io_uring, perhaps there could be a path which registers the memory with the iouring (doing the refcount/pincount dance once), and then use that pinned memory for each I/O. Maybe that already exists; I'm not keeping up with io_uring development and I can't seem to find any documentation on what things like io_provide_buffers() actually do. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-27 16:02 ` Matthew Wilcox @ 2025-01-27 16:09 ` David Hildenbrand 2025-01-27 16:20 ` David Hildenbrand 2025-01-27 18:21 ` Andres Freund 2025-01-27 16:24 ` Keith Busch 1 sibling, 2 replies; 18+ messages in thread From: David Hildenbrand @ 2025-01-27 16:09 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-mm, linux-block, Muchun Song, Jane Chu, Andres Freund > >> If the workload doing a lot of single-page try_grab_folio_fast(), could it >> do so on a larger area (multiple pages at once -> single refcount update)? > > Not really. This is memory that's being used as the buffer cache, so > every thread in your database is hammering on it and pulling in exactly > the data that it needs for the SQL query that it's processing. Ouch. > >> Maybe there is a link to the report you could share, thanks. > > Andres shared some gists, but I don't want to send those to a > mailing list without permission. Here's the kernel part of the > perf report: > > 14.04% postgres [kernel.kallsyms] [k] try_grab_folio_fast > | > --14.04%--try_grab_folio_fast > gup_fast_fallback > | > --13.85%--iov_iter_extract_pages > bio_iov_iter_get_pages > iomap_dio_bio_iter > __iomap_dio_rw > iomap_dio_rw > xfs_file_dio_read > xfs_file_read_iter > __io_read > io_read > io_issue_sqe > io_submit_sqes > __do_sys_io_uring_enter > do_syscall_64 > > Now, since postgres is using io_uring, perhaps there could be a path > which registers the memory with the iouring (doing the refcount/pincount > dance once), and then use that pinned memory for each I/O. Maybe that > already exists; I'm not keeping up with io_uring development and I can't > seem to find any documentation on what things like io_provide_buffers() > actually do. That's precisely what io-uring fixed buffers do :) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-27 16:09 ` David Hildenbrand @ 2025-01-27 16:20 ` David Hildenbrand 2025-01-27 16:56 ` Matthew Wilcox 2025-01-27 18:21 ` Andres Freund 1 sibling, 1 reply; 18+ messages in thread From: David Hildenbrand @ 2025-01-27 16:20 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-mm, linux-block, Muchun Song, Jane Chu, Andres Freund On 27.01.25 17:09, David Hildenbrand wrote: >> >>> If the workload doing a lot of single-page try_grab_folio_fast(), could it >>> do so on a larger area (multiple pages at once -> single refcount update)? >> >> Not really. This is memory that's being used as the buffer cache, so >> every thread in your database is hammering on it and pulling in exactly >> the data that it needs for the SQL query that it's processing. > > Ouch. > >> >>> Maybe there is a link to the report you could share, thanks. >> >> Andres shared some gists, but I don't want to send those to a >> mailing list without permission. Here's the kernel part of the >> perf report: >> >> 14.04% postgres [kernel.kallsyms] [k] try_grab_folio_fast >> | >> --14.04%--try_grab_folio_fast >> gup_fast_fallback >> | >> --13.85%--iov_iter_extract_pages >> bio_iov_iter_get_pages >> iomap_dio_bio_iter >> __iomap_dio_rw >> iomap_dio_rw >> xfs_file_dio_read >> xfs_file_read_iter >> __io_read >> io_read >> io_issue_sqe >> io_submit_sqes >> __do_sys_io_uring_enter >> do_syscall_64 BTW, two things that come to mind: (1) We always fallback to GUP-fast, I wonder why. GUP-fast would go via try_grab_folio_fast(). (2) During GUP slow, we must take the PT lock of the PUD table. So the folio refcount/pincount/whatever is actually sync'ed by the ... PT lock here? See assert_spin_locked(pud_lockptr(mm, pudp)); in follow_huge_pud(). Note that that PUD table lock is likely a per-MM lock ... and yes, it indeed is. We don't have split PUD locks. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-27 16:20 ` David Hildenbrand @ 2025-01-27 16:56 ` Matthew Wilcox 2025-01-27 16:59 ` David Hildenbrand 0 siblings, 1 reply; 18+ messages in thread From: Matthew Wilcox @ 2025-01-27 16:56 UTC (permalink / raw) To: David Hildenbrand Cc: linux-mm, linux-block, Muchun Song, Jane Chu, Andres Freund On Mon, Jan 27, 2025 at 05:20:25PM +0100, David Hildenbrand wrote: > > > 14.04% postgres [kernel.kallsyms] [k] try_grab_folio_fast > > > | > > > --14.04%--try_grab_folio_fast > > > gup_fast_fallback > > > | > > > --13.85%--iov_iter_extract_pages > > > bio_iov_iter_get_pages > > > iomap_dio_bio_iter > > > __iomap_dio_rw > > > iomap_dio_rw > > > xfs_file_dio_read > > > xfs_file_read_iter > > > __io_read > > > io_read > > > io_issue_sqe > > > io_submit_sqes > > > __do_sys_io_uring_enter > > > do_syscall_64 > > BTW, two things that come to mind: > > > (1) We always fallback to GUP-slow, I wonder why. GUP-fast would go via > try_grab_folio_fast(). I don't think we do? iov_iter_extract_pages() calls iov_iter_extract_user_pages() calls pin_user_pages_fast() calls gup_fast_fallback() calls gup_fast() calls gup_fast_pgd_range() calls gup_fast_p4d_range() calls gup_fast_pud_range() calls gup_fast_pud_leaf() calls try_grab_folio_fast() which is where we see the contention. If that were to fail, we'd see contention in __get_user_pages_locked(), right? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-27 16:56 ` Matthew Wilcox @ 2025-01-27 16:59 ` David Hildenbrand 0 siblings, 0 replies; 18+ messages in thread From: David Hildenbrand @ 2025-01-27 16:59 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-mm, linux-block, Muchun Song, Jane Chu, Andres Freund On 27.01.25 17:56, Matthew Wilcox wrote: > On Mon, Jan 27, 2025 at 05:20:25PM +0100, David Hildenbrand wrote: >>>> 14.04% postgres [kernel.kallsyms] [k] try_grab_folio_fast >>>> | >>>> --14.04%--try_grab_folio_fast >>>> gup_fast_fallback >>>> | >>>> --13.85%--iov_iter_extract_pages >>>> bio_iov_iter_get_pages >>>> iomap_dio_bio_iter >>>> __iomap_dio_rw >>>> iomap_dio_rw >>>> xfs_file_dio_read >>>> xfs_file_read_iter >>>> __io_read >>>> io_read >>>> io_issue_sqe >>>> io_submit_sqes >>>> __do_sys_io_uring_enter >>>> do_syscall_64 >> >> BTW, two things that come to mind: >> >> >> (1) We always fallback to GUP-slow, I wonder why. GUP-fast would go via >> try_grab_folio_fast(). > > I don't think we do? > > iov_iter_extract_pages() calls > iov_iter_extract_user_pages() calls > pin_user_pages_fast() calls > gup_fast_fallback() calls > gup_fast() calls > gup_fast_pgd_range() calls > gup_fast_p4d_range() calls > gup_fast_pud_range() calls > gup_fast_pud_leaf() calls > try_grab_folio_fast() which is where we see the contention. > > If that were to fail, we'd see contention in __get_user_pages_locked(), > right? Sorry, my brain stripped the "_fast" in the callchain above :( -- it *is* try_grab_folio_fast() :) So yes, we are in GUP-fast as one would expect. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-27 16:09 ` David Hildenbrand 2025-01-27 16:20 ` David Hildenbrand @ 2025-01-27 18:21 ` Andres Freund 2025-01-27 18:54 ` Jens Axboe 1 sibling, 1 reply; 18+ messages in thread From: Andres Freund @ 2025-01-27 18:21 UTC (permalink / raw) To: David Hildenbrand Cc: Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu, Jens Axboe Hi, On 2025-01-27 17:09:57 +0100, David Hildenbrand wrote: > > Andres shared some gists, but I don't want to send those to a > > mailing list without permission. Here's the kernel part of the > > perf report: > > > > 14.04% postgres [kernel.kallsyms] [k] try_grab_folio_fast > > | > > --14.04%--try_grab_folio_fast > > gup_fast_fallback > > | > > --13.85%--iov_iter_extract_pages > > bio_iov_iter_get_pages > > iomap_dio_bio_iter > > __iomap_dio_rw > > iomap_dio_rw > > xfs_file_dio_read > > xfs_file_read_iter > > __io_read > > io_read > > io_issue_sqe > > io_submit_sqes > > __do_sys_io_uring_enter > > do_syscall_64 > > > > Now, since postgres is using io_uring, perhaps there could be a path > > which registers the memory with the iouring (doing the refcount/pincount > > dance once), and then use that pinned memory for each I/O. Maybe that > > already exists; I'm not keeping up with io_uring development and I can't > > seem to find any documentation on what things like io_provide_buffers() > > actually do. Worth noting that we'll not always use io_uring. Partially for portability to other platforms, partially because it turns out that io_uring is disabled in enough environments that we can't rely on it. The generic fallback implementation is a pool of worker processes connected via shared memory. The worker process approach did run into this issue, fwiw. That's not to say that a legit answer to this scalability issue can't be "use fixed bufs with io_uring", just wanted to give context. > That's precisely what io-uring fixed buffers do :) I looked at using them at some point - unfortunately it seems that there is just {READ,WRITE}_FIXED not {READV,WRITEV}_FIXED. It's *exceedingly* common for us to do reads/writes where source/target buffers aren't wholly contiguous. Thus - unless I am misunderstanding something, entirely plausible - using fixed buffers would unfortunately increase the number of IOs noticeably. Should have sent an email about that... I guess we could add some heuristic to use _FIXED if it doesn't require splitting an IO into too many sub-ios. But that seems pretty gnarly. I dimly recall that I also ran into some around using fixed buffers as a non-root user. It might just be the accounting of registered buffers as mlocked memory and the difficulty of configuring that across distributions. But I unfortunately don't remember any details anymore. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-27 18:21 ` Andres Freund @ 2025-01-27 18:54 ` Jens Axboe 2025-01-27 19:07 ` David Hildenbrand 2025-01-27 21:32 ` Pavel Begunkov 0 siblings, 2 replies; 18+ messages in thread From: Jens Axboe @ 2025-01-27 18:54 UTC (permalink / raw) To: Andres Freund, David Hildenbrand Cc: Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu, Pavel Begunkov On 1/27/25 11:21 AM, Andres Freund wrote: >> That's precisely what io-uring fixed buffers do :) > > I looked at using them at some point - unfortunately it seems that there is > just {READ,WRITE}_FIXED not {READV,WRITEV}_FIXED. It's *exceedingly* common > for us to do reads/writes where source/target buffers aren't wholly > contiguous. Thus - unless I am misunderstanding something, entirely plausible > - using fixed buffers would unfortunately increase the number of IOs > noticeably. > > Should have sent an email about that... > > I guess we could add some heuristic to use _FIXED if it doesn't require > splitting an IO into too many sub-ios. But that seems pretty gnarly. Adding Pavel, he's been working on support registered buffers with readv/writev. > I dimly recall that I also ran into some around using fixed buffers as a > non-root user. It might just be the accounting of registered buffers as > mlocked memory and the difficulty of configuring that across > distributions. But I unfortunately don't remember any details anymore. Should just be accounting. -- Jens Axboe ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-27 18:54 ` Jens Axboe @ 2025-01-27 19:07 ` David Hildenbrand 2025-01-27 21:32 ` Pavel Begunkov 1 sibling, 0 replies; 18+ messages in thread From: David Hildenbrand @ 2025-01-27 19:07 UTC (permalink / raw) To: Jens Axboe, Andres Freund Cc: Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu, Pavel Begunkov On 27.01.25 19:54, Jens Axboe wrote: > On 1/27/25 11:21 AM, Andres Freund wrote: >>> That's precisely what io-uring fixed buffers do :) >> >> I looked at using them at some point - unfortunately it seems that there is >> just {READ,WRITE}_FIXED not {READV,WRITEV}_FIXED. It's *exceedingly* common >> for us to do reads/writes where source/target buffers aren't wholly >> contiguous. Thus - unless I am misunderstanding something, entirely plausible >> - using fixed buffers would unfortunately increase the number of IOs >> noticeably. >> >> Should have sent an email about that... >> >> I guess we could add some heuristic to use _FIXED if it doesn't require >> splitting an IO into too many sub-ios. But that seems pretty gnarly. > > Adding Pavel, he's been working on support registered buffers with > readv/writev. > >> I dimly recall that I also ran into some around using fixed buffers as a >> non-root user. It might just be the accounting of registered buffers as >> mlocked memory and the difficulty of configuring that across >> distributions. But I unfortunately don't remember any details anymore. > > Should just be accounting. For hugetlb we might be able to relax the accounting restriction. It's effectively mlocked already, and setting up hugetlb requires privileges. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-27 18:54 ` Jens Axboe 2025-01-27 19:07 ` David Hildenbrand @ 2025-01-27 21:32 ` Pavel Begunkov 1 sibling, 0 replies; 18+ messages in thread From: Pavel Begunkov @ 2025-01-27 21:32 UTC (permalink / raw) To: Jens Axboe, Andres Freund, David Hildenbrand Cc: Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu On 1/27/25 18:54, Jens Axboe wrote: > On 1/27/25 11:21 AM, Andres Freund wrote: >>> That's precisely what io-uring fixed buffers do :) >> >> I looked at using them at some point - unfortunately it seems that there is >> just {READ,WRITE}_FIXED not {READV,WRITEV}_FIXED. It's *exceedingly* common >> for us to do reads/writes where source/target buffers aren't wholly >> contiguous. Thus - unless I am misunderstanding something, entirely plausible >> - using fixed buffers would unfortunately increase the number of IOs >> noticeably. >> >> Should have sent an email about that... >> >> I guess we could add some heuristic to use _FIXED if it doesn't require >> splitting an IO into too many sub-ios. But that seems pretty gnarly. > > Adding Pavel, he's been working on support registered buffers with > readv/writev. Yes, there are patches from last year out there, I'm going to pick them up and, hopefully for 6.15. I'll keep you in the loop. >> I dimly recall that I also ran into some around using fixed buffers as a >> non-root user. It might just be the accounting of registered buffers as >> mlocked memory and the difficulty of configuring that across >> distributions. But I unfortunately don't remember any details anymore. > > Should just be accounting. > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-27 16:02 ` Matthew Wilcox 2025-01-27 16:09 ` David Hildenbrand @ 2025-01-27 16:24 ` Keith Busch 1 sibling, 0 replies; 18+ messages in thread From: Keith Busch @ 2025-01-27 16:24 UTC (permalink / raw) To: Matthew Wilcox Cc: David Hildenbrand, linux-mm, linux-block, Muchun Song, Jane Chu, Andres Freund On Mon, Jan 27, 2025 at 04:02:41PM +0000, Matthew Wilcox wrote: > Now, since postgres is using io_uring, perhaps there could be a path > which registers the memory with the iouring (doing the refcount/pincount > dance once), and then use that pinned memory for each I/O. Maybe that > already exists; I'm not keeping up with io_uring development and I can't > seem to find any documentation on what things like io_provide_buffers() > actually do. Yes, io_uring does hqve this capability. Here's a doc describing the liburing function that sets it up, and explains how to reference it in subsequent IO: https://unixism.net/loti/ref-liburing/advanced_usage.html#c.io_uring_register_buffers ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-27 14:09 ` David Hildenbrand 2025-01-27 16:02 ` Matthew Wilcox @ 2025-01-27 17:25 ` Andres Freund 2025-01-27 19:20 ` David Hildenbrand 1 sibling, 1 reply; 18+ messages in thread From: Andres Freund @ 2025-01-27 17:25 UTC (permalink / raw) To: David Hildenbrand Cc: Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu Hi, On 2025-01-27 15:09:23 +0100, David Hildenbrand wrote: > Hmmm ... do we really want to make refcounting more complicated, and more > importantly, hugetlb-refcounting more special ?! :) I don't know the answer to that - I mainly wanted to report the issue because it was pretty nasty to debug and initially surprising (to me). > If the workload doing a lot of single-page try_grab_folio_fast(), could it > do so on a larger area (multiple pages at once -> single refcount update)? In the original case I hit this I (a VM with 10 PCIe 3x NVMEs JBODed together), the IO size averaged something like ~240kB (most 256kB, with some smaller ones thrown in). Increasing the IO size further than that starts to hurt latency and thus requires even deeper IO queues... Unfortunately for the VMs with those disks I don't have access to hardware performance counters :(. > Maybe there is a link to the report you could share, thanks. A profile of the "original" case where I hit this, without the patch that Willy linked to: Note this is a profile *not* using hardware perf counters, thus likely to be rather skewed: https://gist.github.com/anarazel/304aa6b81d05feb3f4990b467d02dabc (this was on Debian Sid's 6.12.6) Without the patch I achieved ~18GB/s with 1GB pages and ~35GB/s with 2MB pages. After applying the patch to add an unlocked already-dirty check to bio_set_pages_dirty() performance improves to ~20GB/s when using 1GB pages. A differential profile comparing 2MB and 1GB pages with the patch applied (again, without hardware perf counters): https://gist.github.com/anarazel/f993c238ea7d2c34f44440336d90ad8f Willy then asked me for perf annotate of where in gup_fast_fallback() time is spent. I didn't have access to the VM at that point, and tried to repro the problem with local hardware. As I don't have quite enough IO throughput available locally, I couldn't repro the problem quite as easily. But after lowering the average IO size (which is not unrealistic, far from every workload is just a bulk sequential scan), it showed up when just using two PCIe 4 NVMe SSDs. Here are profiles of the 2MB and 1GB cases, with the bio_set_pages_dirty() patch applied: https://gist.github.com/anarazel/f0d0a884c55ee18851dc9f15f03f7583 2MB pages get ~12.5GB/s, 1GB pages ~7GB/s, with a *lot* of variance. This time it's actual hardware perf counters... Relevant details about the c2c report, excerpted from IRC: andres | willy: Looking at a bit more detail into the c2c report, it looks like the dirtying is due to folio->_pincount and folio->_refcount in about equal measure and folio->flags being modified in gup_fast_fallback(). The modifications then, unsurprisingly, cause a lot of cache misses for reads (like in bio_set_pages_dirty() and bio_check_pages_dirty()). willy | andres: that makes perfect sense, thanks willy | really, the only way to fix that is to split it up willy | and either we can split it per-cpu or per-physical-address-range andres | willy: Yea, that's probably the only fundamental fix. I guess there might be some around-the-edges improvements by colocating the write heavy data on a separate cache line from flags and whatever is at 0x8, which are read more often than written. But I really don't know enough about how all this is used. willy | 0x8 is compound_head which is definitely read more often than written Greetings, Andres Freund ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-27 17:25 ` Andres Freund @ 2025-01-27 19:20 ` David Hildenbrand 2025-01-27 19:36 ` Andres Freund 0 siblings, 1 reply; 18+ messages in thread From: David Hildenbrand @ 2025-01-27 19:20 UTC (permalink / raw) To: Andres Freund Cc: Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu On 27.01.25 18:25, Andres Freund wrote: > Hi, > > On 2025-01-27 15:09:23 +0100, David Hildenbrand wrote: >> Hmmm ... do we really want to make refcounting more complicated, and more >> importantly, hugetlb-refcounting more special ?! :) > > I don't know the answer to that - I mainly wanted to report the issue because > it was pretty nasty to debug and initially surprising (to me). Thanks for reporting! > > >> If the workload doing a lot of single-page try_grab_folio_fast(), could it >> do so on a larger area (multiple pages at once -> single refcount update)? > > In the original case I hit this I (a VM with 10 PCIe 3x NVMEs JBODed > together), the IO size averaged something like ~240kB (most 256kB, with some > smaller ones thrown in). Increasing the IO size further than that starts to > hurt latency and thus requires even deeper IO queues... Makes sense. > > Unfortunately for the VMs with those disks I don't have access to hardware > performance counters :(. > > >> Maybe there is a link to the report you could share, thanks. > > A profile of the "original" case where I hit this, without the patch that > Willy linked to: > > Note this is a profile *not* using hardware perf counters, thus likely to be > rather skewed: > https://gist.github.com/anarazel/304aa6b81d05feb3f4990b467d02dabc > (this was on Debian Sid's 6.12.6) > > Without the patch I achieved ~18GB/s with 1GB pages and ~35GB/s with 2MB > pages. Out of interest, did you ever compare it to 4k? > > After applying the patch to add an unlocked already-dirty check to > bio_set_pages_dirty() performance improves to ~20GB/s when using 1GB pages. > > A differential profile comparing 2MB and 1GB pages with the patch applied > (again, without hardware perf counters): > https://gist.github.com/anarazel/f993c238ea7d2c34f44440336d90ad8f > > > Willy then asked me for perf annotate of where in gup_fast_fallback() time is > spent. I didn't have access to the VM at that point, and tried to repro the > problem with local hardware. > > > As I don't have quite enough IO throughput available locally, I couldn't repro > the problem quite as easily. But after lowering the average IO size (which is > not unrealistic, far from every workload is just a bulk sequential scan), it > showed up when just using two PCIe 4 NVMe SSDs. > > Here are profiles of the 2MB and 1GB cases, with the bio_set_pages_dirty() > patch applied: > https://gist.github.com/anarazel/f0d0a884c55ee18851dc9f15f03f7583 > > 2MB pages get ~12.5GB/s, 1GB pages ~7GB/s, with a *lot* of variance. Thanks! > > This time it's actual hardware perf counters... > > Relevant details about the c2c report, excerpted from IRC: > > andres | willy: Looking at a bit more detail into the c2c report, it looks > like the dirtying is due to folio->_pincount and folio->_refcount in > about equal measure and folio->flags being modified in > gup_fast_fallback(). The modifications then, unsurprisingly, cause a > lot of cache misses for reads (like in bio_set_pages_dirty() and > bio_check_pages_dirty()). > > willy | andres: that makes perfect sense, thanks > willy | really, the only way to fix that is to split it up > willy | and either we can split it per-cpu or per-physical-address-range As discussed, even better is "not repeatedly pinning/unpinning" at all :) I'm curious, are multiple processes involved, or is this all within a single process? -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-27 19:20 ` David Hildenbrand @ 2025-01-27 19:36 ` Andres Freund 0 siblings, 0 replies; 18+ messages in thread From: Andres Freund @ 2025-01-27 19:36 UTC (permalink / raw) To: David Hildenbrand Cc: Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu Hi, On 2025-01-27 20:20:41 +0100, David Hildenbrand wrote: > On 27.01.25 18:25, Andres Freund wrote: > > On 2025-01-27 15:09:23 +0100, David Hildenbrand wrote: > > Unfortunately for the VMs with those disks I don't have access to hardware > > performance counters :(. > > > > > > Maybe there is a link to the report you could share, thanks. > > > > A profile of the "original" case where I hit this, without the patch that > > Willy linked to: > > > > Note this is a profile *not* using hardware perf counters, thus likely to be > > rather skewed: > > https://gist.github.com/anarazel/304aa6b81d05feb3f4990b467d02dabc > > (this was on Debian Sid's 6.12.6) > > > > Without the patch I achieved ~18GB/s with 1GB pages and ~35GB/s with 2MB > > pages. > > Out of interest, did you ever compare it to 4k? I didn't. Postgres will always do at least 8kB (unless compiled with non-default settings). But I also don't think I tested just doing 8kB on that VM. I doubt I'd have gotten close to the max, even with 2MB huge pages. At least not without block-layer-level merging of IOs. If it's particularly interesting, I can bring a similar VM up and run that comparison. > > This time it's actual hardware perf counters... > > > > Relevant details about the c2c report, excerpted from IRC: > > > > andres | willy: Looking at a bit more detail into the c2c report, it looks > > like the dirtying is due to folio->_pincount and folio->_refcount in > > about equal measure and folio->flags being modified in > > gup_fast_fallback(). The modifications then, unsurprisingly, cause a > > lot of cache misses for reads (like in bio_set_pages_dirty() and > > bio_check_pages_dirty()). > > > > willy | andres: that makes perfect sense, thanks > > willy | really, the only way to fix that is to split it up > > willy | and either we can split it per-cpu or per-physical-address-range > > As discussed, even better is "not repeatedly pinning/unpinning" at all :) Indeed ;) > I'm curious, are multiple processes involved, or is this all within a single > process? In the test case here multiple processes are involved, I was testing a parallel sequential scan, with a high limit to the paralellism. There are cases in which a fair bit of read IO is done from a single proccess (e.g. to prerewarm the buffer pool after a restart, that's currently done by a single process), but it's more common for high throughput to happen across multiple processes. With modern drives a single task won't be able to execute non-trivial queries at full disk speed. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-26 0:46 Direct I/O performance problems with 1GB pages Matthew Wilcox 2025-01-27 14:09 ` David Hildenbrand @ 2025-01-28 5:56 ` Christoph Hellwig 2025-01-28 9:47 ` David Hildenbrand 1 sibling, 1 reply; 18+ messages in thread From: Christoph Hellwig @ 2025-01-28 5:56 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-mm, linux-block, Muchun Song, Jane Chu, David Hildenbrand I read through this and unfortunately have nothing useful to contribute to the actual lock sharding. Just two semi-related bits: On Sun, Jan 26, 2025 at 12:46:45AM +0000, Matthew Wilcox wrote: > Postgres are experimenting with doing direct I/O to 1GB hugetlb pages. > Andres has gathered some performance data showing significantly worse > performance with 1GB pages compared to 2MB pages. I sent a patch > recently which improves matters [1], but problems remain. > > The primary problem we've identified is contention of folio->_refcount > with a strong secondary contention on folio->_pincount. This is coming > from the call chain: > > iov_iter_extract_pages -> > gup_fast_fallback -> > try_grab_folio_fast Eww, gup_fast_fallback sent me down the wrong path, as it suggests that is is the fallback slow path, but it's not. It got renamed from internal_get_user_pages_fast in commit 23babe1934d7 ("mm/gup: consistently name GUP-fast functions"). While old name wasn't all that great the new one including fallback is just horrible. Can we please fix this up? The other thing is that the whole GUP machinery takes a reference per page fragment it touches and not just per folio fragment. I'm not sure how fair access to the atomics is, but I suspect the multiple calls to increment/decrement them per operation probably don't make the lock contention any better. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-28 5:56 ` Christoph Hellwig @ 2025-01-28 9:47 ` David Hildenbrand 2025-01-29 6:03 ` Christoph Hellwig 0 siblings, 1 reply; 18+ messages in thread From: David Hildenbrand @ 2025-01-28 9:47 UTC (permalink / raw) To: Christoph Hellwig, Matthew Wilcox Cc: linux-mm, linux-block, Muchun Song, Jane Chu On 28.01.25 06:56, Christoph Hellwig wrote: > I read through this and unfortunately have nothing useful to contribute > to the actual lock sharding. Just two semi-related bits: > > On Sun, Jan 26, 2025 at 12:46:45AM +0000, Matthew Wilcox wrote: >> Postgres are experimenting with doing direct I/O to 1GB hugetlb pages. >> Andres has gathered some performance data showing significantly worse >> performance with 1GB pages compared to 2MB pages. I sent a patch >> recently which improves matters [1], but problems remain. >> >> The primary problem we've identified is contention of folio->_refcount >> with a strong secondary contention on folio->_pincount. This is coming >> from the call chain: >> >> iov_iter_extract_pages -> >> gup_fast_fallback -> >> try_grab_folio_fast > > Eww, gup_fast_fallback sent me down the wrong path, as it suggests that > is is the fallback slow path, but it's not. It got renamed from > internal_get_user_pages_fast in commit 23babe1934d7 ("mm/gup: > consistently name GUP-fast functions"). While old name wasn't all > that great the new one including fallback is just horrible. Can we > please fix this up? Yes, I did that renaming as part of that series after the name was suggested during review. I got confused myself reading this report. internal_get_user_pages_fast() -> gup_fast_fallback() Was certainly an improvement. Naming is hard, we want to express "try fast, but fallback to slow if it fails". Maybe "gup_fast_with_fallback" might be clearer, not sure. > > The other thing is that the whole GUP machinery takes a reference > per page fragment it touches and not just per folio fragment. Right, that's what calling code expects: hands out pages, one ref per page. > I'm not sure how fair access to the atomics is, but I suspect the > multiple calls to increment/decrement them per operation probably > don't make the lock contention any better. The "advanced" logic in __get_user_pages() makes sure that if you GUP multiple pages covered by a single PMD/PUD, that we will adjust the refcount only twice (once deep down in the page table walker, then in __get_user_pages()). There is certainly a lot of room for improvement in these page table walkers ... -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Direct I/O performance problems with 1GB pages 2025-01-28 9:47 ` David Hildenbrand @ 2025-01-29 6:03 ` Christoph Hellwig 0 siblings, 0 replies; 18+ messages in thread From: Christoph Hellwig @ 2025-01-29 6:03 UTC (permalink / raw) To: David Hildenbrand Cc: Christoph Hellwig, Matthew Wilcox, linux-mm, linux-block, Muchun Song, Jane Chu On Tue, Jan 28, 2025 at 10:47:12AM +0100, David Hildenbrand wrote: > Yes, I did that renaming as part of that series after the name was suggested > during review. I got confused myself reading this report. > > internal_get_user_pages_fast() -> gup_fast_fallback() > > Was certainly an improvement. Naming is hard, we want to express "try fast, > but fallback to slow if it fails". > > Maybe "gup_fast_with_fallback" might be clearer, not sure. gup_common? ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-01-29 6:03 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-26 0:46 Direct I/O performance problems with 1GB pages Matthew Wilcox 2025-01-27 14:09 ` David Hildenbrand 2025-01-27 16:02 ` Matthew Wilcox 2025-01-27 16:09 ` David Hildenbrand 2025-01-27 16:20 ` David Hildenbrand 2025-01-27 16:56 ` Matthew Wilcox 2025-01-27 16:59 ` David Hildenbrand 2025-01-27 18:21 ` Andres Freund 2025-01-27 18:54 ` Jens Axboe 2025-01-27 19:07 ` David Hildenbrand 2025-01-27 21:32 ` Pavel Begunkov 2025-01-27 16:24 ` Keith Busch 2025-01-27 17:25 ` Andres Freund 2025-01-27 19:20 ` David Hildenbrand 2025-01-27 19:36 ` Andres Freund 2025-01-28 5:56 ` Christoph Hellwig 2025-01-28 9:47 ` David Hildenbrand 2025-01-29 6:03 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).