Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

From: Salvatore Dipietro <dipiets@amazon.it>
To: <ritesh.list@gmail.com>
Cc: <abuehaze@amazon.com>, <alisaidi@amazon.com>,
	<blakgeof@amazon.com>, <brauner@kernel.org>,
	<dipietro.salvatore@gmail.com>, <dipiets@amazon.it>,
	<djwong@kernel.org>, <linux-fsdevel@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
	<linux-xfs@vger.kernel.org>, <stable@vger.kernel.org>,
	<willy@infradead.org>
Subject: Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
Date: Tue, 28 Apr 2026 15:02:38 +0000	[thread overview]
Message-ID: <20260428150240.3009-1-dipiets@amazon.it> (raw)
In-Reply-To: <cxztt8nw.ritesh.list@gmail.com>

On 4/21/26 00:43, Ritesh Harjani wrote:
> Also, given the Maintainers (willy, Christoph, Dave) shown their
> dis-interest in taking the patch in it's current form, the right way is
> to get back with performance data with both the approaches (which we
> were discussing) and first get the consensus from everyone, before
> proposing this as a patch :).

Thank you for the follow-up and the additional context, Ritesh.
I might have misunderstood the previous request and will make sure to 
link back to previous patch versions in the future.
Here are the performance results that we have collected on our end with
the proposed patches:


| Patch                |    Run 1   |    Run 2   |    Run 3   |   Average   | % vs Baseline |
|----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
| Baseline             | 107,064.61 |  97,043.86 | 101,830.78 | 101,979.75  |       —       |
| Proposed patch       | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20  |    +38.45%    |
| Ritesh's suggestion  | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61  |    +36.50%    |
| Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82  |    +41.07%    |


On 4/21/26 00:43, Ritesh Harjani wrote:
> In that context, I wanted to understand your setup a bit from
> memory fragmentation perspective. Are you trying to simulate memory
> fragmentation and then benchmarking? Or was this problem hitting when
> you run simply run the reproduction steps mentioned in your cover
> letter?

All results were collected on fresh AWS instances as described in the
cover letter. Patch [1] has been applied on all instances to avoid the
other regression. The instance has been restarted to pick up the patched
kernel and ensure clean memory before installing and starting the
PostgreSQL benchmark via repro-collection [2]. 
We do not use any tool to fragment the memory in advance. Collecting 
memory metric of this system, we noticed that ~40% of memory is used by
PageTables since PostgreSQL spawns a new process for each client limiting
significantly the available caching and free memory.

PostgreSQL write pattern consists mostly of 8/16 KB data but during 
the database checkpoints, by default every 5 minutes, it flushes write-ahead
logs to disk, which uses large folios. At this point, the system attempts to
satisfy the folio allocation request, triggering the regression and falling
into the slow path, as shown by the Linux perf profile below:

  `-0.26%-__arm64_sys_pwrite64
    `-0.26%-vfs_write
      `-0.26%-xfs_file_write_iter
        `-0.26%-xfs_file_buffered_write
          `-0.26%-iomap_file_buffered_write
            `-0.26%-iomap_write_iter
              `-0.22%-iomap_write_begin
                `-0.22%-iomap_get_folio
                  `-0.22%-__filemap_get_folio
                    `-0.21%-filemap_alloc_folio->alloc_pages
                      `-0.20%-__alloc_pages_slowpath
                        |-0.12%-__alloc_pages_direct_compact
                        | `-0.12%-try_to_compact_pages
                        |   `-0.11%-compact_zone
                        |     `-0.11%-isolate_migratepages
                        `-0.07%-__drain_all_pages
                          `-0.07%-drain_pages_zone
                            `-0.07%-free_pcppages_bulk

This is also visible in the intermediate PGBench results, which drop
significantly during checkpoint time execution:

# Normal execution:
[20260421.141505] [INFO] progress: 660.0 s, 138828.2 tps, lat 7.509 ms stddev 16.985, 0 failed
[20260421.141515] [INFO] progress: 670.0 s, 151505.1 tps, lat 6.708 ms stddev 8.308, 0 failed
[20260421.141525] [INFO] progress: 680.0 s, 166558.7 tps, lat 6.190 ms stddev 6.537, 0 failed
[20260421.141535] [INFO] progress: 690.0 s, 141267.1 tps, lat 7.246 ms stddev 5.951, 0 failed

# During checkpoints:	
[20260421.141605] [INFO] progress: 720.0 s, 54119.8 tps, lat 18.894 ms stddev 81.816, 0 failed
[20260421.141615] [INFO] progress: 730.0 s, 55184.7 tps, lat 18.564 ms stddev 12.729, 0 failed
[20260421.141625] [INFO] progress: 740.0 s, 37334.0 tps, lat 27.302 ms stddev 25.060, 0 failed
[20260421.141635] [INFO] progress: 750.0 s, 53387.6 tps, lat 19.259 ms stddev 18.313, 0 failed
[20260421.141645] [INFO] progress: 760.0 s, 41247.3 tps, lat 24.805 ms stddev 24.116, 0 failed


On 4/21/26 00:43, Ritesh Harjani wrote:
> BTW - I was following the other thread too where PREEMPT_LAZY problem
> was getting discussed. And from what I understood, you mentioned [1]
> enabling THP on the system made that problem go away. Also it looks like
> enabling THP is the right thing to do for this kind of workload. Does
> that also mean enabling THP fixed this problem too? Do you still hit
> memory fragmentation and/or similar throughput drop w/o this fix after
> you enable THP? It will be good to know those details too please.

We have run more benchmarks (as baseline) with PostgreSQL huge_pages options
(on, off) pre-allocating the shared buffer memory with "vm.nr_hugepages"
(~25% of total memory, 2MB size) and Transparent Huge Pages (THP) options
(always, madvise, never). PostgreSQL performance improves only when
PostgreSQL huge_pages option with pre-allocated memory is enabled.
THP has no significant effect on PostgreSQL or system performance in this
case.

| PG huge_pages + pre-alloc mem | THP     |   Run 1 |   Run 2 |   Run 3 | Average |
|-------------------------------|---------|--------:|--------:|--------:|--------:|
| on                            | never   | 189,418 | 187,764 | 188,207 | 188,463 |
| on                            | always  | 188,813 | 189,798 | 190,032 | 189,548 |
| on                            | madvise | 187,405 | 192,234 | 189,201 | 189,613 |
| off                           | never   | 102,609 | 109,394 | 100,868 | 104,290 |
| off                           | always  |  90,274 | 103,831 | 102,515 |  98,874 |
| off                           | madvise |  90,508 | 103,855 |  96,574 |  96,979 |


[1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368
[2] https://github.com/aws/repro-collection.git



AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico

next prev parent reply	other threads:[~2026-04-28 15:02 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260403193535.9970-1-dipiets@amazon.it>
2026-04-03 19:35 ` [PATCH 1/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
2026-04-04  1:13   ` Ritesh Harjani
2026-04-04  4:15   ` Matthew Wilcox
2026-04-04 16:47     ` Ritesh Harjani
2026-04-04 20:46       ` Matthew Wilcox
2026-04-16 15:14       ` Ritesh Harjani
2026-04-20 16:33         ` Salvatore Dipietro
2026-04-20 18:44           ` Matthew Wilcox
2026-04-21  1:16             ` Ritesh Harjani
2026-04-28 15:02               ` Salvatore Dipietro [this message]
2026-05-03  5:52                 ` Ritesh Harjani
2026-05-03 11:55                   ` Matthew Wilcox
2026-04-05 22:43   ` Dave Chinner
2026-04-07  5:40     ` Christoph Hellwig
2026-04-21  9:02     ` Vlastimil Babka
     [not found] <20260403193201.30479-1-dipiets@amazon.it>
2026-04-03 19:32 ` Salvatore Dipietro
2026-04-04  6:25   ` Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260428150240.3009-1-dipiets@amazon.it \
    --to=dipiets@amazon.it \
    --cc=abuehaze@amazon.com \
    --cc=alisaidi@amazon.com \
    --cc=blakgeof@amazon.com \
    --cc=brauner@kernel.org \
    --cc=dipietro.salvatore@gmail.com \
    --cc=djwong@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ritesh.list@gmail.com \
    --cc=stable@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox