[LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
@ 2026-02-19  9:54 Hannes Reinecke
  2026-02-19 14:32 ` Theodore Tso
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Hannes Reinecke @ 2026-02-19  9:54 UTC (permalink / raw)
  To: lsf-pc, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, linux-mm

Hi all,

I (together with the Czech Technical University) did some experiments 
trying to measure memory fragmentation with large block sizes.
Testbed used was an nvme setup talking to a nvmet storage over
the network.

Doing so raised some challenges:

- How do you _generate_ memory fragmentation? The MM subsystem is
   precisely geared up to avoid it, so you would need to come up
   with some idea how to defeat it. With the help from Willy I managed
   to come up with something, but I really would like to discuss
   what would be the best option here.
- What is acceptable memory fragmentation? Are we good enough if the
   measured fragmentation does not grow during the test runs?
- Do we have better visibility into memory fragmentation other than
   just reading /proc/buddyinfo?

And, of course, I would like to present (and discuss) the results
of the testruns done on 4k, 8k, and 16k blocksizes.

Not sure if this should be a storage or MM topic; I'll let the
lsf-pc decide.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
  2026-02-19  9:54 [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes Hannes Reinecke
@ 2026-02-19 14:32 ` Theodore Tso
  2026-02-20  7:44   ` Hannes Reinecke
  2026-02-19 14:53 ` Bart Van Assche
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2026-02-19 14:32 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: lsf-pc, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, linux-mm

On Thu, Feb 19, 2026 at 10:54:48AM +0100, Hannes Reinecke wrote:
> Hi all,
> 
> I (together with the Czech Technical University) did some experiments trying
> to measure memory fragmentation with large block sizes.
> Testbed used was an nvme setup talking to a nvmet storage over
> the network.
> 
> Doing so raised some challenges:
> 
> - How do you _generate_ memory fragmentation? The MM subsystem is
>   precisely geared up to avoid it, so you would need to come up
>   with some idea how to defeat it. With the help from Willy I managed
>   to come up with something, but I really would like to discuss
>   what would be the best option here.

I'm trying to understand the goal of the experiment.  I'm guessing
that the goal was to see how much memory fragmentation would result
from using large block sizes with the control being to use, say, 4k
blocks.  Is that correct?

So I guess the question here is what are realstic workloads that
people would have in real world situations, so we can do the A-B
experiments to see what using LBS result in?

> - What is acceptable memory fragmentation? Are we good enough if the
>   measured fragmentation does not grow during the test runs?

I can think of two possible metrics.  The first is whether it results
in degradation of performance given certain real world workloads.

The second is whether given a particular memory pressure, the memory
fragmentation results in more jobs getting OOM killed.

	      	      	      	   	   - Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
  2026-02-19  9:54 [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes Hannes Reinecke
  2026-02-19 14:32 ` Theodore Tso
@ 2026-02-19 14:53 ` Bart Van Assche
  2026-02-19 15:00   ` Matthew Wilcox
  2026-03-16 23:26   ` Bart Van Assche
  2026-05-01 14:33 ` Matthew Wilcox
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 13+ messages in thread
From: Bart Van Assche @ 2026-02-19 14:53 UTC (permalink / raw)
  To: Hannes Reinecke, lsf-pc, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, linux-mm

On 2/19/26 1:54 AM, Hannes Reinecke wrote:
> I (together with the Czech Technical University) did some experiments 
> trying to measure memory fragmentation with large block sizes.
> Testbed used was an nvme setup talking to a nvmet storage over
> the network.
> 
> Doing so raised some challenges:
> 
> - How do you _generate_ memory fragmentation? The MM subsystem is
>    precisely geared up to avoid it, so you would need to come up
>    with some idea how to defeat it. With the help from Willy I managed
>    to come up with something, but I really would like to discuss
>    what would be the best option here.
> - What is acceptable memory fragmentation? Are we good enough if the
>    measured fragmentation does not grow during the test runs?
> - Do we have better visibility into memory fragmentation other than
>    just reading /proc/buddyinfo?

The larger the block size, the higher the write amplification (WAF),
isn't it? Why to increase the block size since there is a solution
available that doesn't increase WAF, namely zoned storage?

Additionally, why is contiguous memory required for block sizes
larger than the page size? Does this perhaps come from the VFS layer?
If so, is this something that can be fixed?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
  2026-02-19 14:53 ` Bart Van Assche
@ 2026-02-19 15:00   ` Matthew Wilcox
  2026-03-16 23:26   ` Bart Van Assche
  1 sibling, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2026-02-19 15:00 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Hannes Reinecke, lsf-pc, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, linux-mm

On Thu, Feb 19, 2026 at 06:53:28AM -0800, Bart Van Assche wrote:
> Additionally, why is contiguous memory required for block sizes
> larger than the page size? Does this perhaps come from the VFS layer?
> If so, is this something that can be fixed?

No.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
  2026-02-19 14:32 ` Theodore Tso
@ 2026-02-20  7:44   ` Hannes Reinecke
  0 siblings, 0 replies; 13+ messages in thread
From: Hannes Reinecke @ 2026-02-20  7:44 UTC (permalink / raw)
  To: Theodore Tso
  Cc: lsf-pc, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, linux-mm

On 2/19/26 15:32, Theodore Tso wrote:
> On Thu, Feb 19, 2026 at 10:54:48AM +0100, Hannes Reinecke wrote:
>> Hi all,
>>
>> I (together with the Czech Technical University) did some experiments trying
>> to measure memory fragmentation with large block sizes.
>> Testbed used was an nvme setup talking to a nvmet storage over
>> the network.
>>
>> Doing so raised some challenges:
>>
>> - How do you _generate_ memory fragmentation? The MM subsystem is
>>    precisely geared up to avoid it, so you would need to come up
>>    with some idea how to defeat it. With the help from Willy I managed
>>    to come up with something, but I really would like to discuss
>>    what would be the best option here.
> 
> I'm trying to understand the goal of the experiment.  I'm guessing
> that the goal was to see how much memory fragmentation would result
> from using large block sizes with the control being to use, say, 4k
> blocks.  Is that correct?
> 
The main goal was to figure out if we have increased memory 
fragmentation when using LBS.
Clearly, most (internal) allocations still work on page-sized
objects, so one can argue that using LBS might increase fragmentation.
On the other hand, all _filesystem_ objects will be in LBS sizes,
so we won't increase fragmentation if we only allocate in LBS sizes.
So which is it?

> So I guess the question here is what are realstic workloads that
> people would have in real world situations, so we can do the A-B
> experiments to see what using LBS result in?
> 
Yes.

>> - What is acceptable memory fragmentation? Are we good enough if the
>>    measured fragmentation does not grow during the test runs?
> 
> I can think of two possible metrics.  The first is whether it results
> in degradation of performance given certain real world workloads.
> 
> The second is whether given a particular memory pressure, the memory
> fragmentation results in more jobs getting OOM killed.
> 
That would be ideal, but we first need to have a program exerting
memory pressure...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
  2026-02-19 14:53 ` Bart Van Assche
  2026-02-19 15:00   ` Matthew Wilcox
@ 2026-03-16 23:26   ` Bart Van Assche
  1 sibling, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2026-03-16 23:26 UTC (permalink / raw)
  To: Hannes Reinecke, lsf-pc, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, linux-mm, Theodore Ts'o

On 2/19/26 6:53 AM, Bart Van Assche wrote:
> On 2/19/26 1:54 AM, Hannes Reinecke wrote:
>> I (together with the Czech Technical University) did some experiments 
>> trying to measure memory fragmentation with large block sizes.
>> Testbed used was an nvme setup talking to a nvmet storage over
>> the network.
>>
>> Doing so raised some challenges:
>>
>> - How do you _generate_ memory fragmentation? The MM subsystem is
>>    precisely geared up to avoid it, so you would need to come up
>>    with some idea how to defeat it. With the help from Willy I managed
>>    to come up with something, but I really would like to discuss
>>    what would be the best option here.
>> - What is acceptable memory fragmentation? Are we good enough if the
>>    measured fragmentation does not grow during the test runs?
>> - Do we have better visibility into memory fragmentation other than
>>    just reading /proc/buddyinfo?
> 
> The larger the block size, the higher the write amplification (WAF),
> isn't it? Why to increase the block size since there is a solution
> available that doesn't increase WAF, namely zoned storage?

(replying to my own email)

The following paper shows that it is possible to achieve great
performance with filesystems like ext4 and ZNS SSDs by implementing
an FTL in software (ZTL). This could be a more interesting approach
than optimizing host software for large indirection units. See also
Sass, Jan, André Brinkmann, Matias Bjørling, Xubin He, and Reza
Salkhordeh. "ZTL: A block layer ZNS driver." Journal of Systems
Architecture (2026): 103757.
(https://www.sciencedirect.com/science/article/pii/S1383762126000755).

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
  2026-02-19  9:54 [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes Hannes Reinecke
  2026-02-19 14:32 ` Theodore Tso
  2026-02-19 14:53 ` Bart Van Assche
@ 2026-05-01 14:33 ` Matthew Wilcox
  2026-06-09  7:28 ` Christoph Hellwig
  2026-06-10 22:27 ` Karim Manaouil
  4 siblings, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2026-05-01 14:33 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: lsf-pc, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, linux-mm

On Thu, Feb 19, 2026 at 10:54:48AM +0100, Hannes Reinecke wrote:
> I (together with the Czech Technical University) did some experiments trying
> to measure memory fragmentation with large block sizes.
> 
> Doing so raised some challenges:
> 
> - How do you _generate_ memory fragmentation? The MM subsystem is
>   precisely geared up to avoid it, so you would need to come up
>   with some idea how to defeat it. With the help from Willy I managed
>   to come up with something, but I really would like to discuss
>   what would be the best option here.
> - What is acceptable memory fragmentation? Are we good enough if the
>   measured fragmentation does not grow during the test runs?
> - Do we have better visibility into memory fragmentation other than
>   just reading /proc/buddyinfo?
> 
> And, of course, I would like to present (and discuss) the results
> of the testruns done on 4k, 8k, and 16k blocksizes.

I think that Rik's recent work is going to affect discussion of this
topic (summary: with a "small amount" of work, reliable allocation of
1GB folios is possible):

https://lore.kernel.org/linux-mm/20260430202233.111010-1-riel@surriel.com/

but another aspect to it is the recent performance problem reported by
Amazon (summary: compaction takes too long):

https://lore.kernel.org/linux-mm/20260428150240.3009-1-dipiets@amazon.it/

Anyway, I'm putting you on notice that I may hijack this session to talk
about how GFP flags suck.  I may even have a proposal for a replacement,
depending how inspired I am over the next few days.

I still think this discussion is useful because we wouldn't want an
attacker to be able to make Linux unreliable.  So it's useful to think
about how userspace can make memory unreclaimable and if large folios
make the problem worse in any meaningful way.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
  2026-02-19  9:54 [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes Hannes Reinecke
                   ` (2 preceding siblings ...)
  2026-05-01 14:33 ` Matthew Wilcox
@ 2026-06-09  7:28 ` Christoph Hellwig
  2026-06-09  8:39   ` Hannes Reinecke
  2026-06-10 22:27 ` Karim Manaouil
  4 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2026-06-09  7:28 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: lsf-pc, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, linux-mm

Hannes,

can you share your results on the mailing list?

On Thu, Feb 19, 2026 at 10:54:48AM +0100, Hannes Reinecke wrote:
> Hi all,
> 
> I (together with the Czech Technical University) did some experiments trying
> to measure memory fragmentation with large block sizes.
> Testbed used was an nvme setup talking to a nvmet storage over
> the network.
> 
> Doing so raised some challenges:
> 
> - How do you _generate_ memory fragmentation? The MM subsystem is
>   precisely geared up to avoid it, so you would need to come up
>   with some idea how to defeat it. With the help from Willy I managed
>   to come up with something, but I really would like to discuss
>   what would be the best option here.
> - What is acceptable memory fragmentation? Are we good enough if the
>   measured fragmentation does not grow during the test runs?
> - Do we have better visibility into memory fragmentation other than
>   just reading /proc/buddyinfo?
> 
> And, of course, I would like to present (and discuss) the results
> of the testruns done on 4k, 8k, and 16k blocksizes.
> 
> Not sure if this should be a storage or MM topic; I'll let the
> lsf-pc decide.
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
> 
> 
---end quoted text---


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
  2026-06-09  7:28 ` Christoph Hellwig
@ 2026-06-09  8:39   ` Hannes Reinecke
  2026-06-09  9:02     ` increasing PAGE_ALLOC_COSTLY_ORDER, was " Christoph Hellwig
  2026-06-09  9:37     ` [Lsf-pc] " Vlastimil Babka
  0 siblings, 2 replies; 13+ messages in thread
From: Hannes Reinecke @ 2026-06-09  8:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: lsf-pc, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2764 bytes --]

On 6/9/26 09:28, Christoph Hellwig wrote:
> Hannes,
> 
> can you share your results on the mailing list?
> 
I sure can.

We have run a simple testcase with on fio job on an LBS-enabled device, 
and another job permanently allocating and deallocating arrays of pages
of various array lengths.

We then took snapshots of /proc/buddyinfo to track memory pressure
over time.

Results are visualized in the attach plot.

With 4k block sizes we have seen a high number of 0- and 1- order pages,
and then the expected decline towards higher orders.

With 8k and 16k block sizes a noticeable 'bump' in free pages was 
developing in 2- and 3- order pages, which we think is down to 
compaction trying to merge pages together.
The number of 0- order pages increased slightly, but only half of the
maximum number of pages in the 'bump'.

With 32k block sizes the picture changed completely; the 'bump'
vanished, and there was only pronounces spike with 0-order pages
(about four times the size of the spike with 4k block sizes).

This led me to assume that compaction broke down at 32k block sizes;
this assumption was confirmed by Vlastimil Babka who pointed out that
there is a maximum order to which page compaction is attempted:

include/linux/mmzone.h:
/*
  * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
  * costly to service.  That is between allocation orders which should
  * coalesce naturally under reasonable reclaim pressure and those which
  * will not.
  */
#define PAGE_ALLOC_COSTLY_ORDER 3

and it's main usage is 'order > PAGE_ALLOC_COSTLY_ORDER'.
Which ties in directly with what we're seeing.

It will probably make sense to align the maximum block size which we
currently support (ie 64k) with this value to ensure that compaction
works with larger block sizes. Or maybe even the other way round;
tie the maximum block size which we support to PAGE_ALLOC_COSTLY_ORDER.
But that would mean to restrict the blocksize to 16k, whereas xfs
works happily with 32k. So we might want to raise PAGE_ALLOC_COSTLY_ORDER.

Question is, though, how could we measure the impact?
This particular value has been in since 2007 (commit 5ad333eb66ff1 
'lumpy reclaim V4'), and it might well be that the original
reasoning doesn't apply anymore.

At the same time, this value is tied to a _LOT_ of things
(not to mention the page allocator itself), so increasing it
to '4' has an extremely high chance of impacting mm performance.

I'll probably run mmtests and see what I get.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

[-- Attachment #2: fragmentation.png --]
[-- Type: image/png, Size: 7250 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* increasing PAGE_ALLOC_COSTLY_ORDER, was Re: [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
  2026-06-09  8:39   ` Hannes Reinecke
@ 2026-06-09  9:02     ` Christoph Hellwig
  2026-06-09  9:38       ` Hannes Reinecke
  2026-06-09  9:37     ` [Lsf-pc] " Vlastimil Babka
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2026-06-09  9:02 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Andy Whitcroft, linux-fsdevel,
	linux-mm

On Tue, Jun 09, 2026 at 10:39:14AM +0200, Hannes Reinecke wrote:
> With 32k block sizes the picture changed completely; the 'bump'
> vanished, and there was only pronounces spike with 0-order pages
> (about four times the size of the spike with 4k block sizes).
> 
> This led me to assume that compaction broke down at 32k block sizes;
> this assumption was confirmed by Vlastimil Babka who pointed out that
> there is a maximum order to which page compaction is attempted:
> 
> include/linux/mmzone.h:
> /*
>  * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
>  * costly to service.  That is between allocation orders which should
>  * coalesce naturally under reasonable reclaim pressure and those which
>  * will not.
>  */
> #define PAGE_ALLOC_COSTLY_ORDER 3
> 
> and it's main usage is 'order > PAGE_ALLOC_COSTLY_ORDER'.
> Which ties in directly with what we're seeing.

Yes, we've also seen similar issue with memory allocation beyond
PAGE_ALLOC_COSTLY_ORDER.  Right now the maximum block size supported
for XFS is 64k, and there's various workloads for which we'd like to
make use of that.  Increasing PAGE_ALLOC_COSTLY_ORDER to cover 64k
allocations would be extremely helpful to make it perform well.

How has PAGE_ALLOC_COSTLY_ORDER been chosen?  And what would it take
to increase it?

> It will probably make sense to align the maximum block size which we
> currently support (ie 64k) with this value to ensure that compaction
> works with larger block sizes. Or maybe even the other way round;
> tie the maximum block size which we support to PAGE_ALLOC_COSTLY_ORDER.
> But that would mean to restrict the blocksize to 16k, whereas xfs
> works happily with 32k. So we might want to raise PAGE_ALLOC_COSTLY_ORDER.

block sizes larger than PAGE_ALLOC_COSTLY_ORDER work, but it doesn't
perform great.  So I don't think a hard limit would be good.

> Question is, though, how could we measure the impact?
> This particular value has been in since 2007 (commit 5ad333eb66ff1 'lumpy
> reclaim V4'), and it might well be that the original
> reasoning doesn't apply anymore.

I've not even seen much of a reasoning for picking the value in that
commit.  Or did I miss something?



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
  2026-06-09  8:39   ` Hannes Reinecke
  2026-06-09  9:02     ` increasing PAGE_ALLOC_COSTLY_ORDER, was " Christoph Hellwig
@ 2026-06-09  9:37     ` Vlastimil Babka
  1 sibling, 0 replies; 13+ messages in thread
From: Vlastimil Babka @ 2026-06-09  9:37 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: lsf-pc, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, linux-mm

On 6/9/26 10:39, Hannes Reinecke wrote:
> On 6/9/26 09:28, Christoph Hellwig wrote:
>> Hannes,
>> 
>> can you share your results on the mailing list?
>> 
> I sure can.
> 
> We have run a simple testcase with on fio job on an LBS-enabled device, 
> and another job permanently allocating and deallocating arrays of pages
> of various array lengths.
> 
> We then took snapshots of /proc/buddyinfo to track memory pressure
> over time.
> 
> Results are visualized in the attach plot.
> 
> With 4k block sizes we have seen a high number of 0- and 1- order pages,
> and then the expected decline towards higher orders.
> 
> With 8k and 16k block sizes a noticeable 'bump' in free pages was 
> developing in 2- and 3- order pages, which we think is down to 
> compaction trying to merge pages together.
> The number of 0- order pages increased slightly, but only half of the
> maximum number of pages in the 'bump'.
> 
> With 32k block sizes the picture changed completely; the 'bump'
> vanished, and there was only pronounces spike with 0-order pages
> (about four times the size of the spike with 4k block sizes).
> 
> This led me to assume that compaction broke down at 32k block sizes;
> this assumption was confirmed by Vlastimil Babka who pointed out that
> there is a maximum order to which page compaction is attempted:

Yep, but after the LSF/MM session I've realized I made an off-by-one error
thanks to the misleading name of the define.

> include/linux/mmzone.h:
> /*
>   * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
>   * costly to service.  That is between allocation orders which should
>   * coalesce naturally under reasonable reclaim pressure and those which
>   * will not.
>   */
> #define PAGE_ALLOC_COSTLY_ORDER 3

Which is 32k

> and it's main usage is 'order > PAGE_ALLOC_COSTLY_ORDER'.

And indeed, this means any changes in behavior due to this should only
happen at 64k (order 4) or more. The value of 3 is in fact something like
"PAGE_ALLOC_MAX_CHEAP_ORDER". I'd send a rename patch (Kiryl fixed a similar
off-by-one gotcha with MAX_ORDER), but I suspect we'll be making more
involved changes here so I'd wait for that first.

> Which ties in directly with what we're seeing.

So it's probably not that straigtforward. We should investigate more first?

> It will probably make sense to align the maximum block size which we
> currently support (ie 64k) with this value to ensure that compaction
> works with larger block sizes. Or maybe even the other way round;
> tie the maximum block size which we support to PAGE_ALLOC_COSTLY_ORDER.
> But that would mean to restrict the blocksize to 16k, whereas xfs
> works happily with 32k. So we might want to raise PAGE_ALLOC_COSTLY_ORDER.
> 
> Question is, though, how could we measure the impact?
> This particular value has been in since 2007 (commit 5ad333eb66ff1 
> 'lumpy reclaim V4'), and it might well be that the original
> reasoning doesn't apply anymore.
> 
> At the same time, this value is tied to a _LOT_ of things
> (not to mention the page allocator itself), so increasing it
> to '4' has an extremely high chance of impacting mm performance.
> 
> I'll probably run mmtests and see what I get.
> 
> Cheers,
> 
> Hannes
> 
> 
> fragmentation.png
> 
> 
> _______________________________________________
> Lsf-pc mailing list
> Lsf-pc@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: increasing PAGE_ALLOC_COSTLY_ORDER, was Re: [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
  2026-06-09  9:02     ` increasing PAGE_ALLOC_COSTLY_ORDER, was " Christoph Hellwig
@ 2026-06-09  9:38       ` Hannes Reinecke
  0 siblings, 0 replies; 13+ messages in thread
From: Hannes Reinecke @ 2026-06-09  9:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Andy Whitcroft, linux-fsdevel,
	linux-mm

On 6/9/26 11:02, Christoph Hellwig wrote:
> On Tue, Jun 09, 2026 at 10:39:14AM +0200, Hannes Reinecke wrote:
>> With 32k block sizes the picture changed completely; the 'bump'
>> vanished, and there was only pronounces spike with 0-order pages
>> (about four times the size of the spike with 4k block sizes).
>>
>> This led me to assume that compaction broke down at 32k block sizes;
>> this assumption was confirmed by Vlastimil Babka who pointed out that
>> there is a maximum order to which page compaction is attempted:
>>
>> include/linux/mmzone.h:
>> /*
>>   * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
>>   * costly to service.  That is between allocation orders which should
>>   * coalesce naturally under reasonable reclaim pressure and those which
>>   * will not.
>>   */
>> #define PAGE_ALLOC_COSTLY_ORDER 3
>>
>> and it's main usage is 'order > PAGE_ALLOC_COSTLY_ORDER'.
>> Which ties in directly with what we're seeing.
> 
> Yes, we've also seen similar issue with memory allocation beyond
> PAGE_ALLOC_COSTLY_ORDER.  Right now the maximum block size supported
> for XFS is 64k, and there's various workloads for which we'd like to
> make use of that.  Increasing PAGE_ALLOC_COSTLY_ORDER to cover 64k
> allocations would be extremely helpful to make it perform well.
> 
> How has PAGE_ALLOC_COSTLY_ORDER been chosen?  And what would it take
> to increase it?
> 
Not much; it's basically a setting which we can change.
Measuring the _impact_ of the change would be the tricky thing.

>> It will probably make sense to align the maximum block size which we
>> currently support (ie 64k) with this value to ensure that compaction
>> works with larger block sizes. Or maybe even the other way round;
>> tie the maximum block size which we support to PAGE_ALLOC_COSTLY_ORDER.
>> But that would mean to restrict the blocksize to 16k, whereas xfs
>> works happily with 32k. So we might want to raise PAGE_ALLOC_COSTLY_ORDER.
> 
> block sizes larger than PAGE_ALLOC_COSTLY_ORDER work, but it doesn't
> perform great.  So I don't think a hard limit would be good.
> 
I was thinking to tie the block size to PAGE_ALLOC_COSTLY_ORDER, and 
make PAGE_ALLOC_COSTLY_ORDER a Kconfig setting.
That should give us enough flexibility to test out things.

But sure, just making PAGE_ALLOC_COSTLY_ORDER configurable would be
enough here for now. Until someone comes along wanting to have larger
block sizes.

>> Question is, though, how could we measure the impact?
>> This particular value has been in since 2007 (commit 5ad333eb66ff1 'lumpy
>> reclaim V4'), and it might well be that the original
>> reasoning doesn't apply anymore.
> 
> I've not even seen much of a reasoning for picking the value in that
> commit.  Or did I miss something?
> 
I haven't seen anything. It was merged in 2007, and I would suspect
that with compaction (which went in 2010) most of its reasoning will
be obsolete anyway.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
  2026-02-19  9:54 [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes Hannes Reinecke
                   ` (3 preceding siblings ...)
  2026-06-09  7:28 ` Christoph Hellwig
@ 2026-06-10 22:27 ` Karim Manaouil
  4 siblings, 0 replies; 13+ messages in thread
From: Karim Manaouil @ 2026-06-10 22:27 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: lsf-pc, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, linux-mm

On Thu, Feb 19, 2026 at 10:54:48AM +0100, Hannes Reinecke wrote:
> Hi all,
> 
> I (together with the Czech Technical University) did some experiments trying
> to measure memory fragmentation with large block sizes.
> Testbed used was an nvme setup talking to a nvmet storage over
> the network.
> 
> Doing so raised some challenges:
> 
> - How do you _generate_ memory fragmentation? The MM subsystem is
>   precisely geared up to avoid it, so you would need to come up
>   with some idea how to defeat it. With the help from Willy I managed
>   to come up with something, but I really would like to discuss
>   what would be the best option here.

thpchallenge from mmtests has been a staple for the compaction/anti
fragmentation folks.

And check this https://patchwork.freedesktop.org/patch/716404/?series=164353&rev=1

Btw, do you mind sharing what workloads you discussed with Matthew?

> - What is acceptable memory fragmentation? Are we good enough if the
>   measured fragmentation does not grow during the test runs?
> - Do we have better visibility into memory fragmentation other than
>   just reading /proc/buddyinfo?
> 
> And, of course, I would like to present (and discuss) the results
> of the testruns done on 4k, 8k, and 16k blocksizes.
> 
> Not sure if this should be a storage or MM topic; I'll let the
> lsf-pc decide.
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
> 

-- 
~karim

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-06-10 22:27 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-19  9:54 [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes Hannes Reinecke
2026-02-19 14:32 ` Theodore Tso
2026-02-20  7:44   ` Hannes Reinecke
2026-02-19 14:53 ` Bart Van Assche
2026-02-19 15:00   ` Matthew Wilcox
2026-03-16 23:26   ` Bart Van Assche
2026-05-01 14:33 ` Matthew Wilcox
2026-06-09  7:28 ` Christoph Hellwig
2026-06-09  8:39   ` Hannes Reinecke
2026-06-09  9:02     ` increasing PAGE_ALLOC_COSTLY_ORDER, was " Christoph Hellwig
2026-06-09  9:38       ` Hannes Reinecke
2026-06-09  9:37     ` [Lsf-pc] " Vlastimil Babka
2026-06-10 22:27 ` Karim Manaouil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.