All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/page_alloc: drop flag-conversion "optimisation"
@ 2026-06-12 14:15 Brendan Jackman
  2026-06-12 14:23 ` Brendan Jackman
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Brendan Jackman @ 2026-06-12 14:15 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Johannes Weiner, Zi Yan
  Cc: Harry Yoo (Oracle), Gregory Price, linux-mm, linux-kernel,
	Brendan Jackman

This code uses flag equivalences to try to optimise conversion from
GFP_ to ALLOC_ but there's no clear reason to believe it makes things
faster. Even if it gets rid of conditional branches, it just trades them
for a data dependency.

CPUs are pretty good at conditional branches. But, in my GCC x86 build
it doesn't look like there are any branches anyway, the compiler found
some conditional instruction tricks. (Caveat: This was extracted &
annotated by Gemini AI, I did not actually read the disasm myself)

Old code:

        ae50:    8b 04 24                 mov    (%rsp),%eax           # Load gfp_mask
        ...
        ae5d:    41 89 c4                 mov    %eax,%r12d
        ae64:    41 81 e4 20 08 00 00     and    $0x820,%r12d          # Mask both flags at once
        ...
        ae6f:    44 89 e1                 mov    %r12d,%ecx
        ae77:    83 c9 40                 or     $0x40,%ecx            # OR with ALLOC_CPUSET (0x40)
        ae7a:    89 4c 24 60              mov    %ecx,0x60(%rsp)       # Store to alloc_flags

New code:

  For  __GFP_HIGH  ( 0x20 ):
  It uses the Carry Flag (via  sbb ) to conditionally add  0x20  to the base  0x40  ( ALLOC_CPUSET ) flag:

        ae63:    83 e0 20                 and    $0x20,%eax            # Test __GFP_HIGH
        ...
        ae6a:    83 f8 01                 cmp    $0x1,%eax             # Set carry flag if 0
        ae6f:    45 19 e4                 sbb    %r12d,%r12d           # %r12d = (gfp & 0x20) ? 0 : -1
        ae80:    41 83 e4 e0              and    $0xffffffe0,%r12d     # %r12d = (gfp & 0x20) ? 0 : -32
        ae87:    41 83 c4 60              add    $0x60,%r12d           # %r12d = (gfp & 0x20) ? 0x60 : 0x40

  For  __GFP_KSWAPD_RECLAIM  ( 0x800 ):
  It uses a conditional move ( cmov ) later in the function to set the  ALLOC_KSWAPD  ( 0x800 ) bit:

        ae72:    25 00 08 00 00           and    $0x800,%eax           # Test __GFP_KSWAPD_RECLAIM
        ae77:    89 44 24 30              mov    %eax,0x30(%rsp)       # Store result
        ...
        af2c:    80 cf 08                 or     $0x8,%bh              # Set ALLOC_KSWAPD (0x800) in temp reg
        af2f:    45 85 c9                 test   %r9d,%r9d             # Check if __GFP_KSWAPD_RECLAIM was set
        af32:    0f 44 d8                 cmove  %eax,%ebx             # If not, revert to flags without it

Testing with a modified version[0] of lib/free_pages_test.c (adding
printks with timing)...

[0] https://github.com/bjackman/aethelred/blob/2ccdc84ef087c2a631914f58e106e99e19bd3b98/page-alloc-test/page-alloc-test.c

Old results from a Sapphire Rapids consumer CPU:

[   67.157118] page_alloc_test: Testing with GFP_KERNEL
[   67.157122] page_alloc_test: Starting 1,000,000 allocations...
[   70.704446] page_alloc_test: Completed. Time: 3543002 us (Avg: 3543.00 ns per alloc+free loop)
[   70.704456] page_alloc_test: Testing with GFP_KERNEL | __GFP_COMP
[   70.704460] page_alloc_test: Starting 1,000,000 allocations...
[   70.944672] page_alloc_test: Completed. Time: 239980 us (Avg: 239.98 ns per alloc+free loop)
[   70.944675] page_alloc_test: Test completed

New results:

[   70.079015] page_alloc_test: Testing with GFP_KERNEL
[   70.079020] page_alloc_test: Starting 1,000,000 allocations...
[   73.669396] page_alloc_test: Completed. Time: 3586954 us (Avg: 3586.95 ns per alloc+free loop)
[   73.669402] page_alloc_test: Testing with GFP_KERNEL | __GFP_COMP
[   73.669405] page_alloc_test: Starting 1,000,000 allocations...
[   73.905084] page_alloc_test: Completed. Time: 235496 us (Avg: 235.49 ns per alloc+free loop)
[   73.905086] page_alloc_test: Test completed

Seems like a wash.

So, drop the flag value coupling here and let the compiler and CPU do
their job. Superscalar CPUs are pretty neat after all.

(Used AI for the disasm but the rest is all manual).

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 mm/page_alloc.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee902a468c2f5..9e1949ea13a6d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4478,22 +4478,16 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 {
 	unsigned int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
 
-	/*
-	 * __GFP_HIGH is assumed to be the same as ALLOC_MIN_RESERVE
-	 * and __GFP_KSWAPD_RECLAIM is assumed to be the same as ALLOC_KSWAPD
-	 * to save two branches.
-	 */
-	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_MIN_RESERVE);
-	BUILD_BUG_ON(__GFP_KSWAPD_RECLAIM != (__force gfp_t) ALLOC_KSWAPD);
-
 	/*
 	 * The caller may dip into page reserves a bit more if the caller
 	 * cannot run direct reclaim, or if the caller has realtime scheduling
 	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
 	 * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
 	 */
-	alloc_flags |= (__force int)
-		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_MIN_RESERVE;
+	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+		alloc_flags |= ALLOC_KSWAPD;
 
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		/*

---
base-commit: ca2351ac6da277a470d4fcf122b53267e02b2716
change-id: 20260612-gfp-pessimisation-b258a4bb5ebd

Best regards,
-- 
Brendan Jackman <jackmanb@google.com>



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/page_alloc: drop flag-conversion "optimisation"
  2026-06-12 14:15 [PATCH] mm/page_alloc: drop flag-conversion "optimisation" Brendan Jackman
@ 2026-06-12 14:23 ` Brendan Jackman
  2026-06-12 14:24 ` Vlastimil Babka (SUSE)
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Brendan Jackman @ 2026-06-12 14:23 UTC (permalink / raw)
  To: Brendan Jackman, Andrew Morton, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Zi Yan
  Cc: Harry Yoo (Oracle), Gregory Price, linux-mm, linux-kernel

On Fri Jun 12, 2026 at 2:15 PM UTC, Brendan Jackman wrote:
> [0] https://github.com/bjackman/aethelred/blob/2ccdc84ef087c2a631914f58e106e99e19bd3b98/page-alloc-test/page-alloc-test.c
[...]
> (Used AI for the disasm but the rest is all manual).

Oh sorry, for full transparency: AI also set up the "benchmark" module
above based on mm/test_free_pages.c.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/page_alloc: drop flag-conversion "optimisation"
  2026-06-12 14:15 [PATCH] mm/page_alloc: drop flag-conversion "optimisation" Brendan Jackman
  2026-06-12 14:23 ` Brendan Jackman
@ 2026-06-12 14:24 ` Vlastimil Babka (SUSE)
  2026-06-12 14:53 ` Brendan Jackman
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-06-12 14:24 UTC (permalink / raw)
  To: Brendan Jackman, Andrew Morton, Suren Baghdasaryan, Michal Hocko,
	Johannes Weiner, Zi Yan
  Cc: Harry Yoo (Oracle), Gregory Price, linux-mm, linux-kernel

On 6/12/26 16:15, Brendan Jackman wrote:
> This code uses flag equivalences to try to optimise conversion from
> GFP_ to ALLOC_ but there's no clear reason to believe it makes things
> faster. Even if it gets rid of conditional branches, it just trades them
> for a data dependency.
> 
> CPUs are pretty good at conditional branches. But, in my GCC x86 build
> it doesn't look like there are any branches anyway, the compiler found
> some conditional instruction tricks. (Caveat: This was extracted &
> annotated by Gemini AI, I did not actually read the disasm myself)
> 
> Old code:
> 
>         ae50:    8b 04 24                 mov    (%rsp),%eax           # Load gfp_mask
>         ...
>         ae5d:    41 89 c4                 mov    %eax,%r12d
>         ae64:    41 81 e4 20 08 00 00     and    $0x820,%r12d          # Mask both flags at once
>         ...
>         ae6f:    44 89 e1                 mov    %r12d,%ecx
>         ae77:    83 c9 40                 or     $0x40,%ecx            # OR with ALLOC_CPUSET (0x40)
>         ae7a:    89 4c 24 60              mov    %ecx,0x60(%rsp)       # Store to alloc_flags
> 
> New code:
> 
>   For  __GFP_HIGH  ( 0x20 ):
>   It uses the Carry Flag (via  sbb ) to conditionally add  0x20  to the base  0x40  ( ALLOC_CPUSET ) flag:
> 
>         ae63:    83 e0 20                 and    $0x20,%eax            # Test __GFP_HIGH
>         ...
>         ae6a:    83 f8 01                 cmp    $0x1,%eax             # Set carry flag if 0
>         ae6f:    45 19 e4                 sbb    %r12d,%r12d           # %r12d = (gfp & 0x20) ? 0 : -1
>         ae80:    41 83 e4 e0              and    $0xffffffe0,%r12d     # %r12d = (gfp & 0x20) ? 0 : -32
>         ae87:    41 83 c4 60              add    $0x60,%r12d           # %r12d = (gfp & 0x20) ? 0x60 : 0x40
> 
>   For  __GFP_KSWAPD_RECLAIM  ( 0x800 ):
>   It uses a conditional move ( cmov ) later in the function to set the  ALLOC_KSWAPD  ( 0x800 ) bit:
> 
>         ae72:    25 00 08 00 00           and    $0x800,%eax           # Test __GFP_KSWAPD_RECLAIM
>         ae77:    89 44 24 30              mov    %eax,0x30(%rsp)       # Store result
>         ...
>         af2c:    80 cf 08                 or     $0x8,%bh              # Set ALLOC_KSWAPD (0x800) in temp reg
>         af2f:    45 85 c9                 test   %r9d,%r9d             # Check if __GFP_KSWAPD_RECLAIM was set
>         af32:    0f 44 d8                 cmove  %eax,%ebx             # If not, revert to flags without it
> 
> Testing with a modified version[0] of lib/free_pages_test.c (adding
> printks with timing)...
> 
> [0] https://github.com/bjackman/aethelred/blob/2ccdc84ef087c2a631914f58e106e99e19bd3b98/page-alloc-test/page-alloc-test.c
> 
> Old results from a Sapphire Rapids consumer CPU:
> 
> [   67.157118] page_alloc_test: Testing with GFP_KERNEL
> [   67.157122] page_alloc_test: Starting 1,000,000 allocations...
> [   70.704446] page_alloc_test: Completed. Time: 3543002 us (Avg: 3543.00 ns per alloc+free loop)
> [   70.704456] page_alloc_test: Testing with GFP_KERNEL | __GFP_COMP
> [   70.704460] page_alloc_test: Starting 1,000,000 allocations...
> [   70.944672] page_alloc_test: Completed. Time: 239980 us (Avg: 239.98 ns per alloc+free loop)
> [   70.944675] page_alloc_test: Test completed
> 
> New results:
> 
> [   70.079015] page_alloc_test: Testing with GFP_KERNEL
> [   70.079020] page_alloc_test: Starting 1,000,000 allocations...
> [   73.669396] page_alloc_test: Completed. Time: 3586954 us (Avg: 3586.95 ns per alloc+free loop)
> [   73.669402] page_alloc_test: Testing with GFP_KERNEL | __GFP_COMP
> [   73.669405] page_alloc_test: Starting 1,000,000 allocations...
> [   73.905084] page_alloc_test: Completed. Time: 235496 us (Avg: 235.49 ns per alloc+free loop)
> [   73.905086] page_alloc_test: Test completed
> 
> Seems like a wash.
> 
> So, drop the flag value coupling here and let the compiler and CPU do
> their job. Superscalar CPUs are pretty neat after all.

True

> (Used AI for the disasm but the rest is all manual).
> 
> Signed-off-by: Brendan Jackman <jackmanb@google.com>

Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

Thanks.

> ---
>  mm/page_alloc.c | 14 ++++----------
>  1 file changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ee902a468c2f5..9e1949ea13a6d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4478,22 +4478,16 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
>  {
>  	unsigned int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
>  
> -	/*
> -	 * __GFP_HIGH is assumed to be the same as ALLOC_MIN_RESERVE
> -	 * and __GFP_KSWAPD_RECLAIM is assumed to be the same as ALLOC_KSWAPD
> -	 * to save two branches.
> -	 */
> -	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_MIN_RESERVE);
> -	BUILD_BUG_ON(__GFP_KSWAPD_RECLAIM != (__force gfp_t) ALLOC_KSWAPD);
> -
>  	/*
>  	 * The caller may dip into page reserves a bit more if the caller
>  	 * cannot run direct reclaim, or if the caller has realtime scheduling
>  	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
>  	 * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
>  	 */
> -	alloc_flags |= (__force int)
> -		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
> +	if (gfp_mask & __GFP_HIGH)
> +		alloc_flags |= ALLOC_MIN_RESERVE;
> +	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> +		alloc_flags |= ALLOC_KSWAPD;
>  
>  	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
>  		/*
> 
> ---
> base-commit: ca2351ac6da277a470d4fcf122b53267e02b2716
> change-id: 20260612-gfp-pessimisation-b258a4bb5ebd
> 
> Best regards,


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/page_alloc: drop flag-conversion "optimisation"
  2026-06-12 14:15 [PATCH] mm/page_alloc: drop flag-conversion "optimisation" Brendan Jackman
  2026-06-12 14:23 ` Brendan Jackman
  2026-06-12 14:24 ` Vlastimil Babka (SUSE)
@ 2026-06-12 14:53 ` Brendan Jackman
  2026-06-12 15:06 ` Zi Yan
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Brendan Jackman @ 2026-06-12 14:53 UTC (permalink / raw)
  To: Brendan Jackman, Andrew Morton, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Zi Yan
  Cc: Harry Yoo (Oracle), Gregory Price, linux-mm, linux-kernel

On Fri Jun 12, 2026 at 2:15 PM UTC, Brendan Jackman wrote:
> -	/*
> -	 * __GFP_HIGH is assumed to be the same as ALLOC_MIN_RESERVE
> -	 * and __GFP_KSWAPD_RECLAIM is assumed to be the same as ALLOC_KSWAPD
> -	 * to save two branches.
> -	 */
> -	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_MIN_RESERVE);
> -	BUILD_BUG_ON(__GFP_KSWAPD_RECLAIM != (__force gfp_t) ALLOC_KSWAPD);

Agh, alloc_flags_nofragment() needs to be updated too:

diff --git c/mm/page_alloc.c i/mm/page_alloc.c
index 9e1949ea13a6d..0111cdbdb5321 100644
--- c/mm/page_alloc.c
+++ i/mm/page_alloc.c
@@ -3739,13 +3739,10 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 static inline unsigned int
 alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 {
-       unsigned int alloc_flags;
+       unsigned int alloc_flags = 0;

-       /*
-        * __GFP_KSWAPD_RECLAIM is assumed to be the same as ALLOC_KSWAPD
-        * to save a branch.
-        */
-       alloc_flags = (__force int) (gfp_mask & __GFP_KSWAPD_RECLAIM);
+       if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+               alloc_flags |= ALLOC_KSWAPD;

        if (defrag_mode) {
                alloc_flags |= ALLOC_NOFRAGMENT;

I guess I will just send a v2 once I've tested since this is pretty easy
to review anyway, any objections?


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/page_alloc: drop flag-conversion "optimisation"
  2026-06-12 14:15 [PATCH] mm/page_alloc: drop flag-conversion "optimisation" Brendan Jackman
                   ` (2 preceding siblings ...)
  2026-06-12 14:53 ` Brendan Jackman
@ 2026-06-12 15:06 ` Zi Yan
  2026-06-12 16:04 ` Gregory Price
  2026-06-12 16:33 ` Johannes Weiner
  5 siblings, 0 replies; 7+ messages in thread
From: Zi Yan @ 2026-06-12 15:06 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Johannes Weiner, Harry Yoo (Oracle), Gregory Price, linux-mm,
	linux-kernel

On 12 Jun 2026, at 10:15, Brendan Jackman wrote:

> This code uses flag equivalences to try to optimise conversion from
> GFP_ to ALLOC_ but there's no clear reason to believe it makes things
> faster. Even if it gets rid of conditional branches, it just trades them
> for a data dependency.
>
> CPUs are pretty good at conditional branches. But, in my GCC x86 build
> it doesn't look like there are any branches anyway, the compiler found
> some conditional instruction tricks. (Caveat: This was extracted &
> annotated by Gemini AI, I did not actually read the disasm myself)
>
> Old code:
>
>         ae50:    8b 04 24                 mov    (%rsp),%eax           # Load gfp_mask
>         ...
>         ae5d:    41 89 c4                 mov    %eax,%r12d
>         ae64:    41 81 e4 20 08 00 00     and    $0x820,%r12d          # Mask both flags at once
>         ...
>         ae6f:    44 89 e1                 mov    %r12d,%ecx
>         ae77:    83 c9 40                 or     $0x40,%ecx            # OR with ALLOC_CPUSET (0x40)
>         ae7a:    89 4c 24 60              mov    %ecx,0x60(%rsp)       # Store to alloc_flags
>
> New code:
>
>   For  __GFP_HIGH  ( 0x20 ):
>   It uses the Carry Flag (via  sbb ) to conditionally add  0x20  to the base  0x40  ( ALLOC_CPUSET ) flag:
>
>         ae63:    83 e0 20                 and    $0x20,%eax            # Test __GFP_HIGH
>         ...
>         ae6a:    83 f8 01                 cmp    $0x1,%eax             # Set carry flag if 0
>         ae6f:    45 19 e4                 sbb    %r12d,%r12d           # %r12d = (gfp & 0x20) ? 0 : -1
>         ae80:    41 83 e4 e0              and    $0xffffffe0,%r12d     # %r12d = (gfp & 0x20) ? 0 : -32
>         ae87:    41 83 c4 60              add    $0x60,%r12d           # %r12d = (gfp & 0x20) ? 0x60 : 0x40
>
>   For  __GFP_KSWAPD_RECLAIM  ( 0x800 ):
>   It uses a conditional move ( cmov ) later in the function to set the  ALLOC_KSWAPD  ( 0x800 ) bit:
>
>         ae72:    25 00 08 00 00           and    $0x800,%eax           # Test __GFP_KSWAPD_RECLAIM
>         ae77:    89 44 24 30              mov    %eax,0x30(%rsp)       # Store result
>         ...
>         af2c:    80 cf 08                 or     $0x8,%bh              # Set ALLOC_KSWAPD (0x800) in temp reg
>         af2f:    45 85 c9                 test   %r9d,%r9d             # Check if __GFP_KSWAPD_RECLAIM was set
>         af32:    0f 44 d8                 cmove  %eax,%ebx             # If not, revert to flags without it
>
> Testing with a modified version[0] of lib/free_pages_test.c (adding
> printks with timing)...
>
> [0] https://github.com/bjackman/aethelred/blob/2ccdc84ef087c2a631914f58e106e99e19bd3b98/page-alloc-test/page-alloc-test.c
>
> Old results from a Sapphire Rapids consumer CPU:
>
> [   67.157118] page_alloc_test: Testing with GFP_KERNEL
> [   67.157122] page_alloc_test: Starting 1,000,000 allocations...
> [   70.704446] page_alloc_test: Completed. Time: 3543002 us (Avg: 3543.00 ns per alloc+free loop)
> [   70.704456] page_alloc_test: Testing with GFP_KERNEL | __GFP_COMP
> [   70.704460] page_alloc_test: Starting 1,000,000 allocations...
> [   70.944672] page_alloc_test: Completed. Time: 239980 us (Avg: 239.98 ns per alloc+free loop)
> [   70.944675] page_alloc_test: Test completed
>
> New results:
>
> [   70.079015] page_alloc_test: Testing with GFP_KERNEL
> [   70.079020] page_alloc_test: Starting 1,000,000 allocations...
> [   73.669396] page_alloc_test: Completed. Time: 3586954 us (Avg: 3586.95 ns per alloc+free loop)
> [   73.669402] page_alloc_test: Testing with GFP_KERNEL | __GFP_COMP
> [   73.669405] page_alloc_test: Starting 1,000,000 allocations...
> [   73.905084] page_alloc_test: Completed. Time: 235496 us (Avg: 235.49 ns per alloc+free loop)
> [   73.905086] page_alloc_test: Test completed
>
> Seems like a wash.
>
> So, drop the flag value coupling here and let the compiler and CPU do
> their job. Superscalar CPUs are pretty neat after all.
>
> (Used AI for the disasm but the rest is all manual).
>
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
>  mm/page_alloc.c | 14 ++++----------
>  1 file changed, 4 insertions(+), 10 deletions(-)
>
With your own fixup, it looks good to me.

Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/page_alloc: drop flag-conversion "optimisation"
  2026-06-12 14:15 [PATCH] mm/page_alloc: drop flag-conversion "optimisation" Brendan Jackman
                   ` (3 preceding siblings ...)
  2026-06-12 15:06 ` Zi Yan
@ 2026-06-12 16:04 ` Gregory Price
  2026-06-12 16:33 ` Johannes Weiner
  5 siblings, 0 replies; 7+ messages in thread
From: Gregory Price @ 2026-06-12 16:04 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Johannes Weiner, Zi Yan, Harry Yoo (Oracle), linux-mm,
	linux-kernel

On Fri, Jun 12, 2026 at 02:15:44PM +0000, Brendan Jackman wrote:
> This code uses flag equivalences to try to optimise conversion from
> GFP_ to ALLOC_ but there's no clear reason to believe it makes things
> faster. Even if it gets rid of conditional branches, it just trades them
> for a data dependency.
> 
> CPUs are pretty good at conditional branches. But, in my GCC x86 build
> it doesn't look like there are any branches anyway, the compiler found
> some conditional instruction tricks. (Caveat: This was extracted &
> annotated by Gemini AI, I did not actually read the disasm myself)
> 
> Old code:
> 
>         ae50:    8b 04 24                 mov    (%rsp),%eax           # Load gfp_mask
>         ...
>         ae5d:    41 89 c4                 mov    %eax,%r12d
>         ae64:    41 81 e4 20 08 00 00     and    $0x820,%r12d          # Mask both flags at once
>         ...
>         ae6f:    44 89 e1                 mov    %r12d,%ecx
>         ae77:    83 c9 40                 or     $0x40,%ecx            # OR with ALLOC_CPUSET (0x40)
>         ae7a:    89 4c 24 60              mov    %ecx,0x60(%rsp)       # Store to alloc_flags
> 
> New code:
> 
>   For  __GFP_HIGH  ( 0x20 ):
>   It uses the Carry Flag (via  sbb ) to conditionally add  0x20  to the base  0x40  ( ALLOC_CPUSET ) flag:
> 
>         ae63:    83 e0 20                 and    $0x20,%eax            # Test __GFP_HIGH
>         ...
>         ae6a:    83 f8 01                 cmp    $0x1,%eax             # Set carry flag if 0
>         ae6f:    45 19 e4                 sbb    %r12d,%r12d           # %r12d = (gfp & 0x20) ? 0 : -1
>         ae80:    41 83 e4 e0              and    $0xffffffe0,%r12d     # %r12d = (gfp & 0x20) ? 0 : -32
>         ae87:    41 83 c4 60              add    $0x60,%r12d           # %r12d = (gfp & 0x20) ? 0x60 : 0x40
> 
>   For  __GFP_KSWAPD_RECLAIM  ( 0x800 ):
>   It uses a conditional move ( cmov ) later in the function to set the  ALLOC_KSWAPD  ( 0x800 ) bit:
> 
>         ae72:    25 00 08 00 00           and    $0x800,%eax           # Test __GFP_KSWAPD_RECLAIM
>         ae77:    89 44 24 30              mov    %eax,0x30(%rsp)       # Store result
>         ...
>         af2c:    80 cf 08                 or     $0x8,%bh              # Set ALLOC_KSWAPD (0x800) in temp reg
>         af2f:    45 85 c9                 test   %r9d,%r9d             # Check if __GFP_KSWAPD_RECLAIM was set
>         af32:    0f 44 d8                 cmove  %eax,%ebx             # If not, revert to flags without it
> 
> Testing with a modified version[0] of lib/free_pages_test.c (adding
> printks with timing)...
> 
> [0] https://github.com/bjackman/aethelred/blob/2ccdc84ef087c2a631914f58e106e99e19bd3b98/page-alloc-test/page-alloc-test.c
> 
> Old results from a Sapphire Rapids consumer CPU:
> 
> [   67.157118] page_alloc_test: Testing with GFP_KERNEL
> [   67.157122] page_alloc_test: Starting 1,000,000 allocations...
> [   70.704446] page_alloc_test: Completed. Time: 3543002 us (Avg: 3543.00 ns per alloc+free loop)
> [   70.704456] page_alloc_test: Testing with GFP_KERNEL | __GFP_COMP
> [   70.704460] page_alloc_test: Starting 1,000,000 allocations...
> [   70.944672] page_alloc_test: Completed. Time: 239980 us (Avg: 239.98 ns per alloc+free loop)
> [   70.944675] page_alloc_test: Test completed
> 
> New results:
> 
> [   70.079015] page_alloc_test: Testing with GFP_KERNEL
> [   70.079020] page_alloc_test: Starting 1,000,000 allocations...
> [   73.669396] page_alloc_test: Completed. Time: 3586954 us (Avg: 3586.95 ns per alloc+free loop)
> [   73.669402] page_alloc_test: Testing with GFP_KERNEL | __GFP_COMP
> [   73.669405] page_alloc_test: Starting 1,000,000 allocations...
> [   73.905084] page_alloc_test: Completed. Time: 235496 us (Avg: 235.49 ns per alloc+free loop)
> [   73.905086] page_alloc_test: Test completed
> 
> Seems like a wash.
> 
> So, drop the flag value coupling here and let the compiler and CPU do
> their job. Superscalar CPUs are pretty neat after all.
> 
> (Used AI for the disasm but the rest is all manual).
> 
> Signed-off-by: Brendan Jackman <jackmanb@google.com>

Love me some readability improvements and code reduction :]

Reviewed-by: Gregory Price <gourry@gourry.net>

> ---
>  mm/page_alloc.c | 14 ++++----------
>  1 file changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ee902a468c2f5..9e1949ea13a6d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4478,22 +4478,16 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
>  {
>  	unsigned int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
>  
> -	/*
> -	 * __GFP_HIGH is assumed to be the same as ALLOC_MIN_RESERVE
> -	 * and __GFP_KSWAPD_RECLAIM is assumed to be the same as ALLOC_KSWAPD
> -	 * to save two branches.
> -	 */
> -	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_MIN_RESERVE);
> -	BUILD_BUG_ON(__GFP_KSWAPD_RECLAIM != (__force gfp_t) ALLOC_KSWAPD);
> -
>  	/*
>  	 * The caller may dip into page reserves a bit more if the caller
>  	 * cannot run direct reclaim, or if the caller has realtime scheduling
>  	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
>  	 * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
>  	 */
> -	alloc_flags |= (__force int)
> -		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
> +	if (gfp_mask & __GFP_HIGH)
> +		alloc_flags |= ALLOC_MIN_RESERVE;
> +	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> +		alloc_flags |= ALLOC_KSWAPD;
>  
>  	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
>  		/*
> 
> ---
> base-commit: ca2351ac6da277a470d4fcf122b53267e02b2716
> change-id: 20260612-gfp-pessimisation-b258a4bb5ebd
> 
> Best regards,
> -- 
> Brendan Jackman <jackmanb@google.com>
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/page_alloc: drop flag-conversion "optimisation"
  2026-06-12 14:15 [PATCH] mm/page_alloc: drop flag-conversion "optimisation" Brendan Jackman
                   ` (4 preceding siblings ...)
  2026-06-12 16:04 ` Gregory Price
@ 2026-06-12 16:33 ` Johannes Weiner
  5 siblings, 0 replies; 7+ messages in thread
From: Johannes Weiner @ 2026-06-12 16:33 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Harry Yoo (Oracle), Gregory Price, linux-mm, linux-kernel

On Fri, Jun 12, 2026 at 02:15:44PM +0000, Brendan Jackman wrote:
> This code uses flag equivalences to try to optimise conversion from
> GFP_ to ALLOC_ but there's no clear reason to believe it makes things
> faster. Even if it gets rid of conditional branches, it just trades them
> for a data dependency.
> 
> CPUs are pretty good at conditional branches. But, in my GCC x86 build
> it doesn't look like there are any branches anyway, the compiler found
> some conditional instruction tricks. (Caveat: This was extracted &
> annotated by Gemini AI, I did not actually read the disasm myself)
> 
> Old code:
> 
>         ae50:    8b 04 24                 mov    (%rsp),%eax           # Load gfp_mask
>         ...
>         ae5d:    41 89 c4                 mov    %eax,%r12d
>         ae64:    41 81 e4 20 08 00 00     and    $0x820,%r12d          # Mask both flags at once
>         ...
>         ae6f:    44 89 e1                 mov    %r12d,%ecx
>         ae77:    83 c9 40                 or     $0x40,%ecx            # OR with ALLOC_CPUSET (0x40)
>         ae7a:    89 4c 24 60              mov    %ecx,0x60(%rsp)       # Store to alloc_flags
> 
> New code:
> 
>   For  __GFP_HIGH  ( 0x20 ):
>   It uses the Carry Flag (via  sbb ) to conditionally add  0x20  to the base  0x40  ( ALLOC_CPUSET ) flag:
> 
>         ae63:    83 e0 20                 and    $0x20,%eax            # Test __GFP_HIGH
>         ...
>         ae6a:    83 f8 01                 cmp    $0x1,%eax             # Set carry flag if 0
>         ae6f:    45 19 e4                 sbb    %r12d,%r12d           # %r12d = (gfp & 0x20) ? 0 : -1
>         ae80:    41 83 e4 e0              and    $0xffffffe0,%r12d     # %r12d = (gfp & 0x20) ? 0 : -32
>         ae87:    41 83 c4 60              add    $0x60,%r12d           # %r12d = (gfp & 0x20) ? 0x60 : 0x40
> 
>   For  __GFP_KSWAPD_RECLAIM  ( 0x800 ):
>   It uses a conditional move ( cmov ) later in the function to set the  ALLOC_KSWAPD  ( 0x800 ) bit:
> 
>         ae72:    25 00 08 00 00           and    $0x800,%eax           # Test __GFP_KSWAPD_RECLAIM
>         ae77:    89 44 24 30              mov    %eax,0x30(%rsp)       # Store result
>         ...
>         af2c:    80 cf 08                 or     $0x8,%bh              # Set ALLOC_KSWAPD (0x800) in temp reg
>         af2f:    45 85 c9                 test   %r9d,%r9d             # Check if __GFP_KSWAPD_RECLAIM was set
>         af32:    0f 44 d8                 cmove  %eax,%ebx             # If not, revert to flags without it
> 
> Testing with a modified version[0] of lib/free_pages_test.c (adding
> printks with timing)...
> 
> [0] https://github.com/bjackman/aethelred/blob/2ccdc84ef087c2a631914f58e106e99e19bd3b98/page-alloc-test/page-alloc-test.c
> 
> Old results from a Sapphire Rapids consumer CPU:
> 
> [   67.157118] page_alloc_test: Testing with GFP_KERNEL
> [   67.157122] page_alloc_test: Starting 1,000,000 allocations...
> [   70.704446] page_alloc_test: Completed. Time: 3543002 us (Avg: 3543.00 ns per alloc+free loop)
> [   70.704456] page_alloc_test: Testing with GFP_KERNEL | __GFP_COMP
> [   70.704460] page_alloc_test: Starting 1,000,000 allocations...
> [   70.944672] page_alloc_test: Completed. Time: 239980 us (Avg: 239.98 ns per alloc+free loop)
> [   70.944675] page_alloc_test: Test completed
> 
> New results:
> 
> [   70.079015] page_alloc_test: Testing with GFP_KERNEL
> [   70.079020] page_alloc_test: Starting 1,000,000 allocations...
> [   73.669396] page_alloc_test: Completed. Time: 3586954 us (Avg: 3586.95 ns per alloc+free loop)
> [   73.669402] page_alloc_test: Testing with GFP_KERNEL | __GFP_COMP
> [   73.669405] page_alloc_test: Starting 1,000,000 allocations...
> [   73.905084] page_alloc_test: Completed. Time: 235496 us (Avg: 235.49 ns per alloc+free loop)
> [   73.905086] page_alloc_test: Test completed
> 
> Seems like a wash.
> 
> So, drop the flag value coupling here and let the compiler and CPU do
> their job. Superscalar CPUs are pretty neat after all.
> 
> (Used AI for the disasm but the rest is all manual).
> 
> Signed-off-by: Brendan Jackman <jackmanb@google.com>

Nice. With the _nofragment() bits folded:

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-06-12 16:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12 14:15 [PATCH] mm/page_alloc: drop flag-conversion "optimisation" Brendan Jackman
2026-06-12 14:23 ` Brendan Jackman
2026-06-12 14:24 ` Vlastimil Babka (SUSE)
2026-06-12 14:53 ` Brendan Jackman
2026-06-12 15:06 ` Zi Yan
2026-06-12 16:04 ` Gregory Price
2026-06-12 16:33 ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.