linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: Inline vma_needs_copy
@ 2025-06-18  1:42 Yunshui Jiang
  2025-06-18  9:39 ` Vlastimil Babka
  2025-06-18 10:35 ` Lorenzo Stoakes
  0 siblings, 2 replies; 6+ messages in thread
From: Yunshui Jiang @ 2025-06-18  1:42 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, jiangyunshui

From: jiangyunshui <jiangyunshui@kylinos.cn>

Since commit bcd51a3c679d ("hugetlb: lazy page table copies
in fork()"), the logic about judging whether to copy
page table inside func copy_page_range has been extracted
into a separate func vma_needs_copy. While this change
improves code readability, it also incurs more function call
overhead, especially where fork() were frequently called.

Inline func vma_needs_copy to optimize the copy_page_range
performance. Given that func vma_needs_copy is only called
by copy_page_range, inlining it would not cause unacceptable
code bloat.

Testing was done with the byte-unixbench spawn benchmark
(which frequently calls fork). I measured 1.7% improvement
on x86 and 1.8% improvement on arm64.

Signed-off-by: jiangyunshui <jiangyunshui@kylinos.cn>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8eba595056fe..d15b07f96ab1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1337,7 +1337,7 @@ copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
  * false when we can speed up fork() by allowing lazy page faults later until
  * when the child accesses the memory range.
  */
-static bool
+static __always_inline bool
 vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 {
 	/*
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Inline vma_needs_copy
  2025-06-18  1:42 [PATCH] mm: Inline vma_needs_copy Yunshui Jiang
@ 2025-06-18  9:39 ` Vlastimil Babka
  2025-06-18 10:02   ` David Hildenbrand
  2025-06-18 10:35 ` Lorenzo Stoakes
  1 sibling, 1 reply; 6+ messages in thread
From: Vlastimil Babka @ 2025-06-18  9:39 UTC (permalink / raw)
  To: Yunshui Jiang, linux-kernel, linux-mm
  Cc: akpm, david, lorenzo.stoakes, Liam.Howlett

On 6/18/25 3:42 AM, Yunshui Jiang wrote:
> From: jiangyunshui <jiangyunshui@kylinos.cn>
> 
> Since commit bcd51a3c679d ("hugetlb: lazy page table copies
> in fork()"), the logic about judging whether to copy
> page table inside func copy_page_range has been extracted
> into a separate func vma_needs_copy. While this change
> improves code readability, it also incurs more function call
> overhead, especially where fork() were frequently called.
> 
> Inline func vma_needs_copy to optimize the copy_page_range
> performance. Given that func vma_needs_copy is only called
> by copy_page_range, inlining it would not cause unacceptable
> code bloat.

I'm surprised the compiler doesn't inline it already, if there's a
single caller. In fact, mine (gcc-14.3 on x86) already does.

So I wonder to which extent should we force override wrong compiler
heuristics? Maybe just inline instead of __always_inline would be OK? Is
that enough of a hint for your compiler?

> Testing was done with the byte-unixbench spawn benchmark
> (which frequently calls fork). I measured 1.7% improvement
> on x86 and 1.8% improvement on arm64.
> 
> Signed-off-by: jiangyunshui <jiangyunshui@kylinos.cn>
> ---
>  mm/memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 8eba595056fe..d15b07f96ab1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1337,7 +1337,7 @@ copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>   * false when we can speed up fork() by allowing lazy page faults later until
>   * when the child accesses the memory range.
>   */
> -static bool
> +static __always_inline bool
>  vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  {
>  	/*


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Inline vma_needs_copy
  2025-06-18  9:39 ` Vlastimil Babka
@ 2025-06-18 10:02   ` David Hildenbrand
  2025-06-18 14:54     ` Liam R. Howlett
  0 siblings, 1 reply; 6+ messages in thread
From: David Hildenbrand @ 2025-06-18 10:02 UTC (permalink / raw)
  To: Vlastimil Babka, Yunshui Jiang, linux-kernel, linux-mm
  Cc: akpm, lorenzo.stoakes, Liam.Howlett

On 18.06.25 11:39, Vlastimil Babka wrote:
> On 6/18/25 3:42 AM, Yunshui Jiang wrote:
>> From: jiangyunshui <jiangyunshui@kylinos.cn>
>>
>> Since commit bcd51a3c679d ("hugetlb: lazy page table copies
>> in fork()"), the logic about judging whether to copy
>> page table inside func copy_page_range has been extracted
>> into a separate func vma_needs_copy. While this change
>> improves code readability, it also incurs more function call
>> overhead, especially where fork() were frequently called.
>>
>> Inline func vma_needs_copy to optimize the copy_page_range
>> performance. Given that func vma_needs_copy is only called
>> by copy_page_range, inlining it would not cause unacceptable
>> code bloat.
> 
> I'm surprised the compiler doesn't inline it already, if there's a
> single caller. In fact, mine (gcc-14.3 on x86) already does.

Same here

	gcc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2)
> 
> So I wonder to which extent should we force override wrong compiler
> heuristics?

I think only where we (a) really know better (b) it's crucial for 
performance and (c) modern compilers don't do the right thing.

> Maybe just inline instead of __always_inline would be OK? Is
> that enough of a hint for your compiler?

I would do the same thing (inline as a hint), but some people apparently 
recently argued that newer compiler mostly ignore it either way. At 
least that's what I recall.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Inline vma_needs_copy
  2025-06-18  1:42 [PATCH] mm: Inline vma_needs_copy Yunshui Jiang
  2025-06-18  9:39 ` Vlastimil Babka
@ 2025-06-18 10:35 ` Lorenzo Stoakes
  1 sibling, 0 replies; 6+ messages in thread
From: Lorenzo Stoakes @ 2025-06-18 10:35 UTC (permalink / raw)
  To: Yunshui Jiang; +Cc: linux-kernel, linux-mm, akpm, david, Liam.Howlett, vbabka

On Wed, Jun 18, 2025 at 09:42:09AM +0800, Yunshui Jiang wrote:
> From: jiangyunshui <jiangyunshui@kylinos.cn>
>
> Since commit bcd51a3c679d ("hugetlb: lazy page table copies
> in fork()"), the logic about judging whether to copy
> page table inside func copy_page_range has been extracted
> into a separate func vma_needs_copy. While this change
> improves code readability, it also incurs more function call
> overhead, especially where fork() were frequently called.
>
> Inline func vma_needs_copy to optimize the copy_page_range
> performance. Given that func vma_needs_copy is only called
> by copy_page_range, inlining it would not cause unacceptable
> code bloat.
>
> Testing was done with the byte-unixbench spawn benchmark
> (which frequently calls fork). I measured 1.7% improvement
> on x86 and 1.8% improvement on arm64.

As per others you are going to need to provide details of your compiler
setup because modern compilers are inlining this already.

if it's ye olde compiler I'm not sure this is justified...

>
> Signed-off-by: jiangyunshui <jiangyunshui@kylinos.cn>
> ---
>  mm/memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 8eba595056fe..d15b07f96ab1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1337,7 +1337,7 @@ copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>   * false when we can speed up fork() by allowing lazy page faults later until
>   * when the child accesses the memory range.
>   */
> -static bool
> +static __always_inline bool
>  vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  {

This needs to live in vma.h probably... todo++...

>  	/*
> --
> 2.47.1
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Inline vma_needs_copy
  2025-06-18 10:02   ` David Hildenbrand
@ 2025-06-18 14:54     ` Liam R. Howlett
  2025-06-19  3:30       ` Yunshui Jiang
  0 siblings, 1 reply; 6+ messages in thread
From: Liam R. Howlett @ 2025-06-18 14:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vlastimil Babka, Yunshui Jiang, linux-kernel, linux-mm, akpm,
	lorenzo.stoakes

* David Hildenbrand <david@redhat.com> [250618 06:03]:
> On 18.06.25 11:39, Vlastimil Babka wrote:
> > On 6/18/25 3:42 AM, Yunshui Jiang wrote:
> > > From: jiangyunshui <jiangyunshui@kylinos.cn>
> > > 
> > > Since commit bcd51a3c679d ("hugetlb: lazy page table copies
> > > in fork()"), the logic about judging whether to copy
> > > page table inside func copy_page_range has been extracted
> > > into a separate func vma_needs_copy. While this change
> > > improves code readability, it also incurs more function call
> > > overhead, especially where fork() were frequently called.
> > > 
> > > Inline func vma_needs_copy to optimize the copy_page_range
> > > performance. Given that func vma_needs_copy is only called
> > > by copy_page_range, inlining it would not cause unacceptable
> > > code bloat.
> > 
> > I'm surprised the compiler doesn't inline it already, if there's a
> > single caller. In fact, mine (gcc-14.3 on x86) already does.
> 
> Same here
> 
> 	gcc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2)
> > 
> > So I wonder to which extent should we force override wrong compiler
> > heuristics?
> 
> I think only where we (a) really know better (b) it's crucial for
> performance and (c) modern compilers don't do the right thing.
> 
> > Maybe just inline instead of __always_inline would be OK? Is
> > that enough of a hint for your compiler?
> 
> I would do the same thing (inline as a hint), but some people apparently
> recently argued that newer compiler mostly ignore it either way. At least
> that's what I recall.

I have seen inline ignored, but not with a single caller.  Perhaps
godbolt could find a reason this is necessary, otherwise we should just
leave it, IMO.

Cheers,
Liam


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Inline vma_needs_copy
  2025-06-18 14:54     ` Liam R. Howlett
@ 2025-06-19  3:30       ` Yunshui Jiang
  0 siblings, 0 replies; 6+ messages in thread
From: Yunshui Jiang @ 2025-06-19  3:30 UTC (permalink / raw)
  To: liam.howlett
  Cc: akpm, david, jiangyunshui, linux-kernel, linux-mm,
	lorenzo.stoakes, vbabka

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 746 bytes --]

Initially, I added explicit inline hints in the code precisely because some
newer compilers might ignore such suggestions.
 

Based on everyone's feedback, I rechecked my compiler configurations

—ARM64 uses GCC 12.3, while x86 uses GCC 12.4. By analyzing the

disassembly of vmlinux via objdump, I confirmed that vma_needs_copy

is indeed inlined in both cases. In this case the score difference in spawn

might just be due to fluctuations.

 

Leave it unchanged might be better. If we later encounter issues where

older compilers fail to inline single-caller cases or multiple calls are

incurred due to code changes, we could re-discuss  whether manual

intervention is needed. 

 

Best regards,
Yunshui Jiang

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-06-19  3:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-18  1:42 [PATCH] mm: Inline vma_needs_copy Yunshui Jiang
2025-06-18  9:39 ` Vlastimil Babka
2025-06-18 10:02   ` David Hildenbrand
2025-06-18 14:54     ` Liam R. Howlett
2025-06-19  3:30       ` Yunshui Jiang
2025-06-18 10:35 ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).