From: Kefeng Wang <wangkefeng.wang@huawei.com>
To: Peter Xu <peterx@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, <linux-mm@kvack.org>
Subject: Re: [PATCH v2] mm: memory: check userfaultfd_wp() in vmf_orig_pte_uffd_wp()
Date: Sat, 20 Apr 2024 12:05:04 +0800 [thread overview]
Message-ID: <073f3d13-5656-4de0-a62b-cee96f2b0eaa@huawei.com> (raw)
In-Reply-To: <ZiKLBxS8D_jzD6F9@x1n>
On 2024/4/19 23:17, Peter Xu wrote:
> On Fri, Apr 19, 2024 at 11:00:46AM +0800, Kefeng Wang wrote:
>>
>>
>> On 2024/4/19 0:32, Peter Xu wrote:
>>> Hi, Kefeng,
>>>
>>> On Thu, Apr 18, 2024 at 08:06:41PM +0800, Kefeng Wang wrote:
>>>> Add userfaultfd_wp() check in vmf_orig_pte_uffd_wp() to avoid the
>>>> unnecessary pte_marker_entry_uffd_wp() in most pagefault, difference
>>>> as shows below from perf data of lat_pagefault, note, the function
>>>> vmf_orig_pte_uffd_wp() is not inlined in the two kernel versions.
>>>>
>>>> perf report -i perf.data.before | grep vmf
>>>> 0.17% 0.13% lat_pagefault [kernel.kallsyms] [k] vmf_orig_pte_uffd_wp.part.0.isra.0
>>>> perf report -i perf.data.after | grep vmf
>>>
>>> Any real number to share too besides the perf greps? I meant, even if perf
>>> report will not report such function anymore, it doesn't mean it'll be
>>> faster, and how much it improves?
>>
>> dd if=/dev/zero of=/tmp/XXX bs=512M count=1
>> ./lat_pagefault -W 5 -N 5 /tmp/XXX
>>
>> before after
>> 1 0.2623 0.2605
>> 2 0.2622 0.2598
>> 3 0.2621 0.2595
>> 4 0.2622 0.2600
>> 5 0.2651 0.2598
>> 6 0.2624 0.2594
>> 7 0.2624 0.2605
>> 8 0.2627 0.2608
>> average 0.262675 0.2600375 -0.0026375
>>
>> The lat_pagefault does show some improvement(also I reboot and retest,
>> the results are same).
>
> Thanks. Could you replace the perf report with these real data? Or just
> append to it.
Sure, will append it.
>
> I had a look at the asm and indeed the current code will generate two
> jumps when without this patch, and I don't know why..
>
> 0x0000000000006ac4 <+52>: test $0x8,%ah <---- check FAULT_FLAG_ORIG_PTE_VALID
> 0x0000000000006ac7 <+55>: jne 0x6bcf <set_pte_range+319>
> 0x0000000000006acd <+61>: mov 0x18(%rbp),%rsi
>
> ...
>
> 0x0000000000006bcf <+319>: mov 0x40(%rdi),%rdi
> 0x0000000000006bd3 <+323>: test $0xffffffffffffff9f,%rdi <---- pte_none() check
> 0x0000000000006bda <+330>: je 0x6acd <set_pte_range+61>
> 0x0000000000006be0 <+336>: test $0x101,%edi <---- pte_present() check
> 0x0000000000006be6 <+342>: jne 0x6acd <set_pte_range+61>
> 0x0000000000006bec <+348>: call 0x1c50 <pte_to_swp_entry>
> 0x0000000000006bf1 <+353>: mov 0x0(%rip),%rdx # 0x6bf8 <set_pte_range+360>
> 0x0000000000006bf8 <+360>: mov %rax,%r15
> 0x0000000000006bfb <+363>: shr $0x3a,%rax
> 0x0000000000006bff <+367>: cmp $0x1f,%rax
> 0x0000000000006c03 <+371>: mov $0x0,%eax
> 0x0000000000006c08 <+376>: cmovne %rax,%r15
> 0x0000000000006c0c <+380>: mov 0x28(%rbx),%eax
> 0x0000000000006c0f <+383>: and $0x1,%r15d
> 0x0000000000006c13 <+387>: jmp 0x6acd <set_pte_range+61>
>
> I also don't know why the compiler cannot already merge the none+present
> check into one shot, I thought it could. Also surprised me that
> pte_to_swp_entry() is a function call.. but not involved in this context.
>
> So I think I was right it should bypass this when seeing it pte_none,
> however that includes two jumps.
>
> And with your patch applied the two jumps are not there:
>
> 0x0000000000006b0c <+124>: testb $0x8,0x29(%r14) <--- FAULT_FLAG_ORIG_PTE_VALID
> 0x0000000000006b11 <+129>: je 0x6b6a <set_pte_range+218>
> 0x0000000000006b13 <+131>: mov (%r14),%rax
> 0x0000000000006b16 <+134>: testb $0x10,0x21(%rax) <--- userfaultfd_wp(vmf->vma) check
> 0x0000000000006b1a <+138>: je 0x6b6a <set_pte_range+218>
>
> Maybe that's what contributes to that 0.x% extra time of a fault.
>
> So if we do care about this 0.x% and we're doing this anyway, perhaps move
The latency of lat_pagefault increased a lot than the old kernel(vs
5.10), except mm counter updating, the another obvious difference
shown from perf graph is the new vmf_orig_pte_uffd_wp().
> the vma check upper? Because afaict FAULT_FLAG_ORIG_PTE_VALID should
> always hit in set_pte_range(), so we can avoid two more insts in the common
> paths.
Moving it upper is better, and maybe add __always_inline to
vmf_orig_pte_uffd_wp() to make set_pte_range() only check VM_UFFD_WP
from vm_flags?
>
> I'll leave that to you too if you want to mention some details in above and
> add that into the commit log.
Will update the changelog, thanks.
>
> Thanks,
>
next prev parent reply other threads:[~2024-04-20 4:05 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-18 12:06 [PATCH v2] mm: memory: check userfaultfd_wp() in vmf_orig_pte_uffd_wp() Kefeng Wang
2024-04-18 16:32 ` Peter Xu
2024-04-19 3:00 ` Kefeng Wang
2024-04-19 15:17 ` Peter Xu
2024-04-20 4:05 ` Kefeng Wang [this message]
2024-04-21 13:53 ` Peter Xu
2024-04-22 2:13 ` Kefeng Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=073f3d13-5656-4de0-a62b-cee96f2b0eaa@huawei.com \
--to=wangkefeng.wang@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=linux-mm@kvack.org \
--cc=peterx@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox