From: Catalin Marinas <catalin.marinas@arm.com>
To: Yang Shi <yang@os.amperecomputing.com>
Cc: will@kernel.org, scott@os.amperecomputing.com, cl@gentwo.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] arm64: mm: force write fault for atomic RMW instructions
Date: Fri, 10 May 2024 13:11:30 +0100 [thread overview]
Message-ID: <Zj4O8q9-bliXE435@arm.com> (raw)
In-Reply-To: <20240507223558.3039562-1-yang@os.amperecomputing.com>
On Tue, May 07, 2024 at 03:35:58PM -0700, Yang Shi wrote:
> The atomic RMW instructions, for example, ldadd, actually does load +
> add + store in one instruction, it may trigger two page faults, the
> first fault is a read fault, the second fault is a write fault.
>
> Some applications use atomic RMW instructions to populate memory, for
> example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
> at launch time) between v18 and v22.
I'd also argue that this should be optimised in openjdk. Is an LDADD
more efficient on your hardware than a plain STR? I hope it only does
one operation per page rather than per long. There's also MAP_POPULATE
that openjdk can use to pre-fault the pages with no additional fault.
This would be even more efficient than any store or atomic operation.
Not sure the reason for the architecture to report a read fault only on
atomics. Looking at the pseudocode, it checks for both but the read
permission takes priority. Also in case of a translation fault (which is
what we get on the first fault), I think the syndrome write bit is
populated as (!read && write), so 0 since 'read' is 1 for atomics.
> But the double page fault has some problems:
>
> 1. Noticeable TLB overhead. The kernel actually installs zero page with
> readonly PTE for the read fault. The write fault will trigger a
> write-protection fault (CoW). The CoW will allocate a new page and
> make the PTE point to the new page, this needs TLB invalidations. The
> tlb invalidation and the mandatory memory barriers may incur
> significant overhead, particularly on the machines with many cores.
I can see why the current behaviour is not ideal but I can't tell why
openjdk does it this way either.
A bigger hammer would be to implement mm_forbids_zeropage() but this may
affect some workloads that rely on sparsely populated large arrays.
> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> index db1aeacd4cd9..5d5a3fbeecc0 100644
> --- a/arch/arm64/include/asm/insn.h
> +++ b/arch/arm64/include/asm/insn.h
> @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void) \
> * "-" means "don't care"
> */
> __AARCH64_INSN_FUNCS(class_branch_sys, 0x1c000000, 0x14000000)
> +__AARCH64_INSN_FUNCS(class_atomic, 0x3b200c00, 0x38200000)
This looks correct, it covers the LDADD and SWP instructions. However,
one concern is whether future architecture versions will add some
instructions in this space that are allowed to do a read only operation
(e.g. skip writing if the value is the same or fails some comparison).
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 8251e2fea9c7..f7bceedf5ef3 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -529,6 +529,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
> unsigned int mm_flags = FAULT_FLAG_DEFAULT;
> unsigned long addr = untagged_addr(far);
> struct vm_area_struct *vma;
> + unsigned int insn;
>
> if (kprobe_page_fault(regs, esr))
> return 0;
> @@ -586,6 +587,24 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
> if (!vma)
> goto lock_mmap;
>
> + if (mm_flags & (FAULT_FLAG_WRITE | FAULT_FLAG_INSTRUCTION))
> + goto continue_fault;
I'd avoid the goto if possible. Even better, move this higher up into
the block of if/else statements building the vm_flags and mm_flags.
Factor out the checks into a different function - is_el0_atomic_instr()
or something.
> +
> + pagefault_disable();
This prevents recursively entering do_page_fault() but it may be worth
testing it with an execute-only permission.
> +
> + if (get_user(insn, (unsigned int __user *) instruction_pointer(regs))) {
> + pagefault_enable();
> + goto continue_fault;
> + }
> +
> + if (aarch64_insn_is_class_atomic(insn)) {
> + vm_flags = VM_WRITE;
> + mm_flags |= FAULT_FLAG_WRITE;
> + }
The above would need to check if the fault is coming from a 64-bit user
mode, otherwise the decoding wouldn't make sense:
if (!user_mode(regs) || compat_user_mode(regs))
return false;
(assuming a separate function that checks the above and returns a bool;
you'd need to re-enable the page faults)
You also need to take care of endianness since the instructions are
always little-endian. We use a similar pattern in user_insn_read():
u32 instr;
__le32 instr_le;
if (get_user(instr_le, (__le32 __user *)instruction_pointer(regs)))
return false;
instr = le32_to_cpu(instr_le);
...
That said, I'm not keen on this kernel workaround. If openjdk decides to
improve some security and goes for PROT_EXEC-only mappings of its text
sections, the above trick will no longer work.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
WARNING: multiple messages have this Message-ID (diff)
From: Catalin Marinas <catalin.marinas@arm.com>
To: Yang Shi <yang@os.amperecomputing.com>
Cc: will@kernel.org, scott@os.amperecomputing.com, cl@gentwo.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] arm64: mm: force write fault for atomic RMW instructions
Date: Fri, 10 May 2024 13:11:30 +0100 [thread overview]
Message-ID: <Zj4O8q9-bliXE435@arm.com> (raw)
In-Reply-To: <20240507223558.3039562-1-yang@os.amperecomputing.com>
On Tue, May 07, 2024 at 03:35:58PM -0700, Yang Shi wrote:
> The atomic RMW instructions, for example, ldadd, actually does load +
> add + store in one instruction, it may trigger two page faults, the
> first fault is a read fault, the second fault is a write fault.
>
> Some applications use atomic RMW instructions to populate memory, for
> example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
> at launch time) between v18 and v22.
I'd also argue that this should be optimised in openjdk. Is an LDADD
more efficient on your hardware than a plain STR? I hope it only does
one operation per page rather than per long. There's also MAP_POPULATE
that openjdk can use to pre-fault the pages with no additional fault.
This would be even more efficient than any store or atomic operation.
Not sure the reason for the architecture to report a read fault only on
atomics. Looking at the pseudocode, it checks for both but the read
permission takes priority. Also in case of a translation fault (which is
what we get on the first fault), I think the syndrome write bit is
populated as (!read && write), so 0 since 'read' is 1 for atomics.
> But the double page fault has some problems:
>
> 1. Noticeable TLB overhead. The kernel actually installs zero page with
> readonly PTE for the read fault. The write fault will trigger a
> write-protection fault (CoW). The CoW will allocate a new page and
> make the PTE point to the new page, this needs TLB invalidations. The
> tlb invalidation and the mandatory memory barriers may incur
> significant overhead, particularly on the machines with many cores.
I can see why the current behaviour is not ideal but I can't tell why
openjdk does it this way either.
A bigger hammer would be to implement mm_forbids_zeropage() but this may
affect some workloads that rely on sparsely populated large arrays.
> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> index db1aeacd4cd9..5d5a3fbeecc0 100644
> --- a/arch/arm64/include/asm/insn.h
> +++ b/arch/arm64/include/asm/insn.h
> @@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void) \
> * "-" means "don't care"
> */
> __AARCH64_INSN_FUNCS(class_branch_sys, 0x1c000000, 0x14000000)
> +__AARCH64_INSN_FUNCS(class_atomic, 0x3b200c00, 0x38200000)
This looks correct, it covers the LDADD and SWP instructions. However,
one concern is whether future architecture versions will add some
instructions in this space that are allowed to do a read only operation
(e.g. skip writing if the value is the same or fails some comparison).
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 8251e2fea9c7..f7bceedf5ef3 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -529,6 +529,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
> unsigned int mm_flags = FAULT_FLAG_DEFAULT;
> unsigned long addr = untagged_addr(far);
> struct vm_area_struct *vma;
> + unsigned int insn;
>
> if (kprobe_page_fault(regs, esr))
> return 0;
> @@ -586,6 +587,24 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
> if (!vma)
> goto lock_mmap;
>
> + if (mm_flags & (FAULT_FLAG_WRITE | FAULT_FLAG_INSTRUCTION))
> + goto continue_fault;
I'd avoid the goto if possible. Even better, move this higher up into
the block of if/else statements building the vm_flags and mm_flags.
Factor out the checks into a different function - is_el0_atomic_instr()
or something.
> +
> + pagefault_disable();
This prevents recursively entering do_page_fault() but it may be worth
testing it with an execute-only permission.
> +
> + if (get_user(insn, (unsigned int __user *) instruction_pointer(regs))) {
> + pagefault_enable();
> + goto continue_fault;
> + }
> +
> + if (aarch64_insn_is_class_atomic(insn)) {
> + vm_flags = VM_WRITE;
> + mm_flags |= FAULT_FLAG_WRITE;
> + }
The above would need to check if the fault is coming from a 64-bit user
mode, otherwise the decoding wouldn't make sense:
if (!user_mode(regs) || compat_user_mode(regs))
return false;
(assuming a separate function that checks the above and returns a bool;
you'd need to re-enable the page faults)
You also need to take care of endianness since the instructions are
always little-endian. We use a similar pattern in user_insn_read():
u32 instr;
__le32 instr_le;
if (get_user(instr_le, (__le32 __user *)instruction_pointer(regs)))
return false;
instr = le32_to_cpu(instr_le);
...
That said, I'm not keen on this kernel workaround. If openjdk decides to
improve some security and goes for PROT_EXEC-only mappings of its text
sections, the above trick will no longer work.
--
Catalin
next prev parent reply other threads:[~2024-05-10 12:11 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-07 22:35 [PATCH] arm64: mm: force write fault for atomic RMW instructions Yang Shi
2024-05-07 22:35 ` Yang Shi
2024-05-07 22:42 ` Christoph Lameter (Ampere)
2024-05-08 6:45 ` Anshuman Khandual
2024-05-08 6:45 ` Anshuman Khandual
2024-05-08 17:15 ` Christoph Lameter (Ampere)
2024-05-08 17:15 ` Christoph Lameter (Ampere)
2024-05-09 4:23 ` Anshuman Khandual
2024-05-09 4:23 ` Anshuman Khandual
2024-05-13 22:39 ` Christoph Lameter (Ampere)
2024-05-13 22:39 ` Christoph Lameter (Ampere)
2024-05-08 18:37 ` Yang Shi
2024-05-08 18:37 ` Yang Shi
2024-05-09 4:31 ` Anshuman Khandual
2024-05-09 4:31 ` Anshuman Khandual
2024-05-09 21:46 ` Yang Shi
2024-05-09 21:46 ` Yang Shi
2024-05-10 4:28 ` Anshuman Khandual
2024-05-10 4:28 ` Anshuman Khandual
2024-05-10 16:37 ` Yang Shi
2024-05-10 16:37 ` Yang Shi
2024-05-10 12:11 ` Catalin Marinas [this message]
2024-05-10 12:11 ` Catalin Marinas
2024-05-10 17:13 ` Yang Shi
2024-05-10 17:13 ` Yang Shi
2024-05-13 22:41 ` Christoph Lameter (Ampere)
2024-05-13 22:41 ` Christoph Lameter (Ampere)
2024-05-14 10:39 ` Catalin Marinas
2024-05-14 10:39 ` Catalin Marinas
2024-05-14 15:57 ` David Hildenbrand
2024-05-14 15:57 ` David Hildenbrand
2024-05-17 16:30 ` Yang Shi
2024-05-17 16:30 ` Yang Shi
2024-05-17 17:25 ` Catalin Marinas
2024-05-17 17:25 ` Catalin Marinas
2024-05-17 17:35 ` Yang Shi
2024-05-17 17:35 ` Yang Shi
2024-05-14 3:19 ` Yang Shi
2024-05-14 3:19 ` Yang Shi
2024-05-14 10:53 ` Catalin Marinas
2024-05-14 10:53 ` Catalin Marinas
2024-05-17 16:10 ` Yang Shi
2024-05-17 16:10 ` Yang Shi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zj4O8q9-bliXE435@arm.com \
--to=catalin.marinas@arm.com \
--cc=cl@gentwo.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=scott@os.amperecomputing.com \
--cc=will@kernel.org \
--cc=yang@os.amperecomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.