linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Can people please check this patch out for cross-arch issues
@ 2018-07-09 23:57 Linus Torvalds
  2018-07-10  4:49 ` Nicholas Piggin
  2018-07-10  6:12 ` Martin Schwidefsky
  0 siblings, 2 replies; 4+ messages in thread
From: Linus Torvalds @ 2018-07-09 23:57 UTC (permalink / raw)
  To: linux-arch

[-- Attachment #1: Type: text/plain, Size: 2400 bytes --]

We have a funny new bugzilla entry for an issue that is 12 years old,
and not really all that critical, but one that *can* become a problem
for people who do very specific things.

What happens is that 'fork()' will cause a re-try (with
ERESTARTNOINTR) if a signal has come in between the point where the
fork() started, but before we add the process to the process table.
The reason is that the signal possibly *should* affect the new child
process too, but it was never queued to it because we were obviously
in the process of creating it.

That's normally entirely a non-issue, and I don't think anybody ever
imagined it could matter in practice, but apparently there are loads
where this becomes problematic.

See kernel bugzilla at

    https://bugzilla.kernel.org/show_bug.cgi?id=200447

which has a trial balloon patch for this issue already, to at least
limit that retry to only signals that actually might affect the child
(ie not any random signals sent explicitly and directly only to the
parent process).

HOWEVER.

The very first thing I noticed while looking at this was that one of
the more expensive parts of the fork() retry is actually marking all
the parent page tables read-only. Now, it's one of _many_ expensive
parts, and it's not nearly as expensive as all the reference counting
we do for each page, but it's actually very easy to avoid. When we
have repeated fork() calls, there's just no point in repeatedly
marking pages read-only.

This the attached one-liner patch.

The reason I'm sending it to the arch people is that while this is a
totally trivial patch on x86 ("pte_write()" literally tests exactly
the same bit that "pte_wrprotect()" and "ptep_set_wrprotect()"
clears), the same is not necessarily always true on other
architectures.

Some other architectures make "ptep_set_wrprotect()" do more than just
clear the one bit we test with "pte_write()".

Honestly, I don't think it could possibly matter: if "pte_write()"
returns false, then whatever "ptep_set_writeprotect()" does can not
really matter (at least for a COW mapping). But I thought I'd send
this out for comments anyway, despite how trivial it looks.

So. Comments?

It doesn't make a huge difference, honestly, and the real fix for the
"retry too eagerly" is completely different, but at the same time this
one-liner trivial fix does feel like the RightThing(tm) regardless.

                 Linus

[-- Attachment #2: 0001-mm-cow-don-t-bother-write-protecting-already-write-p.patch --]
[-- Type: text/x-patch, Size: 1674 bytes --]

From 9c1b28c73cac29925511b6b5cf76e7f6860858f8 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 9 Jul 2018 13:19:49 -0700
Subject: [PATCH] mm/cow: don't bother write protecting already write-protected
 pages

This is not normally noticeable, but repeated forks (either because the
process itself uses fork() a lot, or because of the kernel retry due to
signal handling) are unnecessarily expensive because they repeatedly
dirty the parent page tables during the page table copy operation.

It's trivial to just avoid write protecting the page table entry if it
was already not writable.

This patch was inspired by

    https://bugzilla.kernel.org/show_bug.cgi?id=200447

which points to an ancient "waste time re-doing fork" issue in the
presence of lots of signals.  It doesn't _fix_ the issue, but it avoids
at least part of the unnecessary overhead when you do end up repeating a
fork().

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7206a634270b..57e61d50535f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1017,7 +1017,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * If it's a COW mapping, write protect it both
 	 * in the parent and the child
 	 */
-	if (is_cow_mapping(vm_flags)) {
+	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
 		ptep_set_wrprotect(src_mm, addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
-- 
2.18.0.132.g95eda3d86


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: Can people please check this patch out for cross-arch issues
  2018-07-09 23:57 Can people please check this patch out for cross-arch issues Linus Torvalds
@ 2018-07-10  4:49 ` Nicholas Piggin
  2018-07-10  4:49   ` Nicholas Piggin
  2018-07-10  6:12 ` Martin Schwidefsky
  1 sibling, 1 reply; 4+ messages in thread
From: Nicholas Piggin @ 2018-07-10  4:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linuxppc-dev

On Mon, 9 Jul 2018 16:57:46 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> We have a funny new bugzilla entry for an issue that is 12 years old,
> and not really all that critical, but one that *can* become a problem
> for people who do very specific things.
> 
> What happens is that 'fork()' will cause a re-try (with
> ERESTARTNOINTR) if a signal has come in between the point where the
> fork() started, but before we add the process to the process table.
> The reason is that the signal possibly *should* affect the new child
> process too, but it was never queued to it because we were obviously
> in the process of creating it.
> 
> That's normally entirely a non-issue, and I don't think anybody ever
> imagined it could matter in practice, but apparently there are loads
> where this becomes problematic.
> 
> See kernel bugzilla at
> 
>     https://bugzilla.kernel.org/show_bug.cgi?id=200447
> 
> which has a trial balloon patch for this issue already, to at least
> limit that retry to only signals that actually might affect the child
> (ie not any random signals sent explicitly and directly only to the
> parent process).
> 
> HOWEVER.
> 
> The very first thing I noticed while looking at this was that one of
> the more expensive parts of the fork() retry is actually marking all
> the parent page tables read-only. Now, it's one of _many_ expensive
> parts, and it's not nearly as expensive as all the reference counting
> we do for each page, but it's actually very easy to avoid. When we
> have repeated fork() calls, there's just no point in repeatedly
> marking pages read-only.
> 
> This the attached one-liner patch.
> 
> The reason I'm sending it to the arch people is that while this is a
> totally trivial patch on x86 ("pte_write()" literally tests exactly
> the same bit that "pte_wrprotect()" and "ptep_set_wrprotect()"
> clears), the same is not necessarily always true on other
> architectures.
> 
> Some other architectures make "ptep_set_wrprotect()" do more than just
> clear the one bit we test with "pte_write()".
> 
> Honestly, I don't think it could possibly matter: if "pte_write()"
> returns false, then whatever "ptep_set_writeprotect()" does can not
> really matter (at least for a COW mapping). But I thought I'd send
> this out for comments anyway, despite how trivial it looks.
> 
> So. Comments?

Looks good to me (after the huge page bits are done?)

If pte_write returns false here, surely any later write access has to
fault otherwise we've got bigger problems right? powerpc/64/radix is
pretty unsurprising so no problems there (it just modifies the pte, so
shouldn't change anything in this case). Hash will actually schedule
the hash table entry to be invalidated, but it can't be writable.

> It doesn't make a huge difference, honestly, and the real fix for the
> "retry too eagerly" is completely different, but at the same time this
> one-liner trivial fix does feel like the RightThing(tm) regardless.

Acked-by: Nicholas Piggin <npiggin@gmail.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Can people please check this patch out for cross-arch issues
  2018-07-10  4:49 ` Nicholas Piggin
@ 2018-07-10  4:49   ` Nicholas Piggin
  0 siblings, 0 replies; 4+ messages in thread
From: Nicholas Piggin @ 2018-07-10  4:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linuxppc-dev

On Mon, 9 Jul 2018 16:57:46 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> We have a funny new bugzilla entry for an issue that is 12 years old,
> and not really all that critical, but one that *can* become a problem
> for people who do very specific things.
> 
> What happens is that 'fork()' will cause a re-try (with
> ERESTARTNOINTR) if a signal has come in between the point where the
> fork() started, but before we add the process to the process table.
> The reason is that the signal possibly *should* affect the new child
> process too, but it was never queued to it because we were obviously
> in the process of creating it.
> 
> That's normally entirely a non-issue, and I don't think anybody ever
> imagined it could matter in practice, but apparently there are loads
> where this becomes problematic.
> 
> See kernel bugzilla at
> 
>     https://bugzilla.kernel.org/show_bug.cgi?id=200447
> 
> which has a trial balloon patch for this issue already, to at least
> limit that retry to only signals that actually might affect the child
> (ie not any random signals sent explicitly and directly only to the
> parent process).
> 
> HOWEVER.
> 
> The very first thing I noticed while looking at this was that one of
> the more expensive parts of the fork() retry is actually marking all
> the parent page tables read-only. Now, it's one of _many_ expensive
> parts, and it's not nearly as expensive as all the reference counting
> we do for each page, but it's actually very easy to avoid. When we
> have repeated fork() calls, there's just no point in repeatedly
> marking pages read-only.
> 
> This the attached one-liner patch.
> 
> The reason I'm sending it to the arch people is that while this is a
> totally trivial patch on x86 ("pte_write()" literally tests exactly
> the same bit that "pte_wrprotect()" and "ptep_set_wrprotect()"
> clears), the same is not necessarily always true on other
> architectures.
> 
> Some other architectures make "ptep_set_wrprotect()" do more than just
> clear the one bit we test with "pte_write()".
> 
> Honestly, I don't think it could possibly matter: if "pte_write()"
> returns false, then whatever "ptep_set_writeprotect()" does can not
> really matter (at least for a COW mapping). But I thought I'd send
> this out for comments anyway, despite how trivial it looks.
> 
> So. Comments?

Looks good to me (after the huge page bits are done?)

If pte_write returns false here, surely any later write access has to
fault otherwise we've got bigger problems right? powerpc/64/radix is
pretty unsurprising so no problems there (it just modifies the pte, so
shouldn't change anything in this case). Hash will actually schedule
the hash table entry to be invalidated, but it can't be writable.

> It doesn't make a huge difference, honestly, and the real fix for the
> "retry too eagerly" is completely different, but at the same time this
> one-liner trivial fix does feel like the RightThing(tm) regardless.

Acked-by: Nicholas Piggin <npiggin@gmail.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Can people please check this patch out for cross-arch issues
  2018-07-09 23:57 Can people please check this patch out for cross-arch issues Linus Torvalds
  2018-07-10  4:49 ` Nicholas Piggin
@ 2018-07-10  6:12 ` Martin Schwidefsky
  1 sibling, 0 replies; 4+ messages in thread
From: Martin Schwidefsky @ 2018-07-10  6:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch

On Mon, 9 Jul 2018 16:57:46 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> We have a funny new bugzilla entry for an issue that is 12 years old,
> and not really all that critical, but one that *can* become a problem
> for people who do very specific things.
> 
> What happens is that 'fork()' will cause a re-try (with
> ERESTARTNOINTR) if a signal has come in between the point where the
> fork() started, but before we add the process to the process table.
> The reason is that the signal possibly *should* affect the new child
> process too, but it was never queued to it because we were obviously
> in the process of creating it.
> 
> That's normally entirely a non-issue, and I don't think anybody ever
> imagined it could matter in practice, but apparently there are loads
> where this becomes problematic.
> 
> See kernel bugzilla at
> 
>     https://bugzilla.kernel.org/show_bug.cgi?id=200447
> 
> which has a trial balloon patch for this issue already, to at least
> limit that retry to only signals that actually might affect the child
> (ie not any random signals sent explicitly and directly only to the
> parent process).
> 
> HOWEVER.
> 
> The very first thing I noticed while looking at this was that one of
> the more expensive parts of the fork() retry is actually marking all
> the parent page tables read-only. Now, it's one of _many_ expensive
> parts, and it's not nearly as expensive as all the reference counting
> we do for each page, but it's actually very easy to avoid. When we
> have repeated fork() calls, there's just no point in repeatedly
> marking pages read-only.
> 
> This the attached one-liner patch.
> 
> The reason I'm sending it to the arch people is that while this is a
> totally trivial patch on x86 ("pte_write()" literally tests exactly
> the same bit that "pte_wrprotect()" and "ptep_set_wrprotect()"
> clears), the same is not necessarily always true on other
> architectures.
> 
> Some other architectures make "ptep_set_wrprotect()" do more than just
> clear the one bit we test with "pte_write()".
> 
> Honestly, I don't think it could possibly matter: if "pte_write()"
> returns false, then whatever "ptep_set_writeprotect()" does can not
> really matter (at least for a COW mapping). But I thought I'd send
> this out for comments anyway, despite how trivial it looks.
> 
> So. Comments?
> 
> It doesn't make a huge difference, honestly, and the real fix for the
> "retry too eagerly" is completely different, but at the same time this
> one-liner trivial fix does feel like the RightThing(tm) regardless.

s390 does software dirty-bit tracking which makes the logic for the
page protection more bit more complicated. But !pte_write() always
implies not writable.

As a follow up patch we could remove the pte_write() check in the
s390 version of ptep_set_wrprotect.

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 5ab636089c60..e8f9f3cc676a 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1069,8 +1069,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm,
 {
        pte_t pte = *ptep;
 
-       if (pte_write(pte))
-               ptep_xchg_lazy(mm, addr, ptep, pte_wrprotect(pte));
+       ptep_xchg_lazy(mm, addr, ptep, pte_wrprotect(pte));
 }
 
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-07-10  6:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-09 23:57 Can people please check this patch out for cross-arch issues Linus Torvalds
2018-07-10  4:49 ` Nicholas Piggin
2018-07-10  4:49   ` Nicholas Piggin
2018-07-10  6:12 ` Martin Schwidefsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).