All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Song Liu <songliubraving@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
	the arch/x86 maintainers <x86@kernel.org>
Subject: Re: [GIT pull] x86/pti for 5.4-rc1
Date: Wed, 25 Sep 2019 08:23:23 +0200	[thread overview]
Message-ID: <20190925062323.GA65860@gmail.com> (raw)
In-Reply-To: <C6FC577A-A589-46FD-92FE-5C441BDB922D@fb.com>


* Song Liu <songliubraving@fb.com> wrote:

> 
> 
> > On Sep 17, 2019, at 4:35 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > 
> > On Tue, Sep 17, 2019 at 4:29 PM Song Liu <songliubraving@fb.com> wrote:
> >> 
> >> How about we just do:
> >> 
> >> diff --git i/arch/x86/mm/pti.c w/arch/x86/mm/pti.c
> >> index b196524759ec..0437f65250db 100644
> >> --- i/arch/x86/mm/pti.c
> >> +++ w/arch/x86/mm/pti.c
> >> @@ -341,6 +341,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
> >>                }
> >> 
> >>                if (pmd_large(*pmd) || level == PTI_CLONE_PMD) {
> >> +                       WARN_ON_ONCE(addr & ~PMD_MASK);
> >>                        target_pmd = pti_user_pagetable_walk_pmd(addr);
> >>                        if (WARN_ON(!target_pmd))
> >>                                return;
> >> 
> >> So it is a "warn and continue" check just for unaligned PMD address.
> > 
> > The problem there is that the "continue" part can be wrong.
> > 
> > Admittedly it requires a pretty crazy setup: you first hit a
> > pmd_large() entry, but the *next* pmd is regular, so you start doing
> > the per-page cloning.
> > 
> > And that per-page cloning will be wrong, because it will start in the
> > middle of the next pmd, because addr wasn't aligned, and the previous
> > pmd-only clone did
> > 
> >                        addr += PMD_SIZE;
> > 
> > to go to the next case.
> > 
> > See?
> 
> I see. This is tricky. 
> 
> Maybe we should skip clone of the first unaligned large pmd?
> 
> diff --git i/arch/x86/mm/pti.c w/arch/x86/mm/pti.c
> index 7f2140414440..1dfa69f8196b 100644
> --- i/arch/x86/mm/pti.c
> +++ w/arch/x86/mm/pti.c
> @@ -343,6 +343,11 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>                 }
> 
>                 if (pmd_large(*pmd) || level == PTI_CLONE_PMD) {
> +                       if (WARN_ON_ONCE(addr & ~PMD_MASK)) {
> +                               addr = round_up(addr, PMD_SIZE);
> +                               continue;
> +                       }
> +
>                         target_pmd = pti_user_pagetable_walk_pmd(addr);
>                         if (WARN_ON(!target_pmd))
>                                 return;

No, we should do a proper iteration of the page table structures.

> Or we can round_down the addr and copy the whole PMD properly:
> 
> diff --git i/arch/x86/mm/pti.c w/arch/x86/mm/pti.c
> index 7f2140414440..bee9881f2e85 100644
> --- i/arch/x86/mm/pti.c
> +++ w/arch/x86/mm/pti.c
> @@ -343,6 +343,9 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>                 }
> 
>                 if (pmd_large(*pmd) || level == PTI_CLONE_PMD) {
> +                       if (WARN_ON_ONCE(addr & ~PMD_MASK))
> +                               addr &= PMD_MASK;
> +
>                         target_pmd = pti_user_pagetable_walk_pmd(addr);
>                         if (WARN_ON(!target_pmd))
>                                 return;
> 
> I think the latter is better, but I am not sure. 

While this works, it's the wrong iterator pattern I believe.

In this function we iterate by passing in a 'random' [start,end) virtual 
memory address range with no particular alignment assumptions, then look 
up all pagetable entries covered by that range.

The iteration's principle is straightforward: we look up the first 
address (byte granular) then continue iterating according to the observed 
structure of the kernel pagetables, by skipping the range we have just 
looked up:

- If the current PUD is not mapped, then we set 'addr' to the first byte 
  after the virtual memory range represented by the current PUD entry:

    addr = round_up(addr + 1, PUD_SIZE);

- If the current PMD is not mapped, then the next byte is:

    addr = round_up(addr + 1, PMD_SIZE);

The part Linus correctly pointed it is still iterating incorrectly and 
might potentially be unrobust is:

    addr += PMD_SIZE;

This is buggy because it doesn't step to the next byte after the current 
mapped PMD, but potentially somewhere into the middle of the next 
PMD-sized range of virtual memory (which might or might not be covered by 
a PMD entry). The iterations after that might be similarly offset and 
buggy as well.

The right fix is to *fix the address iterator*, to use the basic 
principle of the function, with the same general exact calculation 
pattern we use in the other cases:

    addr = round_down(addr, PMD_SIZE) + PMD_SIZE;

BTW., I'd also suggest using this new round_down() pattern in the other 
two cases as well:

    addr = round_down(addr, PUD_SIZE) + PUD_SIZE;
    ...
    addr = round_down(addr, PMD_SIZE) + PMD_SIZE;

Why? Because this:

    addr = round_up(addr + 1, PUD_SIZE);

Will iterate incorrectly if 'addr' (which is byte granular) is the last 
*byte* of a PUD range, it will incorrectly skip the next PUD range...

Is a page-unaligned address likely to be passed in to this function? With 
the current users I really hope it won't happen, but it costs nothing to 
use clean iterators and think through all cases - it also makes the code 
more readable.

Three random nits about the pti_clone_pgtable() function:

- Could we please also fix all WARN()'s in that function to be 
  WARN_ONCE()? Any warning from that function is probably fatal to the 
  bootup anyway, and it doesn't help if we potentially spam many 
  warnings.

- Please add an explanation comment to why the 'BUG();' case is 
  unrecoverable and needs us to crash the kernel.

- Please add a comment about what the 'level' parameter does. It's non-obvious.

Thanks,

	Ingo

  reply	other threads:[~2019-09-25  6:23 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-16 13:30 [GIT pull] irq/core for 5.4-rc1 Thomas Gleixner
2019-09-16 13:30 ` [GIT pull] x86/irq " Thomas Gleixner
2019-09-17 20:15   ` pr-tracker-bot
2019-09-16 13:30 ` [GIT pull] smp/hotplug " Thomas Gleixner
2019-09-17 20:15   ` pr-tracker-bot
2019-09-16 13:30 ` [GIT pull] x86/apic " Thomas Gleixner
2019-09-17 20:15   ` pr-tracker-bot
2019-09-16 13:30 ` [GIT pull] x86/pti " Thomas Gleixner
2019-09-17 18:13   ` Linus Torvalds
2019-09-17 18:48     ` Song Liu
2019-09-17 19:01       ` Linus Torvalds
2019-09-17 23:28         ` Song Liu
2019-09-17 23:35           ` Linus Torvalds
2019-09-18 10:40             ` Song Liu
2019-09-25  6:23               ` Ingo Molnar [this message]
2019-09-17 20:15   ` pr-tracker-bot
2019-09-16 13:30 ` [GIT pull] timers/urgent " Thomas Gleixner
2019-09-17 20:15   ` pr-tracker-bot
2019-09-16 13:30 ` [GIT pull] timers/core " Thomas Gleixner
2019-09-17 20:15   ` pr-tracker-bot
2019-09-17 20:15 ` [GIT pull] irq/core " pr-tracker-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190925062323.GA65860@gmail.com \
    --to=mingo@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=songliubraving@fb.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.