linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing
       [not found]   ` <CA+55aFwa41fzvx8EZG_gODvw7hSpr+iP+w5fXp6jUcQh-4nFgQ@mail.gmail.com>
@ 2012-06-27 23:01     ` Peter Zijlstra
  2012-06-27 23:42       ` Linus Torvalds
                         ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Peter Zijlstra @ 2012-06-27 23:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Thomas Gleixner, Ingo Molnar,
	akpm, Rik van Riel, Hugh Dickins, Mel Gorman, Nick Piggin,
	Alex Shi, Nikunj A. Dadhania, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, David Miller, Russell King,
	Catalin Marinas, Chris Metcalf, Martin Schwidefsky, Tony Luck,
	Paul Mundt, Jeff Dike, Richard Weinberger, Ralf Baechle,
	Kyle McMartin, James Bottomley, Chris Zankel

On Wed, 2012-06-27 at 15:23 -0700, Linus Torvalds wrote:

> Plus it really isn't about hardware page table walkers at all. It's
> more about the possibility of speculative TLB fils, it has nothing to
> do with *how* they are done. Sure, it's likely that a software
> pagetable walker wouldn't be something that gets called speculatively,
> but it's not out of the question.
> 
Hmm, I would call gup_fast() as speculative as we can get in software.
It does a lock-less walk of the page-tables. That's what the RCU free'd
page-table stuff is for to begin with.
> 
> IOW, if Sparc/PPC really want to guarantee that they never fill TLB
> entries speculatively, and that if we are in a kernel thread they will
> *never* fill the TLB with anything else, then make them enable
> CONFIG_STRICT_TLB_FILL or something in their architecture Kconfig
> files. 

Since we've dealt with the speculative software side by using RCU-ish
stuff, the only thing that's left is hardware, now neither sparc64 nor
ppc actually know about the linux page-tables from what I understood,
they only look at their hash-table thing.

So even if the hardware did do speculative tlb fills, it would do them
from the hash-table, but that's already cleared out.


How about something like this

---
Subject: mm: Add missing TLB invalidate to RCU page-table freeing
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Thu Jun 28 00:49:33 CEST 2012

For normal systems we need a TLB invalidate before freeing the
page-tables, the generic RCU based page-table freeing code lacked
this.

This is because this code originally came from ppc where the hardware
never walks the linux page-tables and thus this invalidate is not
required.

Others, notably s390 which ran into this problem in cd94154cc6a
("[S390] fix tlb flushing for page table pages"), do very much need
this TLB invalidation.

Therefore add it, with a Kconfig option to disable it so as to not
unduly slow down PPC and SPARC64 which neither of them need it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/Kconfig         |    3 +++
 arch/powerpc/Kconfig |    1 +
 arch/sparc/Kconfig   |    1 +
 mm/memory.c          |   18 ++++++++++++++++++
 4 files changed, 23 insertions(+)

--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -231,6 +231,9 @@ config HAVE_ARCH_MUTEX_CPU_RELAX
 config HAVE_RCU_TABLE_FREE
 	bool
 
+config STRICT_TLB_FILL
+	bool
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
 
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -127,6 +127,7 @@ config PPC
 	select GENERIC_IRQ_SHOW_LEVEL
 	select IRQ_FORCED_THREADING
 	select HAVE_RCU_TABLE_FREE if SMP
+	select STRICT_TLB_FILL
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_BPF_JIT if PPC64
 	select HAVE_ARCH_JUMP_LABEL
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -52,6 +52,7 @@ config SPARC64
 	select HAVE_KRETPROBES
 	select HAVE_KPROBES
 	select HAVE_RCU_TABLE_FREE if SMP
+	select STRICT_TLB_FILL
 	select HAVE_MEMBLOCK
 	select HAVE_MEMBLOCK_NODE_MAP
 	select HAVE_SYSCALL_WRAPPERS
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -329,11 +329,27 @@ static void tlb_remove_table_rcu(struct 
 	free_page((unsigned long)batch);
 }
 
+#ifdef CONFIG_STRICT_TLB_FILL
+/*
+ * Some archictures (sparc64, ppc) cannot refill TLBs after the they've removed
+ * the PTE entries from their hash-table. Their hardware never looks at the
+ * linux page-table structures, so they don't need a hardware TLB invalidate
+ * when tearing down the page-table structure itself.
+ */
+static inline void tlb_table_flush_mmu(struct mmu_gather *tlb) { }
+#else
+static inline void tlb_table_flush_mmu(struct mmu_gather *tlb)
+{
+	tlb_flush_mmu(tlb);
+}
+#endif
+
 void tlb_table_flush(struct mmu_gather *tlb)
 {
 	struct mmu_table_batch **batch = &tlb->batch;
 
 	if (*batch) {
+		tlb_table_flush_mmu(tlb);
 		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
 		*batch = NULL;
 	}
@@ -345,6 +361,7 @@ void tlb_remove_table(struct mmu_gather 
 
 	tlb->need_flush = 1;
 
+#ifdef CONFIG_STRICT_TLB_FILL
 	/*
 	 * When there's less then two users of this mm there cannot be a
 	 * concurrent page-table walk.
@@ -353,6 +370,7 @@ void tlb_remove_table(struct mmu_gather 
 		__tlb_remove_table(table);
 		return;
 	}
+#endif
 
 	if (*batch == NULL) {
 		*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing
       [not found]     ` <1340838154.10063.86.camel@twins>
@ 2012-06-27 23:13       ` Peter Zijlstra
  2012-06-27 23:23         ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2012-06-27 23:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Thomas Gleixner, Ingo Molnar,
	akpm, Rik van Riel, Hugh Dickins, Mel Gorman, Nick Piggin,
	Alex Shi, Nikunj A. Dadhania, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, David Miller, Russell King,
	Catalin Marinas, Chris Metcalf, Martin Schwidefsky, Tony Luck,
	Paul Mundt, Jeff Dike, Richard Weinberger, Ralf Baechle,
	Kyle McMartin, James Bottomley, Chris Zankel

On Thu, 2012-06-28 at 01:02 +0200, Peter Zijlstra wrote:
> On Wed, 2012-06-27 at 15:26 -0700, Linus Torvalds wrote:
> > On Wed, Jun 27, 2012 at 2:15 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > This originated from s390 which does something similar and would allow
> > > s390 to use the generic TLB flushing code.
> > >
> > > The idea is to flush the mm wide cache and tlb a priory and not bother
> > > with multiple flushes if the batching isn't large enough.
> > >
> > > This can be safely done since there cannot be any concurrency on this
> > > mm, its either after the process died (exit) or in the middle of
> > > execve where the thread switched to the new mm.
> > 
> > I think we actually *used* to do the final TLB flush from within the
> > context of the process that died. That doesn't seem to ever be the
> > case any more, but it does worry me a bit. Maybe a
> > 
> >    VM_BUG_ON(current->active_mm == mm);
> > 
> > or something for the fullmm case?
> 
> OK, added it and am rebooting the test box..

That triggered.. is this a problem though, at this point userspace is
very dead so it shouldn't matter, right?

Will have to properly think about it tomorrow, its been 1am, brain is
mostly sleeping already.

------------[ cut here ]------------
kernel BUG at /home/root/src/linux-2.6/mm/memory.c:221!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU 13 
Pid: 132, comm: modprobe Not tainted 3.5.0-rc4-01507-g912ca15-dirty #180 Supermicro X8DTN/X8DTN
RIP: 0010:[<ffffffff811511bf>]  [<ffffffff811511bf>] tlb_gather_mmu+0x9f/0xb0
RSP: 0018:ffff880235b2bd78  EFLAGS: 00010246
RAX: ffff880235b18000 RBX: ffff880235b2bdc0 RCX: ffff880235b18000
RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000
RBP: ffff880235b2bd98 R08: 0000000000000018 R09: 0000000000000004
R10: ffffffff81eedfc0 R11: 0000000000000084 R12: ffff8804356b8000
R13: 0000000000000001 R14: ffff880235b185f0 R15: ffff880235b18000
FS:  0000000000000000(0000) GS:ffff880237ce0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000038ce8ae150 CR3: 0000000436ad6000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 132, threadinfo ffff880235b2a000, task ffff880235b18000)
Stack:
 ffff880235b2bd98 0000000000000000 ffff8804356b8000 ffff8804356b8060
 ffff880235b2be38 ffffffff8115ad38 ffff880235b2be38 ffff880235b4e000
 ffff880235b4e630 ffff8804356b8000 0000000100000000 ffff880235b2bdd8
Call Trace:
 [<ffffffff8115ad38>] exit_mmap+0x98/0x150
 [<ffffffff810bf98e>] ? exit_numa+0xae/0xe0
 [<ffffffff81078b74>] mmput+0x84/0x120
 [<ffffffff81080ce8>] exit_mm+0x108/0x130
 [<ffffffff81081388>] do_exit+0x678/0x950
 [<ffffffff811a3ad6>] ? alloc_fd+0xd6/0x120
 [<ffffffff811791c0>] ? kmem_cache_free+0x20/0x130
 [<ffffffff810819af>] do_group_exit+0x3f/0xa0
 [<ffffffff81081a27>] sys_exit_group+0x17/0x20
 [<ffffffff81980ed2>] system_call_fastpath+0x16/0x1b
Code: 10 74 1a 65 48 8b 04 25 80 ba 00 00 4c 3b a0 90 02 00 00 74 16 4c 89 e7 e8 5f 39 f2 ff 48 8b 5d e8 4c 8b 65 f0 4c 8b 6d f8 c9 c3 <0f> 0b 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 
RIP  [<ffffffff811511bf>] tlb_gather_mmu+0x9f/0xb0
 RSP <ffff880235b2bd78>
---[ end trace f99f121b09c974f8 ]---


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing
  2012-06-27 23:13       ` [PATCH 08/20] mm: Optimize fullmm TLB flushing Peter Zijlstra
@ 2012-06-27 23:23         ` Linus Torvalds
  2012-06-27 23:33           ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2012-06-27 23:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-arch, linux-mm, Thomas Gleixner, Ingo Molnar,
	akpm, Rik van Riel, Hugh Dickins, Mel Gorman, Nick Piggin,
	Alex Shi, Nikunj A. Dadhania, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, David Miller, Russell King,
	Catalin Marinas, Chris Metcalf, Martin Schwidefsky, Tony Luck,
	Paul Mundt, Jeff Dike, Richard Weinberger, Ralf Baechle,
	Kyle McMartin, James Bottomley, Chris Zankel

On Wed, Jun 27, 2012 at 4:13 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> That triggered.. is this a problem though, at this point userspace is
> very dead so it shouldn't matter, right?

It still matters. Even if user space is dead, kernel space accesses
can result in TLB fills in user space. Exactly because of things like
speculative fills etc.

So what can happen - for example - is that the kernel does a indirect
jump, and the CPU predicts the destination of the jump that using the
branch prediction tables.

But the branch prediction tables are obviously just predictions, and
they easily contain user addresses etc in them. So the kernel may well
end up speculatively doing a TLB fill on a user access.

And your whole optimization depends on this not happening, unless I
read the logic wrong. The whole "invalidate the TLB just once
up-front" approach is *only* valid if you know that nothing is going
to ever fill that TLB again. But see above - if we're still running
within that TLB context, we have no idea what speculative execution
may or may not end up filling.

That said, maybe I misread your patch?

                   Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing
  2012-06-27 23:23         ` Linus Torvalds
@ 2012-06-27 23:33           ` Linus Torvalds
  2012-06-28 10:55             ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2012-06-27 23:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-arch, linux-mm, Thomas Gleixner, Ingo Molnar,
	akpm, Rik van Riel, Hugh Dickins, Mel Gorman, Nick Piggin,
	Alex Shi, Nikunj A. Dadhania, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, David Miller, Russell King,
	Catalin Marinas, Chris Metcalf, Martin Schwidefsky, Tony Luck,
	Paul Mundt, Jeff Dike, Richard Weinberger, Ralf Baechle,
	Kyle McMartin, James Bottomley, Chris Zankel

On Wed, Jun 27, 2012 at 4:23 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But the branch prediction tables are obviously just predictions, and
> they easily contain user addresses etc in them. So the kernel may well
> end up speculatively doing a TLB fill on a user access.

That should be ".. on a user *address*", hopefully that was clear from
the context, if not from the text.

IOW, the point I'm trying to make is that even if there are zero
*actual* accesses of user space (because user space is dead, and the
kernel hopefully does no "get_user()/put_user()" stuff at this point
any more), the CPU may speculatively use user addresses for the
bog-standard kernel addresses that happen.

Taking a user address from the BTB is just one example. Speculative
memory accesses might happen after a mis-predicted branch, where we
test a pointer against NULL, and after the branch we access it. So
doing a speculative TLB walk of the NULL address would not necessarily
even be unusual. Obviously normally nothing is actually mapped there,
but these kinds of things can *easily* result in the page tables
themselves being cached, even if the final page doesn't exist.

Also, all of this obviously depends on how aggressive the speculation
is. It's entirely possible that effects like these are really hard to
see in practice, and you'll almost never hit it. But stale TLB
contents (or stale page directory caches) are *really* nasty when they
do happen, and almost impossible to debug. So we want to be insanely
anal in this area.

               Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing
  2012-06-27 23:01     ` [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing Peter Zijlstra
@ 2012-06-27 23:42       ` Linus Torvalds
  2012-06-28  7:09       ` Benjamin Herrenschmidt
  2012-07-24  5:12       ` Nikunj A Dadhania
  2 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2012-06-27 23:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-arch, linux-mm, Thomas Gleixner, Ingo Molnar,
	akpm, Rik van Riel, Hugh Dickins, Mel Gorman, Nick Piggin,
	Alex Shi, Nikunj A. Dadhania, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, David Miller, Russell King,
	Catalin Marinas, Chris Metcalf, Martin Schwidefsky, Tony Luck,
	Paul Mundt, Jeff Dike, Richard Weinberger, Ralf Baechle,
	Kyle McMartin, James Bottomley, Chris Zankel

On Wed, Jun 27, 2012 at 4:01 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> How about something like this

Looks better.

I'd be even happier if you made the whole

  "When there's less then two users.."

(There's a misspelling there, btw, I didn't notice until I
cut-and-pasted that) logic be a helper function, and have that helper
function be inside that same #ifdef CONFIG_STRICT_TLB_FILL block
together witht he tlb_table_flush_mmu() function.

IOW, something like

  static int tlb_remove_table_quick( struct mmu_gather *tlb, void *table)
  {
        if (atomic_read(&tlb->mm->mm_users) < 2) {
            __tlb_remove_table(table);
            return 1;
        }
        return 0;
  }

for the CONFIG_STRICT_TLB_FILL case, and then the default case just
does an unconditional "return 0".

So that the actual code can avoid having #ifdef's in the middle of a
function, and could just do

    if (tlb_remove_table_quick(tlb, table))
        return;

instead.

Maybe it's just me, but I detest seeing #ifdef's in the middle of
code. I'd much rather have the #ifdef's *outside* the code and have
these kinds of helper functions that sometimes end up becoming empty.

                   Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing
  2012-06-27 23:01     ` [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing Peter Zijlstra
  2012-06-27 23:42       ` Linus Torvalds
@ 2012-06-28  7:09       ` Benjamin Herrenschmidt
  2012-06-28 11:05         ` Peter Zijlstra
  2012-07-24  5:12       ` Nikunj A Dadhania
  2 siblings, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2012-06-28  7:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, linux-arch, linux-mm,
	Thomas Gleixner, Ingo Molnar, akpm, Rik van Riel, Hugh Dickins,
	Mel Gorman, Nick Piggin, Alex Shi, Nikunj A. Dadhania,
	Konrad Rzeszutek Wilk, David Miller, Russell King,
	Catalin Marinas, Chris Metcalf, Martin Schwidefsky, Tony Luck,
	Paul Mundt, Jeff Dike, Richard Weinberger, Ralf Baechle,
	Kyle McMartin, James Bottomley, Chris Zankel

On Thu, 2012-06-28 at 01:01 +0200, Peter Zijlstra wrote:
> On Wed, 2012-06-27 at 15:23 -0700, Linus Torvalds wrote:
> 
> > Plus it really isn't about hardware page table walkers at all. It's
> > more about the possibility of speculative TLB fils, it has nothing to
> > do with *how* they are done. Sure, it's likely that a software
> > pagetable walker wouldn't be something that gets called speculatively,
> > but it's not out of the question.
> > 
> Hmm, I would call gup_fast() as speculative as we can get in software.
> It does a lock-less walk of the page-tables. That's what the RCU free'd
> page-table stuff is for to begin with.

Strictly speaking it's not :-) To *begin with* (as in the origin of that
code) it comes from powerpc hash table code which walks the linux page
tables locklessly :-) It then came in handy with gup_fast :-)

> > IOW, if Sparc/PPC really want to guarantee that they never fill TLB
> > entries speculatively, and that if we are in a kernel thread they will
> > *never* fill the TLB with anything else, then make them enable
> > CONFIG_STRICT_TLB_FILL or something in their architecture Kconfig
> > files. 
> 
> Since we've dealt with the speculative software side by using RCU-ish
> stuff, the only thing that's left is hardware, now neither sparc64 nor
> ppc actually know about the linux page-tables from what I understood,
> they only look at their hash-table thing.

Some embedded ppc's know about the lowest level (SW loaded PMD) but
that's not an issue here. We flush these special TLB entries
specifically and synchronously in __pte_free_tlb().

> So even if the hardware did do speculative tlb fills, it would do them
> from the hash-table, but that's already cleared out.

Right,

Cheers,
Ben.

> 
> How about something like this
> 
> ---
> Subject: mm: Add missing TLB invalidate to RCU page-table freeing
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Thu Jun 28 00:49:33 CEST 2012
> 
> For normal systems we need a TLB invalidate before freeing the
> page-tables, the generic RCU based page-table freeing code lacked
> this.
> 
> This is because this code originally came from ppc where the hardware
> never walks the linux page-tables and thus this invalidate is not
> required.
> 
> Others, notably s390 which ran into this problem in cd94154cc6a
> ("[S390] fix tlb flushing for page table pages"), do very much need
> this TLB invalidation.
> 
> Therefore add it, with a Kconfig option to disable it so as to not
> unduly slow down PPC and SPARC64 which neither of them need it.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  arch/Kconfig         |    3 +++
>  arch/powerpc/Kconfig |    1 +
>  arch/sparc/Kconfig   |    1 +
>  mm/memory.c          |   18 ++++++++++++++++++
>  4 files changed, 23 insertions(+)
> 
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -231,6 +231,9 @@ config HAVE_ARCH_MUTEX_CPU_RELAX
>  config HAVE_RCU_TABLE_FREE
>  	bool
>  
> +config STRICT_TLB_FILL
> +	bool
> +
>  config ARCH_HAVE_NMI_SAFE_CMPXCHG
>  	bool
>  
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -127,6 +127,7 @@ config PPC
>  	select GENERIC_IRQ_SHOW_LEVEL
>  	select IRQ_FORCED_THREADING
>  	select HAVE_RCU_TABLE_FREE if SMP
> +	select STRICT_TLB_FILL
>  	select HAVE_SYSCALL_TRACEPOINTS
>  	select HAVE_BPF_JIT if PPC64
>  	select HAVE_ARCH_JUMP_LABEL
> --- a/arch/sparc/Kconfig
> +++ b/arch/sparc/Kconfig
> @@ -52,6 +52,7 @@ config SPARC64
>  	select HAVE_KRETPROBES
>  	select HAVE_KPROBES
>  	select HAVE_RCU_TABLE_FREE if SMP
> +	select STRICT_TLB_FILL
>  	select HAVE_MEMBLOCK
>  	select HAVE_MEMBLOCK_NODE_MAP
>  	select HAVE_SYSCALL_WRAPPERS
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -329,11 +329,27 @@ static void tlb_remove_table_rcu(struct 
>  	free_page((unsigned long)batch);
>  }
>  
> +#ifdef CONFIG_STRICT_TLB_FILL
> +/*
> + * Some archictures (sparc64, ppc) cannot refill TLBs after the they've removed
> + * the PTE entries from their hash-table. Their hardware never looks at the
> + * linux page-table structures, so they don't need a hardware TLB invalidate
> + * when tearing down the page-table structure itself.
> + */
> +static inline void tlb_table_flush_mmu(struct mmu_gather *tlb) { }
> +#else
> +static inline void tlb_table_flush_mmu(struct mmu_gather *tlb)
> +{
> +	tlb_flush_mmu(tlb);
> +}
> +#endif
> +
>  void tlb_table_flush(struct mmu_gather *tlb)
>  {
>  	struct mmu_table_batch **batch = &tlb->batch;
>  
>  	if (*batch) {
> +		tlb_table_flush_mmu(tlb);
>  		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
>  		*batch = NULL;
>  	}
> @@ -345,6 +361,7 @@ void tlb_remove_table(struct mmu_gather 
>  
>  	tlb->need_flush = 1;
>  
> +#ifdef CONFIG_STRICT_TLB_FILL
>  	/*
>  	 * When there's less then two users of this mm there cannot be a
>  	 * concurrent page-table walk.
> @@ -353,6 +370,7 @@ void tlb_remove_table(struct mmu_gather 
>  		__tlb_remove_table(table);
>  		return;
>  	}
> +#endif
>  
>  	if (*batch == NULL) {
>  		*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing
  2012-06-27 23:33           ` Linus Torvalds
@ 2012-06-28 10:55             ` Peter Zijlstra
  2012-06-28 11:19               ` Martin Schwidefsky
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2012-06-28 10:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Thomas Gleixner, Ingo Molnar,
	akpm, Rik van Riel, Hugh Dickins, Mel Gorman, Nick Piggin,
	Alex Shi, Nikunj A. Dadhania, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, David Miller, Russell King,
	Catalin Marinas, Chris Metcalf, Martin Schwidefsky, Tony Luck,
	Paul Mundt, Jeff Dike, Richard Weinberger, Ralf Baechle,
	Kyle McMartin, James Bottomley, Chris Zankel

On Wed, 2012-06-27 at 16:33 -0700, Linus Torvalds wrote:
> IOW, the point I'm trying to make is that even if there are zero
> *actual* accesses of user space (because user space is dead, and the
> kernel hopefully does no "get_user()/put_user()" stuff at this point
> any more), the CPU may speculatively use user addresses for the
> bog-standard kernel addresses that happen. 

Right.. and s390 having done this only says that s390 appears to be ok
with it. Martin, does s390 hardware guarantee no speculative stuff like
Linus explained, or might there even be a latent issue on s390?

But it looks like we cannot do this in general, and esp. ARM (as already
noted by Catalin) has very aggressive speculative behaviour.

The alternative is that we do a switch_mm() to init_mm instead of the
TLB flush. On x86 that should be about the same cost, but I've not
looked at other architectures yet.

The second and least favourite alternative is of course special casing
this for s390 if it turns out its a safe thing to do for them.

/me goes look through arch code.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing
  2012-06-28  7:09       ` Benjamin Herrenschmidt
@ 2012-06-28 11:05         ` Peter Zijlstra
  2012-06-28 12:00           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2012-06-28 11:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, linux-kernel, linux-arch, linux-mm,
	Thomas Gleixner, Ingo Molnar, akpm, Rik van Riel, Hugh Dickins,
	Mel Gorman, Nick Piggin, Alex Shi, Nikunj A. Dadhania,
	Konrad Rzeszutek Wilk, David Miller, Russell King,
	Catalin Marinas, Chris Metcalf, Martin Schwidefsky, Tony Luck,
	Paul Mundt, Jeff Dike, Richard Weinberger, Ralf Baechle,
	Kyle McMartin, James Bottomley, Chris Zankel

On Thu, 2012-06-28 at 17:09 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2012-06-28 at 01:01 +0200, Peter Zijlstra wrote:
> > On Wed, 2012-06-27 at 15:23 -0700, Linus Torvalds wrote:
> > 
> > > Plus it really isn't about hardware page table walkers at all. It's
> > > more about the possibility of speculative TLB fils, it has nothing to
> > > do with *how* they are done. Sure, it's likely that a software
> > > pagetable walker wouldn't be something that gets called speculatively,
> > > but it's not out of the question.
> > > 
> > Hmm, I would call gup_fast() as speculative as we can get in software.
> > It does a lock-less walk of the page-tables. That's what the RCU free'd
> > page-table stuff is for to begin with.
> 
> Strictly speaking it's not :-) To *begin with* (as in the origin of that
> code) it comes from powerpc hash table code which walks the linux page
> tables locklessly :-) It then came in handy with gup_fast :-)

Ah, ok my bad.

> > > IOW, if Sparc/PPC really want to guarantee that they never fill TLB
> > > entries speculatively, and that if we are in a kernel thread they will
> > > *never* fill the TLB with anything else, then make them enable
> > > CONFIG_STRICT_TLB_FILL or something in their architecture Kconfig
> > > files. 
> > 
> > Since we've dealt with the speculative software side by using RCU-ish
> > stuff, the only thing that's left is hardware, now neither sparc64 nor
> > ppc actually know about the linux page-tables from what I understood,
> > they only look at their hash-table thing.
> 
> Some embedded ppc's know about the lowest level (SW loaded PMD) but
> that's not an issue here. We flush these special TLB entries
> specifically and synchronously in __pte_free_tlb().

OK, I missed that.. is that
arch/powerpc/mm/tlb_nohash.c:tlb_flush_pgtable() ?

> > So even if the hardware did do speculative tlb fills, it would do them
> > from the hash-table, but that's already cleared out.
> 
> Right,

Phew at least I got the important thing right ;-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing
  2012-06-28 10:55             ` Peter Zijlstra
@ 2012-06-28 11:19               ` Martin Schwidefsky
  2012-06-28 11:30                 ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Martin Schwidefsky @ 2012-06-28 11:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, linux-arch, linux-mm,
	Thomas Gleixner, Ingo Molnar, akpm, Rik van Riel, Hugh Dickins,
	Mel Gorman, Nick Piggin, Alex Shi, Nikunj A. Dadhania,
	Konrad Rzeszutek Wilk, Benjamin Herrenschmidt, David Miller,
	Russell King, Catalin Marinas, Chris Metcalf, Tony Luck,
	Paul Mundt, Jeff Dike, Richard Weinberger, Ralf Baechle,
	Kyle McMartin, James Bottomley, Chris Zankel

On Thu, 28 Jun 2012 12:55:04 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2012-06-27 at 16:33 -0700, Linus Torvalds wrote:
> > IOW, the point I'm trying to make is that even if there are zero
> > *actual* accesses of user space (because user space is dead, and the
> > kernel hopefully does no "get_user()/put_user()" stuff at this point
> > any more), the CPU may speculatively use user addresses for the
> > bog-standard kernel addresses that happen. 
> 
> Right.. and s390 having done this only says that s390 appears to be ok
> with it. Martin, does s390 hardware guarantee no speculative stuff like
> Linus explained, or might there even be a latent issue on s390?

The cpu can create speculative TLB entries, but only if it runs in the
mode that uses the respective mm. We have two mm's active at the same
time, the kernel mm (init_mm) and the user mm. While the cpu runs only
in kernel mode it is not allowed to create TLBs for the user mm.
While running in user mode it is allowed to speculatively create TLBs.
 
> But it looks like we cannot do this in general, and esp. ARM (as already
> noted by Catalin) has very aggressive speculative behaviour.
> 
> The alternative is that we do a switch_mm() to init_mm instead of the
> TLB flush. On x86 that should be about the same cost, but I've not
> looked at other architectures yet.
> 
> The second and least favourite alternative is of course special casing
> this for s390 if it turns out its a safe thing to do for them.
> 
> /me goes look through arch code.

Basically we have two special requirements on s390:
1) do not modify ptes while attached to another cpu except with the
   special IPTE / IDTE instructions
2) do a TLB flush before freeing any kind of page table page, s390
   needs a flush for pud, pmd & pte tables.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 08/20] mm: Optimize fullmm TLB flushing
  2012-06-28 11:19               ` Martin Schwidefsky
@ 2012-06-28 11:30                 ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2012-06-28 11:30 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Linus Torvalds, linux-kernel, linux-arch, linux-mm,
	Thomas Gleixner, Ingo Molnar, akpm, Rik van Riel, Hugh Dickins,
	Mel Gorman, Nick Piggin, Alex Shi, Nikunj A. Dadhania,
	Konrad Rzeszutek Wilk, Benjamin Herrenschmidt, David Miller,
	Russell King, Catalin Marinas, Chris Metcalf, Tony Luck,
	Paul Mundt, Jeff Dike, Richard Weinberger, Ralf Baechle,
	Kyle McMartin, James Bottomley, Chris Zankel

On Thu, 2012-06-28 at 13:19 +0200, Martin Schwidefsky wrote:

> The cpu can create speculative TLB entries, but only if it runs in the
> mode that uses the respective mm. We have two mm's active at the same
> time, the kernel mm (init_mm) and the user mm. While the cpu runs only
> in kernel mode it is not allowed to create TLBs for the user mm.
> While running in user mode it is allowed to speculatively create TLBs.

OK, that's neat.

> Basically we have two special requirements on s390:
> 1) do not modify ptes while attached to another cpu except with the
>    special IPTE / IDTE instructions

Right, and your fullmm case works by doing a global invalidate after all
threads have ceased userspace execution, this allows you to do away with
the IPTE/IDTE instructions since there's no other active cpus on the
userspace mm anymore.


> 2) do a TLB flush before freeing any kind of page table page, s390
>    needs a flush for pud, pmd & pte tables. 

Right, we do that (now)..

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing
  2012-06-28 11:05         ` Peter Zijlstra
@ 2012-06-28 12:00           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2012-06-28 12:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, linux-arch, linux-mm,
	Thomas Gleixner, Ingo Molnar, akpm, Rik van Riel, Hugh Dickins,
	Mel Gorman, Nick Piggin, Alex Shi, Nikunj A. Dadhania,
	Konrad Rzeszutek Wilk, David Miller, Russell King,
	Catalin Marinas, Chris Metcalf, Martin Schwidefsky, Tony Luck,
	Paul Mundt, Jeff Dike, Richard Weinberger, Ralf Baechle,
	Kyle McMartin, James Bottomley, Chris Zankel

On Thu, 2012-06-28 at 13:05 +0200, Peter Zijlstra wrote:
> 
> > Some embedded ppc's know about the lowest level (SW loaded PMD) but
> > that's not an issue here. We flush these special TLB entries
> > specifically and synchronously in __pte_free_tlb().
> 
> OK, I missed that.. is that
> arch/powerpc/mm/tlb_nohash.c:tlb_flush_pgtable() ?

Yup.

> > > So even if the hardware did do speculative tlb fills, it would do
> them
> > > from the hash-table, but that's already cleared out.
> > 
> > Right,
> 
> Phew at least I got the important thing right ;-)

Yeah as long as we have that hash :-) The day we move on (if ever) it
will be as bad as ARM :-)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing
  2012-06-27 23:01     ` [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing Peter Zijlstra
  2012-06-27 23:42       ` Linus Torvalds
  2012-06-28  7:09       ` Benjamin Herrenschmidt
@ 2012-07-24  5:12       ` Nikunj A Dadhania
  2 siblings, 0 replies; 12+ messages in thread
From: Nikunj A Dadhania @ 2012-07-24  5:12 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Thomas Gleixner, Ingo Molnar,
	akpm, Rik van Riel, Hugh Dickins, Mel Gorman, Nick Piggin,
	Alex Shi, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	David Miller, Russell King, Catalin Marinas, Chris Metcalf,
	Martin Schwidefsky, Tony Luck, Paul Mundt, Jeff Dike,
	Richard Weinberger, Ralf Baechle, Kyle McMartin, James Bottomley,
	Chris Zankel

On Thu, 28 Jun 2012 01:01:46 +0200, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
  
> +#ifdef CONFIG_STRICT_TLB_FILL
> +/*
> + * Some archictures (sparc64, ppc) cannot refill TLBs after the they've removed
> + * the PTE entries from their hash-table. Their hardware never looks at the
> + * linux page-table structures, so they don't need a hardware TLB invalidate
> + * when tearing down the page-table structure itself.
> + */
> +static inline void tlb_table_flush_mmu(struct mmu_gather *tlb) { }
> +#else
> +static inline void tlb_table_flush_mmu(struct mmu_gather *tlb)
> +{
> +	tlb_flush_mmu(tlb);
> +}
> +#endif
> +
>  void tlb_table_flush(struct mmu_gather *tlb)
>  {
>  	struct mmu_table_batch **batch = &tlb->batch;
>  
>  	if (*batch) {
> +		tlb_table_flush_mmu(tlb);
>  		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
>  		*batch = NULL;
>  	}

Hi Peter,

When running munmap(https://lkml.org/lkml/2012/5/17/59) test with KVM
and pvflush patches I got a crash. I have verified that the crash
happens on the base(non virt) as well when I have
CONFIG_HAVE_RCU_TABLE_FREE defined. Here is the crash details and my
analysis below:

-----------------------------------------------------------------------

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff810d31d9>] __call_rcu+0x29/0x1c0
PGD 0 
Oops: 0002 [#1] SMP 
CPU 24 
Modules linked in: kvm_intel kvm [last unloaded: scsi_wait_scan]


Pid: 32643, comm: munmap Not tainted 3.5.0-rc7+ #46 IBM System x3850 X5 -[7042CR6]-[root@mx3850x5 ~/Node 1, Processor Card]# 
RIP: 0010:[<ffffffff810d31d9>]  [<ffffffff810d31d9>] __call_rcu+0x29/0x1c0
RSP: 0018:ffff88203164fc28  EFLAGS: 00010246
RAX: ffff88203164fba8 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffffff81e34280 RSI: ffffffff81130330 RDI: 0000000000000000
RBP: ffff88203164fc58 R08: ffffea00d2680340 R09: 0000000000000000
R10: ffff883c7fbd4ef8 R11: 0000000000000078 R12: ffffffff81130330
R13: 00007f09ee803000 R14: ffff883c2fa5bab0 R15: ffff88203164fe08
FS:  00007f09ee7ee700(0000) GS:ffff883c7fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000001e0b000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process munmap (pid: 32643, threadinfo ffff88203164e000, task ffff882030458a70)
Stack:
 ffff883c2fa5bab0 ffff88203164fe08 ffff88203164fc68 ffff88203164fe08
 ffff88203164fe08 00007f09ee803000 ffff88203164fc68 ffffffff810d33c7
 ffff88203164fc88 ffffffff81130e0d ffff88203164fc88 ffffea00d28e54f8
Call Trace:
 [<ffffffff810d33c7>] call_rcu_sched+0x17/0x20
 [<ffffffff81130e0d>] tlb_table_flush+0x2d/0x40
 [<ffffffff81130e80>] tlb_remove_table+0x60/0xc0
 [<ffffffff8103a5e3>] ___pte_free_tlb+0x63/0x70
 [<ffffffff81131b38>] free_pgd_range+0x298/0x4b0
 [<ffffffff81131e1e>] free_pgtables+0xce/0x120
 [<ffffffff81137247>] exit_mmap+0xa7/0x160
 [<ffffffff81043fdf>] mmput+0x6f/0xf0
 [<ffffffff8104c3f5>] exit_mm+0x105/0x130
 [<ffffffff810d6c7d>] ? taskstats_exit+0x17d/0x240
 [<ffffffff8104c596>] do_exit+0x176/0x480
 [<ffffffff8104c8f5>] do_group_exit+0x55/0xd0
 [<ffffffff8104c987>] sys_exit_group+0x17/0x20
 [<ffffffff818a3829>] system_call_fastpath+0x16/0x1b
Code: ff ff 55 48 89 e5 48 83 ec 30 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 40 f6 c7 03 48 89 fb 49 89 f4 0f 85 19 01 00 00 <4c> 89 63 08 48 c7 03 00 00 00 00 0f ae f0 9c 58 66 66 90 66 90 
RIP  [<ffffffff810d31d9>] __call_rcu+0x29/0x1c0
 RSP <ffff88203164fc28>
CR2: 0000000000000008
---[ end trace 3ed30a91ea7cb375 ]---

----------------------------------------------------------------------------

I think this is what is happening:

___pte_free_tlb
   tlb_remove_table
      tlb_table_flush
         tlb_table_flush_mmu
            tlb_flush_mmu
                Sets need_flush = 0
                tlb_table_flush (if CONFIG_HAVE_RCU_TABLE_FREE)
                    [Gets called twice with same *tlb!]

                    tlb_table_flush_mmu
                        tlb_flush_mmu(nop as need_flush is 0)
                    call_rcu_sched(&(*batch)->rcu,...);
                    *batch = NULL;
         call_rcu_sched(&(*batch)->rcu,...); <---- *batch would be NULL

I verified this by putting following fix and do not see the crash
anymore:

diff --git a/mm/memory.c b/mm/memory.c
index 1797bc1..329fcb9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -367,7 +367,8 @@ void tlb_table_flush(struct mmu_gather *tlb)
 
 	if (*batch) {
 		tlb_table_flush_mmu(tlb);
-		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
+		if(*batch)
+			call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
 		*batch = NULL;
 	}
 }

Thanks
Nikunj


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-07-24  5:13 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20120627211540.459910855@chello.nl>
     [not found] ` <20120627212830.693232452@chello.nl>
     [not found]   ` <CA+55aFwa41fzvx8EZG_gODvw7hSpr+iP+w5fXp6jUcQh-4nFgQ@mail.gmail.com>
2012-06-27 23:01     ` [PATCH 02/20] mm: Add optional TLB flush to generic RCU page-table freeing Peter Zijlstra
2012-06-27 23:42       ` Linus Torvalds
2012-06-28  7:09       ` Benjamin Herrenschmidt
2012-06-28 11:05         ` Peter Zijlstra
2012-06-28 12:00           ` Benjamin Herrenschmidt
2012-07-24  5:12       ` Nikunj A Dadhania
     [not found] ` <20120627212831.137126018@chello.nl>
     [not found]   ` <CA+55aFwZoVK76ue7tFveV0XZpPUmoCVXJx8550OxPm+XKCSSZA@mail.gmail.com>
     [not found]     ` <1340838154.10063.86.camel@twins>
2012-06-27 23:13       ` [PATCH 08/20] mm: Optimize fullmm TLB flushing Peter Zijlstra
2012-06-27 23:23         ` Linus Torvalds
2012-06-27 23:33           ` Linus Torvalds
2012-06-28 10:55             ` Peter Zijlstra
2012-06-28 11:19               ` Martin Schwidefsky
2012-06-28 11:30                 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).