[PATCH] mm: fix lazy vmap purging (use-after-free error)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: fix lazy vmap purging (use-after-free error)
@ 2009-02-20 13:41 Vegard Nossum
  2009-02-20 13:50 ` Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Vegard Nossum @ 2009-02-20 13:41 UTC (permalink / raw)
  To: Andrew Morton, Nick Piggin; +Cc: Ingo Molnar, Pekka Enberg, linux-kernel

>From f90fae887ed82bc9369e9f95960e175fef3e5d97 Mon Sep 17 00:00:00 2001
From: Vegard Nossum <vegard.nossum@gmail.com>
Date: Fri, 20 Feb 2009 14:35:09 +0100
Subject: [PATCH] mm: fix lazy vmap purging (use-after-free error)

I just got this new warning from kmemcheck:

    WARNING: kmemcheck: Caught 32-bit read from freed memory (c7806a60)
    a06a80c7ecde70c1a04080c700000000a06709c1000000000000000000000000
     f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f
     ^

    Pid: 0, comm: swapper Not tainted (2.6.29-rc4 #230)
    EIP: 0060:[<c1096df7>] EFLAGS: 00000286 CPU: 0
    EIP is at __purge_vmap_area_lazy+0x117/0x140
    EAX: 00070f43 EBX: c7806a40 ECX: c1677080 EDX: 00027b66
    ESI: 00002001 EDI: c170df0c EBP: c170df00 ESP: c178830c
     DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    CR0: 80050033 CR2: c7806b14 CR3: 01775000 CR4: 00000690
    DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    DR6: 00004000 DR7: 00000000
     [<c1096f3e>] free_unmap_vmap_area_noflush+0x6e/0x70
     [<c1096f6a>] remove_vm_area+0x2a/0x70
     [<c1097025>] __vunmap+0x45/0xe0
     [<c10970de>] vunmap+0x1e/0x30
     [<c1008ba5>] text_poke+0x95/0x150
     [<c1008ca9>] alternatives_smp_unlock+0x49/0x60
     [<c171ef47>] alternative_instructions+0x11b/0x124
     [<c171f991>] check_bugs+0xbd/0xdc
     [<c17148c5>] start_kernel+0x2ed/0x360
     [<c171409e>] __init_begin+0x9e/0xa9
     [<ffffffff>] 0xffffffff

It happened here:

    $ addr2line -e vmlinux -i c1096df7
    mm/vmalloc.c:540

Code:

	list_for_each_entry(va, &valist, purge_list)
		__free_vmap_area(va);

It's this instruction:

    mov    0x20(%ebx),%edx

Which corresponds to a dereference of va->purge_list.next:

    (gdb) p ((struct vmap_area *) 0)->purge_list.next
    Cannot access memory at address 0x20

It seems that we should use "safe" list traversal here, as the element
is freed inside the loop. Please verify that this is the right fix.

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
---
 mm/vmalloc.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 75f49d3..cda20b2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -498,6 +498,7 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
 	static DEFINE_SPINLOCK(purge_lock);
 	LIST_HEAD(valist);
 	struct vmap_area *va;
+	struct vmap_area *n_va;
 	int nr = 0;
 
 	/*
@@ -537,7 +538,7 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
 
 	if (nr) {
 		spin_lock(&vmap_area_lock);
-		list_for_each_entry(va, &valist, purge_list)
+		list_for_each_entry_safe(va, n_va, &valist, purge_list)
 			__free_vmap_area(va);
 		spin_unlock(&vmap_area_lock);
 	}
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 13:41 [PATCH] mm: fix lazy vmap purging (use-after-free error) Vegard Nossum
@ 2009-02-20 13:50 ` Ingo Molnar
  2009-02-20 13:58   ` Pekka Enberg
  2009-02-20 14:01   ` Ingo Molnar
  0 siblings, 2 replies; 46+ messages in thread
From: Ingo Molnar @ 2009-02-20 13:50 UTC (permalink / raw)
  To: Vegard Nossum, stable
  Cc: Andrew Morton, Nick Piggin, Pekka Enberg, linux-kernel


* Vegard Nossum <vegard.nossum@gmail.com> wrote:

> >From f90fae887ed82bc9369e9f95960e175fef3e5d97 Mon Sep 17 00:00:00 2001
> From: Vegard Nossum <vegard.nossum@gmail.com>
> Date: Fri, 20 Feb 2009 14:35:09 +0100
> Subject: [PATCH] mm: fix lazy vmap purging (use-after-free error)
> 
> I just got this new warning from kmemcheck:
> 
>     WARNING: kmemcheck: Caught 32-bit read from freed memory (c7806a60)
>     a06a80c7ecde70c1a04080c700000000a06709c1000000000000000000000000
>      f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f
>      ^
> 
>     Pid: 0, comm: swapper Not tainted (2.6.29-rc4 #230)
>     EIP: 0060:[<c1096df7>] EFLAGS: 00000286 CPU: 0
>     EIP is at __purge_vmap_area_lazy+0x117/0x140
>     EAX: 00070f43 EBX: c7806a40 ECX: c1677080 EDX: 00027b66
>     ESI: 00002001 EDI: c170df0c EBP: c170df00 ESP: c178830c
>      DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>     CR0: 80050033 CR2: c7806b14 CR3: 01775000 CR4: 00000690
>     DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
>     DR6: 00004000 DR7: 00000000
>      [<c1096f3e>] free_unmap_vmap_area_noflush+0x6e/0x70
>      [<c1096f6a>] remove_vm_area+0x2a/0x70
>      [<c1097025>] __vunmap+0x45/0xe0
>      [<c10970de>] vunmap+0x1e/0x30
>      [<c1008ba5>] text_poke+0x95/0x150
>      [<c1008ca9>] alternatives_smp_unlock+0x49/0x60
>      [<c171ef47>] alternative_instructions+0x11b/0x124
>      [<c171f991>] check_bugs+0xbd/0xdc
>      [<c17148c5>] start_kernel+0x2ed/0x360
>      [<c171409e>] __init_begin+0x9e/0xa9
>      [<ffffffff>] 0xffffffff
> 
> It happened here:
> 
>     $ addr2line -e vmlinux -i c1096df7
>     mm/vmalloc.c:540
> 
> Code:
> 
> 	list_for_each_entry(va, &valist, purge_list)
> 		__free_vmap_area(va);
> 
> It's this instruction:
> 
>     mov    0x20(%ebx),%edx
> 
> Which corresponds to a dereference of va->purge_list.next:
> 
>     (gdb) p ((struct vmap_area *) 0)->purge_list.next
>     Cannot access memory at address 0x20
> 
> It seems that we should use "safe" list traversal here, as the element
> is freed inside the loop. Please verify that this is the right fix.
> 
> Cc: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
> ---
>  mm/vmalloc.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 75f49d3..cda20b2 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -498,6 +498,7 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>  	static DEFINE_SPINLOCK(purge_lock);
>  	LIST_HEAD(valist);
>  	struct vmap_area *va;
> +	struct vmap_area *n_va;
>  	int nr = 0;
>  
>  	/*
> @@ -537,7 +538,7 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>  
>  	if (nr) {
>  		spin_lock(&vmap_area_lock);
> -		list_for_each_entry(va, &valist, purge_list)
> +		list_for_each_entry_safe(va, n_va, &valist, purge_list)
>  			__free_vmap_area(va);
>  		spin_unlock(&vmap_area_lock);

ah, indeed:

        list_del_rcu(&va->list);

i suspect it could be hit big time in a workload that opens more 
than 512 files, as expand_files() uses a vmalloc()+vfree() pair 
in that case.

Nice catch! .29 must-have. The bug was introduced in 
v2.6.27-5616-gdb64fe0:

  db64fe0: mm: rewrite vmap layer

So 2.6.28 is affected too.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 13:50 ` Ingo Molnar
@ 2009-02-20 13:58   ` Pekka Enberg
  2009-02-20 14:01   ` Ingo Molnar
  1 sibling, 0 replies; 46+ messages in thread
From: Pekka Enberg @ 2009-02-20 13:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vegard Nossum, stable, Andrew Morton, Nick Piggin, linux-kernel

On Fri, Feb 20, 2009 at 3:50 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> @@ -537,7 +538,7 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>>
>>       if (nr) {
>>               spin_lock(&vmap_area_lock);
>> -             list_for_each_entry(va, &valist, purge_list)
>> +             list_for_each_entry_safe(va, n_va, &valist, purge_list)
>>                       __free_vmap_area(va);
>>               spin_unlock(&vmap_area_lock);
>
> ah, indeed:
>
>        list_del_rcu(&va->list);
>
> i suspect it could be hit big time in a workload that opens more
> than 512 files, as expand_files() uses a vmalloc()+vfree() pair
> in that case.
>
> Nice catch! .29 must-have. The bug was introduced in
> v2.6.27-5616-gdb64fe0:
>
>  db64fe0: mm: rewrite vmap layer
>
> So 2.6.28 is affected too.
>
> Acked-by: Ingo Molnar <mingo@elte.hu>

Oh, I wish more people would run their code under kmemcheck... ;-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 13:50 ` Ingo Molnar
  2009-02-20 13:58   ` Pekka Enberg
@ 2009-02-20 14:01   ` Ingo Molnar
  2009-02-20 14:18     ` Pekka Enberg
                       ` (2 more replies)
  1 sibling, 3 replies; 46+ messages in thread
From: Ingo Molnar @ 2009-02-20 14:01 UTC (permalink / raw)
  To: Vegard Nossum, stable
  Cc: Andrew Morton, Nick Piggin, Pekka Enberg, linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> ah, indeed:
> 
>         list_del_rcu(&va->list);
> 
> i suspect it could be hit big time in a workload that opens 
> more than 512 files, as expand_files() uses a 
> vmalloc()+vfree() pair in that case.

hm, perhaps it's not a problem after all. The freeing is done 
via rcu, and list_del_rcu() leaves the forward pointer intact.

So how did it happen that the entry got kfree()d before the loop 
was done? We are in a spinlocked section so the CPU should not 
have entered rcu processing.

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 14:01   ` Ingo Molnar
@ 2009-02-20 14:18     ` Pekka Enberg
  2009-02-20 15:41       ` Paul E. McKenney
  2009-02-20 14:51     ` Vegard Nossum
  2009-02-20 15:56     ` Paul E. McKenney
  2 siblings, 1 reply; 46+ messages in thread
From: Pekka Enberg @ 2009-02-20 14:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vegard Nossum, stable, Andrew Morton, Nick Piggin, linux-kernel,
	paulmck

On Fri, 2009-02-20 at 15:01 +0100, Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > ah, indeed:
> > 
> >         list_del_rcu(&va->list);
> > 
> > i suspect it could be hit big time in a workload that opens 
> > more than 512 files, as expand_files() uses a 
> > vmalloc()+vfree() pair in that case.
> 
> hm, perhaps it's not a problem after all. The freeing is done 
> via rcu, and list_del_rcu() leaves the forward pointer intact.
> 
> So how did it happen that the entry got kfree()d before the loop 
> was done? We are in a spinlocked section so the CPU should not 
> have entered rcu processing.

RCU. Lets CC Paul.

Looking at it, Documentation/RCU/checklist.txt states that:

7.      If the updater uses call_rcu(), then the corresponding readers
        must use rcu_read_lock() and rcu_read_unlock().

which we don't do for this loop. I fail to see how it could be a
kmemcheck false positive, so it's probably a real bug.

			Pekka


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 14:01   ` Ingo Molnar
  2009-02-20 14:18     ` Pekka Enberg
@ 2009-02-20 14:51     ` Vegard Nossum
  2009-02-20 15:46       ` Paul E. McKenney
  2009-02-20 16:01       ` Ingo Molnar
  2009-02-20 15:56     ` Paul E. McKenney
  2 siblings, 2 replies; 46+ messages in thread
From: Vegard Nossum @ 2009-02-20 14:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: stable, Andrew Morton, Nick Piggin, Pekka Enberg, linux-kernel

2009/2/20 Ingo Molnar <mingo@elte.hu>:
>
> * Ingo Molnar <mingo@elte.hu> wrote:
>
>> ah, indeed:
>>
>>         list_del_rcu(&va->list);
>>
>> i suspect it could be hit big time in a workload that opens
>> more than 512 files, as expand_files() uses a
>> vmalloc()+vfree() pair in that case.
>
> hm, perhaps it's not a problem after all. The freeing is done
> via rcu, and list_del_rcu() leaves the forward pointer intact.

Well, it's not the particular line that you posted, in any case.
That's &va->list, but the traversed list is &va->purge_list.

I thought it would be the line:

        call_rcu(&va->rcu_head, rcu_free_va);

(which does kfree() in the callback) that was the problem.

>
> So how did it happen that the entry got kfree()d before the loop
> was done? We are in a spinlocked section so the CPU should not
> have entered rcu processing.

I added some printks to __free_vmap_area() and rcu_free_va(), and it
shows that the kfree() is being called immediately (inside the list
traversal). So the call_rcu() is happening immediately (or almost
immediately).

If I've understood correctly, the RCU processing can happen inside a
spinlock, as long as interrupts are enabled. (Won't the timer IRQ
trigger softirq processing, which triggers RCU callback processing,
for example?)

And interrupts are enabled when this happens: EFLAGS: 00000292

Please correct me if I am wrong!

Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 14:18     ` Pekka Enberg
@ 2009-02-20 15:41       ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-20 15:41 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Ingo Molnar, Vegard Nossum, stable, Andrew Morton, Nick Piggin,
	linux-kernel

On Fri, Feb 20, 2009 at 04:18:12PM +0200, Pekka Enberg wrote:
> On Fri, 2009-02-20 at 15:01 +0100, Ingo Molnar wrote:
> > * Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > ah, indeed:
> > > 
> > >         list_del_rcu(&va->list);
> > > 
> > > i suspect it could be hit big time in a workload that opens 
> > > more than 512 files, as expand_files() uses a 
> > > vmalloc()+vfree() pair in that case.
> > 
> > hm, perhaps it's not a problem after all. The freeing is done 
> > via rcu, and list_del_rcu() leaves the forward pointer intact.
> > 
> > So how did it happen that the entry got kfree()d before the loop 
> > was done? We are in a spinlocked section so the CPU should not 
> > have entered rcu processing.
> 
> RCU. Lets CC Paul.
> 
> Looking at it, Documentation/RCU/checklist.txt states that:
> 
> 7.      If the updater uses call_rcu(), then the corresponding readers
>         must use rcu_read_lock() and rcu_read_unlock().
> 
> which we don't do for this loop. I fail to see how it could be a
> kmemcheck false positive, so it's probably a real bug.

And checklist.txt rule #7 is even more important in preemptable kernels,
especially if CONFIG_PREEMPT_RT.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 14:51     ` Vegard Nossum
@ 2009-02-20 15:46       ` Paul E. McKenney
  2009-02-20 16:04         ` Ingo Molnar
  2009-02-20 23:51         ` Vegard Nossum
  2009-02-20 16:01       ` Ingo Molnar
  1 sibling, 2 replies; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-20 15:46 UTC (permalink / raw)
  To: Vegard Nossum
  Cc: Ingo Molnar, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

On Fri, Feb 20, 2009 at 03:51:28PM +0100, Vegard Nossum wrote:
> 2009/2/20 Ingo Molnar <mingo@elte.hu>:
> >
> > * Ingo Molnar <mingo@elte.hu> wrote:
> >
> >> ah, indeed:
> >>
> >>         list_del_rcu(&va->list);
> >>
> >> i suspect it could be hit big time in a workload that opens
> >> more than 512 files, as expand_files() uses a
> >> vmalloc()+vfree() pair in that case.
> >
> > hm, perhaps it's not a problem after all. The freeing is done
> > via rcu, and list_del_rcu() leaves the forward pointer intact.
> 
> Well, it's not the particular line that you posted, in any case.
> That's &va->list, but the traversed list is &va->purge_list.
> 
> I thought it would be the line:
> 
>         call_rcu(&va->rcu_head, rcu_free_va);
> 
> (which does kfree() in the callback) that was the problem.
> 
> > So how did it happen that the entry got kfree()d before the loop
> > was done? We are in a spinlocked section so the CPU should not
> > have entered rcu processing.
> 
> I added some printks to __free_vmap_area() and rcu_free_va(), and it
> shows that the kfree() is being called immediately (inside the list
> traversal). So the call_rcu() is happening immediately (or almost
> immediately).
> 
> If I've understood correctly, the RCU processing can happen inside a
> spinlock, as long as interrupts are enabled. (Won't the timer IRQ
> trigger softirq processing, which triggers RCU callback processing,
> for example?)
> 
> And interrupts are enabled when this happens: EFLAGS: 00000292
> 
> Please correct me if I am wrong!

If you are using preemptable RCU, and if the read side accesses are not
protected by rcu_read_lock(), this can happen.  At least for values of
"immediately" in the millisecond range.

If you were using classic or hierarchical RCU, the fact that the
call_rcu() is within a spinlock (as opposed to mutex) critical section
should prevent the grace period from ending.

So, what flavor of RCU were you using?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 14:01   ` Ingo Molnar
  2009-02-20 14:18     ` Pekka Enberg
  2009-02-20 14:51     ` Vegard Nossum
@ 2009-02-20 15:56     ` Paul E. McKenney
  2 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-20 15:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vegard Nossum, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

On Fri, Feb 20, 2009 at 03:01:57PM +0100, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > ah, indeed:
> > 
> >         list_del_rcu(&va->list);
> > 
> > i suspect it could be hit big time in a workload that opens 
> > more than 512 files, as expand_files() uses a 
> > vmalloc()+vfree() pair in that case.
> 
> hm, perhaps it's not a problem after all. The freeing is done 
> via rcu, and list_del_rcu() leaves the forward pointer intact.
> 
> So how did it happen that the entry got kfree()d before the loop 
> was done? We are in a spinlocked section so the CPU should not 
> have entered rcu processing.

Good point -- is the .config file available?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 14:51     ` Vegard Nossum
  2009-02-20 15:46       ` Paul E. McKenney
@ 2009-02-20 16:01       ` Ingo Molnar
  2009-02-20 16:49         ` Paul E. McKenney
  1 sibling, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2009-02-20 16:01 UTC (permalink / raw)
  To: Vegard Nossum
  Cc: stable, Andrew Morton, Nick Piggin, Pekka Enberg, linux-kernel


* Vegard Nossum <vegard.nossum@gmail.com> wrote:

> 2009/2/20 Ingo Molnar <mingo@elte.hu>:
> >
> > * Ingo Molnar <mingo@elte.hu> wrote:
> >
> >> ah, indeed:
> >>
> >>         list_del_rcu(&va->list);
> >>
> >> i suspect it could be hit big time in a workload that opens
> >> more than 512 files, as expand_files() uses a
> >> vmalloc()+vfree() pair in that case.
> >
> > hm, perhaps it's not a problem after all. The freeing is done
> > via rcu, and list_del_rcu() leaves the forward pointer intact.
> 
> Well, it's not the particular line that you posted, in any case.
> That's &va->list, but the traversed list is &va->purge_list.
> 
> I thought it would be the line:
> 
>         call_rcu(&va->rcu_head, rcu_free_va);
> 
> (which does kfree() in the callback) that was the problem.
> 
> >
> > So how did it happen that the entry got kfree()d before the loop
> > was done? We are in a spinlocked section so the CPU should not
> > have entered rcu processing.
> 
> I added some printks to __free_vmap_area() and rcu_free_va(), and it
> shows that the kfree() is being called immediately (inside the list
> traversal). So the call_rcu() is happening immediately (or almost
> immediately).
> 
> If I've understood correctly, the RCU processing can happen inside a
> spinlock, as long as interrupts are enabled. (Won't the timer IRQ
> trigger softirq processing, which triggers RCU callback processing,
> for example?)
> 
> And interrupts are enabled when this happens: EFLAGS: 00000292
> 
> Please correct me if I am wrong!

The timer irq will do RCU garbage-collection - but only of 
entries where a grace period has passed.

Otherwise there would be no point in using RCU at all, if the 
kfree() can happen immediately. RCU is about delaying action, 
and doing it at a point in time when we sure are in a quiescent 
state. (we have done a context-switch or scheduled to idle)

The question is, is this piece of loop traversal code 
preemptible? I dont think it is since it's embedded in a 
spinlock:

                spin_lock(&vmap_area_lock);
-               list_for_each_entry(va, &valist, purge_list)
+               list_for_each_entry_safe(va, n_va, &valist, purge_list)
                       	__free_vmap_area(va);
                spin_unlock(&vmap_area_lock);

[ on -rt this could be preemptible and this would be a real fix 
  there. I just dont see how the kfree() can execute on 
  mainline. Obviously it did, since you got the kmemcheck 
  assert. ]

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 15:46       ` Paul E. McKenney
@ 2009-02-20 16:04         ` Ingo Molnar
  2009-02-20 16:44           ` Paul E. McKenney
  2009-02-20 23:51         ` Vegard Nossum
  1 sibling, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2009-02-20 16:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Vegard Nossum, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Fri, Feb 20, 2009 at 03:51:28PM +0100, Vegard Nossum wrote:
> > 2009/2/20 Ingo Molnar <mingo@elte.hu>:
> > >
> > > * Ingo Molnar <mingo@elte.hu> wrote:
> > >
> > >> ah, indeed:
> > >>
> > >>         list_del_rcu(&va->list);
> > >>
> > >> i suspect it could be hit big time in a workload that opens
> > >> more than 512 files, as expand_files() uses a
> > >> vmalloc()+vfree() pair in that case.
> > >
> > > hm, perhaps it's not a problem after all. The freeing is done
> > > via rcu, and list_del_rcu() leaves the forward pointer intact.
> > 
> > Well, it's not the particular line that you posted, in any case.
> > That's &va->list, but the traversed list is &va->purge_list.
> > 
> > I thought it would be the line:
> > 
> >         call_rcu(&va->rcu_head, rcu_free_va);
> > 
> > (which does kfree() in the callback) that was the problem.
> > 
> > > So how did it happen that the entry got kfree()d before the loop
> > > was done? We are in a spinlocked section so the CPU should not
> > > have entered rcu processing.
> > 
> > I added some printks to __free_vmap_area() and rcu_free_va(), and it
> > shows that the kfree() is being called immediately (inside the list
> > traversal). So the call_rcu() is happening immediately (or almost
> > immediately).
> > 
> > If I've understood correctly, the RCU processing can happen inside a
> > spinlock, as long as interrupts are enabled. (Won't the timer IRQ
> > trigger softirq processing, which triggers RCU callback processing,
> > for example?)
> > 
> > And interrupts are enabled when this happens: EFLAGS: 00000292
> > 
> > Please correct me if I am wrong!
> 
> If you are using preemptable RCU, and if the read side 
> accesses are not protected by rcu_read_lock(), this can 
> happen.  At least for values of "immediately" in the 
> millisecond range.
> 
> If you were using classic or hierarchical RCU, the fact that 
> the call_rcu() is within a spinlock (as opposed to mutex) 
> critical section should prevent the grace period from ending.
> 
> So, what flavor of RCU were you using?

well, even in preemptible RCU the grace period should be 
extended as long as we are non-preempt (which we are here), 
correct?

Preemptible RCU does make an rcu_read_lock() critical section 
preemptible, so if this were protected by rcu_read_lock() it 
would be a bug. But it does not make spin_lock() section 
preemptible, and this is a spin_lock(&vmap_area_lock) section:

                spin_lock(&vmap_area_lock);
-               list_for_each_entry(va, &valist, purge_list)
+               list_for_each_entry_safe(va, n_va, &valist, purge_list)
                       	__free_vmap_area(va);
                spin_unlock(&vmap_area_lock);

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 16:04         ` Ingo Molnar
@ 2009-02-20 16:44           ` Paul E. McKenney
  2009-02-20 17:14             ` Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-20 16:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vegard Nossum, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

On Fri, Feb 20, 2009 at 05:04:09PM +0100, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Fri, Feb 20, 2009 at 03:51:28PM +0100, Vegard Nossum wrote:
> > > 2009/2/20 Ingo Molnar <mingo@elte.hu>:
> > > >
> > > > * Ingo Molnar <mingo@elte.hu> wrote:
> > > >
> > > >> ah, indeed:
> > > >>
> > > >>         list_del_rcu(&va->list);
> > > >>
> > > >> i suspect it could be hit big time in a workload that opens
> > > >> more than 512 files, as expand_files() uses a
> > > >> vmalloc()+vfree() pair in that case.
> > > >
> > > > hm, perhaps it's not a problem after all. The freeing is done
> > > > via rcu, and list_del_rcu() leaves the forward pointer intact.
> > > 
> > > Well, it's not the particular line that you posted, in any case.
> > > That's &va->list, but the traversed list is &va->purge_list.
> > > 
> > > I thought it would be the line:
> > > 
> > >         call_rcu(&va->rcu_head, rcu_free_va);
> > > 
> > > (which does kfree() in the callback) that was the problem.
> > > 
> > > > So how did it happen that the entry got kfree()d before the loop
> > > > was done? We are in a spinlocked section so the CPU should not
> > > > have entered rcu processing.
> > > 
> > > I added some printks to __free_vmap_area() and rcu_free_va(), and it
> > > shows that the kfree() is being called immediately (inside the list
> > > traversal). So the call_rcu() is happening immediately (or almost
> > > immediately).
> > > 
> > > If I've understood correctly, the RCU processing can happen inside a
> > > spinlock, as long as interrupts are enabled. (Won't the timer IRQ
> > > trigger softirq processing, which triggers RCU callback processing,
> > > for example?)
> > > 
> > > And interrupts are enabled when this happens: EFLAGS: 00000292
> > > 
> > > Please correct me if I am wrong!
> > 
> > If you are using preemptable RCU, and if the read side 
> > accesses are not protected by rcu_read_lock(), this can 
> > happen.  At least for values of "immediately" in the 
> > millisecond range.
> > 
> > If you were using classic or hierarchical RCU, the fact that 
> > the call_rcu() is within a spinlock (as opposed to mutex) 
> > critical section should prevent the grace period from ending.
> > 
> > So, what flavor of RCU were you using?
> 
> well, even in preemptible RCU the grace period should be 
> extended as long as we are non-preempt (which we are here), 
> correct?

Given Classic RCU, you are quite correct.  With preemptable RCU, if
there are no readers, and if irqs are enabled, the grace period could
end within the spinlock's critical section.  Of course, the spinlock
would need to be held for an improbably long time -- many milliseconds.

The only thing that is -guaranteed- to block RCU read-side critical
sections is some task being in an RCU read-side critical section.

The way a preemptable RCU grace period could end, even with some CPU
having preemption disabled, is as follows:

o	Because there are no RCU readers anywhere in the system,
	both sides of the per-CPU rcu_flipctr[] array sum to zero.

o	Because irqs are enabled, scheduling-clock interrupts
	still happen.  The state machine sequences as follows:

	o	rcu_try_flip_idle() on the CPU that did the call_rcu()
		sees that rcu_pending() returns 1, and therefore
		increments the rcu_ctrlblk.completed counter, starting a
		new grace period.  It also sets each CPU's rcu_flip_flag
		to rcu_flipped, which requests acknowledgement of the
		counter increment.

	o	Each CPU's scheduling-clock interrupt will note that
		the rcu_ctrlblk.completed counter has advanced, and
		will therefore invoke __rcu_advance_callbacks, which
		will set that CPU's rcu_flip_flag to rcu_flip_seen.

	o	The scheduling-clock interrupt (on any CPU) following
		the last __rcu_advance_callbacks() will now invoke
		rcu_try_flip_waitack(), which will find that all CPUs'
		rcu_mb_flag values are rcu_mb_done.  It will return
		non-zero indicating that the state machine should advance.

	o	The next scheduling-clock interrupt will therefore invoke
		rcu_try_flip_waitzero(), which, because there are no
		RCU readers, will find that the sum of the rcu_flipctr[]
		entries will be zero.  This function will therefore
		return non-zero indicating that the state machine should
		advance.  Before returning, it will set each CPU's
		rcu_mb_flag to rcu_mb_needed to indicate that each CPU
		must execute a memory 

	o	Each CPU's rcu_check_mb(), also invoked during the
		scheduling-clock interrupt, executes a memory barrier
		and sets the per-CPU rcu_mb_flag to rcu_mb_done.

	o	The scheduling-clock interrupt (on any CPU) following the
		last rcu_check_mb() will now invoke rcu_try_flip_waitmb(),
		which iwll find that all CPUs' rcu_mb_flag variables are
		equal to rcu_mb_done, and will therefore return non-zero.

	This process will repeat, and then the callback will be invoked,
	freeing the element.

Unlike Classic RCU, disabled preemption does not block RCU grace periods.
Disabled irqs currently happen to block preemptable RCU grace periods,
but that is simply a side-effect of the implementation.  It is not
guaranteed, and any code that disables irqs to block RCU grace periods
should be viewed with great suspicion, if not viewed as buggy.

> Preemptible RCU does make an rcu_read_lock() critical section 
> preemptible, so if this were protected by rcu_read_lock() it 
> would be a bug. But it does not make spin_lock() section 
> preemptible, and this is a spin_lock(&vmap_area_lock) section:
> 
>                 spin_lock(&vmap_area_lock);
> -               list_for_each_entry(va, &valist, purge_list)
> +               list_for_each_entry_safe(va, n_va, &valist, purge_list)
>                        	__free_vmap_area(va);
>                 spin_unlock(&vmap_area_lock);

Again, the only thing -guaranteed- to block preemptable RCU grace
periods is some task residing in an RCU read-side critical section.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 16:01       ` Ingo Molnar
@ 2009-02-20 16:49         ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-20 16:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vegard Nossum, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

On Fri, Feb 20, 2009 at 05:01:57PM +0100, Ingo Molnar wrote:
> 
> * Vegard Nossum <vegard.nossum@gmail.com> wrote:
> 
> > 2009/2/20 Ingo Molnar <mingo@elte.hu>:
> > >
> > > * Ingo Molnar <mingo@elte.hu> wrote:
> > >
> > >> ah, indeed:
> > >>
> > >>         list_del_rcu(&va->list);
> > >>
> > >> i suspect it could be hit big time in a workload that opens
> > >> more than 512 files, as expand_files() uses a
> > >> vmalloc()+vfree() pair in that case.
> > >
> > > hm, perhaps it's not a problem after all. The freeing is done
> > > via rcu, and list_del_rcu() leaves the forward pointer intact.
> > 
> > Well, it's not the particular line that you posted, in any case.
> > That's &va->list, but the traversed list is &va->purge_list.
> > 
> > I thought it would be the line:
> > 
> >         call_rcu(&va->rcu_head, rcu_free_va);
> > 
> > (which does kfree() in the callback) that was the problem.
> > 
> > >
> > > So how did it happen that the entry got kfree()d before the loop
> > > was done? We are in a spinlocked section so the CPU should not
> > > have entered rcu processing.
> > 
> > I added some printks to __free_vmap_area() and rcu_free_va(), and it
> > shows that the kfree() is being called immediately (inside the list
> > traversal). So the call_rcu() is happening immediately (or almost
> > immediately).
> > 
> > If I've understood correctly, the RCU processing can happen inside a
> > spinlock, as long as interrupts are enabled. (Won't the timer IRQ
> > trigger softirq processing, which triggers RCU callback processing,
> > for example?)
> > 
> > And interrupts are enabled when this happens: EFLAGS: 00000292
> > 
> > Please correct me if I am wrong!
> 
> The timer irq will do RCU garbage-collection - but only of 
> entries where a grace period has passed.
> 
> Otherwise there would be no point in using RCU at all, if the 
> kfree() can happen immediately. RCU is about delaying action, 
> and doing it at a point in time when we sure are in a quiescent 
> state. (we have done a context-switch or scheduled to idle)

Indeed, preemptable RCU grace periods extend for many milliseconds.

So, Vegard, when you say "immediately", do you mean "within
microseconds" or "faster than my human reflexes can react to"?

						Thanx, Paul

> The question is, is this piece of loop traversal code 
> preemptible? I dont think it is since it's embedded in a 
> spinlock:
> 
>                 spin_lock(&vmap_area_lock);
> -               list_for_each_entry(va, &valist, purge_list)
> +               list_for_each_entry_safe(va, n_va, &valist, purge_list)
>                        	__free_vmap_area(va);
>                 spin_unlock(&vmap_area_lock);
> 
> [ on -rt this could be preemptible and this would be a real fix 
>   there. I just dont see how the kfree() can execute on 
>   mainline. Obviously it did, since you got the kmemcheck 
>   assert. ]
> 
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 16:44           ` Paul E. McKenney
@ 2009-02-20 17:14             ` Ingo Molnar
  2009-02-20 17:25               ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2009-02-20 17:14 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Vegard Nossum, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> > > So, what flavor of RCU were you using?
> > 
> > well, even in preemptible RCU the grace period should be 
> > extended as long as we are non-preempt (which we are here), 
> > correct?
> 
> Given Classic RCU, you are quite correct.  With preemptable 
> RCU, if there are no readers, and if irqs are enabled, the 
> grace period could end within the spinlock's critical section.  
> Of course, the spinlock would need to be held for an 
> improbably long time -- many milliseconds.

ah. But we _can_ get unlucky there and get into 
multiple-milliseconds of blockage: for example if some heavy 
interrupt source or softirq processes stuff for many 
milliseconds.

So it's a real fix needed both for mainline and for 2.6.28. 
And kudos to kmemcheck ;-)

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 17:14             ` Ingo Molnar
@ 2009-02-20 17:25               ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-20 17:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vegard Nossum, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

On Fri, Feb 20, 2009 at 06:14:27PM +0100, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > > > So, what flavor of RCU were you using?
> > > 
> > > well, even in preemptible RCU the grace period should be 
> > > extended as long as we are non-preempt (which we are here), 
> > > correct?
> > 
> > Given Classic RCU, you are quite correct.  With preemptable 
> > RCU, if there are no readers, and if irqs are enabled, the 
> > grace period could end within the spinlock's critical section.  
> > Of course, the spinlock would need to be held for an 
> > improbably long time -- many milliseconds.
> 
> ah. But we _can_ get unlucky there and get into 
> multiple-milliseconds of blockage: for example if some heavy 
> interrupt source or softirq processes stuff for many 
> milliseconds.
> 
> So it's a real fix needed both for mainline and for 2.6.28. 
> And kudos to kmemcheck ;-)

Kudos from me as well!!!  Sure beats the heck out of tracking down memory
corruption by hand!!!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 15:46       ` Paul E. McKenney
  2009-02-20 16:04         ` Ingo Molnar
@ 2009-02-20 23:51         ` Vegard Nossum
  2009-02-21  1:40           ` Paul E. McKenney
  1 sibling, 1 reply; 46+ messages in thread
From: Vegard Nossum @ 2009-02-20 23:51 UTC (permalink / raw)
  To: paulmck
  Cc: Ingo Molnar, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

2009/2/20 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> On Fri, Feb 20, 2009 at 03:51:28PM +0100, Vegard Nossum wrote:
>>
>> I added some printks to __free_vmap_area() and rcu_free_va(), and it
>> shows that the kfree() is being called immediately (inside the list
>> traversal). So the call_rcu() is happening immediately (or almost
>> immediately).
>>
>> If I've understood correctly, the RCU processing can happen inside a
>> spinlock, as long as interrupts are enabled. (Won't the timer IRQ
>> trigger softirq processing, which triggers RCU callback processing,
>> for example?)
>>
>> And interrupts are enabled when this happens: EFLAGS: 00000292
>>
>> Please correct me if I am wrong!
>
> If you are using preemptable RCU, and if the read side accesses are not
> protected by rcu_read_lock(), this can happen.  At least for values of
> "immediately" in the millisecond range.
>
> If you were using classic or hierarchical RCU, the fact that the
> call_rcu() is within a spinlock (as opposed to mutex) critical section
> should prevent the grace period from ending.
>
> So, what flavor of RCU were you using?

$ grep RCU .config
# RCU Subsystem
# CONFIG_CLASSIC_RCU is not set
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
# CONFIG_RCU_TRACE is not set
CONFIG_RCU_FANOUT=32
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_CPU_STALL_DETECTOR is not set

And at boot:

[    0.000000] Initializing CPU#0
[    0.000000] Experimental hierarchical RCU implementation.
[    0.000000] Experimental hierarchical RCU init done.

What I did for this list traversal was to put one print-out in front
of the traversal, one after the traversal, one inside (so it would be
called on each iteration), and one in the RCU callback. It looks
something like this:

[  449.670460] __purge_vmap_area_lazy() list:
[  449.671332] __free_vmap_area(c7806a40)
[  449.674736] __free_vmap_area(c7806a80)
[  449.675441] rcu_free_va(c7806a40)
[  449.677407] __free_vmap_area(c7806ac0)
[  449.680113] rcu_free_va(c7806a80)
[  449.682821] __free_vmap_area(c7806b00)
[  449.684264] rcu_free_va(c7806ac0)
[  449.686525] __free_vmap_area(c7806b40)
[  449.688205] rcu_free_va(c7806b00)

...and goes on for a long time, until something triggers this:

[  449.902253] rcu_free_va(c7839d00)
[  449.903247] WARNING: kmemcheck: Caught 32-bit read from freed
memory (c7839d20)

...and finally:

[  457.580253] __purge_vmap_area_lazy() end
[  457.581201] rcu_free_va(c78974c0)

So this is also what I meant by "immediately": The RCU callbacks are
getting called inside the loop, and they're almost always paired with
the list removal, or lagging one object behind.

My guess is that this code posts "too many callbacks", which would
"force the grace period" according to __call_rcu() in
kernel/rcutree.c. What do you think about this?


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-20 23:51         ` Vegard Nossum
@ 2009-02-21  1:40           ` Paul E. McKenney
  2009-02-21  9:30             ` Vegard Nossum
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-21  1:40 UTC (permalink / raw)
  To: Vegard Nossum
  Cc: Ingo Molnar, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

On Sat, Feb 21, 2009 at 12:51:23AM +0100, Vegard Nossum wrote:
> 2009/2/20 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> > On Fri, Feb 20, 2009 at 03:51:28PM +0100, Vegard Nossum wrote:
> >>
> >> I added some printks to __free_vmap_area() and rcu_free_va(), and it
> >> shows that the kfree() is being called immediately (inside the list
> >> traversal). So the call_rcu() is happening immediately (or almost
> >> immediately).
> >>
> >> If I've understood correctly, the RCU processing can happen inside a
> >> spinlock, as long as interrupts are enabled. (Won't the timer IRQ
> >> trigger softirq processing, which triggers RCU callback processing,
> >> for example?)
> >>
> >> And interrupts are enabled when this happens: EFLAGS: 00000292
> >>
> >> Please correct me if I am wrong!
> >
> > If you are using preemptable RCU, and if the read side accesses are not
> > protected by rcu_read_lock(), this can happen.  At least for values of
> > "immediately" in the millisecond range.
> >
> > If you were using classic or hierarchical RCU, the fact that the
> > call_rcu() is within a spinlock (as opposed to mutex) critical section
> > should prevent the grace period from ending.
> >
> > So, what flavor of RCU were you using?
> 
> $ grep RCU .config
> # RCU Subsystem
> # CONFIG_CLASSIC_RCU is not set
> CONFIG_TREE_RCU=y

OK, for this RCU implementation, disabling preemption should prevent
grace periods from completing.

Hmmm...

> # CONFIG_PREEMPT_RCU is not set
> # CONFIG_RCU_TRACE is not set
> CONFIG_RCU_FANOUT=32
> # CONFIG_RCU_FANOUT_EXACT is not set
> # CONFIG_TREE_RCU_TRACE is not set
> # CONFIG_PREEMPT_RCU_TRACE is not set
> # CONFIG_RCU_TORTURE_TEST is not set
> # CONFIG_RCU_CPU_STALL_DETECTOR is not set
> 
> And at boot:
> 
> [    0.000000] Initializing CPU#0
> [    0.000000] Experimental hierarchical RCU implementation.
> [    0.000000] Experimental hierarchical RCU init done.
> 
> What I did for this list traversal was to put one print-out in front
> of the traversal, one after the traversal, one inside (so it would be
> called on each iteration), and one in the RCU callback. It looks
> something like this:
> 
> [  449.670460] __purge_vmap_area_lazy() list:
> [  449.671332] __free_vmap_area(c7806a40)
> [  449.674736] __free_vmap_area(c7806a80)
> [  449.675441] rcu_free_va(c7806a40)

This is 4.1 milliseconds, so is quite plausible.  Is the code -really-
disabling preemption for 4.1 milliseconds?

> [  449.677407] __free_vmap_area(c7806ac0)
> [  449.680113] rcu_free_va(c7806a80)

5.4 milliseconds...

> [  449.682821] __free_vmap_area(c7806b00)
> [  449.684264] rcu_free_va(c7806ac0)

6.9 milliseconds...

> [  449.686525] __free_vmap_area(c7806b40)
> [  449.688205] rcu_free_va(c7806b00)

5.4 milliseconds...

> ...and goes on for a long time, until something triggers this:
> 
> [  449.902253] rcu_free_va(c7839d00)
> [  449.903247] WARNING: kmemcheck: Caught 32-bit read from freed
> memory (c7839d20)
> 
> ...and finally:
> 
> [  457.580253] __purge_vmap_area_lazy() end
> [  457.581201] rcu_free_va(c78974c0)

And I don't see the corresponding __free_vmap_area() for either of the
above rcu_free_va() calls.  Would you be willing to forward the
timestamp for the __free_vmap_area() for c7839d20?

> So this is also what I meant by "immediately": The RCU callbacks are
> getting called inside the loop, and they're almost always paired with
> the list removal, or lagging one object behind.
> 
> My guess is that this code posts "too many callbacks", which would
> "force the grace period" according to __call_rcu() in
> kernel/rcutree.c. What do you think about this?

If the code really suppresses preemption across the whole loop, then
any attempt to force the grace period should fail.  Is it possible that
preemption is momentarily enabled somewhere within the loop?  Or that
we are seeing multiple passes through the loop rather than one big long
pass through the loop?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-21  1:40           ` Paul E. McKenney
@ 2009-02-21  9:30             ` Vegard Nossum
  2009-02-21 17:47               ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Vegard Nossum @ 2009-02-21  9:30 UTC (permalink / raw)
  To: paulmck
  Cc: Ingo Molnar, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

2009/2/21 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> This is 4.1 milliseconds, so is quite plausible.  Is the code -really-
> disabling preemption for 4.1 milliseconds?
>
>> [  449.677407] __free_vmap_area(c7806ac0)
>> [  449.680113] rcu_free_va(c7806a80)
>
> 5.4 milliseconds...
>
>> [  449.682821] __free_vmap_area(c7806b00)
>> [  449.684264] rcu_free_va(c7806ac0)
>
> 6.9 milliseconds...
>
>> [  449.686525] __free_vmap_area(c7806b40)
>> [  449.688205] rcu_free_va(c7806b00)
>
> 5.4 milliseconds...
>
>> ...and goes on for a long time, until something triggers this:
>>
>> [  449.902253] rcu_free_va(c7839d00)
>> [  449.903247] WARNING: kmemcheck: Caught 32-bit read from freed
>> memory (c7839d20)
>>
>> ...and finally:
>>
>> [  457.580253] __purge_vmap_area_lazy() end
>> [  457.581201] rcu_free_va(c78974c0)
>
> And I don't see the corresponding __free_vmap_area() for either of the
> above rcu_free_va() calls.  Would you be willing to forward the
> timestamp for the __free_vmap_area() for c7839d20?

I'm sorry. The numbers in paranthesis are the struct vmap_area
pointers, not the actual parameters being passed to the function. So
when we have this:

[  449.696775] __free_vmap_area(c7806c00)
[  449.697274] rcu_free_va(c7806bc0)
[  449.699543] __free_vmap_area(c7806c40)
[  449.701104] rcu_free_va(c7806c00)
[  449.703353] __free_vmap_area(c7806c80)
[  449.704247] rcu_free_va(c7806c40)

rcu_free_va(c7806c00) corresponds to rcu_free_va(c7806c00).

>
>> So this is also what I meant by "immediately": The RCU callbacks are
>> getting called inside the loop, and they're almost always paired with
>> the list removal, or lagging one object behind.
>>
>> My guess is that this code posts "too many callbacks", which would
>> "force the grace period" according to __call_rcu() in
>> kernel/rcutree.c. What do you think about this?
>
> If the code really suppresses preemption across the whole loop, then
> any attempt to force the grace period should fail.  Is it possible that
> preemption is momentarily enabled somewhere within the loop?  Or that
> we are seeing multiple passes through the loop rather than one big long
> pass through the loop?

Multiple passes? No. We have a print-out at the beginning and at the
end, and there's nothing else happening in-between. It doesn't leave
the function __purge_vmap_area_lazy. I don't see preemption being
enabled anywhere in __free_vmap_area (or its calls).

I single-stepped __free_vmap_area, and it will get interrupted. I got
(among other things) this chain of calls:

do_IRQ -> handle_irq -> handle_level_irq -> handle_IRQ_event ->
timer_interrupt -> ... -> run_local_timers -> raise_softirq (nr=1)

And at one point it also calls into RCU machinery:

update_process_times (user_tick=0) at kernel/timer.c:1033
1033            if (rcu_pending(cpu))
(gdb)
rcu_pending (cpu=0) at kernel/rcutree.c:1288
1288            return __rcu_pending(&rcu_state, &per_cpu(rcu_data, cpu)) ||
...
update_process_times (user_tick=0) at kernel/timer.c:1034
1034                    rcu_check_callbacks(cpu, user_tick);
(gdb)
rcu_check_callbacks (cpu=0, user=0) at kernel/rcutree.c:949
949     {
...
rcu_check_callbacks (cpu=0, user=-1049147360) at kernel/rcutree.c:967
967                     rcu_qsctr_inc(cpu);
(gdb)
rcu_qsctr_inc () at include/linux/rcutree.h:253
253             struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
(gdb)
255             rdp->passed_quiesc_completed = rdp->completed;
(gdb)
254             rdp->passed_quiesc = 1;
(gdb)
255             rdp->passed_quiesc_completed = rdp->completed;
(gdb)
rcu_check_callbacks (cpu=0, user=<value optimized out>) at kernel/rcutree.c:979
979                     rcu_bh_qsctr_inc(cpu);
(gdb)
rcu_bh_qsctr_inc () at include/linux/rcutree.h:259
259             struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
(gdb)
261             rdp->passed_quiesc_completed = rdp->completed;
(gdb)
260             rdp->passed_quiesc = 1;
(gdb)
261             rdp->passed_quiesc_completed = rdp->completed;
(gdb)
rcu_check_callbacks (cpu=0, user=<value optimized out>) at kernel/rcutree.c:981
981             raise_softirq(RCU_SOFTIRQ);
(gdb)
raise_softirq (nr=8) at kernel/softirq.c:313
313     {

We also get into __rcu_check_callbacks after a while, with this stacktrace:

#0  do_IRQ (regs=0xc170ddc0) at arch/x86/kernel/irq.c:201
#1  <signal handler called>
#2  0xc1073b7a in __rcu_process_callbacks (rsp=0xc1677060, rdp=0xc1941320)
    at kernel/rcutree.c:1127
#3  0xc1073dbf in rcu_process_callbacks (unused=<value optimized out>)
    at kernel/rcutree.c:1162
#4  0xc103741f in __do_softirq () at kernel/softirq.c:198
#5  0xc10374fd in do_softirq () at kernel/softirq.c:244
#6  0xc1037655 in irq_exit () at kernel/softirq.c:281
#7  0xc100529f in do_IRQ (regs=0xc170de98) at arch/x86/kernel/irq.c:223
#8  <signal handler called>
#9  __free_vmap_area (va=0xc7803a40) at mm/vmalloc.c:411
#10 0xc1096dd9 in __purge_vmap_area_lazy (start=0xc170df10, end=0xc170df0c,
    sync=<value optimized out>, force_flush=0) at mm/vmalloc.c:542
#11 0xc1096f1e in try_purge_vmap_area_lazy () at mm/vmalloc.c:556
#12 free_unmap_vmap_area_noflush (va=<value optimized out>) at mm/vmalloc.c:578
#13 0xc1096f4a in free_unmap_vmap_area () at mm/vmalloc.c:587
#14 remove_vm_area (addr=<value optimized out>) at mm/vmalloc.c:1168
#15 0xc1097005 in __vunmap (addr=0xc7803a40, deallocate_pages=0)
    at mm/vmalloc.c:1194
#16 0xc10970be in vunmap (addr=0xc7803aa0) at mm/vmalloc.c:1253
#17 0xc1008ba5 in text_poke (addr=0xc127ba4f, opcode=0xc170df8f, len=1)
    at arch/x86/kernel/alternative.c:523
#18 0xc1008ca9 in alternatives_smp_unlock (start=<value optimized out>,
    end=0xc1713360, text=0xc1000000 "�\206\021\002", text_end=0xc150250f "")
    at arch/x86/kernel/alternative.c:252
#19 0xc171ef47 in alternative_instructions ()
    at arch/x86/kernel/alternative.c:438
#20 0xc171f991 in check_bugs () at arch/x86/kernel/cpu/bugs.c:168
#21 0xc17148c5 in start_kernel () at init/main.c:687
#22 0xc171409e in i386_start_kernel () at arch/x86/kernel/head32.c:43
#23 0x00000000 in ?? ()

(On a side note, it strikes me that __do_softirq() does
local_irq_enable(), which means that __rcu_process_callbacks() can be
interrupted and the interrupt handler can call rcu_process_callbacks()
again...)

Does this ring any bells or make any sense at all? What else can I do
to help understand what's going on?

This, by the way, happens regardless of whether kmemcheck is used or
not. The only thing I do to trigger this particular behaviour is to
run an SMP kernel on a UP machine, since it will call
alternative_instructions() like we see in the stack trace above.


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-21  9:30             ` Vegard Nossum
@ 2009-02-21 17:47               ` Paul E. McKenney
  2009-02-21 18:08                 ` Vegard Nossum
  2009-02-21 19:21                 ` Vegard Nossum
  0 siblings, 2 replies; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-21 17:47 UTC (permalink / raw)
  To: Vegard Nossum
  Cc: Ingo Molnar, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

On Sat, Feb 21, 2009 at 10:30:58AM +0100, Vegard Nossum wrote:
> 2009/2/21 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> > This is 4.1 milliseconds, so is quite plausible.  Is the code -really-
> > disabling preemption for 4.1 milliseconds?
> >
> >> [  449.677407] __free_vmap_area(c7806ac0)
> >> [  449.680113] rcu_free_va(c7806a80)
> >
> > 5.4 milliseconds...
> >
> >> [  449.682821] __free_vmap_area(c7806b00)
> >> [  449.684264] rcu_free_va(c7806ac0)
> >
> > 6.9 milliseconds...
> >
> >> [  449.686525] __free_vmap_area(c7806b40)
> >> [  449.688205] rcu_free_va(c7806b00)
> >
> > 5.4 milliseconds...
> >
> >> ...and goes on for a long time, until something triggers this:
> >>
> >> [  449.902253] rcu_free_va(c7839d00)
> >> [  449.903247] WARNING: kmemcheck: Caught 32-bit read from freed
> >> memory (c7839d20)
> >>
> >> ...and finally:
> >>
> >> [  457.580253] __purge_vmap_area_lazy() end
> >> [  457.581201] rcu_free_va(c78974c0)
> >
> > And I don't see the corresponding __free_vmap_area() for either of the
> > above rcu_free_va() calls.  Would you be willing to forward the
> > timestamp for the __free_vmap_area() for c7839d20?
> 
> I'm sorry. The numbers in paranthesis are the struct vmap_area
> pointers, not the actual parameters being passed to the function. So
> when we have this:
> 
> [  449.696775] __free_vmap_area(c7806c00)
> [  449.697274] rcu_free_va(c7806bc0)
> [  449.699543] __free_vmap_area(c7806c40)
> [  449.701104] rcu_free_va(c7806c00)
> [  449.703353] __free_vmap_area(c7806c80)
> [  449.704247] rcu_free_va(c7806c40)
> 
> rcu_free_va(c7806c00) corresponds to rcu_free_va(c7806c00).

OK, so 449.701104-449.696775=0.004329, or about four milliseconds,
correct?

> >> So this is also what I meant by "immediately": The RCU callbacks are
> >> getting called inside the loop, and they're almost always paired with
> >> the list removal, or lagging one object behind.
> >>
> >> My guess is that this code posts "too many callbacks", which would
> >> "force the grace period" according to __call_rcu() in
> >> kernel/rcutree.c. What do you think about this?
> >
> > If the code really suppresses preemption across the whole loop, then
> > any attempt to force the grace period should fail.  Is it possible that
> > preemption is momentarily enabled somewhere within the loop?  Or that
> > we are seeing multiple passes through the loop rather than one big long
> > pass through the loop?
> 
> Multiple passes? No. We have a print-out at the beginning and at the
> end, and there's nothing else happening in-between. It doesn't leave
> the function __purge_vmap_area_lazy. I don't see preemption being
> enabled anywhere in __free_vmap_area (or its calls).
> 
> I single-stepped __free_vmap_area, and it will get interrupted. I got
> (among other things) this chain of calls:
> 
> do_IRQ -> handle_irq -> handle_level_irq -> handle_IRQ_event ->
> timer_interrupt -> ... -> run_local_timers -> raise_softirq (nr=1)
> 
> And at one point it also calls into RCU machinery:
> 
> update_process_times (user_tick=0) at kernel/timer.c:1033
> 1033            if (rcu_pending(cpu))
> (gdb)
> rcu_pending (cpu=0) at kernel/rcutree.c:1288
> 1288            return __rcu_pending(&rcu_state, &per_cpu(rcu_data, cpu)) ||
> ...
> update_process_times (user_tick=0) at kernel/timer.c:1034
> 1034                    rcu_check_callbacks(cpu, user_tick);
> (gdb)
> rcu_check_callbacks (cpu=0, user=0) at kernel/rcutree.c:949
> 949     {
> ...
> rcu_check_callbacks (cpu=0, user=-1049147360) at kernel/rcutree.c:967
> 967                     rcu_qsctr_inc(cpu);

????  Are the argument values trustworthy?  If so, I don't see how
the variable user transitioned from zero to non-zero.

The value user!=0 tells RCU that we were interrupted from a user process,
but this immediately follows user==0.  If we really were interrupted
from kernel code, (including from an irq handler) we should have user==0.

The user!=0 causes RCU to conclude that we are in a quiescent state.

RCU is then within its rights to process callbacks, which would result
in the behavior you saw.

> (gdb)
> rcu_qsctr_inc () at include/linux/rcutree.h:253
> 253             struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
> (gdb)
> 255             rdp->passed_quiesc_completed = rdp->completed;
> (gdb)
> 254             rdp->passed_quiesc = 1;
> (gdb)
> 255             rdp->passed_quiesc_completed = rdp->completed;
> (gdb)
> rcu_check_callbacks (cpu=0, user=<value optimized out>) at kernel/rcutree.c:979
> 979                     rcu_bh_qsctr_inc(cpu);
> (gdb)
> rcu_bh_qsctr_inc () at include/linux/rcutree.h:259
> 259             struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
> (gdb)
> 261             rdp->passed_quiesc_completed = rdp->completed;
> (gdb)
> 260             rdp->passed_quiesc = 1;
> (gdb)
> 261             rdp->passed_quiesc_completed = rdp->completed;
> (gdb)
> rcu_check_callbacks (cpu=0, user=<value optimized out>) at kernel/rcutree.c:981
> 981             raise_softirq(RCU_SOFTIRQ);
> (gdb)
> raise_softirq (nr=8) at kernel/softirq.c:313
> 313     {
> 
> We also get into __rcu_check_callbacks after a while, with this stacktrace:
> 
> #0  do_IRQ (regs=0xc170ddc0) at arch/x86/kernel/irq.c:201
> #1  <signal handler called>
> #2  0xc1073b7a in __rcu_process_callbacks (rsp=0xc1677060, rdp=0xc1941320)
>     at kernel/rcutree.c:1127
> #3  0xc1073dbf in rcu_process_callbacks (unused=<value optimized out>)
>     at kernel/rcutree.c:1162
> #4  0xc103741f in __do_softirq () at kernel/softirq.c:198
> #5  0xc10374fd in do_softirq () at kernel/softirq.c:244
> #6  0xc1037655 in irq_exit () at kernel/softirq.c:281
> #7  0xc100529f in do_IRQ (regs=0xc170de98) at arch/x86/kernel/irq.c:223
> #8  <signal handler called>
> #9  __free_vmap_area (va=0xc7803a40) at mm/vmalloc.c:411
> #10 0xc1096dd9 in __purge_vmap_area_lazy (start=0xc170df10, end=0xc170df0c,
>     sync=<value optimized out>, force_flush=0) at mm/vmalloc.c:542
> #11 0xc1096f1e in try_purge_vmap_area_lazy () at mm/vmalloc.c:556
> #12 free_unmap_vmap_area_noflush (va=<value optimized out>) at mm/vmalloc.c:578
> #13 0xc1096f4a in free_unmap_vmap_area () at mm/vmalloc.c:587
> #14 remove_vm_area (addr=<value optimized out>) at mm/vmalloc.c:1168
> #15 0xc1097005 in __vunmap (addr=0xc7803a40, deallocate_pages=0)
>     at mm/vmalloc.c:1194
> #16 0xc10970be in vunmap (addr=0xc7803aa0) at mm/vmalloc.c:1253
> #17 0xc1008ba5 in text_poke (addr=0xc127ba4f, opcode=0xc170df8f, len=1)
>     at arch/x86/kernel/alternative.c:523
> #18 0xc1008ca9 in alternatives_smp_unlock (start=<value optimized out>,
>     end=0xc1713360, text=0xc1000000 "�\206\021\002", text_end=0xc150250f "")
>     at arch/x86/kernel/alternative.c:252
> #19 0xc171ef47 in alternative_instructions ()
>     at arch/x86/kernel/alternative.c:438
> #20 0xc171f991 in check_bugs () at arch/x86/kernel/cpu/bugs.c:168
> #21 0xc17148c5 in start_kernel () at init/main.c:687
> #22 0xc171409e in i386_start_kernel () at arch/x86/kernel/head32.c:43
> #23 0x00000000 in ?? ()
> 
> (On a side note, it strikes me that __do_softirq() does
> local_irq_enable(), which means that __rcu_process_callbacks() can be
> interrupted and the interrupt handler can call rcu_process_callbacks()
> again...)
> 
> Does this ring any bells or make any sense at all? What else can I do
> to help understand what's going on?

I will look into what might have caused rcu_check_callbacks()'s
"user" argument to suddenly become non-zero.

> This, by the way, happens regardless of whether kmemcheck is used or
> not. The only thing I do to trigger this particular behaviour is to
> run an SMP kernel on a UP machine, since it will call
> alternative_instructions() like we see in the stack trace above.

OK, good to know, thank you!

Is this behavior recent, or does it apply to earlier 2.6.29-rc kernels
as well?  Does it happen with CONFIG_CLASSIC_RCU as well?  (From the
trace above, I suspect that it might well do so, but if not, that will
be valuable information as well.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-21 17:47               ` Paul E. McKenney
@ 2009-02-21 18:08                 ` Vegard Nossum
  2009-02-21 18:33                   ` Paul E. McKenney
  2009-02-21 18:37                   ` Vegard Nossum
  2009-02-21 19:21                 ` Vegard Nossum
  1 sibling, 2 replies; 46+ messages in thread
From: Vegard Nossum @ 2009-02-21 18:08 UTC (permalink / raw)
  To: paulmck
  Cc: Ingo Molnar, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

2009/2/21 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
>> rcu_check_callbacks (cpu=0, user=0) at kernel/rcutree.c:949
>> 949     {
>> ...
>> rcu_check_callbacks (cpu=0, user=-1049147360) at kernel/rcutree.c:967
>> 967                     rcu_qsctr_inc(cpu);
>
> ????  Are the argument values trustworthy?  If so, I don't see how
> the variable user transitioned from zero to non-zero.
>
> The value user!=0 tells RCU that we were interrupted from a user process,
> but this immediately follows user==0.  If we really were interrupted
> from kernel code, (including from an irq handler) we should have user==0.
>
> The user!=0 causes RCU to conclude that we are in a quiescent state.
>
> RCU is then within its rights to process callbacks, which would result
> in the behavior you saw.

Ah, curious. Thanks for the explanation.

I tried again, just to be sure:

Breakpoint 1, rcu_check_callbacks (cpu=0, user=0) at kernel/rcutree.c:949
949     {
(gdb) p &user
Address requested for identifier "user" which is in register $edx
(gdb) p user
$1 = 0
(gdb) s
950             if (user ||
(gdb)
949     {
(gdb)
950             if (user ||
(gdb)
idle_cpu (cpu=0) at kernel/sched.c:5196
5196            return cpu_curr(cpu) == cpu_rq(cpu)->idle;
(gdb)
5197    }
(gdb)
idle_cpu (cpu=<value optimized out>) at kernel/sched.c:5196
5196            return cpu_curr(cpu) == cpu_rq(cpu)->idle;
(gdb)
5197    }
(gdb)
rcu_check_callbacks (cpu=0, user=-1049147360) at kernel/rcutree.c:967
967                     rcu_qsctr_inc(cpu);

Could that be a missing "d" clobber in some inline assembly? Or a
miscompilation?

Here's the disassembly (I hope it won't wrap):

0xc1073ec0 <rcu_check_callbacks+0>:     push   %ebp
0xc1073ec1 <rcu_check_callbacks+1>:     test   %edx,%edx
0xc1073ec3 <rcu_check_callbacks+3>:     mov    %esp,%ebp
0xc1073ec5 <rcu_check_callbacks+5>:     push   %ebx
0xc1073ec6 <rcu_check_callbacks+6>:     mov    %eax,%ebx
0xc1073ec8 <rcu_check_callbacks+8>:     je     0xc1073f08
<rcu_check_callbacks+72>
0xc1073eca <rcu_qsctr_inc+0>:   mov    $0xc1771320,%eax
0xc1073ecf <rcu_qsctr_inc+5>:   add    -0x3e8fa900(,%ebx,4),%eax
0xc1073ed6 <rcu_qsctr_inc+12>:  mov    (%eax),%edx
0xc1073ed8 <rcu_qsctr_inc+14>:  movb   $0x1,0xc(%eax)
0xc1073edc <rcu_qsctr_inc+18>:  mov    %edx,0x8(%eax)
0xc1073edf <rcu_bh_qsctr_inc+0>:        mov    $0xc1771380,%eax
0xc1073ee4 <rcu_bh_qsctr_inc+5>:        add    -0x3e8fa900(,%ebx,4),%eax
0xc1073eeb <rcu_bh_qsctr_inc+12>:       mov    (%eax),%edx
0xc1073eed <rcu_bh_qsctr_inc+14>:       movb   $0x1,0xc(%eax)
0xc1073ef1 <rcu_bh_qsctr_inc+18>:       mov    %edx,0x8(%eax)
0xc1073ef4 <rcu_check_callbacks+52>:    mov    $0x8,%eax

Seems to be rcu_qsctr_inc() that reloads %edx. If I'd guess, I'd say
x86's per_cpu macros. But it seems so strange that the corruption
would not manifest in other ways too.

Stand by for further investigations :-)


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-21 18:08                 ` Vegard Nossum
@ 2009-02-21 18:33                   ` Paul E. McKenney
  2009-02-21 18:37                   ` Vegard Nossum
  1 sibling, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-21 18:33 UTC (permalink / raw)
  To: Vegard Nossum
  Cc: Ingo Molnar, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

On Sat, Feb 21, 2009 at 07:08:55PM +0100, Vegard Nossum wrote:
> 2009/2/21 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> >> rcu_check_callbacks (cpu=0, user=0) at kernel/rcutree.c:949
> >> 949     {
> >> ...
> >> rcu_check_callbacks (cpu=0, user=-1049147360) at kernel/rcutree.c:967
> >> 967                     rcu_qsctr_inc(cpu);
> >
> > ????  Are the argument values trustworthy?  If so, I don't see how
> > the variable user transitioned from zero to non-zero.
> >
> > The value user!=0 tells RCU that we were interrupted from a user process,
> > but this immediately follows user==0.  If we really were interrupted
> > from kernel code, (including from an irq handler) we should have user==0.
> >
> > The user!=0 causes RCU to conclude that we are in a quiescent state.
> >
> > RCU is then within its rights to process callbacks, which would result
> > in the behavior you saw.
> 
> Ah, curious. Thanks for the explanation.
> 
> I tried again, just to be sure:
> 
> Breakpoint 1, rcu_check_callbacks (cpu=0, user=0) at kernel/rcutree.c:949
> 949     {
> (gdb) p &user
> Address requested for identifier "user" which is in register $edx
> (gdb) p user
> $1 = 0
> (gdb) s
> 950             if (user ||
> (gdb)
> 949     {
> (gdb)
> 950             if (user ||
> (gdb)
> idle_cpu (cpu=0) at kernel/sched.c:5196
> 5196            return cpu_curr(cpu) == cpu_rq(cpu)->idle;
> (gdb)
> 5197    }
> (gdb)
> idle_cpu (cpu=<value optimized out>) at kernel/sched.c:5196
> 5196            return cpu_curr(cpu) == cpu_rq(cpu)->idle;
> (gdb)
> 5197    }
> (gdb)
> rcu_check_callbacks (cpu=0, user=-1049147360) at kernel/rcutree.c:967
> 967                     rcu_qsctr_inc(cpu);
> 
> Could that be a missing "d" clobber in some inline assembly? Or a
> miscompilation?

Hmmm...  cpu_rq() does invoke per_cpu()...

> Here's the disassembly (I hope it won't wrap):
> 
> 0xc1073ec0 <rcu_check_callbacks+0>:     push   %ebp
> 0xc1073ec1 <rcu_check_callbacks+1>:     test   %edx,%edx
> 0xc1073ec3 <rcu_check_callbacks+3>:     mov    %esp,%ebp
> 0xc1073ec5 <rcu_check_callbacks+5>:     push   %ebx
> 0xc1073ec6 <rcu_check_callbacks+6>:     mov    %eax,%ebx
> 0xc1073ec8 <rcu_check_callbacks+8>:     je     0xc1073f08
> <rcu_check_callbacks+72>
> 0xc1073eca <rcu_qsctr_inc+0>:   mov    $0xc1771320,%eax
> 0xc1073ecf <rcu_qsctr_inc+5>:   add    -0x3e8fa900(,%ebx,4),%eax
> 0xc1073ed6 <rcu_qsctr_inc+12>:  mov    (%eax),%edx
> 0xc1073ed8 <rcu_qsctr_inc+14>:  movb   $0x1,0xc(%eax)
> 0xc1073edc <rcu_qsctr_inc+18>:  mov    %edx,0x8(%eax)
> 0xc1073edf <rcu_bh_qsctr_inc+0>:        mov    $0xc1771380,%eax
> 0xc1073ee4 <rcu_bh_qsctr_inc+5>:        add    -0x3e8fa900(,%ebx,4),%eax
> 0xc1073eeb <rcu_bh_qsctr_inc+12>:       mov    (%eax),%edx
> 0xc1073eed <rcu_bh_qsctr_inc+14>:       movb   $0x1,0xc(%eax)
> 0xc1073ef1 <rcu_bh_qsctr_inc+18>:       mov    %edx,0x8(%eax)
> 0xc1073ef4 <rcu_check_callbacks+52>:    mov    $0x8,%eax
> 
> Seems to be rcu_qsctr_inc() that reloads %edx. If I'd guess, I'd say
> x86's per_cpu macros. But it seems so strange that the corruption
> would not manifest in other ways too.
> 
> Stand by for further investigations :-)

I will look into this, but it will take a bit.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-21 18:08                 ` Vegard Nossum
  2009-02-21 18:33                   ` Paul E. McKenney
@ 2009-02-21 18:37                   ` Vegard Nossum
  2009-02-22  3:00                     ` Paul E. McKenney
  1 sibling, 1 reply; 46+ messages in thread
From: Vegard Nossum @ 2009-02-21 18:37 UTC (permalink / raw)
  To: paulmck
  Cc: Ingo Molnar, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

2009/2/21 Vegard Nossum <vegard.nossum@gmail.com>:
> Here's the disassembly (I hope it won't wrap):
>
> 0xc1073ec0 <rcu_check_callbacks+0>:     push   %ebp
> 0xc1073ec1 <rcu_check_callbacks+1>:     test   %edx,%edx
> 0xc1073ec3 <rcu_check_callbacks+3>:     mov    %esp,%ebp
> 0xc1073ec5 <rcu_check_callbacks+5>:     push   %ebx
> 0xc1073ec6 <rcu_check_callbacks+6>:     mov    %eax,%ebx
> 0xc1073ec8 <rcu_check_callbacks+8>:     je     0xc1073f08
> <rcu_check_callbacks+72>
> 0xc1073eca <rcu_qsctr_inc+0>:   mov    $0xc1771320,%eax
> 0xc1073ecf <rcu_qsctr_inc+5>:   add    -0x3e8fa900(,%ebx,4),%eax
> 0xc1073ed6 <rcu_qsctr_inc+12>:  mov    (%eax),%edx
> 0xc1073ed8 <rcu_qsctr_inc+14>:  movb   $0x1,0xc(%eax)
> 0xc1073edc <rcu_qsctr_inc+18>:  mov    %edx,0x8(%eax)
> 0xc1073edf <rcu_bh_qsctr_inc+0>:        mov    $0xc1771380,%eax
> 0xc1073ee4 <rcu_bh_qsctr_inc+5>:        add    -0x3e8fa900(,%ebx,4),%eax
> 0xc1073eeb <rcu_bh_qsctr_inc+12>:       mov    (%eax),%edx
> 0xc1073eed <rcu_bh_qsctr_inc+14>:       movb   $0x1,0xc(%eax)
> 0xc1073ef1 <rcu_bh_qsctr_inc+18>:       mov    %edx,0x8(%eax)
> 0xc1073ef4 <rcu_check_callbacks+52>:    mov    $0x8,%eax
>
> Seems to be rcu_qsctr_inc() that reloads %edx. If I'd guess, I'd say
> x86's per_cpu macros. But it seems so strange that the corruption
> would not manifest in other ways too.
>

Okay, I don't really think it's an error. The if (user) test happens
at the very beginning and gcc decides to reuse %edx. GDB doesn't know
this, so it thinks the parameter changed, but at this point the
parameter simply won't be used anymore.

So you're right: The value can't be trusted (after entry, anyway).


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-21 17:47               ` Paul E. McKenney
  2009-02-21 18:08                 ` Vegard Nossum
@ 2009-02-21 19:21                 ` Vegard Nossum
  1 sibling, 0 replies; 46+ messages in thread
From: Vegard Nossum @ 2009-02-21 19:21 UTC (permalink / raw)
  To: paulmck
  Cc: Ingo Molnar, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

2009/2/21 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> Is this behavior recent, or does it apply to earlier 2.6.29-rc kernels
> as well?  Does it happen with CONFIG_CLASSIC_RCU as well?  (From the
> trace above, I suspect that it might well do so, but if not, that will
> be valuable information as well.)

I don't know -- I haven't tried earlier kernels.

I just compiled a CONFIG_CLASSIC_RCU kernel, and it does happen there too:

[    1.061138] traversing list {
[    1.062281]     __free_vmap_area(c7803a40)
[    1.063177]     __free_vmap_area(c7803a80)
[    1.065217]     __free_vmap_area(c7803ac0)
[    1.066255]     __free_vmap_area(c7803b00)
[    1.067156]     kfree(c7803a40)
[    1.068292]     __free_vmap_area(c7803b40)
[    1.069094]     __free_vmap_area(c7803b80)
[    1.071051]     kfree(c7803a80)
[    1.072138]     __free_vmap_area(c7803bc0)
[    1.073101]     __free_vmap_area(c7803c00)
[    1.074065]     kfree(c7803ac0)
[    1.077126]     kfree(c7803b00)
[    1.078124]     __free_vmap_area(c7803c40)
[    1.079079]     __free_vmap_area(c7803c80)
...
[    4.645118]     __free_vmap_area(c7853440)
[    4.646099]     __free_vmap_area(c7853480)
[    4.646507]     __free_vmap_area(c78534c0)
[    4.647049]     kfree(c7853340)
[    4.648058]     kfree(c7853380)
[    4.650090] };
[    4.652061]     kfree(c78533c0)
[    4.653055]     kfree(c7853400)
[    4.656109]     kfree(c7853440)
[    4.656482]     kfree(c7853480)
[    4.656760]     kfree(c78534c0)
[    5.541885] traversing list {
[    5.543081]     __free_vmap_area(c7853380)
[    5.544071]     __free_vmap_area(c7853340)
...
[    8.977412]     __free_vmap_area(c7822400)
[    8.977723]     __free_vmap_area(c7822440)
[    8.978045]     kfree(c7822200)
[    8.978569]     kfree(c7822240)
[    8.979054]     kfree(c7822280)
[    8.979389] };
[    8.981051]     kfree(c78222c0)
[    8.983055]     kfree(c7822300)
[    8.984089]     kfree(c7822340)
[    8.984640] Freeing SMP alternatives: 20k freed
[    8.985146] ACPI: Core revision 20081204
[    8.986060]     kfree(c7822380)
[    8.987057]     kfree(c78223c0)
[    8.988052]     kfree(c7822400)
[    8.990045]     kfree(c7822440)
[    9.021526] ..TIMER: vector=0x30 apic1=0 pin1=0 apic2=-1 pin2=-1

The only difference I can see is that they're clumped together more
closely (4-5 calls of one kind, then 4-5 calls of the other) than with
the Tree RCU.


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-21 18:37                   ` Vegard Nossum
@ 2009-02-22  3:00                     ` Paul E. McKenney
  2009-02-23  5:17                       ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-22  3:00 UTC (permalink / raw)
  To: Vegard Nossum
  Cc: Ingo Molnar, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

On Sat, Feb 21, 2009 at 07:37:20PM +0100, Vegard Nossum wrote:
> 2009/2/21 Vegard Nossum <vegard.nossum@gmail.com>:
> > Here's the disassembly (I hope it won't wrap):
> >
> > 0xc1073ec0 <rcu_check_callbacks+0>:     push   %ebp
> > 0xc1073ec1 <rcu_check_callbacks+1>:     test   %edx,%edx
> > 0xc1073ec3 <rcu_check_callbacks+3>:     mov    %esp,%ebp
> > 0xc1073ec5 <rcu_check_callbacks+5>:     push   %ebx
> > 0xc1073ec6 <rcu_check_callbacks+6>:     mov    %eax,%ebx
> > 0xc1073ec8 <rcu_check_callbacks+8>:     je     0xc1073f08
> > <rcu_check_callbacks+72>
> > 0xc1073eca <rcu_qsctr_inc+0>:   mov    $0xc1771320,%eax
> > 0xc1073ecf <rcu_qsctr_inc+5>:   add    -0x3e8fa900(,%ebx,4),%eax
> > 0xc1073ed6 <rcu_qsctr_inc+12>:  mov    (%eax),%edx
> > 0xc1073ed8 <rcu_qsctr_inc+14>:  movb   $0x1,0xc(%eax)
> > 0xc1073edc <rcu_qsctr_inc+18>:  mov    %edx,0x8(%eax)
> > 0xc1073edf <rcu_bh_qsctr_inc+0>:        mov    $0xc1771380,%eax
> > 0xc1073ee4 <rcu_bh_qsctr_inc+5>:        add    -0x3e8fa900(,%ebx,4),%eax
> > 0xc1073eeb <rcu_bh_qsctr_inc+12>:       mov    (%eax),%edx
> > 0xc1073eed <rcu_bh_qsctr_inc+14>:       movb   $0x1,0xc(%eax)
> > 0xc1073ef1 <rcu_bh_qsctr_inc+18>:       mov    %edx,0x8(%eax)
> > 0xc1073ef4 <rcu_check_callbacks+52>:    mov    $0x8,%eax
> >
> > Seems to be rcu_qsctr_inc() that reloads %edx. If I'd guess, I'd say
> > x86's per_cpu macros. But it seems so strange that the corruption
> > would not manifest in other ways too.
> >
> 
> Okay, I don't really think it's an error. The if (user) test happens
> at the very beginning and gcc decides to reuse %edx. GDB doesn't know
> this, so it thinks the parameter changed, but at this point the
> parameter simply won't be used anymore.
> 
> So you're right: The value can't be trusted (after entry, anyway).

OK.  So at least the compiler is sane.  ;-)

And the fact that RCU Classic behaves the same as hierarchical RCU
pretty clearly points at some issue with the quiescent-state check code:

void rcu_check_callbacks(int cpu, int user)
{
	if (user ||
	    (idle_cpu(cpu) && !in_softirq() &&
				hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
		rcu_qsctr_inc(cpu);
		rcu_bh_qsctr_inc(cpu);
	} else if (!in_softirq()) {
		rcu_bh_qsctr_inc(cpu);
	}
	raise_softirq(RCU_SOFTIRQ);
}

In the case you traced earlier, we interrupted out of kernel code, yet
somehow arrived at rcu_qsctr_inc().  We know that "user" really was 0,
thanks to your careful analysis, so the issue must be in the other
clause.  Since we interrupted out of mainline kernel code, in_softirq()
should have returned 0, and hardirq_count() should also have met the
above condition.

You mentioned some concern about idle_cpu() separately, and if idle_cpu()
was returning 1, then RCU would most certainly decide that it was in a
quiescent state and that it could end the current grace period.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-22  3:00                     ` Paul E. McKenney
@ 2009-02-23  5:17                       ` Paul E. McKenney
  2009-02-23  8:24                         ` Vegard Nossum
                                           ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-23  5:17 UTC (permalink / raw)
  To: Vegard Nossum
  Cc: Ingo Molnar, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

On Sat, Feb 21, 2009 at 07:00:30PM -0800, Paul E. McKenney wrote:
> On Sat, Feb 21, 2009 at 07:37:20PM +0100, Vegard Nossum wrote:
> > 2009/2/21 Vegard Nossum <vegard.nossum@gmail.com>:

[ . . . ]

> > Okay, I don't really think it's an error. The if (user) test happens
> > at the very beginning and gcc decides to reuse %edx. GDB doesn't know
> > this, so it thinks the parameter changed, but at this point the
> > parameter simply won't be used anymore.
> > 
> > So you're right: The value can't be trusted (after entry, anyway).
> 
> OK.  So at least the compiler is sane.  ;-)
> 
> And the fact that RCU Classic behaves the same as hierarchical RCU
> pretty clearly points at some issue with the quiescent-state check code:
> 
> void rcu_check_callbacks(int cpu, int user)
> {
> 	if (user ||
> 	    (idle_cpu(cpu) && !in_softirq() &&
> 				hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
> 		rcu_qsctr_inc(cpu);
> 		rcu_bh_qsctr_inc(cpu);
> 	} else if (!in_softirq()) {
> 		rcu_bh_qsctr_inc(cpu);
> 	}
> 	raise_softirq(RCU_SOFTIRQ);
> }
> 
> In the case you traced earlier, we interrupted out of kernel code, yet
> somehow arrived at rcu_qsctr_inc().  We know that "user" really was 0,
> thanks to your careful analysis, so the issue must be in the other
> clause.  Since we interrupted out of mainline kernel code, in_softirq()
> should have returned 0, and hardirq_count() should also have met the
> above condition.
> 
> You mentioned some concern about idle_cpu() separately, and if idle_cpu()
> was returning 1, then RCU would most certainly decide that it was in a
> quiescent state and that it could end the current grace period.

Hello, Vegard,

Could you please try out the following patch?  I am not 100% confident
of it on non-x86 architectures, nor during the time that non-boot CPUs
start up (though this patch should not break non-boot CPUs any more than
they might already be broken).

							Thanx, Paul

------------------------------------------------------------------------

The boot CPU runs in the context of its idle thread during boot-up.
During this time, idle_cpu(0) will always return nonzero, which will
fool Classic and Hierarchical RCU into deciding that a large chunk of
the boot-up sequence is a big long quiescent state.  This in turn causes
RCU to prematurely end grace periods during this time.

This patch creates a new global variable that is set to 1 just before
the boot CPU first enters the scheduler, after which the idle task
really is idle.

Located-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 init/main.c         |    3 +++
 kernel/rcuclassic.c |    4 +++-
 kernel/rcutree.c    |    4 +++-
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/init/main.c b/init/main.c
index 8442094..51f4b71 100644
--- a/init/main.c
+++ b/init/main.c
@@ -121,6 +121,8 @@ static char *static_command_line;
 static char *execute_command;
 static char *ramdisk_execute_command;
 
+int idle_task_is_really_idle;	/* set to 1 late in boot. */
+
 #ifdef CONFIG_SMP
 /* Setup configured maximum number of CPUs to activate */
 unsigned int __initdata setup_max_cpus = NR_CPUS;
@@ -463,6 +465,7 @@ static noinline void __init_refok rest_init(void)
 	 * at least once to get things moving:
 	 */
 	init_idle_bootup_task(current);
+	idle_task_is_really_idle = 1;
 	preempt_enable_no_resched();
 	schedule();
 	preempt_disable();
diff --git a/kernel/rcuclassic.c b/kernel/rcuclassic.c
index bd5a900..a758fa6 100644
--- a/kernel/rcuclassic.c
+++ b/kernel/rcuclassic.c
@@ -678,8 +678,10 @@ int rcu_needs_cpu(int cpu)
  */
 void rcu_check_callbacks(int cpu, int user)
 {
+	extern int idle_task_is_really_idle;
+
 	if (user ||
-	    (idle_cpu(cpu) && !in_softirq() &&
+	    (idle_cpu(cpu) && idle_task_is_really_idle && !in_softirq() &&
 				hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
 
 		/*
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index b2fd602..e996d85 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -947,8 +947,10 @@ static void rcu_do_batch(struct rcu_data *rdp)
  */
 void rcu_check_callbacks(int cpu, int user)
 {
+	extern int idle_task_is_really_idle;
+
 	if (user ||
-	    (idle_cpu(cpu) && !in_softirq() &&
+	    (idle_cpu(cpu) && idle_task_is_really_idle && !in_softirq() &&
 				hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
 
 		/*

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23  5:17                       ` Paul E. McKenney
@ 2009-02-23  8:24                         ` Vegard Nossum
  2009-02-23 15:39                           ` Paul E. McKenney
  2009-02-23  9:07                         ` Ingo Molnar
  2009-02-23 13:29                         ` Nick Piggin
  2 siblings, 1 reply; 46+ messages in thread
From: Vegard Nossum @ 2009-02-23  8:24 UTC (permalink / raw)
  To: paulmck
  Cc: Ingo Molnar, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

2009/2/23 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> Hello, Vegard,
>
> Could you please try out the following patch?  I am not 100% confident
> of it on non-x86 architectures, nor during the time that non-boot CPUs
> start up (though this patch should not break non-boot CPUs any more than
> they might already be broken).

Hi!

This patch fixes it for me. Now I see the huge stream of call_rcu()s,
followed by a huge stream of kfrees() (after the list traversal has
completed).


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23  5:17                       ` Paul E. McKenney
  2009-02-23  8:24                         ` Vegard Nossum
@ 2009-02-23  9:07                         ` Ingo Molnar
  2009-02-23  9:17                           ` Andrew Morton
  2009-02-23 13:29                         ` Nick Piggin
  2 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2009-02-23  9:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Vegard Nossum, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Sat, Feb 21, 2009 at 07:00:30PM -0800, Paul E. McKenney wrote:
> > On Sat, Feb 21, 2009 at 07:37:20PM +0100, Vegard Nossum wrote:
> > > 2009/2/21 Vegard Nossum <vegard.nossum@gmail.com>:
> 
> [ . . . ]
> 
> > > Okay, I don't really think it's an error. The if (user) test happens
> > > at the very beginning and gcc decides to reuse %edx. GDB doesn't know
> > > this, so it thinks the parameter changed, but at this point the
> > > parameter simply won't be used anymore.
> > > 
> > > So you're right: The value can't be trusted (after entry, anyway).
> > 
> > OK.  So at least the compiler is sane.  ;-)
> > 
> > And the fact that RCU Classic behaves the same as hierarchical RCU
> > pretty clearly points at some issue with the quiescent-state check code:
> > 
> > void rcu_check_callbacks(int cpu, int user)
> > {
> > 	if (user ||
> > 	    (idle_cpu(cpu) && !in_softirq() &&
> > 				hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
> > 		rcu_qsctr_inc(cpu);
> > 		rcu_bh_qsctr_inc(cpu);
> > 	} else if (!in_softirq()) {
> > 		rcu_bh_qsctr_inc(cpu);
> > 	}
> > 	raise_softirq(RCU_SOFTIRQ);
> > }
> > 
> > In the case you traced earlier, we interrupted out of kernel code, yet
> > somehow arrived at rcu_qsctr_inc().  We know that "user" really was 0,
> > thanks to your careful analysis, so the issue must be in the other
> > clause.  Since we interrupted out of mainline kernel code, in_softirq()
> > should have returned 0, and hardirq_count() should also have met the
> > above condition.
> > 
> > You mentioned some concern about idle_cpu() separately, and if idle_cpu()
> > was returning 1, then RCU would most certainly decide that it was in a
> > quiescent state and that it could end the current grace period.
> 
> Hello, Vegard,
> 
> Could you please try out the following patch?  I am not 100% 
> confident of it on non-x86 architectures, nor during the time 
> that non-boot CPUs start up (though this patch should not 
> break non-boot CPUs any more than they might already be 
> broken).
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> The boot CPU runs in the context of its idle thread during 
> boot-up. During this time, idle_cpu(0) will always return 
> nonzero, which will fool Classic and Hierarchical RCU into 
> deciding that a large chunk of the boot-up sequence is a big 
> long quiescent state.  This in turn causes RCU to prematurely 
> end grace periods during this time.

ah, that makes a lot of sense and explains it all! What a nasty 
little bug we had all along ...

> This patch creates a new global variable that is set to 1 just 
> before the boot CPU first enters the scheduler, after which 
> the idle task really is idle.
> 
> Located-by: Vegard Nossum <vegard.nossum@gmail.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Please also add kmemcheck to the changelog while at it ;-)

> ---
> 
>  init/main.c         |    3 +++
>  kernel/rcuclassic.c |    4 +++-
>  kernel/rcutree.c    |    4 +++-
>  3 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/init/main.c b/init/main.c
> index 8442094..51f4b71 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -121,6 +121,8 @@ static char *static_command_line;
>  static char *execute_command;
>  static char *ramdisk_execute_command;
>  
> +int idle_task_is_really_idle;	/* set to 1 late in boot. */
> +
>  #ifdef CONFIG_SMP
>  /* Setup configured maximum number of CPUs to activate */
>  unsigned int __initdata setup_max_cpus = NR_CPUS;
> @@ -463,6 +465,7 @@ static noinline void __init_refok rest_init(void)
>  	 * at least once to get things moving:
>  	 */
>  	init_idle_bootup_task(current);
> +	idle_task_is_really_idle = 1;
>  	preempt_enable_no_resched();
>  	schedule();
>  	preempt_disable();

Could you please use system_state instead? We could insert a new 
stage - or just use SYSTEM_RUNNING as the trigger.

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23  9:07                         ` Ingo Molnar
@ 2009-02-23  9:17                           ` Andrew Morton
  2009-02-23  9:27                             ` Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2009-02-23  9:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul E. McKenney, Vegard Nossum, stable, Nick Piggin,
	Pekka Enberg, linux-kernel

On Mon, 23 Feb 2009 10:07:35 +0100 Ingo Molnar <mingo@elte.hu> wrote:

> > ---
> > 
> >  init/main.c         |    3 +++
> >  kernel/rcuclassic.c |    4 +++-
> >  kernel/rcutree.c    |    4 +++-
> >  3 files changed, 9 insertions(+), 2 deletions(-)
> > 
> > diff --git a/init/main.c b/init/main.c
> > index 8442094..51f4b71 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -121,6 +121,8 @@ static char *static_command_line;
> >  static char *execute_command;
> >  static char *ramdisk_execute_command;
> >  
> > +int idle_task_is_really_idle;	/* set to 1 late in boot. */
> > +
> >  #ifdef CONFIG_SMP
> >  /* Setup configured maximum number of CPUs to activate */
> >  unsigned int __initdata setup_max_cpus = NR_CPUS;
> > @@ -463,6 +465,7 @@ static noinline void __init_refok rest_init(void)
> >  	 * at least once to get things moving:
> >  	 */
> >  	init_idle_bootup_task(current);
> > +	idle_task_is_really_idle = 1;
> >  	preempt_enable_no_resched();
> >  	schedule();
> >  	preempt_disable();
> 
> Could you please use system_state instead? We could insert a new 
> stage - or just use SYSTEM_RUNNING as the trigger.

I think the standalone flag is better (once those extern-decls-in-C get
fixed).

system_state's semantics have, err, evolved over time.  If this happens
again (and the patch sneaks past my attention) then there's a risk that
code which depends upon system_state will break - this has happened in
the past.  Plus piling more dependencies on system_state of course
makes any evolution of its semantics harder to do...

It seems safer/saner/more-robust to create the additional
application-specific global as Paul has done.  One more word of bss
won't kill us.  It adds a few more words of .text, but it's in __init.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23  9:17                           ` Andrew Morton
@ 2009-02-23  9:27                             ` Ingo Molnar
  2009-02-23 15:56                               ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2009-02-23  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul E. McKenney, Vegard Nossum, stable, Nick Piggin,
	Pekka Enberg, linux-kernel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 23 Feb 2009 10:07:35 +0100 Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > ---
> > > 
> > >  init/main.c         |    3 +++
> > >  kernel/rcuclassic.c |    4 +++-
> > >  kernel/rcutree.c    |    4 +++-
> > >  3 files changed, 9 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/init/main.c b/init/main.c
> > > index 8442094..51f4b71 100644
> > > --- a/init/main.c
> > > +++ b/init/main.c
> > > @@ -121,6 +121,8 @@ static char *static_command_line;
> > >  static char *execute_command;
> > >  static char *ramdisk_execute_command;
> > >  
> > > +int idle_task_is_really_idle;	/* set to 1 late in boot. */
> > > +
> > >  #ifdef CONFIG_SMP
> > >  /* Setup configured maximum number of CPUs to activate */
> > >  unsigned int __initdata setup_max_cpus = NR_CPUS;
> > > @@ -463,6 +465,7 @@ static noinline void __init_refok rest_init(void)
> > >  	 * at least once to get things moving:
> > >  	 */
> > >  	init_idle_bootup_task(current);
> > > +	idle_task_is_really_idle = 1;
> > >  	preempt_enable_no_resched();
> > >  	schedule();
> > >  	preempt_disable();
> > 
> > Could you please use system_state instead? We could insert a new 
> > stage - or just use SYSTEM_RUNNING as the trigger.
> 
> I think the standalone flag is better (once those 
> extern-decls-in-C get fixed).
> 
> system_state's semantics have, err, evolved over time.  If 
> this happens again (and the patch sneaks past my attention) 
> then there's a risk that code which depends upon system_state 
> will break - this has happened in the past.  Plus piling more 
> dependencies on system_state of course makes any evolution of 
> its semantics harder to do...

All we need is a SYSTEM_BOOTING_EARLY boundary - prior which 
there's no real scheduling yet. I used SYSTEM_SCHEDULER_BOOTING 
state before and it worked well and wasnt fragile.

Our system_state semantics problems were more rooted in the fact 
that the SYSTEM_BOOTING stage wasnt well defined. But if we did 
a SYSTEM_SCHEDULER_BOOTING stage that would be pretty 
bit-rot-safe.

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23  5:17                       ` Paul E. McKenney
  2009-02-23  8:24                         ` Vegard Nossum
  2009-02-23  9:07                         ` Ingo Molnar
@ 2009-02-23 13:29                         ` Nick Piggin
  2009-02-23 16:17                           ` Paul E. McKenney
  2 siblings, 1 reply; 46+ messages in thread
From: Nick Piggin @ 2009-02-23 13:29 UTC (permalink / raw)
  To: paulmck
  Cc: Vegard Nossum, Ingo Molnar, stable, Andrew Morton, Nick Piggin,
	Pekka Enberg, linux-kernel

On Monday 23 February 2009 16:17:09 Paul E. McKenney wrote:

> The boot CPU runs in the context of its idle thread during boot-up.
> During this time, idle_cpu(0) will always return nonzero, which will
> fool Classic and Hierarchical RCU into deciding that a large chunk of
> the boot-up sequence is a big long quiescent state.  This in turn causes
> RCU to prematurely end grace periods during this time.
>
> This patch creates a new global variable that is set to 1 just before
> the boot CPU first enters the scheduler, after which the idle task
> really is idle.

Nice work all (btw. if this patch goes in rather than using system_state,
then please make the variable __read_mostly).

Vegard, I would like to still use your patch in vmalloc.c as well. It
solves a possible use-after-free with preemptible RCU, and also helps with
a patch I have to conditionally disable lazy vmap unmapping (for Xen).

We _could_ disable RCU there instead to solve the preemptible RCU bug, but
your patch I think is less overhead.

So with appropriate changelog update, please also resend your patch
(with Acked-by: Nick Piggin <npiggin@suse.de>)

Thanks,
Nick

> Located-by: Vegard Nossum <vegard.nossum@gmail.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>
>  init/main.c         |    3 +++
>  kernel/rcuclassic.c |    4 +++-
>  kernel/rcutree.c    |    4 +++-
>  3 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/init/main.c b/init/main.c
> index 8442094..51f4b71 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -121,6 +121,8 @@ static char *static_command_line;
>  static char *execute_command;
>  static char *ramdisk_execute_command;
>
> +int idle_task_is_really_idle;	/* set to 1 late in boot. */
> +
>  #ifdef CONFIG_SMP
>  /* Setup configured maximum number of CPUs to activate */
>  unsigned int __initdata setup_max_cpus = NR_CPUS;
> @@ -463,6 +465,7 @@ static noinline void __init_refok rest_init(void)
>  	 * at least once to get things moving:
>  	 */
>  	init_idle_bootup_task(current);
> +	idle_task_is_really_idle = 1;
>  	preempt_enable_no_resched();
>  	schedule();
>  	preempt_disable();
> diff --git a/kernel/rcuclassic.c b/kernel/rcuclassic.c
> index bd5a900..a758fa6 100644
> --- a/kernel/rcuclassic.c
> +++ b/kernel/rcuclassic.c
> @@ -678,8 +678,10 @@ int rcu_needs_cpu(int cpu)
>   */
>  void rcu_check_callbacks(int cpu, int user)
>  {
> +	extern int idle_task_is_really_idle;
> +
>  	if (user ||
> -	    (idle_cpu(cpu) && !in_softirq() &&
> +	    (idle_cpu(cpu) && idle_task_is_really_idle && !in_softirq() &&
>  				hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
>
>  		/*
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index b2fd602..e996d85 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -947,8 +947,10 @@ static void rcu_do_batch(struct rcu_data *rdp)
>   */
>  void rcu_check_callbacks(int cpu, int user)
>  {
> +	extern int idle_task_is_really_idle;
> +
>  	if (user ||
> -	    (idle_cpu(cpu) && !in_softirq() &&
> +	    (idle_cpu(cpu) && idle_task_is_really_idle && !in_softirq() &&
>  				hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
>
>  		/*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23  8:24                         ` Vegard Nossum
@ 2009-02-23 15:39                           ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-23 15:39 UTC (permalink / raw)
  To: Vegard Nossum
  Cc: Ingo Molnar, stable, Andrew Morton, Nick Piggin, Pekka Enberg,
	linux-kernel

On Mon, Feb 23, 2009 at 09:24:39AM +0100, Vegard Nossum wrote:
> 2009/2/23 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> > Hello, Vegard,
> >
> > Could you please try out the following patch?  I am not 100% confident
> > of it on non-x86 architectures, nor during the time that non-boot CPUs
> > start up (though this patch should not break non-boot CPUs any more than
> > they might already be broken).
> 
> Hi!
> 
> This patch fixes it for me. Now I see the huge stream of call_rcu()s,
> followed by a huge stream of kfrees() (after the list traversal has
> completed).

Much better!!!  Thank you for checking this out!  To say nothing of all
your work finding and isolating the problem!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23  9:27                             ` Ingo Molnar
@ 2009-02-23 15:56                               ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-23 15:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Vegard Nossum, stable, Nick Piggin, Pekka Enberg,
	linux-kernel

On Mon, Feb 23, 2009 at 10:27:44AM +0100, Ingo Molnar wrote:
> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Mon, 23 Feb 2009 10:07:35 +0100 Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > > ---
> > > > 
> > > >  init/main.c         |    3 +++
> > > >  kernel/rcuclassic.c |    4 +++-
> > > >  kernel/rcutree.c    |    4 +++-
> > > >  3 files changed, 9 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/init/main.c b/init/main.c
> > > > index 8442094..51f4b71 100644
> > > > --- a/init/main.c
> > > > +++ b/init/main.c
> > > > @@ -121,6 +121,8 @@ static char *static_command_line;
> > > >  static char *execute_command;
> > > >  static char *ramdisk_execute_command;
> > > >  
> > > > +int idle_task_is_really_idle;	/* set to 1 late in boot. */
> > > > +
> > > >  #ifdef CONFIG_SMP
> > > >  /* Setup configured maximum number of CPUs to activate */
> > > >  unsigned int __initdata setup_max_cpus = NR_CPUS;
> > > > @@ -463,6 +465,7 @@ static noinline void __init_refok rest_init(void)
> > > >  	 * at least once to get things moving:
> > > >  	 */
> > > >  	init_idle_bootup_task(current);
> > > > +	idle_task_is_really_idle = 1;
> > > >  	preempt_enable_no_resched();
> > > >  	schedule();
> > > >  	preempt_disable();
> > > 
> > > Could you please use system_state instead? We could insert a new 
> > > stage - or just use SYSTEM_RUNNING as the trigger.
> > 
> > I think the standalone flag is better (once those 
> > extern-decls-in-C get fixed).
> > 
> > system_state's semantics have, err, evolved over time.  If 
> > this happens again (and the patch sneaks past my attention) 
> > then there's a risk that code which depends upon system_state 
> > will break - this has happened in the past.  Plus piling more 
> > dependencies on system_state of course makes any evolution of 
> > its semantics harder to do...
> 
> All we need is a SYSTEM_BOOTING_EARLY boundary - prior which 
> there's no real scheduling yet. I used SYSTEM_SCHEDULER_BOOTING 
> state before and it worked well and wasnt fragile.
> 
> Our system_state semantics problems were more rooted in the fact 
> that the SYSTEM_BOOTING stage wasnt well defined. But if we did 
> a SYSTEM_SCHEDULER_BOOTING stage that would be pretty 
> bit-rot-safe.

Actually, I should be able to simply use (system_state != SYSTEM_BOOTING),
because the system state transitions to SYSTEM_RUNNING at essentially
the same place that I set my new global variable.  Especially given that
this approach saves others from any bad guesses I might make about the
existing uses of SYSTEM_BOOTING.  ;-)

I will also credit kmemcheck on the resulting patch.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23 13:29                         ` Nick Piggin
@ 2009-02-23 16:17                           ` Paul E. McKenney
  2009-02-23 17:20                             ` Ingo Molnar
  2009-02-23 19:10                             ` Andrew Morton
  0 siblings, 2 replies; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-23 16:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Vegard Nossum, Ingo Molnar, stable, Andrew Morton, Nick Piggin,
	Pekka Enberg, linux-kernel

On Tue, Feb 24, 2009 at 12:29:36AM +1100, Nick Piggin wrote:
> On Monday 23 February 2009 16:17:09 Paul E. McKenney wrote:
> 
> > The boot CPU runs in the context of its idle thread during boot-up.
> > During this time, idle_cpu(0) will always return nonzero, which will
> > fool Classic and Hierarchical RCU into deciding that a large chunk of
> > the boot-up sequence is a big long quiescent state.  This in turn causes
> > RCU to prematurely end grace periods during this time.
> >
> > This patch creates a new global variable that is set to 1 just before
> > the boot CPU first enters the scheduler, after which the idle task
> > really is idle.
> 
> Nice work all (btw. if this patch goes in rather than using system_state,
> then please make the variable __read_mostly).

Hmmm...  I misread this and made system_state be __read_mostly.  Let
me know if this is bad, easy to fix if needed.

							Thanx, Paul

> Vegard, I would like to still use your patch in vmalloc.c as well. It
> solves a possible use-after-free with preemptible RCU, and also helps with
> a patch I have to conditionally disable lazy vmap unmapping (for Xen).
> 
> We _could_ disable RCU there instead to solve the preemptible RCU bug, but
> your patch I think is less overhead.
> 
> So with appropriate changelog update, please also resend your patch
> (with Acked-by: Nick Piggin <npiggin@suse.de>)
> 
> Thanks,
> Nick
> 
> > Located-by: Vegard Nossum <vegard.nossum@gmail.com>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >
> >  init/main.c         |    3 +++
> >  kernel/rcuclassic.c |    4 +++-
> >  kernel/rcutree.c    |    4 +++-
> >  3 files changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/init/main.c b/init/main.c
> > index 8442094..51f4b71 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -121,6 +121,8 @@ static char *static_command_line;
> >  static char *execute_command;
> >  static char *ramdisk_execute_command;
> >
> > +int idle_task_is_really_idle;	/* set to 1 late in boot. */
> > +
> >  #ifdef CONFIG_SMP
> >  /* Setup configured maximum number of CPUs to activate */
> >  unsigned int __initdata setup_max_cpus = NR_CPUS;
> > @@ -463,6 +465,7 @@ static noinline void __init_refok rest_init(void)
> >  	 * at least once to get things moving:
> >  	 */
> >  	init_idle_bootup_task(current);
> > +	idle_task_is_really_idle = 1;
> >  	preempt_enable_no_resched();
> >  	schedule();
> >  	preempt_disable();
> > diff --git a/kernel/rcuclassic.c b/kernel/rcuclassic.c
> > index bd5a900..a758fa6 100644
> > --- a/kernel/rcuclassic.c
> > +++ b/kernel/rcuclassic.c
> > @@ -678,8 +678,10 @@ int rcu_needs_cpu(int cpu)
> >   */
> >  void rcu_check_callbacks(int cpu, int user)
> >  {
> > +	extern int idle_task_is_really_idle;
> > +
> >  	if (user ||
> > -	    (idle_cpu(cpu) && !in_softirq() &&
> > +	    (idle_cpu(cpu) && idle_task_is_really_idle && !in_softirq() &&
> >  				hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
> >
> >  		/*
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index b2fd602..e996d85 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -947,8 +947,10 @@ static void rcu_do_batch(struct rcu_data *rdp)
> >   */
> >  void rcu_check_callbacks(int cpu, int user)
> >  {
> > +	extern int idle_task_is_really_idle;
> > +
> >  	if (user ||
> > -	    (idle_cpu(cpu) && !in_softirq() &&
> > +	    (idle_cpu(cpu) && idle_task_is_really_idle && !in_softirq() &&
> >  				hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
> >
> >  		/*
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23 16:17                           ` Paul E. McKenney
@ 2009-02-23 17:20                             ` Ingo Molnar
  2009-02-23 19:10                             ` Andrew Morton
  1 sibling, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2009-02-23 17:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Nick Piggin, Vegard Nossum, stable, Andrew Morton, Nick Piggin,
	Pekka Enberg, linux-kernel


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Tue, Feb 24, 2009 at 12:29:36AM +1100, Nick Piggin wrote:
> > On Monday 23 February 2009 16:17:09 Paul E. McKenney wrote:
> > 
> > > The boot CPU runs in the context of its idle thread during boot-up.
> > > During this time, idle_cpu(0) will always return nonzero, which will
> > > fool Classic and Hierarchical RCU into deciding that a large chunk of
> > > the boot-up sequence is a big long quiescent state.  This in turn causes
> > > RCU to prematurely end grace periods during this time.
> > >
> > > This patch creates a new global variable that is set to 1 just before
> > > the boot CPU first enters the scheduler, after which the idle task
> > > really is idle.
> > 
> > Nice work all (btw. if this patch goes in rather than using system_state,
> > then please make the variable __read_mostly).
> 
> Hmmm...  I misread this and made system_state be 
> __read_mostly.  Let me know if this is bad, easy to fix if 
> needed.

no, it's the right thing to do. I'd not be surprised if we were 
already running into it as a performance problem - a number of 
common things check this flag.

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23 16:17                           ` Paul E. McKenney
  2009-02-23 17:20                             ` Ingo Molnar
@ 2009-02-23 19:10                             ` Andrew Morton
  2009-02-23 19:30                               ` Paul E. McKenney
  2009-02-23 19:33                               ` Ingo Molnar
  1 sibling, 2 replies; 46+ messages in thread
From: Andrew Morton @ 2009-02-23 19:10 UTC (permalink / raw)
  To: paulmck
  Cc: Nick Piggin, Vegard Nossum, Ingo Molnar, stable, Nick Piggin,
	Pekka Enberg, linux-kernel

On Mon, 23 Feb 2009 08:17:26 -0800 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Tue, Feb 24, 2009 at 12:29:36AM +1100, Nick Piggin wrote:
> > On Monday 23 February 2009 16:17:09 Paul E. McKenney wrote:
> > 
> > > The boot CPU runs in the context of its idle thread during boot-up.
> > > During this time, idle_cpu(0) will always return nonzero, which will
> > > fool Classic and Hierarchical RCU into deciding that a large chunk of
> > > the boot-up sequence is a big long quiescent state.  This in turn causes
> > > RCU to prematurely end grace periods during this time.
> > >
> > > This patch creates a new global variable that is set to 1 just before
> > > the boot CPU first enters the scheduler, after which the idle task
> > > really is idle.
> > 
> > Nice work all (btw. if this patch goes in rather than using system_state,
> > then please make the variable __read_mostly).
> 
> Hmmm...  I misread this and made system_state be __read_mostly.  Let
> me know if this is bad, easy to fix if needed.

Please don't use system_state.  The whole thing is just bad design. 
It's a global variable, breaks encapsulation, creates interactions etc.
CS-101 stuff.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23 19:10                             ` Andrew Morton
@ 2009-02-23 19:30                               ` Paul E. McKenney
  2009-02-23 19:59                                 ` Andrew Morton
  2009-02-23 19:33                               ` Ingo Molnar
  1 sibling, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-23 19:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Vegard Nossum, Ingo Molnar, stable, Nick Piggin,
	Pekka Enberg, linux-kernel

On Mon, Feb 23, 2009 at 11:10:09AM -0800, Andrew Morton wrote:
> On Mon, 23 Feb 2009 08:17:26 -0800 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Tue, Feb 24, 2009 at 12:29:36AM +1100, Nick Piggin wrote:
> > > On Monday 23 February 2009 16:17:09 Paul E. McKenney wrote:
> > > 
> > > > The boot CPU runs in the context of its idle thread during boot-up.
> > > > During this time, idle_cpu(0) will always return nonzero, which will
> > > > fool Classic and Hierarchical RCU into deciding that a large chunk of
> > > > the boot-up sequence is a big long quiescent state.  This in turn causes
> > > > RCU to prematurely end grace periods during this time.
> > > >
> > > > This patch creates a new global variable that is set to 1 just before
> > > > the boot CPU first enters the scheduler, after which the idle task
> > > > really is idle.
> > > 
> > > Nice work all (btw. if this patch goes in rather than using system_state,
> > > then please make the variable __read_mostly).
> > 
> > Hmmm...  I misread this and made system_state be __read_mostly.  Let
> > me know if this is bad, easy to fix if needed.
> 
> Please don't use system_state.  The whole thing is just bad design. 
> It's a global variable, breaks encapsulation, creates interactions etc.
> CS-101 stuff.

OK.  Would it help if I wrapped an accessor function around system_state?

I do need some sort of global state if I am to solve this problem.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23 19:10                             ` Andrew Morton
  2009-02-23 19:30                               ` Paul E. McKenney
@ 2009-02-23 19:33                               ` Ingo Molnar
  2009-02-23 20:04                                 ` Andrew Morton
  2009-02-23 20:43                                 ` Paul E. McKenney
  1 sibling, 2 replies; 46+ messages in thread
From: Ingo Molnar @ 2009-02-23 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: paulmck, Nick Piggin, Vegard Nossum, stable, Nick Piggin,
	Pekka Enberg, linux-kernel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 23 Feb 2009 08:17:26 -0800 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Tue, Feb 24, 2009 at 12:29:36AM +1100, Nick Piggin wrote:
> > > On Monday 23 February 2009 16:17:09 Paul E. McKenney wrote:
> > > 
> > > > The boot CPU runs in the context of its idle thread during boot-up.
> > > > During this time, idle_cpu(0) will always return nonzero, which will
> > > > fool Classic and Hierarchical RCU into deciding that a large chunk of
> > > > the boot-up sequence is a big long quiescent state.  This in turn causes
> > > > RCU to prematurely end grace periods during this time.
> > > >
> > > > This patch creates a new global variable that is set to 1 just before
> > > > the boot CPU first enters the scheduler, after which the idle task
> > > > really is idle.
> > > 
> > > Nice work all (btw. if this patch goes in rather than using system_state,
> > > then please make the variable __read_mostly).
> > 
> > Hmmm...  I misread this and made system_state be __read_mostly.  Let
> > me know if this is bad, easy to fix if needed.
> 
> Please don't use system_state.  The whole thing is just bad 
> design. It's a global variable, breaks encapsulation, creates 
> interactions etc. CS-101 stuff.

ok, i've removed the patch - Paul, would you mind to re-send 
your original flag solution, with it marked __read_mostly and 
with the extern declarations put into a suitable header file?

Paul, incidentally, this very minute i tracked down that the 
patch is also causing boot lockups in -tip testing. I havent yet 
fully debugged it, but a question comes immediately: if there's 
no grace periods during bootup, wont rcu_sync() & friends just 
hang indefinitely?

More thought is needed.

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23 19:30                               ` Paul E. McKenney
@ 2009-02-23 19:59                                 ` Andrew Morton
  2009-02-23 20:12                                   ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2009-02-23 19:59 UTC (permalink / raw)
  To: paulmck
  Cc: nickpiggin, vegard.nossum, mingo, stable, npiggin, penberg,
	linux-kernel

On Mon, 23 Feb 2009 11:30:57 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Mon, Feb 23, 2009 at 11:10:09AM -0800, Andrew Morton wrote:
> > On Mon, 23 Feb 2009 08:17:26 -0800 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > > On Tue, Feb 24, 2009 at 12:29:36AM +1100, Nick Piggin wrote:
> > > > On Monday 23 February 2009 16:17:09 Paul E. McKenney wrote:
> > > > 
> > > > > The boot CPU runs in the context of its idle thread during boot-up.
> > > > > During this time, idle_cpu(0) will always return nonzero, which will
> > > > > fool Classic and Hierarchical RCU into deciding that a large chunk of
> > > > > the boot-up sequence is a big long quiescent state.  This in turn causes
> > > > > RCU to prematurely end grace periods during this time.
> > > > >
> > > > > This patch creates a new global variable that is set to 1 just before
> > > > > the boot CPU first enters the scheduler, after which the idle task
> > > > > really is idle.
> > > > 
> > > > Nice work all (btw. if this patch goes in rather than using system_state,
> > > > then please make the variable __read_mostly).
> > > 
> > > Hmmm...  I misread this and made system_state be __read_mostly.  Let
> > > me know if this is bad, easy to fix if needed.
> > 
> > Please don't use system_state.  The whole thing is just bad design. 
> > It's a global variable, breaks encapsulation, creates interactions etc.
> > CS-101 stuff.
> 
> OK.  Would it help if I wrapped an accessor function around system_state?

That doesn't help the core problem: system_state creates interactions
between unrelated subsystems.  And the number of interactions grows
exponentially with the number of subsystems which use system_state.

> I do need some sort of global state if I am to solve this problem.  ;-)

As I suggested before, please add a new state variable which is private
to your subsystem.  That way it remains isolated from other subsystems.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23 19:33                               ` Ingo Molnar
@ 2009-02-23 20:04                                 ` Andrew Morton
  2009-02-23 20:09                                   ` Ingo Molnar
  2009-02-23 20:44                                   ` Paul E. McKenney
  2009-02-23 20:43                                 ` Paul E. McKenney
  1 sibling, 2 replies; 46+ messages in thread
From: Andrew Morton @ 2009-02-23 20:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: paulmck, nickpiggin, vegard.nossum, stable, npiggin, penberg,
	linux-kernel

On Mon, 23 Feb 2009 20:33:51 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> > > > Nice work all (btw. if this patch goes in rather than using system_state,
> > > > then please make the variable __read_mostly).
> > > 
> > > Hmmm...  I misread this and made system_state be __read_mostly.  Let
> > > me know if this is bad, easy to fix if needed.
> > 
> > Please don't use system_state.  The whole thing is just bad 
> > design. It's a global variable, breaks encapsulation, creates 
> > interactions etc. CS-101 stuff.
> 
> ok, i've removed the patch

Thanks.

> - Paul, would you mind to re-send 
> your original flag solution, with it marked __read_mostly and 
> with the extern declarations put into a suitable header file?

It would still make sense to mark system_state __read_mostly.

(That's assuming that __read_mostly makes sense.  I still wanna
rename it to __pack_all_write_often_variables_into_the_same_cacheline
and see what happens then)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23 20:04                                 ` Andrew Morton
@ 2009-02-23 20:09                                   ` Ingo Molnar
  2009-02-23 20:44                                   ` Paul E. McKenney
  1 sibling, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2009-02-23 20:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: paulmck, nickpiggin, vegard.nossum, stable, npiggin, penberg,
	linux-kernel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 23 Feb 2009 20:33:51 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > > > Nice work all (btw. if this patch goes in rather than using system_state,
> > > > > then please make the variable __read_mostly).
> > > > 
> > > > Hmmm...  I misread this and made system_state be __read_mostly.  Let
> > > > me know if this is bad, easy to fix if needed.
> > > 
> > > Please don't use system_state.  The whole thing is just bad 
> > > design. It's a global variable, breaks encapsulation, creates 
> > > interactions etc. CS-101 stuff.
> > 
> > ok, i've removed the patch
> 
> Thanks.
> 
> > - Paul, would you mind to re-send your original flag 
> > solution, with it marked __read_mostly and with the extern 
> > declarations put into a suitable header file?
> 
> It would still make sense to mark system_state __read_mostly.
> 
> (That's assuming that __read_mostly makes sense.  I still 
> wanna rename it to 
> __pack_all_write_often_variables_into_the_same_cacheline and 
> see what happens then)

the write often variables should be on separate cachelines. 
Putting them on the same cacheline as a read-mostly variable is 
not helpful.

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23 19:59                                 ` Andrew Morton
@ 2009-02-23 20:12                                   ` Paul E. McKenney
  2009-02-23 20:30                                     ` Andrew Morton
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-23 20:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nickpiggin, vegard.nossum, mingo, stable, npiggin, penberg,
	linux-kernel

On Mon, Feb 23, 2009 at 11:59:26AM -0800, Andrew Morton wrote:
> On Mon, 23 Feb 2009 11:30:57 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Mon, Feb 23, 2009 at 11:10:09AM -0800, Andrew Morton wrote:
> > > On Mon, 23 Feb 2009 08:17:26 -0800 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > 
> > > > On Tue, Feb 24, 2009 at 12:29:36AM +1100, Nick Piggin wrote:
> > > > > On Monday 23 February 2009 16:17:09 Paul E. McKenney wrote:
> > > > > 
> > > > > > The boot CPU runs in the context of its idle thread during boot-up.
> > > > > > During this time, idle_cpu(0) will always return nonzero, which will
> > > > > > fool Classic and Hierarchical RCU into deciding that a large chunk of
> > > > > > the boot-up sequence is a big long quiescent state.  This in turn causes
> > > > > > RCU to prematurely end grace periods during this time.
> > > > > >
> > > > > > This patch creates a new global variable that is set to 1 just before
> > > > > > the boot CPU first enters the scheduler, after which the idle task
> > > > > > really is idle.
> > > > > 
> > > > > Nice work all (btw. if this patch goes in rather than using system_state,
> > > > > then please make the variable __read_mostly).
> > > > 
> > > > Hmmm...  I misread this and made system_state be __read_mostly.  Let
> > > > me know if this is bad, easy to fix if needed.
> > > 
> > > Please don't use system_state.  The whole thing is just bad design. 
> > > It's a global variable, breaks encapsulation, creates interactions etc.
> > > CS-101 stuff.
> > 
> > OK.  Would it help if I wrapped an accessor function around system_state?
> 
> That doesn't help the core problem: system_state creates interactions
> between unrelated subsystems.  And the number of interactions grows
> exponentially with the number of subsystems which use system_state.
> 
> > I do need some sort of global state if I am to solve this problem.  ;-)
> 
> As I suggested before, please add a new state variable which is private
> to your subsystem.  That way it remains isolated from other subsystems.

OK, so my thought would be to have a function call into RCU from either
init_post() or from rest_init().  This function would then set a variable
private to RCU.

Seem reasonable?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23 20:12                                   ` Paul E. McKenney
@ 2009-02-23 20:30                                     ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2009-02-23 20:30 UTC (permalink / raw)
  To: paulmck
  Cc: nickpiggin, vegard.nossum, mingo, stable, npiggin, penberg,
	linux-kernel

On Mon, 23 Feb 2009 12:12:36 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > As I suggested before, please add a new state variable which is private
> > to your subsystem.  That way it remains isolated from other subsystems.
> 
> OK, so my thought would be to have a function call into RCU from either
> init_post() or from rest_init().  This function would then set a variable
> private to RCU.
> 
> Seem reasonable?
> 

If that flag will only be used by one flavour of RCU then that would be
a good way to go.

Otherwise, heck, just make it a global variable with a well-chosen name
and write to it directly.  Simple, perfectly clear and understandable..


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23 19:33                               ` Ingo Molnar
  2009-02-23 20:04                                 ` Andrew Morton
@ 2009-02-23 20:43                                 ` Paul E. McKenney
  2009-02-24  3:23                                   ` Nick Piggin
  1 sibling, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-23 20:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Nick Piggin, Vegard Nossum, stable, Nick Piggin,
	Pekka Enberg, linux-kernel

On Mon, Feb 23, 2009 at 08:33:51PM +0100, Ingo Molnar wrote:
> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Mon, 23 Feb 2009 08:17:26 -0800 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > > On Tue, Feb 24, 2009 at 12:29:36AM +1100, Nick Piggin wrote:
> > > > On Monday 23 February 2009 16:17:09 Paul E. McKenney wrote:
> > > > 
> > > > > The boot CPU runs in the context of its idle thread during boot-up.
> > > > > During this time, idle_cpu(0) will always return nonzero, which will
> > > > > fool Classic and Hierarchical RCU into deciding that a large chunk of
> > > > > the boot-up sequence is a big long quiescent state.  This in turn causes
> > > > > RCU to prematurely end grace periods during this time.
> > > > >
> > > > > This patch creates a new global variable that is set to 1 just before
> > > > > the boot CPU first enters the scheduler, after which the idle task
> > > > > really is idle.
> > > > 
> > > > Nice work all (btw. if this patch goes in rather than using system_state,
> > > > then please make the variable __read_mostly).
> > > 
> > > Hmmm...  I misread this and made system_state be __read_mostly.  Let
> > > me know if this is bad, easy to fix if needed.
> > 
> > Please don't use system_state.  The whole thing is just bad 
> > design. It's a global variable, breaks encapsulation, creates 
> > interactions etc. CS-101 stuff.
> 
> ok, i've removed the patch - Paul, would you mind to re-send 
> your original flag solution, with it marked __read_mostly and 
> with the extern declarations put into a suitable header file?
> 
> Paul, incidentally, this very minute i tracked down that the 
> patch is also causing boot lockups in -tip testing. I havent yet 
> fully debugged it, but a question comes immediately: if there's 
> no grace periods during bootup, wont rcu_sync() & friends just 
> hang indefinitely?

Ouch!!!  Indeed they would.

> More thought is needed.

One fix would be to sprinkle calls to rcu_qsctr_inc() through the
boot process.  But a much better approach would be for me to make
synchronize_rcu() check this same flag, and simply return if called
during early boot.  The rationale for this is that there is but a single
CPU during early boot, so tinyrcu.c's optimization can be used.  ;-)

Out of both paranoia and self defense, I would check num_online_cpus()
in my proposed call into RCU.  ;-)

Seem reasonable?  And does synchronize_sched() also need the UP-only
optimization?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23 20:04                                 ` Andrew Morton
  2009-02-23 20:09                                   ` Ingo Molnar
@ 2009-02-23 20:44                                   ` Paul E. McKenney
  1 sibling, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-23 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, nickpiggin, vegard.nossum, stable, npiggin, penberg,
	linux-kernel

On Mon, Feb 23, 2009 at 12:04:01PM -0800, Andrew Morton wrote:
> On Mon, 23 Feb 2009 20:33:51 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > > > Nice work all (btw. if this patch goes in rather than using system_state,
> > > > > then please make the variable __read_mostly).
> > > > 
> > > > Hmmm...  I misread this and made system_state be __read_mostly.  Let
> > > > me know if this is bad, easy to fix if needed.
> > > 
> > > Please don't use system_state.  The whole thing is just bad 
> > > design. It's a global variable, breaks encapsulation, creates 
> > > interactions etc. CS-101 stuff.
> > 
> > ok, i've removed the patch
> 
> Thanks.
> 
> > - Paul, would you mind to re-send 
> > your original flag solution, with it marked __read_mostly and 
> > with the extern declarations put into a suitable header file?
> 
> It would still make sense to mark system_state __read_mostly.
> 
> (That's assuming that __read_mostly makes sense.  I still wanna
> rename it to __pack_all_write_often_variables_into_the_same_cacheline
> and see what happens then)

The patch I just posted retains this change.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-23 20:43                                 ` Paul E. McKenney
@ 2009-02-24  3:23                                   ` Nick Piggin
  2009-02-24  3:37                                     ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Nick Piggin @ 2009-02-24  3:23 UTC (permalink / raw)
  To: paulmck
  Cc: Ingo Molnar, Andrew Morton, Vegard Nossum, stable, Nick Piggin,
	Pekka Enberg, linux-kernel

On Tuesday 24 February 2009 07:43:59 Paul E. McKenney wrote:
> On Mon, Feb 23, 2009 at 08:33:51PM +0100, Ingo Molnar wrote:
> > * Andrew Morton <akpm@linux-foundation.org> wrote:
> > > On Mon, 23 Feb 2009 08:17:26 -0800 "Paul E. McKenney" 
<paulmck@linux.vnet.ibm.com> wrote:
> > > > On Tue, Feb 24, 2009 at 12:29:36AM +1100, Nick Piggin wrote:
> > > > > On Monday 23 February 2009 16:17:09 Paul E. McKenney wrote:
> > > > > > The boot CPU runs in the context of its idle thread during
> > > > > > boot-up. During this time, idle_cpu(0) will always return
> > > > > > nonzero, which will fool Classic and Hierarchical RCU into
> > > > > > deciding that a large chunk of the boot-up sequence is a big long
> > > > > > quiescent state.  This in turn causes RCU to prematurely end
> > > > > > grace periods during this time.
> > > > > >
> > > > > > This patch creates a new global variable that is set to 1 just
> > > > > > before the boot CPU first enters the scheduler, after which the
> > > > > > idle task really is idle.
> > > > >
> > > > > Nice work all (btw. if this patch goes in rather than using
> > > > > system_state, then please make the variable __read_mostly).
> > > >
> > > > Hmmm...  I misread this and made system_state be __read_mostly.  Let
> > > > me know if this is bad, easy to fix if needed.
> > >
> > > Please don't use system_state.  The whole thing is just bad
> > > design. It's a global variable, breaks encapsulation, creates
> > > interactions etc. CS-101 stuff.
> >
> > ok, i've removed the patch - Paul, would you mind to re-send
> > your original flag solution, with it marked __read_mostly and
> > with the extern declarations put into a suitable header file?
> >
> > Paul, incidentally, this very minute i tracked down that the
> > patch is also causing boot lockups in -tip testing. I havent yet
> > fully debugged it, but a question comes immediately: if there's
> > no grace periods during bootup, wont rcu_sync() & friends just
> > hang indefinitely?
>
> Ouch!!!  Indeed they would.
>
> > More thought is needed.
>
> One fix would be to sprinkle calls to rcu_qsctr_inc() through the
> boot process.  But a much better approach would be for me to make
> synchronize_rcu() check this same flag, and simply return if called
> during early boot.  The rationale for this is that there is but a single
> CPU during early boot, so tinyrcu.c's optimization can be used.  ;-)

Well can you simply return if called if num_online_cpus() == 1, regardless
of the state of boot?

> Out of both paranoia and self defense, I would check num_online_cpus()
> in my proposed call into RCU.  ;-)
>
> Seem reasonable?  And does synchronize_sched() also need the UP-only
> optimization?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] mm: fix lazy vmap purging (use-after-free error)
  2009-02-24  3:23                                   ` Nick Piggin
@ 2009-02-24  3:37                                     ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2009-02-24  3:37 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andrew Morton, Vegard Nossum, stable, Nick Piggin,
	Pekka Enberg, linux-kernel

On Tue, Feb 24, 2009 at 02:23:19PM +1100, Nick Piggin wrote:
> On Tuesday 24 February 2009 07:43:59 Paul E. McKenney wrote:
> > On Mon, Feb 23, 2009 at 08:33:51PM +0100, Ingo Molnar wrote:
> > > * Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > On Mon, 23 Feb 2009 08:17:26 -0800 "Paul E. McKenney" 
> <paulmck@linux.vnet.ibm.com> wrote:
> > > > > On Tue, Feb 24, 2009 at 12:29:36AM +1100, Nick Piggin wrote:
> > > > > > On Monday 23 February 2009 16:17:09 Paul E. McKenney wrote:
> > > > > > > The boot CPU runs in the context of its idle thread during
> > > > > > > boot-up. During this time, idle_cpu(0) will always return
> > > > > > > nonzero, which will fool Classic and Hierarchical RCU into
> > > > > > > deciding that a large chunk of the boot-up sequence is a big long
> > > > > > > quiescent state.  This in turn causes RCU to prematurely end
> > > > > > > grace periods during this time.
> > > > > > >
> > > > > > > This patch creates a new global variable that is set to 1 just
> > > > > > > before the boot CPU first enters the scheduler, after which the
> > > > > > > idle task really is idle.
> > > > > >
> > > > > > Nice work all (btw. if this patch goes in rather than using
> > > > > > system_state, then please make the variable __read_mostly).
> > > > >
> > > > > Hmmm...  I misread this and made system_state be __read_mostly.  Let
> > > > > me know if this is bad, easy to fix if needed.
> > > >
> > > > Please don't use system_state.  The whole thing is just bad
> > > > design. It's a global variable, breaks encapsulation, creates
> > > > interactions etc. CS-101 stuff.
> > >
> > > ok, i've removed the patch - Paul, would you mind to re-send
> > > your original flag solution, with it marked __read_mostly and
> > > with the extern declarations put into a suitable header file?
> > >
> > > Paul, incidentally, this very minute i tracked down that the
> > > patch is also causing boot lockups in -tip testing. I havent yet
> > > fully debugged it, but a question comes immediately: if there's
> > > no grace periods during bootup, wont rcu_sync() & friends just
> > > hang indefinitely?
> >
> > Ouch!!!  Indeed they would.
> >
> > > More thought is needed.
> >
> > One fix would be to sprinkle calls to rcu_qsctr_inc() through the
> > boot process.  But a much better approach would be for me to make
> > synchronize_rcu() check this same flag, and simply return if called
> > during early boot.  The rationale for this is that there is but a single
> > CPU during early boot, so tinyrcu.c's optimization can be used.  ;-)
> 
> Well can you simply return if called if num_online_cpus() == 1, regardless
> of the state of boot?

Yep!

And that is indeed what I do in http://lkml.org/lkml/2009/2/23/305.

						Thanx, Paul

> > Out of both paranoia and self defense, I would check num_online_cpus()
> > in my proposed call into RCU.  ;-)
> >
> > Seem reasonable?  And does synchronize_sched() also need the UP-only
> > optimization?
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2009-02-24  3:37 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-20 13:41 [PATCH] mm: fix lazy vmap purging (use-after-free error) Vegard Nossum
2009-02-20 13:50 ` Ingo Molnar
2009-02-20 13:58   ` Pekka Enberg
2009-02-20 14:01   ` Ingo Molnar
2009-02-20 14:18     ` Pekka Enberg
2009-02-20 15:41       ` Paul E. McKenney
2009-02-20 14:51     ` Vegard Nossum
2009-02-20 15:46       ` Paul E. McKenney
2009-02-20 16:04         ` Ingo Molnar
2009-02-20 16:44           ` Paul E. McKenney
2009-02-20 17:14             ` Ingo Molnar
2009-02-20 17:25               ` Paul E. McKenney
2009-02-20 23:51         ` Vegard Nossum
2009-02-21  1:40           ` Paul E. McKenney
2009-02-21  9:30             ` Vegard Nossum
2009-02-21 17:47               ` Paul E. McKenney
2009-02-21 18:08                 ` Vegard Nossum
2009-02-21 18:33                   ` Paul E. McKenney
2009-02-21 18:37                   ` Vegard Nossum
2009-02-22  3:00                     ` Paul E. McKenney
2009-02-23  5:17                       ` Paul E. McKenney
2009-02-23  8:24                         ` Vegard Nossum
2009-02-23 15:39                           ` Paul E. McKenney
2009-02-23  9:07                         ` Ingo Molnar
2009-02-23  9:17                           ` Andrew Morton
2009-02-23  9:27                             ` Ingo Molnar
2009-02-23 15:56                               ` Paul E. McKenney
2009-02-23 13:29                         ` Nick Piggin
2009-02-23 16:17                           ` Paul E. McKenney
2009-02-23 17:20                             ` Ingo Molnar
2009-02-23 19:10                             ` Andrew Morton
2009-02-23 19:30                               ` Paul E. McKenney
2009-02-23 19:59                                 ` Andrew Morton
2009-02-23 20:12                                   ` Paul E. McKenney
2009-02-23 20:30                                     ` Andrew Morton
2009-02-23 19:33                               ` Ingo Molnar
2009-02-23 20:04                                 ` Andrew Morton
2009-02-23 20:09                                   ` Ingo Molnar
2009-02-23 20:44                                   ` Paul E. McKenney
2009-02-23 20:43                                 ` Paul E. McKenney
2009-02-24  3:23                                   ` Nick Piggin
2009-02-24  3:37                                     ` Paul E. McKenney
2009-02-21 19:21                 ` Vegard Nossum
2009-02-20 16:01       ` Ingo Molnar
2009-02-20 16:49         ` Paul E. McKenney
2009-02-20 15:56     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).