[PATCH 1/2] net: dont leave active on stack LIST

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] net: dont leave active on stack LIST_HEAD
       [not found]                         ` <1298014191.2642.11.camel@edumazet-laptop>
@ 2011-02-18  8:54                           ` Eric Dumazet
  2011-02-18 20:14                             ` David Miller
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2011-02-18  8:54 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, ebiederm, opurdila, mingo, mhocko, linux-mm,
	linux-kernel, netdev

From: Linus Torvalds <torvalds@linux-foundation.org>

Eric W. Biderman and Michal Hocko reported various memory corruptions
that we suspected to be related to a LIST head located on stack, that
was manipulated after thread left function frame (and eventually exited,
so its stack was freed and reused).

Eric Dumazet suggested the problem was probably coming from commit
443457242beb (net: factorize
sync-rcu call in unregister_netdevice_many)

This patch fixes __dev_close() and dev_close() to properly deinit their
respective LIST_HEAD(single) before exiting.

References: https://lkml.org/lkml/2011/2/16/304
References: https://lkml.org/lkml/2011/2/14/223

Reported-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Eric W. Biderman <ebiderman@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Octavian Purdila <opurdila@ixiacom.com>
CC: Eric W. Biderman <ebiderman@xmission.com>
---
 net/core/dev.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8e726cb..a18c164 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1280,10 +1280,13 @@ static int __dev_close_many(struct list_head *head)
 
 static int __dev_close(struct net_device *dev)
 {
+	int retval;
 	LIST_HEAD(single);
 
 	list_add(&dev->unreg_list, &single);
-	return __dev_close_many(&single);
+	retval = __dev_close_many(&single);
+	list_del(&single);
+	return retval;
 }
 
 int dev_close_many(struct list_head *head)
@@ -1325,7 +1328,7 @@ int dev_close(struct net_device *dev)
 
 	list_add(&dev->unreg_list, &single);
 	dev_close_many(&single);
-
+	list_del(&single);
 	return 0;
 }
 EXPORT_SYMBOL(dev_close);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] net: dont leave active on stack LIST_HEAD
  2011-02-18  8:54                           ` [PATCH 1/2] net: dont leave active on stack LIST_HEAD Eric Dumazet
@ 2011-02-18 20:14                             ` David Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2011-02-18 20:14 UTC (permalink / raw)
  To: eric.dumazet
  Cc: torvalds, ebiederm, opurdila, mingo, mhocko, linux-mm,
	linux-kernel, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 18 Feb 2011 09:54:38 +0100

> From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> Eric W. Biderman and Michal Hocko reported various memory corruptions
> that we suspected to be related to a LIST head located on stack, that
> was manipulated after thread left function frame (and eventually exited,
> so its stack was freed and reused).
> 
> Eric Dumazet suggested the problem was probably coming from commit
> 443457242beb (net: factorize
> sync-rcu call in unregister_netdevice_many)
> 
> This patch fixes __dev_close() and dev_close() to properly deinit their
> respective LIST_HEAD(single) before exiting.
> 
> References: https://lkml.org/lkml/2011/2/16/304
> References: https://lkml.org/lkml/2011/2/14/223
> 
> Reported-by: Michal Hocko <mhocko@suse.cz>
> Reported-by: Eric W. Biderman <ebiderman@xmission.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <m17hcx7wca.fsf@fess.ebiederm.org>]

* [PATCH 2/2] net: deinit automatic LIST_HEAD
       [not found]                     ` <m17hcx7wca.fsf@fess.ebiederm.org>
@ 2011-02-18  8:59                       ` Eric Dumazet
  2011-02-18 20:14                         ` David Miller
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2011-02-18  8:59 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, ebiederm, mingo, opurdila, linux-kernel, linux-mm,
	mhocko, netdev, stable

commit 9b5e383c11b08784 (net: Introduce
unregister_netdevice_many()) left an active LIST_HEAD() in
rollback_registered(), with possible memory corruption.

Even if device is freed without touching its unreg_list (and therefore
touching the previous memory location holding LISTE_HEAD(single), better
close the bug for good, since its really subtle.

(Same fix for default_device_exit_batch() for completeness)

Reported-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Eric W. Biderman <ebiderman@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Octavian Purdila <opurdila@ixiacom.com>
CC: Eric W. Biderman <ebiderman@xmission.com>
CC: stable <stable@kernel.org> [.33+]
---
 net/core/dev.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index a18c164..8ae6631 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5066,6 +5066,7 @@ static void rollback_registered(struct net_device *dev)
 
 	list_add(&dev->unreg_list, &single);
 	rollback_registered_many(&single);
+	list_del(&single);
 }
 
 unsigned long netdev_fix_features(unsigned long features, const char *name)
@@ -6219,6 +6220,7 @@ static void __net_exit default_device_exit_batch(struct list_head *net_list)
 		}
 	}
 	unregister_netdevice_many(&dev_kill_list);
+	list_del(&dev_kill_list);
 	rtnl_unlock();
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] net: deinit automatic LIST_HEAD
  2011-02-18  8:59                       ` [PATCH 2/2] net: deinit automatic LIST_HEAD Eric Dumazet
@ 2011-02-18 20:14                         ` David Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2011-02-18 20:14 UTC (permalink / raw)
  To: eric.dumazet
  Cc: torvalds, ebiederm, mingo, opurdila, linux-kernel, linux-mm,
	mhocko, netdev, stable

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 18 Feb 2011 09:59:19 +0100

> commit 9b5e383c11b08784 (net: Introduce
> unregister_netdevice_many()) left an active LIST_HEAD() in
> rollback_registered(), with possible memory corruption.
> 
> Even if device is freed without touching its unreg_list (and therefore
> touching the previous memory location holding LISTE_HEAD(single), better
> close the bug for good, since its really subtle.
> 
> (Same fix for default_device_exit_batch() for completeness)
> 
> Reported-by: Michal Hocko <mhocko@suse.cz>
> Reported-by: Eric W. Biderman <ebiderman@xmission.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <20110218122938.GB26779@tiehlicka.suse.cz>]

[parent not found: <20110218162623.GD4862@tiehlicka.suse.cz>]

[parent not found: <AANLkTimO=M5xG_mnDBSxPKwSOTrp3JhHVBa8=wHsiVHY@mail.gmail.com>]

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
       [not found]                     ` <AANLkTimO=M5xG_mnDBSxPKwSOTrp3JhHVBa8=wHsiVHY@mail.gmail.com>
@ 2011-02-18 18:08                       ` Eric W. Biederman
  2011-02-18 18:48                         ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2011-02-18 18:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michal Hocko, Ingo Molnar, linux-mm, LKML, David Miller,
	Eric Dumazet, netdev

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Feb 18, 2011 at 8:26 AM, Michal Hocko <mhocko@suse.cz> wrote:
>>> Now, I will try with the 2 patches patches in this thread. I will also
>>> turn on DEBUG_LIST and DEBUG_PAGEALLOC.
>>
>> I am not able to reproduce with those 2 patches applied.
>
> Thanks for verifying. Davem/EricD - you can add Michal's tested-by to
> the patches too.
>
> And I think we can consider this whole thing solved. It hopefully also
> explains all the other random crashes that EricB saw - just random
> memory corruption in other datastructures.
>
> EricB - do all your stress-testers run ok now?

Things are looking better and PAGEALLOC debug isn't firing.
So this looks like one bug down.  I have not seen the bad page map
symptom.

I am still getting programs segfaulting but that is happening on other
machines running on older kernels so I am going to chalk that up to a
buggy test and a false positive.

I am have OOM problems getting my tests run to complete.  On a good
day that happens about 1 time in 3 right now.  I'm guess I will have
to turn off DEBUG_PAGEALLOC to get everything to complete.
DEBUG_PAGEALLOC causes us to use more memory doesn't it?

The most interesting thing I have right now is a networking lockdep
issue.  Does anyone know what is going on there?

Eric


=================================
[ INFO: inconsistent lock state ]
2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
---------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
kworker/u:1/10833 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (tcp_death_row.death_lock){+.?...}, at: [<ffffffff81460e69>] inet_twsk_deschedule+0x29/0xa0
{IN-SOFTIRQ-W} state was registered at:
  [<ffffffff810840ce>] __lock_acquire+0x70e/0x1d30
  [<ffffffff81085cff>] lock_acquire+0x9f/0x120
  [<ffffffff814deb6c>] _raw_spin_lock+0x2c/0x40
  [<ffffffff8146066b>] inet_twsk_schedule+0x3b/0x1e0
  [<ffffffff8147bf7d>] tcp_time_wait+0x20d/0x380
  [<ffffffff8146b46e>] tcp_fin.clone.39+0x10e/0x1c0
  [<ffffffff8146c628>] tcp_data_queue+0x798/0xd50
  [<ffffffff8146fdd9>] tcp_rcv_state_process+0x799/0xbb0
  [<ffffffff814786d8>] tcp_v4_do_rcv+0x238/0x500
  [<ffffffff8147a90a>] tcp_v4_rcv+0x86a/0xbe0
  [<ffffffff81455d4d>] ip_local_deliver_finish+0x10d/0x380
  [<ffffffff81456180>] ip_local_deliver+0x80/0x90
  [<ffffffff81455832>] ip_rcv_finish+0x192/0x5a0
  [<ffffffff814563c4>] ip_rcv+0x234/0x300
  [<ffffffff81420e83>] __netif_receive_skb+0x443/0x700
  [<ffffffff81421d68>] netif_receive_skb+0xb8/0xf0
  [<ffffffff81421ed8>] napi_skb_finish+0x48/0x60
  [<ffffffff81422d35>] napi_gro_receive+0xb5/0xc0
  [<ffffffffa006b4cf>] igb_poll+0x89f/0xd20 [igb]
  [<ffffffff81422279>] net_rx_action+0x149/0x270
  [<ffffffff81054bc0>] __do_softirq+0xc0/0x1f0
  [<ffffffff81003d1c>] call_softirq+0x1c/0x30
  [<ffffffff81005825>] do_softirq+0xa5/0xe0
  [<ffffffff81054dfd>] irq_exit+0x8d/0xa0
  [<ffffffff81005391>] do_IRQ+0x61/0xe0
  [<ffffffff814df793>] ret_from_intr+0x0/0x1a
  [<ffffffff810ec9ed>] ____pagevec_lru_add+0x16d/0x1a0
  [<ffffffff810ed073>] lru_add_drain+0x73/0xe0
  [<ffffffff8110a44c>] exit_mmap+0x5c/0x180
  [<ffffffff8104aad2>] mmput+0x52/0xe0
  [<ffffffff810513c0>] exit_mm+0x120/0x150
  [<ffffffff81051522>] do_exit+0x132/0x8c0
  [<ffffffff81051f39>] do_group_exit+0x59/0xd0
  [<ffffffff81051fc2>] sys_exit_group+0x12/0x20
  [<ffffffff81002d92>] system_call_fastpath+0x16/0x1b
irq event stamp: 187417
hardirqs last  enabled at (187417): [<ffffffff81127db5>] kmem_cache_free+0x125/0x160
hardirqs last disabled at (187416): [<ffffffff81127d02>] kmem_cache_free+0x72/0x160
softirqs last  enabled at (187410): [<ffffffff81411c52>] sk_common_release+0x62/0xc0
softirqs last disabled at (187408): [<ffffffff814dec41>] _raw_write_lock_bh+0x11/0x40
other info that might help us debug this:
3 locks held by kworker/u:1/10833:
 #0:  (netns){.+.+.+}, at: [<ffffffff81068be1>] process_one_work+0x121/0x4b0
 #1:  (net_cleanup_work){+.+.+.}, at: [<ffffffff81068be1>] process_one_work+0x121/0x4b0
 #2:  (net_mutex){+.+.+.}, at: [<ffffffff8141c4a0>] cleanup_net+0x80/0x1b0

stack backtrace:
Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
Call Trace:
 [<ffffffff810835b0>] ? print_usage_bug+0x170/0x180
 [<ffffffff8108393f>] ? mark_lock+0x37f/0x400
 [<ffffffff81084150>] ? __lock_acquire+0x790/0x1d30
 [<ffffffff81083d8f>] ? __lock_acquire+0x3cf/0x1d30
 [<ffffffff81124acf>] ? check_object+0xaf/0x270
 [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
 [<ffffffff81085cff>] ? lock_acquire+0x9f/0x120
 [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
 [<ffffffff81410f49>] ? __sk_free+0xd9/0x160
 [<ffffffff814deb6c>] ? _raw_spin_lock+0x2c/0x40
 [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
 [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
 [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
 [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
 [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
 [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
 [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
 [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
 [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
 [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
 [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330
 [<ffffffff8106a500>] ? worker_thread+0x0/0x330
 [<ffffffff8106f226>] ? kthread+0xb6/0xc0
 [<ffffffff8108678d>] ? trace_hardirqs_on_caller+0x13d/0x180
 [<ffffffff81003c24>] ? kernel_thread_helper+0x4/0x10
 [<ffffffff814df854>] ? restore_args+0x0/0x30
 [<ffffffff8106f170>] ? kthread+0x0/0xc0
 [<ffffffff81003c20>] ? kernel_thread_helper+0x0/0x10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
  2011-02-18 18:08                       ` BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4 Eric W. Biederman
@ 2011-02-18 18:48                         ` Linus Torvalds
  2011-02-18 19:01                           ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2011-02-18 18:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michal Hocko, Ingo Molnar, linux-mm, LKML, David Miller,
	Eric Dumazet, netdev, Arnaldo Carvalho de Melo

On Fri, Feb 18, 2011 at 10:08 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> I am still getting programs segfaulting but that is happening on other
> machines running on older kernels so I am going to chalk that up to a
> buggy test and a false positive.

Ok.

> I am have OOM problems getting my tests run to complete.  On a good
> day that happens about 1 time in 3 right now.  I'm guess I will have
> to turn off DEBUG_PAGEALLOC to get everything to complete.
> DEBUG_PAGEALLOC causes us to use more memory doesn't it?

It does use a bit more memory, but it shouldn't be _that_ noticeable.
The real cost of DEBUG_PAGEALLOC is all the crazy page table
operations and TLB flushes we do for each allocation/deallocation. So
DEBUG_PAGEALLOC is very CPU-intensive, but it shouldn't have _that_
much of a memory overhead - just some trivial overhead due to not
being able to use largepages for the normal kernel identity mappings.

But there might be some other interaction with OOM that I haven't thought about.

> The most interesting thing I have right now is a networking lockdep
> issue.  Does anyone know what is going on there?

This seems to be a fairly straightforward bug.

In net/ipv4/inet_timewait_sock.c we have this:

  /* These are always called from BH context.  See callers in
   * tcp_input.c to verify this.
   */

  /* This is for handling early-kills of TIME_WAIT sockets. */
  void inet_twsk_deschedule(struct inet_timewait_sock *tw,
                            struct inet_timewait_death_row *twdr)
  {
          spin_lock(&twdr->death_lock);
          ..

and the intention is clearly that that spin_lock is BH-safe because
it's called from BH context.

Except that clearly isn't true. It's called from a worker thread:

> stack backtrace:
> Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
> Call Trace:
>  [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
>  [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
>  [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
>  [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
>  [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
>  [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
>  [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
>  [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
>  [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
>  [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330

so it can deadlock with a BH happening at the same time, afaik.

The code (and comment) is all from 2005, it looks like the BH->worker
thread has broken the code. But somebody who knows that code better
should take a deeper look at it.

Added acme to the cc, since the code is attributed to him back in 2005
;). Although I don't know how active he's been in networking lately
(seems to be all perf-related). Whatever, it can't hurt.

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
  2011-02-18 18:48                         ` Linus Torvalds
@ 2011-02-18 19:01                           ` Arnaldo Carvalho de Melo
  2011-02-18 19:11                             ` Arnaldo Carvalho de Melo
  2011-02-18 19:13                             ` BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4 Eric Dumazet
  0 siblings, 2 replies; 12+ messages in thread
From: Arnaldo Carvalho de Melo @ 2011-02-18 19:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Michal Hocko, Ingo Molnar, linux-mm, LKML,
	David Miller, Eric Dumazet, netdev

Em Fri, Feb 18, 2011 at 10:48:18AM -0800, Linus Torvalds escreveu:
> On Fri, Feb 18, 2011 at 10:08 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> > I am still getting programs segfaulting but that is happening on other
> > machines running on older kernels so I am going to chalk that up to a
> > buggy test and a false positive.
> 
> Ok.
> 
> > I am have OOM problems getting my tests run to complete.  On a good
> > day that happens about 1 time in 3 right now.  I'm guess I will have
> > to turn off DEBUG_PAGEALLOC to get everything to complete.
> > DEBUG_PAGEALLOC causes us to use more memory doesn't it?
> 
> It does use a bit more memory, but it shouldn't be _that_ noticeable.
> The real cost of DEBUG_PAGEALLOC is all the crazy page table
> operations and TLB flushes we do for each allocation/deallocation. So
> DEBUG_PAGEALLOC is very CPU-intensive, but it shouldn't have _that_
> much of a memory overhead - just some trivial overhead due to not
> being able to use largepages for the normal kernel identity mappings.
> 
> But there might be some other interaction with OOM that I haven't thought about.
> 
> > The most interesting thing I have right now is a networking lockdep
> > issue.  Does anyone know what is going on there?
> 
> This seems to be a fairly straightforward bug.
> 
> In net/ipv4/inet_timewait_sock.c we have this:
> 
>   /* These are always called from BH context.  See callers in
>    * tcp_input.c to verify this.
>    */
> 
>   /* This is for handling early-kills of TIME_WAIT sockets. */
>   void inet_twsk_deschedule(struct inet_timewait_sock *tw,
>                             struct inet_timewait_death_row *twdr)
>   {
>           spin_lock(&twdr->death_lock);
>           ..
> 
> and the intention is clearly that that spin_lock is BH-safe because
> it's called from BH context.
> 
> Except that clearly isn't true. It's called from a worker thread:
> 
> > stack backtrace:
> > Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
> > Call Trace:
> >  [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
> >  [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
> >  [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
> >  [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
> >  [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
> >  [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
> >  [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
> >  [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
> >  [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
> >  [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330
> 
> so it can deadlock with a BH happening at the same time, afaik.
> 
> The code (and comment) is all from 2005, it looks like the BH->worker
> thread has broken the code. But somebody who knows that code better
> should take a deeper look at it.
> 
> Added acme to the cc, since the code is attributed to him back in 2005
> ;). Although I don't know how active he's been in networking lately
> (seems to be all perf-related). Whatever, it can't hurt.

Original code is ANK's, I just made it possible to use with DCCP, and
yeah, the smiley is appropriate, something 6 years old and the world
around it changing continually... well, thanks for the git blame ;-)

- Arnaldo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
  2011-02-18 19:01                           ` Arnaldo Carvalho de Melo
@ 2011-02-18 19:11                             ` Arnaldo Carvalho de Melo
  2011-02-18 20:38                               ` Eric W. Biederman
  2011-02-18 19:13                             ` BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4 Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Arnaldo Carvalho de Melo @ 2011-02-18 19:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Michal Hocko, Ingo Molnar, linux-mm, LKML,
	David Miller, Eric Dumazet, netdev, Pavel Emelyanov

Em Fri, Feb 18, 2011 at 05:01:28PM -0200, Arnaldo Carvalho de Melo escreveu:
> Em Fri, Feb 18, 2011 at 10:48:18AM -0800, Linus Torvalds escreveu:
> > This seems to be a fairly straightforward bug.
> > 
> > In net/ipv4/inet_timewait_sock.c we have this:
> > 
> >   /* These are always called from BH context.  See callers in
> >    * tcp_input.c to verify this.
> >    */
> > 
> >   /* This is for handling early-kills of TIME_WAIT sockets. */
> >   void inet_twsk_deschedule(struct inet_timewait_sock *tw,
> >                             struct inet_timewait_death_row *twdr)
> >   {
> >           spin_lock(&twdr->death_lock);
> >           ..
> > 
> > and the intention is clearly that that spin_lock is BH-safe because
> > it's called from BH context.
> > 
> > Except that clearly isn't true. It's called from a worker thread:
> > 
> > > stack backtrace:
> > > Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
> > > Call Trace:
> > >  [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
> > >  [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
> > >  [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
> > >  [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
> > >  [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
> > >  [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
> > >  [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
> > >  [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
> > >  [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
> > >  [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330
> > 
> > so it can deadlock with a BH happening at the same time, afaik.
> > 
> > The code (and comment) is all from 2005, it looks like the BH->worker
> > thread has broken the code. But somebody who knows that code better
> > should take a deeper look at it.
> > 
> > Added acme to the cc, since the code is attributed to him back in 2005
> > ;). Although I don't know how active he's been in networking lately
> > (seems to be all perf-related). Whatever, it can't hurt.
> 
> Original code is ANK's, I just made it possible to use with DCCP, and
> yeah, the smiley is appropriate, something 6 years old and the world
> around it changing continually... well, thanks for the git blame ;-)

But yeah, your analisys seems correct, with the bug being introduced by
one of these world around it changing continually issues, networking
namespaces broke the rules of the game on its cleanup_net() routine,
adding Pavel to the CC list since it doesn't hurt ;-)

- Arnaldo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
  2011-02-18 19:11                             ` Arnaldo Carvalho de Melo
@ 2011-02-18 20:38                               ` Eric W. Biederman
  2011-02-19  8:35                                 ` [PATCH] tcp: fix inet_twsk_deschedule() Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2011-02-18 20:38 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Linus Torvalds, Michal Hocko, Ingo Molnar, linux-mm, LKML,
	David Miller, Eric Dumazet, netdev, Pavel Emelyanov,
	Daniel Lezcano

Arnaldo Carvalho de Melo <acme@redhat.com> writes:

> Em Fri, Feb 18, 2011 at 05:01:28PM -0200, Arnaldo Carvalho de Melo escreveu:
>> Em Fri, Feb 18, 2011 at 10:48:18AM -0800, Linus Torvalds escreveu:
>> > This seems to be a fairly straightforward bug.
>> > 
>> > In net/ipv4/inet_timewait_sock.c we have this:
>> > 
>> >   /* These are always called from BH context.  See callers in
>> >    * tcp_input.c to verify this.
>> >    */
>> > 
>> >   /* This is for handling early-kills of TIME_WAIT sockets. */
>> >   void inet_twsk_deschedule(struct inet_timewait_sock *tw,
>> >                             struct inet_timewait_death_row *twdr)
>> >   {
>> >           spin_lock(&twdr->death_lock);
>> >           ..
>> > 
>> > and the intention is clearly that that spin_lock is BH-safe because
>> > it's called from BH context.
>> > 
>> > Except that clearly isn't true. It's called from a worker thread:
>> > 
>> > > stack backtrace:
>> > > Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
>> > > Call Trace:
>> > >  [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
>> > >  [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
>> > >  [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
>> > >  [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
>> > >  [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
>> > >  [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
>> > >  [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
>> > >  [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
>> > >  [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
>> > >  [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330
>> > 
>> > so it can deadlock with a BH happening at the same time, afaik.
>> > 
>> > The code (and comment) is all from 2005, it looks like the BH->worker
>> > thread has broken the code. But somebody who knows that code better
>> > should take a deeper look at it.
>> > 
>> > Added acme to the cc, since the code is attributed to him back in 2005
>> > ;). Although I don't know how active he's been in networking lately
>> > (seems to be all perf-related). Whatever, it can't hurt.
>> 
>> Original code is ANK's, I just made it possible to use with DCCP, and
>> yeah, the smiley is appropriate, something 6 years old and the world
>> around it changing continually... well, thanks for the git blame ;-)
>
> But yeah, your analisys seems correct, with the bug being introduced by
> one of these world around it changing continually issues, networking
> namespaces broke the rules of the game on its cleanup_net() routine,
> adding Pavel to the CC list since it doesn't hurt ;-)

Which probably gets the bug back around to me.

I guess this must be one of those ipv4 cases that where the cleanup
simply did not exist in the rmmod sense that we had to invent.

I think that was Daniel who did the time wait sockets.  I do remember
they were a real pain.

Would a bh_disable be sufficient?  I guess I should stop remembering and
look at the code now.

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] tcp: fix inet_twsk_deschedule()
  2011-02-18 20:38                               ` Eric W. Biederman
@ 2011-02-19  8:35                                 ` Eric Dumazet
  2011-02-20  2:59                                   ` David Miller
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2011-02-19  8:35 UTC (permalink / raw)
  To: Eric W. Biederman, David Miller
  Cc: Arnaldo Carvalho de Melo, Linus Torvalds, Michal Hocko,
	Ingo Molnar, linux-mm, LKML, netdev, Pavel Emelyanov,
	Daniel Lezcano

Le vendredi 18 février 2011 à 12:38 -0800, Eric W. Biederman a écrit :
> Arnaldo Carvalho de Melo <acme@redhat.com> writes:
> 
> > Em Fri, Feb 18, 2011 at 05:01:28PM -0200, Arnaldo Carvalho de Melo escreveu:
> >> Em Fri, Feb 18, 2011 at 10:48:18AM -0800, Linus Torvalds escreveu:
> >> > This seems to be a fairly straightforward bug.
> >> > 
> >> > In net/ipv4/inet_timewait_sock.c we have this:
> >> > 
> >> >   /* These are always called from BH context.  See callers in
> >> >    * tcp_input.c to verify this.
> >> >    */
> >> > 
> >> >   /* This is for handling early-kills of TIME_WAIT sockets. */
> >> >   void inet_twsk_deschedule(struct inet_timewait_sock *tw,
> >> >                             struct inet_timewait_death_row *twdr)
> >> >   {
> >> >           spin_lock(&twdr->death_lock);
> >> >           ..
> >> > 
> >> > and the intention is clearly that that spin_lock is BH-safe because
> >> > it's called from BH context.
> >> > 
> >> > Except that clearly isn't true. It's called from a worker thread:
> >> > 
> >> > > stack backtrace:
> >> > > Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
> >> > > Call Trace:
> >> > >  [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
> >> > >  [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
> >> > >  [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
> >> > >  [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
> >> > >  [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
> >> > >  [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
> >> > >  [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
> >> > >  [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
> >> > >  [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
> >> > >  [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330
> >> > 
> >> > so it can deadlock with a BH happening at the same time, afaik.
> >> > 
> >> > The code (and comment) is all from 2005, it looks like the BH->worker
> >> > thread has broken the code. But somebody who knows that code better
> >> > should take a deeper look at it.
> >> > 
> >> > Added acme to the cc, since the code is attributed to him back in 2005
> >> > ;). Although I don't know how active he's been in networking lately
> >> > (seems to be all perf-related). Whatever, it can't hurt.
> >> 
> >> Original code is ANK's, I just made it possible to use with DCCP, and
> >> yeah, the smiley is appropriate, something 6 years old and the world
> >> around it changing continually... well, thanks for the git blame ;-)
> >
> > But yeah, your analisys seems correct, with the bug being introduced by
> > one of these world around it changing continually issues, networking
> > namespaces broke the rules of the game on its cleanup_net() routine,
> > adding Pavel to the CC list since it doesn't hurt ;-)
> 
> Which probably gets the bug back around to me.
> 
> I guess this must be one of those ipv4 cases that where the cleanup
> simply did not exist in the rmmod sense that we had to invent.
> 
> I think that was Daniel who did the time wait sockets.  I do remember
> they were a real pain.
> 
> Would a bh_disable be sufficient?  I guess I should stop remembering and
> look at the code now.
> 

Here is the patch to fix the problem

Daniel commit (d315492b1a6ba29d (netns : fix kernel panic in timewait
socket destruction) was OK (it did use local_bh_disable())

Problem comes from commit 575f4cd5a5b6394577
(net: Use rcu lookups in inet_twsk_purge.) added in 2.6.33

Thanks !

[PATCH] tcp: fix inet_twsk_deschedule()

Eric W. Biederman reported a lockdep splat in inet_twsk_deschedule()

This is caused by inet_twsk_purge(), run from process context,
and commit 575f4cd5a5b6394577 (net: Use rcu lookups in inet_twsk_purge.)
removed the BH disabling that was necessary.

Add the BH disabling but fine grained, right before calling
inet_twsk_deschedule(), instead of whole function.

With help from Linus Torvalds and Eric W. Biederman

Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Daniel Lezcano <daniel.lezcano@free.fr>
CC: Pavel Emelyanov <xemul@openvz.org>
CC: Arnaldo Carvalho de Melo <acme@redhat.com>
CC: stable <stable@kernel.org> (# 2.6.33+)
---
 net/ipv4/inet_timewait_sock.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index c5af909..3c8dfa1 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -505,7 +505,9 @@ restart:
 			}
 
 			rcu_read_unlock();
+			local_bh_disable();
 			inet_twsk_deschedule(tw, twdr);
+			local_bh_enable();
 			inet_twsk_put(tw);
 			goto restart_rcu;
 		}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] tcp: fix inet_twsk_deschedule()
  2011-02-19  8:35                                 ` [PATCH] tcp: fix inet_twsk_deschedule() Eric Dumazet
@ 2011-02-20  2:59                                   ` David Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2011-02-20  2:59 UTC (permalink / raw)
  To: eric.dumazet
  Cc: ebiederm, acme, torvalds, mhocko, mingo, linux-mm, linux-kernel,
	netdev, xemul, daniel.lezcano

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 19 Feb 2011 09:35:56 +0100

> [PATCH] tcp: fix inet_twsk_deschedule()
> 
> Eric W. Biederman reported a lockdep splat in inet_twsk_deschedule()
> 
> This is caused by inet_twsk_purge(), run from process context,
> and commit 575f4cd5a5b6394577 (net: Use rcu lookups in inet_twsk_purge.)
> removed the BH disabling that was necessary.
> 
> Add the BH disabling but fine grained, right before calling
> inet_twsk_deschedule(), instead of whole function.
> 
> With help from Linus Torvalds and Eric W. Biederman
> 
> Reported-by: Eric W. Biederman <ebiederm@xmission.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Daniel Lezcano <daniel.lezcano@free.fr>
> CC: Pavel Emelyanov <xemul@openvz.org>
> CC: Arnaldo Carvalho de Melo <acme@redhat.com>
> CC: stable <stable@kernel.org> (# 2.6.33+)

Applied, thanks Eric.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
  2011-02-18 19:01                           ` Arnaldo Carvalho de Melo
  2011-02-18 19:11                             ` Arnaldo Carvalho de Melo
@ 2011-02-18 19:13                             ` Eric Dumazet
  1 sibling, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2011-02-18 19:13 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Linus Torvalds, Eric W. Biederman, Michal Hocko, Ingo Molnar,
	linux-mm, LKML, David Miller, netdev

Le vendredi 18 février 2011 à 17:01 -0200, Arnaldo Carvalho de Melo a
écrit :

> Original code is ANK's, I just made it possible to use with DCCP, and
> yeah, the smiley is appropriate, something 6 years old and the world
> around it changing continually... well, thanks for the git blame ;-)
> 

At that time, net namespaces did not exist.

I would blame people implementing them ;)




^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-02-20  2:59 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20110216185234.GA11636@tiehlicka.suse.cz>
     [not found] ` <20110216193700.GA6377@elte.hu>
     [not found]   ` <AANLkTinDxxbVjrUViCs=UaMD9Wg9PR7b0ShNud5zKE3w@mail.gmail.com>
     [not found]     ` <AANLkTi=xnbcs5BKj3cNE_aLtBO7W5m+2uaUacu7M8g_S@mail.gmail.com>
     [not found]       ` <20110217090910.GA3781@tiehlicka.suse.cz>
     [not found]         ` <AANLkTikPKpNHxDQAYBd3fiQsmVozLtCVDsNn=+eF_q2r@mail.gmail.com>
     [not found]           ` <20110217163531.GF14168@elte.hu>
     [not found]             ` <m1pqqqfpzh.fsf@fess.ebiederm.org>
     [not found]               ` <AANLkTinB=EgDGNv-v-qD-MvHVAmstfP_CyyLNhhotkZx@mail.gmail.com>
     [not found]                 ` <m1sjvm822m.fsf@fess.ebiederm.org>
     [not found]                   ` <AANLkTimzP0UNRXutkt1zJ+OGhmeg6ga87HFyMuZQmpMj@mail.gmail.com>
     [not found]                     ` <20110217.203647.193696765.davem@davemloft.net>
     [not found]                       ` <1298010320.2642.7.camel@edumazet-laptop>
     [not found]                         ` <1298014191.2642.11.camel@edumazet-laptop>
2011-02-18  8:54                           ` [PATCH 1/2] net: dont leave active on stack LIST_HEAD Eric Dumazet
2011-02-18 20:14                             ` David Miller
     [not found]                     ` <m17hcx7wca.fsf@fess.ebiederm.org>
2011-02-18  8:59                       ` [PATCH 2/2] net: deinit automatic LIST_HEAD Eric Dumazet
2011-02-18 20:14                         ` David Miller
     [not found]                 ` <20110218122938.GB26779@tiehlicka.suse.cz>
     [not found]                   ` <20110218162623.GD4862@tiehlicka.suse.cz>
     [not found]                     ` <AANLkTimO=M5xG_mnDBSxPKwSOTrp3JhHVBa8=wHsiVHY@mail.gmail.com>
2011-02-18 18:08                       ` BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4 Eric W. Biederman
2011-02-18 18:48                         ` Linus Torvalds
2011-02-18 19:01                           ` Arnaldo Carvalho de Melo
2011-02-18 19:11                             ` Arnaldo Carvalho de Melo
2011-02-18 20:38                               ` Eric W. Biederman
2011-02-19  8:35                                 ` [PATCH] tcp: fix inet_twsk_deschedule() Eric Dumazet
2011-02-20  2:59                                   ` David Miller
2011-02-18 19:13                             ` BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4 Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox