Re: tc related lockdep warning.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Patrick McHardy <kaber@trash.net>
To: Dave Jones <davej@redhat.com>
Cc: hadi@cyberus.ca, Jarek Poplawski <jarkao2@o2.pl>,
	netdev@vger.kernel.org, davem@davemloft.net
Subject: Re: tc related lockdep warning.
Date: Wed, 27 Sep 2006 14:07:04 +0200	[thread overview]
Message-ID: <451A6968.2090607@trash.net> (raw)
In-Reply-To: <20060926212034.GA3134@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 655 bytes --]

Dave Jones wrote:
> With this patch, I get no lockdep warnings, but the machine locks up completely.
> I hooked up a serial console, and found this..
> 
> 
> u32 classifier
>     Performance counters on
>     input device check on 
>     Actions configured 
> BUG: warning at net/sched/sch_htb.c:395/htb_safe_rb_erase()
> 
> Call Trace:
>  [<ffffffff8026f79b>] show_trace+0xae/0x336
>  [<ffffffff8026fa38>] dump_stack+0x15/0x17
>  [<ffffffff8860a171>] :sch_htb:htb_safe_rb_erase+0x3b/0x55

I found the reason for this, it was an unrelated bug. I've attached
the latest version of the locking fixes and the fix for the HTB bug.
Can you please try again?



[-- Attachment #2: 01.diff --]
[-- Type: text/plain, Size: 1252 bytes --]

[NET_SCHED]: HTB: fix incorrect use of RB_EMPTY_NODE

Fix incorrect use of RB_EMPTY_NODE in htb_safe_rb_erase, which makes it
skip nodes within the rbtree instead of nodes not in the tree, resulting
in crashes later on.

The root cause for this seems to be the very counter-intuitive behaviour
of the RB_EMPTY_NODE macro, which returns _false_ when the node is empty.

Signed-off-by: Patrick McHardy <kaber@trash.net>

---
commit 9a0cd6d60280d88c38791844c87548d45cf6f2c2
tree fdf4f4a46fb088d957322006828af557b8ce594a
parent 7e4720201ad44ace85a443f41d668a62a737e7d0
author Patrick McHardy <kaber@trash.net> Wed, 27 Sep 2006 13:27:24 +0200
committer Patrick McHardy <kaber@trash.net> Wed, 27 Sep 2006 13:27:24 +0200

 net/sched/sch_htb.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index bb3ddd4..6c058e3 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -391,7 +391,7 @@ static inline void htb_add_class_to_row(
 /* If this triggers, it is a bug in this code, but it need not be fatal */
 static void htb_safe_rb_erase(struct rb_node *rb, struct rb_root *root)
 {
-	if (RB_EMPTY_NODE(rb)) {
+	if (!RB_EMPTY_NODE(rb)) {
 		WARN_ON(1);
 	} else {
 		rb_erase(rb, root);


[-- Attachment #3: 02.diff --]
[-- Type: text/plain, Size: 8540 bytes --]

[NET_SCHED]: Fix fallout from dev->qdisc RCU change

The move of qdisc destruction to a rcu callback broke locking in the
entire qdisc layer by invalidating previously valid assumptions about
the context in which changes to the qdisc tree occur.

The two assumptions were:

- since changes only happen in process context, read_lock doesn't need
  bottem half protection. Now invalid since destruction of inner qdiscs,
  classifiers, actions and estimators happens in the RCU callback unless
  they're manually deleted, resulting in dead-locks when read_lock in
  process context is interrupted by write_lock_bh in bottem half context.

- since changes only happen under the RTNL, no additional locking is
  necessary for data not used during packet processing (f.e. u32_list).
  Again, since destruction now happens in the RCU callback, this assumption
  is not valid anymore, causing races while using this data, which can
  result in corruption or use-after-free.

Instead of "fixing" this by disabling bottem halfs everywhere and adding
new locks/refcounting, this patch makes these assumptions valid again by
moving destruction back to process context. Since only the dev->qdisc
pointer is protected by RCU, but ->enqueue and the qdisc tree are still
protected by dev->qdisc_lock, destruction of the tree can be performed
immediately and only the final free needs to happen in the rcu callback
to make sure dev_queue_xmit doesn't access already freed memory.

Signed-off-by: Patrick McHardy <kaber@trash.net>

---
commit fe5b95bfcde98ca2e32b4274c93889cdd1fbc040
tree 951a0d83d91b15dbe55a41366e0fe01966fec7ed
parent 9a0cd6d60280d88c38791844c87548d45cf6f2c2
author Patrick McHardy <kaber@trash.net> Wed, 27 Sep 2006 13:57:18 +0200
committer Patrick McHardy <kaber@trash.net> Wed, 27 Sep 2006 13:57:18 +0200

 net/core/dev.c          |   14 ++++++----
 net/sched/cls_api.c     |    4 +--
 net/sched/sch_api.c     |   16 ++++++-----
 net/sched/sch_generic.c |   66 +++++++++++++++--------------------------------
 4 files changed, 39 insertions(+), 61 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 14de297..4d891be 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1480,14 +1480,16 @@ #endif
 	if (q->enqueue) {
 		/* Grab device queue */
 		spin_lock(&dev->queue_lock);
+		q = dev->qdisc;
+		if (q->enqueue) {
+			rc = q->enqueue(skb, q);
+			qdisc_run(dev);
+			spin_unlock(&dev->queue_lock);
 
-		rc = q->enqueue(skb, q);
-
-		qdisc_run(dev);
-
+			rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
+			goto out;
+		}
 		spin_unlock(&dev->queue_lock);
-		rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
-		goto out;
 	}
 
 	/* The device has no queue. Common case for software devices:
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 7e14f14..37a1840 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -401,7 +401,7 @@ static int tc_dump_tfilter(struct sk_buf
 	if ((dev = dev_get_by_index(tcm->tcm_ifindex)) == NULL)
 		return skb->len;
 
-	read_lock_bh(&qdisc_tree_lock);
+	read_lock(&qdisc_tree_lock);
 	if (!tcm->tcm_parent)
 		q = dev->qdisc_sleeping;
 	else
@@ -458,7 +458,7 @@ errout:
 	if (cl)
 		cops->put(q, cl);
 out:
-	read_unlock_bh(&qdisc_tree_lock);
+	read_unlock(&qdisc_tree_lock);
 	dev_put(dev);
 	return skb->len;
 }
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index a19eff1..0b64892 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -195,14 +195,14 @@ struct Qdisc *qdisc_lookup(struct net_de
 {
 	struct Qdisc *q;
 
-	read_lock_bh(&qdisc_tree_lock);
+	read_lock(&qdisc_tree_lock);
 	list_for_each_entry(q, &dev->qdisc_list, list) {
 		if (q->handle == handle) {
-			read_unlock_bh(&qdisc_tree_lock);
+			read_unlock(&qdisc_tree_lock);
 			return q;
 		}
 	}
-	read_unlock_bh(&qdisc_tree_lock);
+	read_unlock(&qdisc_tree_lock);
 	return NULL;
 }
 
@@ -837,7 +837,7 @@ static int tc_dump_qdisc(struct sk_buff 
 			continue;
 		if (idx > s_idx)
 			s_q_idx = 0;
-		read_lock_bh(&qdisc_tree_lock);
+		read_lock(&qdisc_tree_lock);
 		q_idx = 0;
 		list_for_each_entry(q, &dev->qdisc_list, list) {
 			if (q_idx < s_q_idx) {
@@ -846,12 +846,12 @@ static int tc_dump_qdisc(struct sk_buff 
 			}
 			if (tc_fill_qdisc(skb, q, q->parent, NETLINK_CB(cb->skb).pid,
 					  cb->nlh->nlmsg_seq, NLM_F_MULTI, RTM_NEWQDISC) <= 0) {
-				read_unlock_bh(&qdisc_tree_lock);
+				read_unlock(&qdisc_tree_lock);
 				goto done;
 			}
 			q_idx++;
 		}
-		read_unlock_bh(&qdisc_tree_lock);
+		read_unlock(&qdisc_tree_lock);
 	}
 
 done:
@@ -1074,7 +1074,7 @@ static int tc_dump_tclass(struct sk_buff
 	s_t = cb->args[0];
 	t = 0;
 
-	read_lock_bh(&qdisc_tree_lock);
+	read_lock(&qdisc_tree_lock);
 	list_for_each_entry(q, &dev->qdisc_list, list) {
 		if (t < s_t || !q->ops->cl_ops ||
 		    (tcm->tcm_parent &&
@@ -1096,7 +1096,7 @@ static int tc_dump_tclass(struct sk_buff
 			break;
 		t++;
 	}
-	read_unlock_bh(&qdisc_tree_lock);
+	read_unlock(&qdisc_tree_lock);
 
 	cb->args[0] = t;
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 6f91518..88c6a99 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -45,11 +45,10 @@ #include <net/pkt_sched.h>
    The idea is the following:
    - enqueue, dequeue are serialized via top level device
      spinlock dev->queue_lock.
-   - tree walking is protected by read_lock_bh(qdisc_tree_lock)
+   - tree walking is protected by read_lock(qdisc_tree_lock)
      and this lock is used only in process context.
-   - updates to tree are made under rtnl semaphore or
-     from softirq context (__qdisc_destroy rcu-callback)
-     hence this lock needs local bh disabling.
+   - updates to tree are made only under rtnl semaphore,
+     hence this lock may be made without local bh disabling.
 
    qdisc_tree_lock must be grabbed BEFORE dev->queue_lock!
  */
@@ -57,14 +56,14 @@ DEFINE_RWLOCK(qdisc_tree_lock);
 
 void qdisc_lock_tree(struct net_device *dev)
 {
-	write_lock_bh(&qdisc_tree_lock);
+	write_lock(&qdisc_tree_lock);
 	spin_lock_bh(&dev->queue_lock);
 }
 
 void qdisc_unlock_tree(struct net_device *dev)
 {
 	spin_unlock_bh(&dev->queue_lock);
-	write_unlock_bh(&qdisc_tree_lock);
+	write_unlock(&qdisc_tree_lock);
 }
 
 /* 
@@ -483,20 +482,6 @@ void qdisc_reset(struct Qdisc *qdisc)
 static void __qdisc_destroy(struct rcu_head *head)
 {
 	struct Qdisc *qdisc = container_of(head, struct Qdisc, q_rcu);
-	struct Qdisc_ops  *ops = qdisc->ops;
-
-#ifdef CONFIG_NET_ESTIMATOR
-	gen_kill_estimator(&qdisc->bstats, &qdisc->rate_est);
-#endif
-	write_lock(&qdisc_tree_lock);
-	if (ops->reset)
-		ops->reset(qdisc);
-	if (ops->destroy)
-		ops->destroy(qdisc);
-	write_unlock(&qdisc_tree_lock);
-	module_put(ops->owner);
-
-	dev_put(qdisc->dev);
 	kfree((char *) qdisc - qdisc->padded);
 }
 
@@ -504,32 +489,23 @@ #endif
 
 void qdisc_destroy(struct Qdisc *qdisc)
 {
-	struct list_head cql = LIST_HEAD_INIT(cql);
-	struct Qdisc *cq, *q, *n;
+	struct Qdisc_ops  *ops = qdisc->ops;
 
 	if (qdisc->flags & TCQ_F_BUILTIN ||
-		!atomic_dec_and_test(&qdisc->refcnt))
+	    !atomic_dec_and_test(&qdisc->refcnt))
 		return;
 
-	if (!list_empty(&qdisc->list)) {
-		if (qdisc->ops->cl_ops == NULL)
-			list_del(&qdisc->list);
-		else
-			list_move(&qdisc->list, &cql);
-	}
-
-	/* unlink inner qdiscs from dev->qdisc_list immediately */
-	list_for_each_entry(cq, &cql, list)
-		list_for_each_entry_safe(q, n, &qdisc->dev->qdisc_list, list)
-			if (TC_H_MAJ(q->parent) == TC_H_MAJ(cq->handle)) {
-				if (q->ops->cl_ops == NULL)
-					list_del_init(&q->list);
-				else
-					list_move_tail(&q->list, &cql);
-			}
-	list_for_each_entry_safe(cq, n, &cql, list)
-		list_del_init(&cq->list);
+	list_del(&qdisc->list);
+#ifdef CONFIG_NET_ESTIMATOR
+	gen_kill_estimator(&qdisc->bstats, &qdisc->rate_est);
+#endif
+	if (ops->reset)
+		ops->reset(qdisc);
+	if (ops->destroy)
+		ops->destroy(qdisc);
 
+	module_put(ops->owner);
+	dev_put(qdisc->dev);
 	call_rcu(&qdisc->q_rcu, __qdisc_destroy);
 }
 
@@ -549,15 +525,15 @@ void dev_activate(struct net_device *dev
 				printk(KERN_INFO "%s: activation failed\n", dev->name);
 				return;
 			}
-			write_lock_bh(&qdisc_tree_lock);
+			write_lock(&qdisc_tree_lock);
 			list_add_tail(&qdisc->list, &dev->qdisc_list);
-			write_unlock_bh(&qdisc_tree_lock);
+			write_unlock(&qdisc_tree_lock);
 		} else {
 			qdisc =  &noqueue_qdisc;
 		}
-		write_lock_bh(&qdisc_tree_lock);
+		write_lock(&qdisc_tree_lock);
 		dev->qdisc_sleeping = qdisc;
-		write_unlock_bh(&qdisc_tree_lock);
+		write_unlock(&qdisc_tree_lock);
 	}
 
 	if (!netif_carrier_ok(dev))

next prev parent reply	other threads:[~2006-09-27 12:07 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-24 21:29 tc related lockdep warning Dave Jones
2006-09-25 12:43 ` Jarek Poplawski
2006-09-25 12:47   ` jamal
2006-09-25 13:05     ` Jarek Poplawski
2006-09-25 13:29     ` Patrick McHardy
2006-09-26 16:15       ` Patrick McHardy
2006-09-26 21:20         ` Dave Jones
2006-09-27  8:54           ` Jarek Poplawski
2006-09-27  9:57             ` Patrick McHardy
2006-09-28 12:17             ` Patrick McHardy
2006-09-28 13:13               ` Jarek Poplawski
2006-09-28 14:20                 ` Stephen Hemminger
2006-09-29  6:28                   ` Jarek Poplawski
2006-09-27 10:14           ` Patrick McHardy
2006-09-27 14:41             ` Ismail Donmez
2006-09-27 12:07           ` Patrick McHardy [this message]
2006-09-27 17:26             ` Ismail Donmez
2006-09-27 17:33             ` Dave Jones
2006-09-27 23:53             ` David Miller
2006-09-28  9:07               ` Jarek Poplawski
2006-09-28  8:17             ` Jarek Poplawski

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:bb3ddd4 dfblob:6c058e3 dfblob:14de297 dfblob:4d891be
dfblob:7e14f14 dfblob:37a1840 dfblob:a19eff1 dfblob:0b64892
dfblob:6f91518 dfblob:88c6a99 )
 OR (
bs:"Re: tc related lockdep warning." )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=451A6968.2090607@trash.net \
    --to=kaber@trash.net \
    --cc=davej@redhat.com \
    --cc=davem@davemloft.net \
    --cc=hadi@cyberus.ca \
    --cc=jarkao2@o2.pl \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).