public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Greg KH <gregkh@suse.de>
To: linux-kernel@vger.kernel.org, stable@kernel.org
Cc: Justin Forbes <jmforbes@linuxtx.org>,
	Zwane Mwaikambo <zwane@arm.linux.org.uk>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Randy Dunlap <rdunlap@xenotime.net>,
	Dave Jones <davej@redhat.com>,
	Chuck Wolber <chuckw@quantumlinux.com>,
	Chris Wedgwood <reviews@ml.cw.f00f.org>,
	Michael Krufky <mkrufky@linuxtv.org>,
	Chuck Ebbert <cebbert@redhat.com>,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	alan@lxorguk.ukuu.org.uk, bunk@stusta.de,
	"David S. Miller" <davem@davemloft.net>
Subject: [patch 12/37] IPV6: Fix ipv6 round-robin locking.
Date: Fri, 30 Mar 2007 14:04:35 -0700	[thread overview]
Message-ID: <20070330210435.GN29450@kroah.com> (raw)
In-Reply-To: <20070330210334.GA29450@kroah.com>

[-- Attachment #1: ipv6-fix-ipv6-round-robin-locking.patch --]
[-- Type: text/plain, Size: 6567 bytes --]

-stable review patch.  If anyone has any objections, please let us know.

------------------
From: David Miller <davem@davemloft.net>

[IPV6]: Fix routing round-robin locking.

As per RFC2461, section 6.3.6, item #2, when no routers on the
matching list are known to be reachable or probably reachable we
do round robin on those available routes so that we make sure
to probe as many of them as possible to detect when one becomes
reachable faster.

Each routing table has a rwlock protecting the tree and the linked
list of routes at each leaf.  The round robin code executes during
lookup and thus with the rwlock taken as a reader.  A small local
spinlock tries to provide protection but this does not work at all
for two reasons:

1) The round-robin list manipulation, as coded, goes like this (with
   read lock held):

	walk routes finding head and tail

	spin_lock();
	rotate list using head and tail
	spin_unlock();

   While one thread is rotating the list, another thread can
   end up with stale values of head and tail and then proceed
   to corrupt the list when it gets the lock.  This ends up causing
   the OOPS in fib6_add() later onthat many people have been hitting.

2) All the other code paths that run with the rwlock held as
   a reader do not expect the list to change on them, they
   expect it to remain completely fixed while they hold the
   lock in that way.

So, simply stated, it is impossible to implement this correctly using
a manipulation of the list without violating the rwlock locking
semantics.

Reimplement using a per-fib6_node round-robin pointer.  This way we
don't need to manipulate the list at all, and since the round-robin
pointer can only ever point to real existing entries we don't need
to perform any locking on the changing of the round-robin pointer
itself.  We only need to reset the round-robin pointer to NULL when
the entry it is pointing to is removed.

The idea is from Thomas Graf and it is very similar to how this
was implemented before the advanced router selection code when in.

Signed-off-by: David S. Miller <davem@davemloft.net>

---
 include/net/ip6_fib.h |    1 
 net/ipv6/ip6_fib.c    |    8 ++++
 net/ipv6/route.c      |   97 ++++++++++++++++++++++++++++++--------------------
 3 files changed, 68 insertions(+), 38 deletions(-)

--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -58,6 +58,7 @@ struct fib6_node
 	__u16			fn_bit;		/* bit key */
 	__u16			fn_flags;
 	__u32			fn_sernum;
+	struct rt6_info		*rr_ptr;
 };
 
 #ifndef CONFIG_IPV6_SUBTREES
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -659,6 +659,10 @@ static int fib6_add_rt2node(struct fib6_
 		ins = &iter->u.next;
 	}
 
+	/* Reset round-robin state, if necessary */
+	if (ins == &fn->leaf)
+		fn->rr_ptr = NULL;
+
 	/*
 	 *	insert node
 	 */
@@ -1110,6 +1114,10 @@ static void fib6_del_route(struct fib6_n
 	rt6_stats.fib_rt_entries--;
 	rt6_stats.fib_discarded_routes++;
 
+	/* Reset round-robin state, if necessary */
+	if (fn->rr_ptr == rt)
+		fn->rr_ptr = NULL;
+
 	/* Adjust walkers */
 	read_lock(&fib6_walker_lock);
 	FOR_WALKERS(w) {
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -354,55 +354,76 @@ static int rt6_score_route(struct rt6_in
 	return m;
 }
 
-static struct rt6_info *rt6_select(struct rt6_info **head, int oif,
-				   int strict)
+static struct rt6_info *find_match(struct rt6_info *rt, int oif, int strict,
+				   int *mpri, struct rt6_info *match)
 {
-	struct rt6_info *match = NULL, *last = NULL;
-	struct rt6_info *rt, *rt0 = *head;
-	u32 metric;
+	int m;
+
+	if (rt6_check_expired(rt))
+		goto out;
+
+	m = rt6_score_route(rt, oif, strict);
+	if (m < 0)
+		goto out;
+
+	if (m > *mpri) {
+		if (strict & RT6_LOOKUP_F_REACHABLE)
+			rt6_probe(match);
+		*mpri = m;
+		match = rt;
+	} else if (strict & RT6_LOOKUP_F_REACHABLE) {
+		rt6_probe(rt);
+	}
+
+out:
+	return match;
+}
+
+static struct rt6_info *find_rr_leaf(struct fib6_node *fn,
+				     struct rt6_info *rr_head,
+				     u32 metric, int oif, int strict)
+{
+	struct rt6_info *rt, *match;
 	int mpri = -1;
 
-	RT6_TRACE("%s(head=%p(*head=%p), oif=%d)\n",
-		  __FUNCTION__, head, head ? *head : NULL, oif);
+	match = NULL;
+	for (rt = rr_head; rt && rt->rt6i_metric == metric;
+	     rt = rt->u.next)
+		match = find_match(rt, oif, strict, &mpri, match);
+	for (rt = fn->leaf; rt && rt != rr_head && rt->rt6i_metric == metric;
+	     rt = rt->u.next)
+		match = find_match(rt, oif, strict, &mpri, match);
 
-	for (rt = rt0, metric = rt0->rt6i_metric;
-	     rt && rt->rt6i_metric == metric && (!last || rt != rt0);
-	     rt = rt->u.next) {
-		int m;
+	return match;
+}
 
-		if (rt6_check_expired(rt))
-			continue;
+static struct rt6_info *rt6_select(struct fib6_node *fn, int oif, int strict)
+{
+	struct rt6_info *match, *rt0;
 
-		last = rt;
+	RT6_TRACE("%s(fn->leaf=%p, oif=%d)\n",
+		  __FUNCTION__, fn->leaf, oif);
 
-		m = rt6_score_route(rt, oif, strict);
-		if (m < 0)
-			continue;
+	rt0 = fn->rr_ptr;
+	if (!rt0)
+		fn->rr_ptr = rt0 = fn->leaf;
 
-		if (m > mpri) {
-			if (strict & RT6_LOOKUP_F_REACHABLE)
-				rt6_probe(match);
-			match = rt;
-			mpri = m;
-		} else if (strict & RT6_LOOKUP_F_REACHABLE) {
-			rt6_probe(rt);
-		}
-	}
+	match = find_rr_leaf(fn, rt0, rt0->rt6i_metric, oif, strict);
 
 	if (!match &&
-	    (strict & RT6_LOOKUP_F_REACHABLE) &&
-	    last && last != rt0) {
+	    (strict & RT6_LOOKUP_F_REACHABLE)) {
+		struct rt6_info *next = rt0->u.next;
+
 		/* no entries matched; do round-robin */
-		static DEFINE_SPINLOCK(lock);
-		spin_lock(&lock);
-		*head = rt0->u.next;
-		rt0->u.next = last->u.next;
-		last->u.next = rt0;
-		spin_unlock(&lock);
+		if (!next || next->rt6i_metric != rt0->rt6i_metric)
+			next = fn->leaf;
+
+		if (next != rt0)
+			fn->rr_ptr = next;
 	}
 
-	RT6_TRACE("%s() => %p, score=%d\n",
-		  __FUNCTION__, match, mpri);
+	RT6_TRACE("%s() => %p\n",
+		  __FUNCTION__, match);
 
 	return (match ? match : &ip6_null_entry);
 }
@@ -648,7 +669,7 @@ restart_2:
 	fn = fib6_lookup(&table->tb6_root, &fl->fl6_dst, &fl->fl6_src);
 
 restart:
-	rt = rt6_select(&fn->leaf, fl->iif, strict | reachable);
+	rt = rt6_select(fn, fl->iif, strict | reachable);
 	BACKTRACK(&fl->fl6_src);
 	if (rt == &ip6_null_entry ||
 	    rt->rt6i_flags & RTF_CACHE)
@@ -743,7 +764,7 @@ restart_2:
 	fn = fib6_lookup(&table->tb6_root, &fl->fl6_dst, &fl->fl6_src);
 
 restart:
-	rt = rt6_select(&fn->leaf, fl->oif, strict | reachable);
+	rt = rt6_select(fn, fl->oif, strict | reachable);
 	BACKTRACK(&fl->fl6_src);
 	if (rt == &ip6_null_entry ||
 	    rt->rt6i_flags & RTF_CACHE)

-- 

  parent reply	other threads:[~2007-03-30 21:23 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20070330205938.984247529@mini.kroah.org>
2007-03-30 21:03 ` [patch 00/37] 2.6.20-stable review Greg KH
2007-03-30 21:03   ` [patch 01/37] ide: clear bmdma status in ide_intr() for ICHx controllers (revised #4) Greg KH
2007-03-30 21:03   ` [patch 02/37] ide: remove clearing bmdma status from cdrom_decode_status() (rev #4) Greg KH
2007-03-30 21:03   ` [patch 03/37] sata_nv: delay on switching between NCQ and non-NCQ commands Greg KH
2007-03-30 21:04   ` [patch 04/37] UML - fix epoll Greg KH
2007-03-30 21:04   ` [patch 05/37] UML - host VDSO fix Greg KH
2007-03-30 21:04   ` [patch 06/37] UML - Fix static linking Greg KH
2007-03-30 21:04   ` Greg KH
2007-03-31  1:21     ` [uml-devel] " Blaisorblade
2007-03-30 21:04   ` [patch 07/37] UML - use correct register file size everywhere Greg KH
2007-03-30 21:04   ` [patch 08/37] uml: fix unreasonably long udelay Greg KH
2007-03-30 21:04   ` [patch 09/37] ieee1394: dv1394: fix CardBus card ejection Greg KH
2007-03-30 21:04   ` [patch 10/37] NET: Fix packet classidier NULL pointer OOPS Greg KH
2007-03-30 21:04   ` [patch 11/37] NET_SCHED: Fix ingress qdisc locking Greg KH
2007-03-30 21:04   ` Greg KH [this message]
2007-03-30 21:04   ` [patch 13/37] PPP: Fix PPP skb leak Greg KH
2007-03-30 21:04   ` [patch 14/37] DCCP: Fix exploitable hole in DCCP socket options Greg KH
2007-03-30 21:04   ` [patch 15/37] VIDEO: Fix FFB DAC revision probing Greg KH
2007-03-30 21:04   ` [patch 16/37] NET: Fix sock_attach_fd() failure in sys_accept() Greg KH
2007-03-30 21:04   ` Greg KH
2007-03-30 21:04   ` [patch 17/37] SPARC: Fix sparc builds with gcc-4.2.x Greg KH
2007-03-30 21:05   ` [patch 18/37] Fix decnet endianness Greg KH
2007-03-30 21:05   ` [patch 19/37] NET: Fix FIB rules compatability Greg KH
2007-03-30 21:05   ` [patch 20/37] DVB: fix nxt200x rf input switching Greg KH
2007-03-30 21:05   ` [patch 21/37] V4L: radio: Fix error in Kbuild file Greg KH
2007-03-30 21:05   ` [patch 22/37] V4L: Fix SECAM handling on saa7115 Greg KH
2007-03-30 21:06   ` [patch 23/37] V4L: msp_attach must return 0 if no msp3400 was found Greg KH
2007-03-30 21:06   ` [patch 24/37] DVB: isl6421: dont reference freed memory Greg KH
2007-03-30 21:06   ` [patch 25/37] dvb-core: fix several locking related problems Greg KH
2007-03-30 21:06   ` [patch 26/37] V4L: saa7146: Fix allocation of clipping memory Greg KH
2007-03-30 21:06   ` [patch 27/37] jmicron: make ide jmicron driver play nice with libata ones Greg KH
2007-03-30 21:06   ` [patch 28/37] i2o: block IO errors on i2o disk Greg KH
2007-03-30 21:06   ` [patch 29/37] ide: revert "ide: fix drive side 80c cable check, take 2" for now Greg KH
2007-03-30 21:06   ` [patch 30/37] CIFS: Allow reset of file to ATTR_NORMAL when archive bit not set Greg KH
2007-03-30 21:06   ` [patch 31/37] CIFS: reset mode when client notices that ATTR_READONLY is no longer set Greg KH
2007-03-30 21:06   ` [patch 32/37] CRYPTO: api: scatterwalk_copychunks() fails to advance through scatterlist Greg KH
2007-03-31  1:41     ` Patrick McHardy
2007-03-31  2:14       ` Herbert Xu
2007-03-31  2:31         ` Patrick McHardy
2007-03-31  3:11         ` Greg KH
2007-03-31  3:45           ` Herbert Xu
2007-03-31 21:35         ` J. Bruce Fields
2007-03-30 21:06   ` [patch 33/37] libata: clear TF before IDENTIFYing Greg KH
2007-03-30 21:06   ` [patch 34/37] libata bugfix: HDIO_DRIVE_TASK Greg KH
2007-03-30 21:42     ` Mark Lord
2007-03-30 21:59       ` Greg KH
2007-03-30 21:45     ` libata bugfix: preserve LBA bit for HDIO_DRIVE_TASK Mark Lord
2007-03-31  3:36       ` Tejun Heo
2007-03-31 16:55         ` Mark Lord
2007-03-31 17:05           ` Tejun Heo
2007-04-04  6:08       ` Jeff Garzik
2007-03-30 21:07   ` [patch 35/37] libata: sata_mv: dont touch reserved bits in EDMA config register Greg KH
2007-03-30 21:07   ` [patch 36/37] libata: sata_mv: Fix 50xx irq mask Greg KH
2007-03-30 21:07   ` [patch 37/37] generic_serial: fix decoding of baud rate Greg KH
2007-03-30 21:10   ` [patch 00/37] 2.6.20-stable review Greg KH
2007-04-04 14:28   ` Chuck Ebbert
2007-04-04 21:23     ` [stable] " Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070330210435.GN29450@kroah.com \
    --to=gregkh@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=bunk@stusta.de \
    --cc=cebbert@redhat.com \
    --cc=chuckw@quantumlinux.com \
    --cc=davej@redhat.com \
    --cc=davem@davemloft.net \
    --cc=jmforbes@linuxtx.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mkrufky@linuxtv.org \
    --cc=rdunlap@xenotime.net \
    --cc=reviews@ml.cw.f00f.org \
    --cc=stable@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    --cc=zwane@arm.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox