[RFC] Multiple IPV6 Routing Tables & Policy Routing

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Multiple IPV6 Routing Tables & Policy Routing
@ 2006-07-26 22:11 Thomas Graf
  2006-07-26 22:00 ` [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency Thomas Graf
                   ` (6 more replies)
  0 siblings, 7 replies; 54+ messages in thread
From: Thomas Graf @ 2006-07-26 22:11 UTC (permalink / raw)
  To: netdev; +Cc: vnuorval, usagi-core, yoshfuji, davem, anttit

Hello,

Thought it might be time to go through a round of comments
on this work. Even though I've almost rewritten all the code
the patches are based on the work found on www.mobile-ipv6.org.
I have no idea which code was written by whom so just email me
to get the credits right.

Main differences to the version found on mobile-ipv6.org is
that I removed table refcnt and defined that tables cannot
disappear once created to simplify things and avoid too many
atomic operations when looking up routes. I've replaced the
table array with a hash table to prepare it for > 255 tables
and made things aware of the new default router selection
code and experimental route info stuff added recently.

It's not final but somewhat working, I'm eager to see comments
or patches. I apologize if I've tramped onto anybody's foot
by taking this up and submitting it, this isn't meant as an
attempt to steal credits but rather to pick up good code and
finally get it upstream after a very long while.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency
  2006-07-26 22:11 [RFC] Multiple IPV6 Routing Tables & Policy Routing Thomas Graf
@ 2006-07-26 22:00 ` Thomas Graf
  2006-07-26 22:28   ` David Miller
  2006-07-26 22:00 ` [PATCH 2/5] [IPV6]: Multiple Routing Tables Thomas Graf
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 54+ messages in thread
From: Thomas Graf @ 2006-07-26 22:00 UTC (permalink / raw)
  To: netdev; +Cc: vnuorval, usagi-core, yoshfuji, davem, anttit

[-- Attachment #1: ip6_mrt_ndisc_lock --]
[-- Type: text/plain, Size: 1544 bytes --]

(Ab)using rt6_lock wouldn't work anymore if rt6_lock is
converted into a per table lock.

Signed-off-by: Thomas Graf <tgraf@suug.ch>

Index: net-2.6.git/net/ipv6/route.c
===================================================================
--- net-2.6.git.orig/net/ipv6/route.c
+++ net-2.6.git/net/ipv6/route.c
@@ -745,8 +745,6 @@ static void ip6_rt_update_pmtu(struct ds
 	}
 }
 
-/* Protected by rt6_lock.  */
-static struct dst_entry *ndisc_dst_gc_list;
 static int ipv6_get_mtu(struct net_device *dev);
 
 static inline unsigned int ipv6_advmss(unsigned int mtu)
@@ -767,6 +765,9 @@ static inline unsigned int ipv6_advmss(u
 	return mtu;
 }
 
+static struct dst_entry *ndisc_dst_gc_list;
+DEFINE_SPINLOCK(ndisc_lock);
+
 struct dst_entry *ndisc_dst_alloc(struct net_device *dev, 
 				  struct neighbour *neigh,
 				  struct in6_addr *addr,
@@ -807,10 +808,10 @@ struct dst_entry *ndisc_dst_alloc(struct
 	rt->rt6i_dst.plen = 128;
 #endif
 
-	write_lock_bh(&rt6_lock);
+	spin_lock_bh(&ndisc_lock);
 	rt->u.dst.next = ndisc_dst_gc_list;
 	ndisc_dst_gc_list = &rt->u.dst;
-	write_unlock_bh(&rt6_lock);
+	spin_unlock_bh(&ndisc_lock);
 
 	fib6_force_start_gc();
 
@@ -824,8 +825,11 @@ int ndisc_dst_gc(int *more)
 	int freed;
 
 	next = NULL;
+ 	freed = 0;
+
+	spin_lock_bh(&ndisc_lock);
 	pprev = &ndisc_dst_gc_list;
-	freed = 0;
+
 	while ((dst = *pprev) != NULL) {
 		if (!atomic_read(&dst->__refcnt)) {
 			*pprev = dst->next;
@@ -837,6 +841,8 @@ int ndisc_dst_gc(int *more)
 		}
 	}
 
+	spin_unlock_bh(&ndisc_lock);
+
 	return freed;
 }
 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency
  2006-07-26 22:00 ` [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency Thomas Graf
@ 2006-07-26 22:28   ` David Miller
  2006-07-26 23:34     ` Tushar Gohad
  2006-07-31 11:01     ` Ville Nuorvala
  0 siblings, 2 replies; 54+ messages in thread
From: David Miller @ 2006-07-26 22:28 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, vnuorval, usagi-core, yoshfuji, anttit

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 27 Jul 2006 00:00:01 +0200

> (Ab)using rt6_lock wouldn't work anymore if rt6_lock is
> converted into a per table lock.
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>

This one looks great.

Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency
  2006-07-26 22:28   ` David Miller
@ 2006-07-26 23:34     ` Tushar Gohad
  2006-07-26 23:34       ` David Miller
  2006-07-31 11:01     ` Ville Nuorvala
  1 sibling, 1 reply; 54+ messages in thread
From: Tushar Gohad @ 2006-07-26 23:34 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, netdev, vnuorval, usagi-core, yoshfuji, anttit

Are these changes scheduled to go into 2.6.18-rcX or 2.6.19?

Thanks.
- Tushar

David Miller wrote:
> From: Thomas Graf <tgraf@suug.ch>
> Date: Thu, 27 Jul 2006 00:00:01 +0200
> 
> 
>>(Ab)using rt6_lock wouldn't work anymore if rt6_lock is
>>converted into a per table lock.
>>
>>Signed-off-by: Thomas Graf <tgraf@suug.ch>
> 
> 
> This one looks great.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency
  2006-07-26 23:34     ` Tushar Gohad
@ 2006-07-26 23:34       ` David Miller
  0 siblings, 0 replies; 54+ messages in thread
From: David Miller @ 2006-07-26 23:34 UTC (permalink / raw)
  To: tgohad; +Cc: tgraf, netdev, vnuorval, usagi-core, yoshfuji, anttit

From: Tushar Gohad <tgohad@mvista.com>
Date: Wed, 26 Jul 2006 16:34:03 -0700

> Are these changes scheduled to go into 2.6.18-rcX or 2.6.19?

Since the feature freeze is frozen, net-2.6.19 is where it
will likely go.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency
  2006-07-26 22:28   ` David Miller
  2006-07-26 23:34     ` Tushar Gohad
@ 2006-07-31 11:01     ` Ville Nuorvala
  1 sibling, 0 replies; 54+ messages in thread
From: Ville Nuorvala @ 2006-07-31 11:01 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, netdev, usagi-core, yoshfuji, anttit

David Miller wrote:
> From: Thomas Graf <tgraf@suug.ch>
> Date: Thu, 27 Jul 2006 00:00:01 +0200
> 
>> (Ab)using rt6_lock wouldn't work anymore if rt6_lock is
>> converted into a per table lock.
>>
>> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> 
> This one looks great.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

Ditto.
Signed-off-by: Ville Nuorvala <vnuorval@tcs.hut.fi>

> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 2/5] [IPV6]: Multiple Routing Tables
  2006-07-26 22:11 [RFC] Multiple IPV6 Routing Tables & Policy Routing Thomas Graf
  2006-07-26 22:00 ` [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency Thomas Graf
@ 2006-07-26 22:00 ` Thomas Graf
  2006-07-26 22:39   ` David Miller
                     ` (2 more replies)
  2006-07-26 22:00 ` [PATCH 3/5] [NET]: Protocol Independant Policy Routing Rules Framework Thomas Graf
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 54+ messages in thread
From: Thomas Graf @ 2006-07-26 22:00 UTC (permalink / raw)
  To: netdev; +Cc: vnuorval, usagi-core, yoshfuji, davem, anttit

[-- Attachment #1: ip6_mrt --]
[-- Type: text/plain, Size: 28704 bytes --]

Adds the framework to support multiple IPv6 routing tables.
Currently all automatically generated routes are put into the
same table. This could be changed at a later point after
considering the produced locking overhead.

When locating routes for redirects only the main table is
searched for now. Since policy rules will not be reversible
it is unclear whether it makes sense to change this.

Signed-off-by: Thomas Graf <tgraf@suug.ch>

Index: net-2.6.git/include/net/ip6_fib.h
===================================================================
--- net-2.6.git.orig/include/net/ip6_fib.h
+++ net-2.6.git/include/net/ip6_fib.h
@@ -51,6 +51,8 @@ struct rt6key
 	int		plen;
 };
 
+struct fib6_table;
+
 struct rt6_info
 {
 	union {
@@ -71,6 +73,7 @@ struct rt6_info
 	u32				rt6i_flags;
 	u32				rt6i_metric;
 	atomic_t			rt6i_ref;
+	struct fib6_table		*rt6i_table;
 
 	struct rt6key			rt6i_dst;
 	struct rt6key			rt6i_src;
@@ -143,12 +146,41 @@ struct rt6_statistics {
 
 typedef void			(*f_pnode)(struct fib6_node *fn, void *);
 
-extern struct fib6_node		ip6_routing_table;
+struct fib6_table {
+	struct hlist_node	tb6_hlist;
+	u32			tb6_id;
+	rwlock_t		tb6_lock;
+	struct fib6_node	tb6_root;
+};
+
+#define RT6_TABLE_UNSPEC	RT_TABLE_UNSPEC
+#define RT6_TABLE_MAIN		RT_TABLE_MAIN
+#define RT6_TABLE_LOCAL		RT6_TABLE_MAIN
+#define RT6_TABLE_DFLT		RT6_TABLE_MAIN
+#define RT6_TABLE_INFO		RT6_TABLE_MAIN
+
+#ifdef CONFIG_IPV6_MULTIPLE_TABLES
+#define FIB6_TABLE_MIN		1
+#define FIB6_TABLE_MAX		RT_TABLE_MAX
+#else
+#define FIB6_TABLE_MIN		RT_TABLE_MAIN
+#define FIB6_TABLE_MAX		FIB6_TABLE_MIN
+#endif
+
+#define RT6_F_STRICT		1
+
+typedef struct rt6_info *(*pol_lookup_t)(struct fib6_table *,
+					 struct flowi *, int);
 
 /*
  *	exported functions
  */
 
+extern struct fib6_table *	fib6_get_table(u32 id);
+extern struct fib6_table *	fib6_new_table(u32 id);
+extern struct dst_entry *	fib6_rule_lookup(struct flowi *fl, int flags,
+						 pol_lookup_t lookup);
+
 extern struct fib6_node		*fib6_lookup(struct fib6_node *root,
 					     struct in6_addr *daddr,
 					     struct in6_addr *saddr);
@@ -161,6 +193,9 @@ extern void			fib6_clean_tree(struct fib
 						int (*func)(struct rt6_info *, void *arg),
 						int prune, void *arg);
 
+extern void			fib6_clean_all(int (*func)(struct rt6_info *, void *arg),
+					       int prune, void *arg);
+
 extern int			fib6_walk(struct fib6_walker_t *w);
 extern int			fib6_walk_continue(struct fib6_walker_t *w);
 
Index: net-2.6.git/net/ipv6/ip6_fib.c
===================================================================
--- net-2.6.git.orig/net/ipv6/ip6_fib.c
+++ net-2.6.git/net/ipv6/ip6_fib.c
@@ -26,6 +26,7 @@
 #include <linux/netdevice.h>
 #include <linux/in6.h>
 #include <linux/init.h>
+#include <linux/list.h>
 
 #ifdef 	CONFIG_PROC_FS
 #include <linux/proc_fs.h>
@@ -147,6 +148,126 @@ static __inline__ void rt6_release(struc
 		dst_free(&rt->u.dst);
 }
 
+static struct fib6_table fib6_main_tbl = {
+	.tb6_id		= RT6_TABLE_MAIN,
+	.tb6_lock	= RW_LOCK_UNLOCKED,
+	.tb6_root	= {
+		.leaf		= &ip6_null_entry,
+		.fn_flags	= RTN_ROOT | RTN_TL_ROOT | RTN_RTINFO,
+	},
+};
+
+#ifdef CONFIG_IPV6_MULTIPLE_TABLES
+
+#define FIB_TABLE_HASHSZ 256
+static struct hlist_head fib_table_hash[FIB_TABLE_HASHSZ];
+
+static struct fib6_table *fib6_alloc_table(u32 id)
+{
+	struct fib6_table *table;
+
+	table = kzalloc(sizeof(*table), GFP_ATOMIC);
+	if (table != NULL) {
+		table->tb6_id = id;
+		table->tb6_lock = RW_LOCK_UNLOCKED;
+		table->tb6_root.leaf = &ip6_null_entry;
+		table->tb6_root.fn_flags = RTN_ROOT | RTN_TL_ROOT | RTN_RTINFO;
+	}
+
+	return table;
+}
+
+static void fib6_link_table(struct fib6_table *tb)
+{
+	unsigned int h;
+
+	h = tb->tb6_id & (FIB_TABLE_HASHSZ - 1);
+
+	/*
+	 * No protection necessary, this is the only list mutatation
+	 * operation, tables never disappear once they exist.
+	 */
+	hlist_add_head_rcu(&tb->tb6_hlist, &fib_table_hash[h]);
+}
+
+struct fib6_table *fib6_new_table(u32 id)
+{
+	struct fib6_table *tb;
+
+	if (id == 0)
+		id = RT6_TABLE_MAIN;
+	tb = fib6_get_table(id);
+	if (tb)
+		return tb;
+
+	tb = fib6_alloc_table(id);
+	if (tb != NULL)
+		fib6_link_table(tb);
+
+        return tb;
+}
+
+struct fib6_table *fib6_get_table(u32 id)
+{
+	struct fib6_table *tb;
+	struct hlist_node *node;
+	unsigned int h;
+
+	if (id == 0)
+		id = RT6_TABLE_MAIN;
+	h = id & (FIB_TABLE_HASHSZ - 1);
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(tb, node, &fib_table_hash[h], tb6_hlist) {
+		if (tb->tb6_id == id) {
+			rcu_read_unlock();
+			return tb;
+		}
+	}
+	rcu_read_unlock();
+
+	return NULL;
+}
+
+struct dst_entry *fib6_rule_lookup(struct flowi *fl, int flags,
+				   pol_lookup_t lookup)
+{
+	/*
+	 * TODO: Add rule lookup
+	 */
+	struct fib6_table *table = fib6_get_table(RT6_TABLE_MAIN);
+
+	return (struct dst_entry *) lookup(table, fl, flags);
+}
+
+static void __init fib6_tables_init(void)
+{
+	fib6_link_table(&fib6_main_tbl);
+}
+
+#else
+
+struct fib6_table *fib6_new_table(u32 id)
+{
+	return fib6_get_table(id);
+}
+
+struct fib6_table *fib6_get_table(u32 id)
+{
+	return &fib6_main_tbl;
+}
+
+struct dst_entry *fib6_rule_lookup(struct flowi *fl, int flags,
+				   pol_lookup_t lookup)
+{
+	return (struct dst_entry *) lookup(&fib6_main_tbl, fl, flags);
+}
+
+static void __init fib6_tables_init(void)
+{
+}
+
+#endif
+
 
 /*
  *	Routing Table
@@ -1064,6 +1185,22 @@ void fib6_clean_tree(struct fib6_node *r
 	fib6_walk(&c.w);
 }
 
+void fib6_clean_all(int (*func)(struct rt6_info *, void *arg),
+		    int prune, void *arg)
+{
+	int i;
+	struct fib6_table *table;
+
+	for (i = FIB6_TABLE_MIN; i <= FIB6_TABLE_MAX; i++) {
+		table = fib6_get_table(i);
+		if (table != NULL) {
+			write_lock_bh(&table->tb6_lock);
+			fib6_clean_tree(&table->tb6_root, func, prune, arg);
+			write_unlock_bh(&table->tb6_lock);
+		}
+	}
+}
+
 static int fib6_prune_clone(struct rt6_info *rt, void *arg)
 {
 	if (rt->rt6i_flags & RTF_CACHE) {
@@ -1142,11 +1279,8 @@ void fib6_run_gc(unsigned long dummy)
 	}
 	gc_args.more = 0;
 
-
-	write_lock_bh(&rt6_lock);
 	ndisc_dst_gc(&gc_args.more);
-	fib6_clean_tree(&ip6_routing_table, fib6_age, 0, NULL);
-	write_unlock_bh(&rt6_lock);
+	fib6_clean_all(fib6_age, 0, NULL);
 
 	if (gc_args.more)
 		mod_timer(&ip6_fib_timer, jiffies + ip6_rt_gc_interval);
@@ -1165,6 +1299,8 @@ void __init fib6_init(void)
 					   NULL, NULL);
 	if (!fib6_node_kmem)
 		panic("cannot create fib6_nodes cache");
+
+	fib6_tables_init();
 }
 
 void fib6_gc_cleanup(void)
Index: net-2.6.git/net/ipv6/route.c
===================================================================
--- net-2.6.git.orig/net/ipv6/route.c
+++ net-2.6.git/net/ipv6/route.c
@@ -139,16 +139,6 @@ struct rt6_info ip6_null_entry = {
 	.rt6i_ref	= ATOMIC_INIT(1),
 };
 
-struct fib6_node ip6_routing_table = {
-	.leaf		= &ip6_null_entry,
-	.fn_flags	= RTN_ROOT | RTN_TL_ROOT | RTN_RTINFO,
-};
-
-/* Protects all the ip6 fib */
-
-DEFINE_RWLOCK(rt6_lock);
-
-
 /* allocate dst with ip6_dst_ops */
 static __inline__ struct rt6_info *ip6_dst_alloc(void)
 {
@@ -187,8 +177,14 @@ static __inline__ int rt6_check_expired(
 		time_after(jiffies, rt->rt6i_expires));
 }
 
+static inline int rt6_need_strict(struct in6_addr *daddr)
+{
+	return (ipv6_addr_type(daddr) &
+		(IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL));
+}
+
 /*
- *	Route lookup. Any rt6_lock is implied.
+ *	Route lookup. Any table->tb6_lock is implied.
  */
 
 static __inline__ struct rt6_info *rt6_device_match(struct rt6_info *rt,
@@ -440,27 +436,66 @@ int rt6_route_rcv(struct net_device *dev
 }
 #endif
 
-struct rt6_info *rt6_lookup(struct in6_addr *daddr, struct in6_addr *saddr,
-			    int oif, int strict)
+#define BACKTRACK() \
+if (rt == &ip6_null_entry && flags & RT6_F_STRICT) { \
+	while ((fn = fn->parent) != NULL) { \
+		if (fn->fn_flags & RTN_TL_ROOT) { \
+			dst_hold(&rt->u.dst); \
+			goto out; \
+		} \
+		if (fn->fn_flags & RTN_RTINFO) \
+			goto restart; \
+	} \
+}
+
+static struct rt6_info *ip6_pol_route_lookup(struct fib6_table *table,
+					     struct flowi *fl, int flags)
 {
 	struct fib6_node *fn;
 	struct rt6_info *rt;
 
-	read_lock_bh(&rt6_lock);
-	fn = fib6_lookup(&ip6_routing_table, daddr, saddr);
-	rt = rt6_device_match(fn->leaf, oif, strict);
+	read_lock_bh(&table->tb6_lock);
+	fn = fib6_lookup(&table->tb6_root, &fl->fl6_dst, &fl->fl6_src);
+restart:
+	rt = fn->leaf;
+	rt = rt6_device_match(rt, fl->oif, flags & RT6_F_STRICT);
+	BACKTRACK();
 	dst_hold(&rt->u.dst);
-	rt->u.dst.__use++;
-	read_unlock_bh(&rt6_lock);
+out:
+	read_unlock_bh(&table->tb6_lock);
 
 	rt->u.dst.lastuse = jiffies;
-	if (rt->u.dst.error == 0)
-		return rt;
-	dst_release(&rt->u.dst);
+	rt->u.dst.__use++;
+
+	return rt;
+
+}
+
+struct rt6_info *rt6_lookup(struct in6_addr *daddr, struct in6_addr *saddr,
+			    int oif, int strict)
+{
+	struct flowi fl = {
+		.oif = oif,
+		.nl_u = {
+			.ip6_u = {
+				.daddr = *daddr,
+				/* TODO: saddr */
+			},
+		},
+	};
+	struct dst_entry *dst;
+	int flags = strict ? RT6_F_STRICT : 0;
+
+	dst = fib6_rule_lookup(&fl, flags, ip6_pol_route_lookup);
+	if (dst->error == 0)
+		return (struct rt6_info *) dst;
+
+	dst_release(dst);
+
 	return NULL;
 }
 
-/* ip6_ins_rt is called with FREE rt6_lock.
+/* ip6_ins_rt is called with FREE table->tb6_lock.
    It takes new route entry, the addition fails by any reason the
    route is freed. In any case, if caller does not hold it, it may
    be destroyed.
@@ -470,10 +505,12 @@ int ip6_ins_rt(struct rt6_info *rt, stru
 		void *_rtattr, struct netlink_skb_parms *req)
 {
 	int err;
+	struct fib6_table *table;
 
-	write_lock_bh(&rt6_lock);
-	err = fib6_add(&ip6_routing_table, rt, nlh, _rtattr, req);
-	write_unlock_bh(&rt6_lock);
+	table = rt->rt6i_table;
+	write_lock_bh(&table->tb6_lock);
+	err = fib6_add(&table->tb6_root, rt, nlh, _rtattr, req);
+	write_unlock_bh(&table->tb6_lock);
 
 	return err;
 }
@@ -531,51 +568,40 @@ static struct rt6_info *rt6_alloc_clone(
 	return rt;
 }
 
-#define BACKTRACK() \
-if (rt == &ip6_null_entry) { \
-       while ((fn = fn->parent) != NULL) { \
-		if (fn->fn_flags & RTN_ROOT) { \
-			goto out; \
-		} \
-		if (fn->fn_flags & RTN_RTINFO) \
-			goto restart; \
-	} \
-}
-
-
-void ip6_route_input(struct sk_buff *skb)
+struct rt6_info *ip6_pol_route_input(struct fib6_table *table, struct flowi *fl,
+				     int flags)
 {
 	struct fib6_node *fn;
 	struct rt6_info *rt, *nrt;
-	int strict;
+	int strict = 0;
 	int attempts = 3;
 	int err;
 	int reachable = RT6_SELECT_F_REACHABLE;
 
-	strict = ipv6_addr_type(&skb->nh.ipv6h->daddr) & (IPV6_ADDR_MULTICAST|IPV6_ADDR_LINKLOCAL) ? RT6_SELECT_F_IFACE : 0;
+	if (flags & RT6_F_STRICT)
+		strict = RT6_SELECT_F_IFACE;
 
 relookup:
-	read_lock_bh(&rt6_lock);
+	read_lock_bh(&table->tb6_lock);
 
 restart_2:
-	fn = fib6_lookup(&ip6_routing_table, &skb->nh.ipv6h->daddr,
-			 &skb->nh.ipv6h->saddr);
+	fn = fib6_lookup(&table->tb6_root, &fl->fl6_dst, &fl->fl6_src);
 
 restart:
-	rt = rt6_select(&fn->leaf, skb->dev->ifindex, strict | reachable);
+	rt = rt6_select(&fn->leaf, fl->iif, strict | reachable);
 	BACKTRACK();
 	if (rt == &ip6_null_entry ||
 	    rt->rt6i_flags & RTF_CACHE)
 		goto out;
 
 	dst_hold(&rt->u.dst);
-	read_unlock_bh(&rt6_lock);
+	read_unlock_bh(&table->tb6_lock);
 
 	if (!rt->rt6i_nexthop && !(rt->rt6i_flags & RTF_NONEXTHOP))
-		nrt = rt6_alloc_cow(rt, &skb->nh.ipv6h->daddr, &skb->nh.ipv6h->saddr);
+		nrt = rt6_alloc_cow(rt, &fl->fl6_dst, &fl->fl6_src);
 	else {
 #if CLONE_OFFLINK_ROUTE
-		nrt = rt6_alloc_clone(rt, &skb->nh.ipv6h->daddr);
+		nrt = rt6_alloc_clone(rt, &fl->fl6_dst);
 #else
 		goto out2;
 #endif
@@ -586,7 +612,7 @@ restart:
 
 	dst_hold(&rt->u.dst);
 	if (nrt) {
-		err = ip6_ins_rt(nrt, NULL, NULL, &NETLINK_CB(skb));
+		err = ip6_ins_rt(nrt, NULL, NULL, NULL);
 		if (!err)
 			goto out2;
 	}
@@ -595,7 +621,7 @@ restart:
 		goto out2;
 
 	/*
-	 * Race condition! In the gap, when rt6_lock was
+	 * Race condition! In the gap, when table->tb6_lock was
 	 * released someone could insert this route.  Relookup.
 	 */
 	dst_release(&rt->u.dst);
@@ -607,30 +633,54 @@ out:
 		goto restart_2;
 	}
 	dst_hold(&rt->u.dst);
-	read_unlock_bh(&rt6_lock);
+	read_unlock_bh(&table->tb6_lock);
 out2:
 	rt->u.dst.lastuse = jiffies;
 	rt->u.dst.__use++;
-	skb->dst = (struct dst_entry *) rt;
-	return;
+
+	return rt;
 }
 
-struct dst_entry * ip6_route_output(struct sock *sk, struct flowi *fl)
+void ip6_route_input(struct sk_buff *skb)
+{
+	struct ipv6hdr *iph = skb->nh.ipv6h;
+	struct flowi fl = {
+		.iif = skb->dev->ifindex,
+		.nl_u = {
+			.ip6_u = {
+				.daddr = iph->daddr,
+				.saddr = iph->saddr,
+				.flowlabel = (* (u32 *) iph)&IPV6_FLOWINFO_MASK,
+			},
+		},
+		.proto = iph->nexthdr,
+	};
+	int flags = 0;
+
+	if (rt6_need_strict(&iph->daddr))
+		flags |= RT6_F_STRICT;
+
+	skb->dst = fib6_rule_lookup(&fl, flags, ip6_pol_route_input);
+}
+
+static struct rt6_info *ip6_pol_route_output(struct fib6_table *table,
+					     struct flowi *fl, int flags)
 {
 	struct fib6_node *fn;
 	struct rt6_info *rt, *nrt;
-	int strict;
+	int strict = 0;
 	int attempts = 3;
 	int err;
 	int reachable = RT6_SELECT_F_REACHABLE;
 
-	strict = ipv6_addr_type(&fl->fl6_dst) & (IPV6_ADDR_MULTICAST|IPV6_ADDR_LINKLOCAL) ? RT6_SELECT_F_IFACE : 0;
+	if (flags & RT6_F_STRICT)
+		strict = RT6_SELECT_F_IFACE;
 
 relookup:
-	read_lock_bh(&rt6_lock);
+	read_lock_bh(&table->tb6_lock);
 
 restart_2:
-	fn = fib6_lookup(&ip6_routing_table, &fl->fl6_dst, &fl->fl6_src);
+	fn = fib6_lookup(&table->tb6_root, &fl->fl6_dst, &fl->fl6_src);
 
 restart:
 	rt = rt6_select(&fn->leaf, fl->oif, strict | reachable);
@@ -640,7 +690,7 @@ restart:
 		goto out;
 
 	dst_hold(&rt->u.dst);
-	read_unlock_bh(&rt6_lock);
+	read_unlock_bh(&table->tb6_lock);
 
 	if (!rt->rt6i_nexthop && !(rt->rt6i_flags & RTF_NONEXTHOP))
 		nrt = rt6_alloc_cow(rt, &fl->fl6_dst, &fl->fl6_src);
@@ -666,7 +716,7 @@ restart:
 		goto out2;
 
 	/*
-	 * Race condition! In the gap, when rt6_lock was
+	 * Race condition! In the gap, when table->tb6_lock was
 	 * released someone could insert this route.  Relookup.
 	 */
 	dst_release(&rt->u.dst);
@@ -678,11 +728,21 @@ out:
 		goto restart_2;
 	}
 	dst_hold(&rt->u.dst);
-	read_unlock_bh(&rt6_lock);
+	read_unlock_bh(&table->tb6_lock);
 out2:
 	rt->u.dst.lastuse = jiffies;
 	rt->u.dst.__use++;
-	return &rt->u.dst;
+	return rt;
+}
+
+struct dst_entry * ip6_route_output(struct sock *sk, struct flowi *fl)
+{
+	int flags = 0;
+
+	if (rt6_need_strict(&fl->fl6_dst))
+		flags |= RT6_F_STRICT;
+
+	return fib6_rule_lookup(fl, flags, ip6_pol_route_output);
 }
 
 
@@ -904,7 +964,8 @@ int ipv6_get_hoplimit(struct net_device 
  */
 
 int ip6_route_add(struct in6_rtmsg *rtmsg, struct nlmsghdr *nlh, 
-		void *_rtattr, struct netlink_skb_parms *req)
+		  void *_rtattr, struct netlink_skb_parms *req,
+		  u32 table_id)
 {
 	int err;
 	struct rtmsg *r;
@@ -912,6 +973,7 @@ int ip6_route_add(struct in6_rtmsg *rtms
 	struct rt6_info *rt = NULL;
 	struct net_device *dev = NULL;
 	struct inet6_dev *idev = NULL;
+	struct fib6_table *table;
 	int addr_type;
 
 	rta = (struct rtattr **) _rtattr;
@@ -935,6 +997,12 @@ int ip6_route_add(struct in6_rtmsg *rtms
 	if (rtmsg->rtmsg_metric == 0)
 		rtmsg->rtmsg_metric = IP6_RT_PRIO_USER;
 
+	table = fib6_new_table(table_id);
+	if (table == NULL) {
+		err = -ENOBUFS;
+		goto out;
+	}
+
 	rt = ip6_dst_alloc();
 
 	if (rt == NULL) {
@@ -1091,6 +1159,7 @@ install_route:
 		rt->u.dst.metrics[RTAX_ADVMSS-1] = ipv6_advmss(dst_mtu(&rt->u.dst));
 	rt->u.dst.dev = dev;
 	rt->rt6i_idev = idev;
+	rt->rt6i_table = table;
 	return ip6_ins_rt(rt, nlh, _rtattr, req);
 
 out:
@@ -1106,26 +1175,35 @@ out:
 int ip6_del_rt(struct rt6_info *rt, struct nlmsghdr *nlh, void *_rtattr, struct netlink_skb_parms *req)
 {
 	int err;
+	struct fib6_table *table;
 
-	write_lock_bh(&rt6_lock);
+	table = rt->rt6i_table;
+	write_lock_bh(&table->tb6_lock);
 
 	err = fib6_del(rt, nlh, _rtattr, req);
 	dst_release(&rt->u.dst);
 
-	write_unlock_bh(&rt6_lock);
+	write_unlock_bh(&table->tb6_lock);
 
 	return err;
 }
 
-static int ip6_route_del(struct in6_rtmsg *rtmsg, struct nlmsghdr *nlh, void *_rtattr, struct netlink_skb_parms *req)
+static int ip6_route_del(struct in6_rtmsg *rtmsg, struct nlmsghdr *nlh,
+			 void *_rtattr, struct netlink_skb_parms *req,
+			 u32 table_id)
 {
+	struct fib6_table *table;
 	struct fib6_node *fn;
 	struct rt6_info *rt;
 	int err = -ESRCH;
 
-	read_lock_bh(&rt6_lock);
+	table = fib6_get_table(table_id);
+	if (table == NULL)
+		return err;
 
-	fn = fib6_locate(&ip6_routing_table,
+	read_lock_bh(&table->tb6_lock);
+
+	fn = fib6_locate(&table->tb6_root,
 			 &rtmsg->rtmsg_dst, rtmsg->rtmsg_dst_len,
 			 &rtmsg->rtmsg_src, rtmsg->rtmsg_src_len);
 	
@@ -1142,12 +1220,12 @@ static int ip6_route_del(struct in6_rtms
 			    rtmsg->rtmsg_metric != rt->rt6i_metric)
 				continue;
 			dst_hold(&rt->u.dst);
-			read_unlock_bh(&rt6_lock);
+			read_unlock_bh(&table->tb6_lock);
 
 			return ip6_del_rt(rt, nlh, _rtattr, req);
 		}
 	}
-	read_unlock_bh(&rt6_lock);
+	read_unlock_bh(&table->tb6_lock);
 
 	return err;
 }
@@ -1159,8 +1237,13 @@ void rt6_redirect(struct in6_addr *dest,
 		  struct neighbour *neigh, u8 *lladdr, int on_link)
 {
 	struct rt6_info *rt, *nrt = NULL;
-	int strict;
 	struct fib6_node *fn;
+	struct fib6_table *table;
+
+	/* TODO: Very lazy, might need to check all tables */
+	table = fib6_get_table(RT6_TABLE_MAIN);
+	if (table == NULL)
+		return;
 
 	/*
 	 * Get the "current" route for this destination and
@@ -1172,10 +1255,9 @@ void rt6_redirect(struct in6_addr *dest,
 	 * is a bit fuzzy and one might need to check all possible
 	 * routes.
 	 */
-	strict = ipv6_addr_type(dest) & (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL);
 
-	read_lock_bh(&rt6_lock);
-	fn = fib6_lookup(&ip6_routing_table, dest, NULL);
+	read_lock_bh(&table->tb6_lock);
+	fn = fib6_lookup(&table->tb6_root, dest, NULL);
 restart:
 	for (rt = fn->leaf; rt; rt = rt->u.next) {
 		/*
@@ -1198,7 +1280,7 @@ restart:
 	}
 	if (rt)
 		dst_hold(&rt->u.dst);
-	else if (strict) {
+	else if (rt6_need_strict(dest)) {
 		while ((fn = fn->parent) != NULL) {
 			if (fn->fn_flags & RTN_ROOT)
 				break;
@@ -1206,7 +1288,7 @@ restart:
 				goto restart;
 		}
 	}
-	read_unlock_bh(&rt6_lock);
+	read_unlock_bh(&table->tb6_lock);
 
 	if (!rt) {
 		if (net_ratelimit())
@@ -1377,6 +1459,7 @@ static struct rt6_info * ip6_rt_copy(str
 #ifdef CONFIG_IPV6_SUBTREES
 		memcpy(&rt->rt6i_src, &ort->rt6i_src, sizeof(struct rt6key));
 #endif
+		rt->rt6i_table = ort->rt6i_table;
 	}
 	return rt;
 }
@@ -1387,9 +1470,14 @@ static struct rt6_info *rt6_get_route_in
 {
 	struct fib6_node *fn;
 	struct rt6_info *rt = NULL;
+	struct fib6_table *table;
 
-	write_lock_bh(&rt6_lock);
-	fn = fib6_locate(&ip6_routing_table, prefix ,prefixlen, NULL, 0);
+	table = fib6_get_table(RT6_TABLE_INFO);
+	if (table == NULL)
+		return NULL;
+
+	write_lock_bh(&table->tb6_lock);
+	fn = fib6_locate(&table->tb6_root, prefix ,prefixlen, NULL, 0);
 	if (!fn)
 		goto out;
 
@@ -1404,7 +1492,7 @@ static struct rt6_info *rt6_get_route_in
 		break;
 	}
 out:
-	write_unlock_bh(&rt6_lock);
+	write_unlock_bh(&table->tb6_lock);
 	return rt;
 }
 
@@ -1426,7 +1514,7 @@ static struct rt6_info *rt6_add_route_in
 		rtmsg.rtmsg_flags |= RTF_DEFAULT;
 	rtmsg.rtmsg_ifindex = ifindex;
 
-	ip6_route_add(&rtmsg, NULL, NULL, NULL);
+	ip6_route_add(&rtmsg, NULL, NULL, NULL, RT6_TABLE_INFO);
 
 	return rt6_get_route_info(prefix, prefixlen, gwaddr, ifindex);
 }
@@ -1435,12 +1523,15 @@ static struct rt6_info *rt6_add_route_in
 struct rt6_info *rt6_get_dflt_router(struct in6_addr *addr, struct net_device *dev)
 {	
 	struct rt6_info *rt;
-	struct fib6_node *fn;
+	struct fib6_table *table;
 
-	fn = &ip6_routing_table;
+	/* TODO: It might be better to search all tables */
+	table = fib6_get_table(RT6_TABLE_DFLT);
+	if (table == NULL)
+		return NULL;
 
-	write_lock_bh(&rt6_lock);
-	for (rt = fn->leaf; rt; rt=rt->u.next) {
+	write_lock_bh(&table->tb6_lock);
+	for (rt = table->tb6_root.leaf; rt; rt=rt->u.next) {
 		if (dev == rt->rt6i_dev &&
 		    ((rt->rt6i_flags & (RTF_ADDRCONF | RTF_DEFAULT)) == (RTF_ADDRCONF | RTF_DEFAULT)) &&
 		    ipv6_addr_equal(&rt->rt6i_gateway, addr))
@@ -1448,7 +1539,7 @@ struct rt6_info *rt6_get_dflt_router(str
 	}
 	if (rt)
 		dst_hold(&rt->u.dst);
-	write_unlock_bh(&rt6_lock);
+	write_unlock_bh(&table->tb6_lock);
 	return rt;
 }
 
@@ -1467,28 +1558,31 @@ struct rt6_info *rt6_add_dflt_router(str
 
 	rtmsg.rtmsg_ifindex = dev->ifindex;
 
-	ip6_route_add(&rtmsg, NULL, NULL, NULL);
+	ip6_route_add(&rtmsg, NULL, NULL, NULL, RT6_TABLE_DFLT);
 	return rt6_get_dflt_router(gwaddr, dev);
 }
 
 void rt6_purge_dflt_routers(void)
 {
 	struct rt6_info *rt;
+	struct fib6_table *table;
+
+	/* NOTE: Keep consistent with rt6_get_dflt_router */
+	table = fib6_get_table(RT6_TABLE_DFLT);
+	if (table == NULL)
+		return;
 
 restart:
-	read_lock_bh(&rt6_lock);
-	for (rt = ip6_routing_table.leaf; rt; rt = rt->u.next) {
+	read_lock_bh(&table->tb6_lock);
+	for (rt = table->tb6_root.leaf; rt; rt = rt->u.next) {
 		if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ADDRCONF)) {
 			dst_hold(&rt->u.dst);
-
-			read_unlock_bh(&rt6_lock);
-
+			read_unlock_bh(&table->tb6_lock);
 			ip6_del_rt(rt, NULL, NULL, NULL);
-
 			goto restart;
 		}
 	}
-	read_unlock_bh(&rt6_lock);
+	read_unlock_bh(&table->tb6_lock);
 }
 
 int ipv6_route_ioctl(unsigned int cmd, void __user *arg)
@@ -1509,10 +1603,12 @@ int ipv6_route_ioctl(unsigned int cmd, v
 		rtnl_lock();
 		switch (cmd) {
 		case SIOCADDRT:
-			err = ip6_route_add(&rtmsg, NULL, NULL, NULL);
+			err = ip6_route_add(&rtmsg, NULL, NULL, NULL,
+					    RT6_TABLE_MAIN);
 			break;
 		case SIOCDELRT:
-			err = ip6_route_del(&rtmsg, NULL, NULL, NULL);
+			err = ip6_route_del(&rtmsg, NULL, NULL, NULL,
+					    RT6_TABLE_MAIN);
 			break;
 		default:
 			err = -EINVAL;
@@ -1582,6 +1678,7 @@ struct rt6_info *addrconf_dst_alloc(stru
 
 	ipv6_addr_copy(&rt->rt6i_dst.addr, addr);
 	rt->rt6i_dst.plen = 128;
+	rt->rt6i_table = fib6_get_table(RT6_TABLE_LOCAL);
 
 	atomic_set(&rt->u.dst.__refcnt, 1);
 
@@ -1600,9 +1697,7 @@ static int fib6_ifdown(struct rt6_info *
 
 void rt6_ifdown(struct net_device *dev)
 {
-	write_lock_bh(&rt6_lock);
-	fib6_clean_tree(&ip6_routing_table, fib6_ifdown, 0, dev);
-	write_unlock_bh(&rt6_lock);
+	fib6_clean_all(fib6_ifdown, 0, dev);
 }
 
 struct rt6_mtu_change_arg
@@ -1652,13 +1747,12 @@ static int rt6_mtu_change_route(struct r
 
 void rt6_mtu_change(struct net_device *dev, unsigned mtu)
 {
-	struct rt6_mtu_change_arg arg;
+	struct rt6_mtu_change_arg arg = {
+		.dev = dev,
+		.mtu = mtu,
+	};
 
-	arg.dev = dev;
-	arg.mtu = mtu;
-	read_lock_bh(&rt6_lock);
-	fib6_clean_tree(&ip6_routing_table, rt6_mtu_change_route, 0, &arg);
-	read_unlock_bh(&rt6_lock);
+	fib6_clean_all(rt6_mtu_change_route, 0, &arg);
 }
 
 static int inet6_rtm_to_rtmsg(struct rtmsg *r, struct rtattr **rta,
@@ -1708,7 +1802,7 @@ int inet6_rtm_delroute(struct sk_buff *s
 
 	if (inet6_rtm_to_rtmsg(r, arg, &rtmsg))
 		return -EINVAL;
-	return ip6_route_del(&rtmsg, nlh, arg, &NETLINK_CB(skb));
+	return ip6_route_del(&rtmsg, nlh, arg, &NETLINK_CB(skb), r->rtm_table);
 }
 
 int inet6_rtm_newroute(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
@@ -1718,7 +1812,7 @@ int inet6_rtm_newroute(struct sk_buff *s
 
 	if (inet6_rtm_to_rtmsg(r, arg, &rtmsg))
 		return -EINVAL;
-	return ip6_route_add(&rtmsg, nlh, arg, &NETLINK_CB(skb));
+	return ip6_route_add(&rtmsg, nlh, arg, &NETLINK_CB(skb), r->rtm_table);
 }
 
 struct rt6_rtnl_dump_arg
@@ -1750,6 +1844,10 @@ static int rt6_fill_node(struct sk_buff 
 	rtm->rtm_dst_len = rt->rt6i_dst.plen;
 	rtm->rtm_src_len = rt->rt6i_src.plen;
 	rtm->rtm_tos = 0;
+	if (rt->rt6i_table)
+		rtm->rtm_table = rt->rt6i_table->tb6_id;
+	else
+		rtm->rtm_table = RT6_TABLE_UNSPEC;
 	rtm->rtm_table = RT_TABLE_MAIN;
 	if (rt->rt6i_flags&RTF_REJECT)
 		rtm->rtm_type = RTN_UNREACHABLE;
@@ -1857,7 +1955,6 @@ static void fib6_dump_end(struct netlink
 
 	if (w) {
 		cb->args[0] = 0;
-		fib6_walker_unlink(w);
 		kfree(w);
 	}
 	cb->done = (void*)cb->args[1];
@@ -1872,13 +1969,20 @@ static int fib6_dump_done(struct netlink
 
 int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 {
+	struct fib6_table *table;
 	struct rt6_rtnl_dump_arg arg;
 	struct fib6_walker_t *w;
-	int res;
+	int i, res = 0;
 
 	arg.skb = skb;
 	arg.cb = cb;
 
+	/*
+	 * cb->args[0] = pointer to walker structure
+	 * cb->args[1] = saved cb->done() pointer
+	 * cb->args[2] = current table being dumped
+	 */
+
 	w = (void*)cb->args[0];
 	if (w == NULL) {
 		/* New dump:
@@ -1894,24 +1998,48 @@ int inet6_dump_fib(struct sk_buff *skb, 
 		w = kzalloc(sizeof(*w), GFP_ATOMIC);
 		if (w == NULL)
 			return -ENOMEM;
-		RT6_TRACE("dump<%p", w);
-		w->root = &ip6_routing_table;
 		w->func = fib6_dump_node;
 		w->args = &arg;
 		cb->args[0] = (long)w;
-		read_lock_bh(&rt6_lock);
-		res = fib6_walk(w);
-		read_unlock_bh(&rt6_lock);
+		cb->args[2] = FIB6_TABLE_MIN;
 	} else {
 		w->args = &arg;
-		read_lock_bh(&rt6_lock);
-		res = fib6_walk_continue(w);
-		read_unlock_bh(&rt6_lock);
-	}
-#if RT6_DEBUG >= 3
-	if (res <= 0 && skb->len == 0)
-		RT6_TRACE("%p>dump end\n", w);
-#endif
+		i = cb->args[2];
+		if (i > FIB6_TABLE_MAX)
+			goto end;
+
+		table = fib6_get_table(i);
+		if (table != NULL) {
+			read_lock_bh(&table->tb6_lock);
+			w->root = &table->tb6_root;
+			res = fib6_walk_continue(w);
+			read_unlock_bh(&table->tb6_lock);
+			if (res != 0) {
+				if (res < 0)
+					fib6_walker_unlink(w);
+				goto end;
+			}
+		}
+
+		fib6_walker_unlink(w);
+		cb->args[2] = ++i;
+	}
+
+	for (i = cb->args[2]; i <= FIB6_TABLE_MAX; i++) {
+		table = fib6_get_table(i);
+		if (table == NULL)
+			continue;
+
+		read_lock_bh(&table->tb6_lock);
+		w->root = &table->tb6_root;
+		res = fib6_walk(w);
+		read_unlock_bh(&table->tb6_lock);
+		if (res)
+			break;
+	}
+end:
+	cb->args[2] = i;
+
 	res = res < 0 ? res : skb->len;
 	/* res < 0 is an error. (really, impossible)
 	   res == 0 means that dump is complete, but skb still can contain data.
@@ -2091,16 +2219,13 @@ static int rt6_info_route(struct rt6_inf
 
 static int rt6_proc_info(char *buffer, char **start, off_t offset, int length)
 {
-	struct rt6_proc_arg arg;
-	arg.buffer = buffer;
-	arg.offset = offset;
-	arg.length = length;
-	arg.skip = 0;
-	arg.len = 0;
-
-	read_lock_bh(&rt6_lock);
-	fib6_clean_tree(&ip6_routing_table, rt6_info_route, 0, &arg);
-	read_unlock_bh(&rt6_lock);
+	struct rt6_proc_arg arg = {
+		.buffer = buffer,
+		.offset = offset,
+		.length = length,
+	};
+
+	fib6_clean_all(rt6_info_route, 0, &arg);
 
 	*start = buffer;
 	if (offset)
Index: net-2.6.git/net/ipv6/Kconfig
===================================================================
--- net-2.6.git.orig/net/ipv6/Kconfig
+++ net-2.6.git/net/ipv6/Kconfig
@@ -135,3 +135,9 @@ config IPV6_TUNNEL
 
 	  If unsure, say N.
 
+config IPV6_MULTIPLE_TABLES
+	bool "IPv6: Multiple Routing Tables"
+	depends on IPV6 && EXPERIMENTAL
+	---help---
+	  Support multiple routing tables.
+
Index: net-2.6.git/include/net/ip6_route.h
===================================================================
--- net-2.6.git.orig/include/net/ip6_route.h
+++ net-2.6.git/include/net/ip6_route.h
@@ -58,7 +58,8 @@ extern int			ipv6_route_ioctl(unsigned i
 extern int			ip6_route_add(struct in6_rtmsg *rtmsg,
 					      struct nlmsghdr *,
 					      void *rtattr,
-					      struct netlink_skb_parms *req);
+					      struct netlink_skb_parms *req,
+					      u32 table_id);
 extern int			ip6_ins_rt(struct rt6_info *,
 					   struct nlmsghdr *,
 					   void *rtattr,
Index: net-2.6.git/net/ipv6/addrconf.c
===================================================================
--- net-2.6.git.orig/net/ipv6/addrconf.c
+++ net-2.6.git/net/ipv6/addrconf.c
@@ -1525,7 +1525,7 @@ addrconf_prefix_route(struct in6_addr *p
 	if (dev->type == ARPHRD_SIT && (dev->flags&IFF_POINTOPOINT))
 		rtmsg.rtmsg_flags |= RTF_NONEXTHOP;
 
-	ip6_route_add(&rtmsg, NULL, NULL, NULL);
+	ip6_route_add(&rtmsg, NULL, NULL, NULL, RT6_TABLE_MAIN);
 }
 
 /* Create "default" multicast route to the interface */
@@ -1542,7 +1542,7 @@ static void addrconf_add_mroute(struct n
 	rtmsg.rtmsg_ifindex = dev->ifindex;
 	rtmsg.rtmsg_flags = RTF_UP;
 	rtmsg.rtmsg_type = RTMSG_NEWROUTE;
-	ip6_route_add(&rtmsg, NULL, NULL, NULL);
+	ip6_route_add(&rtmsg, NULL, NULL, NULL, RT6_TABLE_MAIN);
 }
 
 static void sit_route_add(struct net_device *dev)
@@ -1559,7 +1559,7 @@ static void sit_route_add(struct net_dev
 	rtmsg.rtmsg_flags	= RTF_UP|RTF_NONEXTHOP;
 	rtmsg.rtmsg_ifindex	= dev->ifindex;
 
-	ip6_route_add(&rtmsg, NULL, NULL, NULL);
+	ip6_route_add(&rtmsg, NULL, NULL, NULL, RT6_TABLE_MAIN);
 }
 
 static void addrconf_add_lroute(struct net_device *dev)


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
  2006-07-26 22:00 ` [PATCH 2/5] [IPV6]: Multiple Routing Tables Thomas Graf
@ 2006-07-26 22:39   ` David Miller
  2006-07-26 22:48     ` Thomas Graf
  2006-07-29  4:13   ` YOSHIFUJI Hideaki / 吉藤英明
  2006-07-31 13:55   ` Ville Nuorvala
  2 siblings, 1 reply; 54+ messages in thread
From: David Miller @ 2006-07-26 22:39 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, vnuorval, usagi-core, yoshfuji, anttit

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 27 Jul 2006 00:00:02 +0200

> Adds the framework to support multiple IPv6 routing tables.
> Currently all automatically generated routes are put into the
> same table. This could be changed at a later point after
> considering the produced locking overhead.
> 
> When locating routes for redirects only the main table is
> searched for now. Since policy rules will not be reversible
> it is unclear whether it makes sense to change this.
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>

This looks good, and it seems we even fixed a bug:

> @@ -586,7 +612,7 @@ restart:
>  
>  	dst_hold(&rt->u.dst);
>  	if (nrt) {
> -		err = ip6_ins_rt(nrt, NULL, NULL, &NETLINK_CB(skb));
> +		err = ip6_ins_rt(nrt, NULL, NULL, NULL);
>  		if (!err)
>  			goto out2;
>  	}

Wow, were we corrupting the IP6CB() of input packets on every
route lookup that hit this path?

I'm probably the one who put that erroneous &NETLINK_CB(skb) there.
Sorry :)

Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
  2006-07-26 22:39   ` David Miller
@ 2006-07-26 22:48     ` Thomas Graf
  2006-07-26 22:55       ` David Miller
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Graf @ 2006-07-26 22:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, vnuorval, usagi-core, yoshfuji, anttit

* David Miller <davem@davemloft.net> 2006-07-26 15:39
> Wow, were we corrupting the IP6CB() of input packets on every
> route lookup that hit this path?

I found that via compile error as passing on skb to this context
didn't make sense when I rewrote the code around it :-)

Fortunately it's only used read-only when notifying to figure
out the destination pid but that code seems bogus as well and
we can probably remove the netlink cb passing hell throughout
all the code there.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
  2006-07-26 22:48     ` Thomas Graf
@ 2006-07-26 22:55       ` David Miller
  0 siblings, 0 replies; 54+ messages in thread
From: David Miller @ 2006-07-26 22:55 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, vnuorval, usagi-core, yoshfuji, anttit

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 27 Jul 2006 00:48:48 +0200

> Fortunately it's only used read-only when notifying to figure
> out the destination pid but that code seems bogus as well and
> we can probably remove the netlink cb passing hell throughout
> all the code there.

Yes, that pid handling looks very strange.  But I would
search the history of this code that passes the netlink
skb parms around before making any changes to things like
inet6_rt_notify :)

It appears to have been added in 2.4.32, the original patch passes the
raw pid around as the argument but that got changed to the current
form which passes the netlink_skb_parms.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
  2006-07-26 22:00 ` [PATCH 2/5] [IPV6]: Multiple Routing Tables Thomas Graf
  2006-07-26 22:39   ` David Miller
@ 2006-07-29  4:13   ` YOSHIFUJI Hideaki / 吉藤英明
  2006-07-29  4:14     ` David Miller
  2006-07-29 10:29     ` Thomas Graf
  2006-07-31 13:55   ` Ville Nuorvala
  2 siblings, 2 replies; 54+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2006-07-29  4:13 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, vnuorval, usagi-core, davem, anttit, yoshfuji

Hello.

In article <20060726221849.495778268@postel.suug.ch> (at Thu, 27 Jul 2006 00:00:02 +0200), Thomas Graf <tgraf@suug.ch> says:

> Adds the framework to support multiple IPv6 routing tables.

Well, one design consideration that I have had for several months
is performance impact.

Previously, we directly use address, ifindex etc., not flowi,
in IPv6 routing code except for ip6_route_output().
This patch changes them to use flowi.
I know this should work and it is a good way for abstraction.
However, initializing flowi for IPv6 is more expensive than
one for IPv4, and it would result in poor performance (especially
without CONFIG_IPV6_MULTIPLE_TABLES).

Am I too cautious?
Should we eat this?

--yoshfuji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
  2006-07-29  4:13   ` YOSHIFUJI Hideaki / 吉藤英明
@ 2006-07-29  4:14     ` David Miller
  2006-07-29  4:28       ` YOSHIFUJI Hideaki / 吉藤英明
  2006-07-29 10:29     ` Thomas Graf
  1 sibling, 1 reply; 54+ messages in thread
From: David Miller @ 2006-07-29  4:14 UTC (permalink / raw)
  To: yoshfuji; +Cc: tgraf, netdev, vnuorval, usagi-core, anttit

From: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Date: Sat, 29 Jul 2006 13:13:06 +0900 (JST)

> Previously, we directly use address, ifindex etc., not flowi,
> in IPv6 routing code except for ip6_route_output().
> This patch changes them to use flowi.
> I know this should work and it is a good way for abstraction.
> However, initializing flowi for IPv6 is more expensive than
> one for IPv4, and it would result in poor performance (especially
> without CONFIG_IPV6_MULTIPLE_TABLES).
> 
> Am I too cautious?
> Should we eat this?

I think it is a legitimate consideration.

For now I would suggest we use Thomas's approach, and we
can make performance measurements to determine if it makes
to optimize this or consider other methods.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
  2006-07-29  4:14     ` David Miller
@ 2006-07-29  4:28       ` YOSHIFUJI Hideaki / 吉藤英明
  0 siblings, 0 replies; 54+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2006-07-29  4:28 UTC (permalink / raw)
  To: davem; +Cc: tgraf, netdev, vnuorval, usagi-core, anttit, yoshfuji

In article <20060728.211428.26968121.davem@davemloft.net> (at Fri, 28 Jul 2006 21:14:28 -0700 (PDT)), David Miller <davem@davemloft.net> says:

> For now I would suggest we use Thomas's approach, and we
> can make performance measurements to determine if it makes
> to optimize this or consider other methods.

Okay.

--yoshfuji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
  2006-07-29  4:13   ` YOSHIFUJI Hideaki / 吉藤英明
  2006-07-29  4:14     ` David Miller
@ 2006-07-29 10:29     ` Thomas Graf
  1 sibling, 0 replies; 54+ messages in thread
From: Thomas Graf @ 2006-07-29 10:29 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki / ?$B5HF#1QL@
  Cc: netdev, vnuorval, usagi-core, davem, anttit

* YOSHIFUJI Hideaki / ?$B5HF#1QL@ <yoshfuji@linux-ipv6.org> 2006-07-29 13:13
> Well, one design consideration that I have had for several months
> is performance impact.
> 
> Previously, we directly use address, ifindex etc., not flowi,
> in IPv6 routing code except for ip6_route_output().
> This patch changes them to use flowi.
> I know this should work and it is a good way for abstraction.
> However, initializing flowi for IPv6 is more expensive than
> one for IPv4, and it would result in poor performance (especially
> without CONFIG_IPV6_MULTIPLE_TABLES).

Do you have numbers for this? Personally I'm more worried
about the number of locks needed for a lookup and the
selection.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
  2006-07-26 22:00 ` [PATCH 2/5] [IPV6]: Multiple Routing Tables Thomas Graf
  2006-07-26 22:39   ` David Miller
  2006-07-29  4:13   ` YOSHIFUJI Hideaki / 吉藤英明
@ 2006-07-31 13:55   ` Ville Nuorvala
  2006-07-31 14:01     ` Herbert Xu
  2006-07-31 15:34     ` Thomas Graf
  2 siblings, 2 replies; 54+ messages in thread
From: Ville Nuorvala @ 2006-07-31 13:55 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, usagi-core, yoshfuji, davem, anttit

Thomas Graf wrote:

> Adds the framework to support multiple IPv6 routing tables.
> Currently all automatically generated routes are put into the
> same table. This could be changed at a later point after
> considering the produced locking overhead.

Hi Thomes, some minor comments below.

> When locating routes for redirects only the main table is
> searched for now. Since policy rules will not be reversible
> it is unclear whether it makes sense to change this.

This is a good point. You are absolutely correct about the policy rules.

IIRC, I initially looked through all the tables, but skipped this
behavior when I rewrote the code for 2.6.11. Currently I'm once again
in favor of looping through them all. This is IMO at least closer to the
spirit of RFC 2461 section 8.3. where a host SHOULD update its
destination cache upon receiving a redirect. If we don't look through
all tables, we can't ensure this happens.

> Index: net-2.6.git/include/net/ip6_fib.h
> ===================================================================
> --- net-2.6.git.orig/include/net/ip6_fib.h
> +++ net-2.6.git/include/net/ip6_fib.h

<snip>

> @@ -143,12 +146,41 @@ struct rt6_statistics {
>  
>  typedef void			(*f_pnode)(struct fib6_node *fn, void *);
>  
> -extern struct fib6_node		ip6_routing_table;
> +struct fib6_table {
> +	struct hlist_node	tb6_hlist;
> +	u32			tb6_id;
> +	rwlock_t		tb6_lock;
> +	struct fib6_node	tb6_root;
> +};
> +
> +#define RT6_TABLE_UNSPEC	RT_TABLE_UNSPEC
> +#define RT6_TABLE_MAIN		RT_TABLE_MAIN
> +#define RT6_TABLE_LOCAL		RT6_TABLE_MAIN
> +#define RT6_TABLE_DFLT		RT6_TABLE_MAIN
> +#define RT6_TABLE_INFO		RT6_TABLE_MAIN

IMO it's a bit inconsistent to define a separate table entry for Route
Information generated routes, but not Prefix Information based ones.
What do you say about adding a RT6_TABLE_PRFX?

> Index: net-2.6.git/net/ipv6/route.c
> ===================================================================
> --- net-2.6.git.orig/net/ipv6/route.c
> +++ net-2.6.git/net/ipv6/route.c

<snip>

> @@ -1435,12 +1523,15 @@ static struct rt6_info *rt6_add_route_in
>  struct rt6_info *rt6_get_dflt_router(struct in6_addr *addr, struct net_device *dev)
>  {	
>  	struct rt6_info *rt;
> -	struct fib6_node *fn;
> +	struct fib6_table *table;
>  
> -	fn = &ip6_routing_table;
> +	/* TODO: It might be better to search all tables */
> +	table = fib6_get_table(RT6_TABLE_DFLT);

As long as the table for default routes is RT6_TABLE_DFLT and can't be
configured by the user, I think the correct behavior is just to search
RT6_TABLE_DFLT.

Otherwise it looks very good!

Regards,
Ville

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
  2006-07-31 13:55   ` Ville Nuorvala
@ 2006-07-31 14:01     ` Herbert Xu
  2006-07-31 14:02       ` Herbert Xu
  2006-07-31 15:41       ` Thomas Graf
  2006-07-31 15:34     ` Thomas Graf
  1 sibling, 2 replies; 54+ messages in thread
From: Herbert Xu @ 2006-07-31 14:01 UTC (permalink / raw)
  To: Ville Nuorvala; +Cc: tgraf, netdev, usagi-core, yoshfuji, davem, anttit

Ville Nuorvala <vnuorval@tcs.hut.fi> wrote:
> 
>> When locating routes for redirects only the main table is
>> searched for now. Since policy rules will not be reversible
>> it is unclear whether it makes sense to change this.
> 
> This is a good point. You are absolutely correct about the policy rules.
> 
> IIRC, I initially looked through all the tables, but skipped this
> behavior when I rewrote the code for 2.6.11. Currently I'm once again
> in favor of looping through them all. This is IMO at least closer to the
> spirit of RFC 2461 section 8.3. where a host SHOULD update its
> destination cache upon receiving a redirect. If we don't look through
> all tables, we can't ensure this happens.

Without a route cache, I think our only choice is to search through
all tables.  The same thing applies to PMTU updates as well.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
  2006-07-31 14:01     ` Herbert Xu
@ 2006-07-31 14:02       ` Herbert Xu
  2006-07-31 15:41       ` Thomas Graf
  1 sibling, 0 replies; 54+ messages in thread
From: Herbert Xu @ 2006-07-31 14:02 UTC (permalink / raw)
  To: Ville Nuorvala; +Cc: tgraf, netdev, usagi-core, yoshfuji, davem, anttit

On Tue, Aug 01, 2006 at 12:01:03AM +1000, Herbert Xu wrote:
> Ville Nuorvala <vnuorval@tcs.hut.fi> wrote:
> > 
> >> When locating routes for redirects only the main table is
> >> searched for now. Since policy rules will not be reversible
> >> it is unclear whether it makes sense to change this.
> > 
> > This is a good point. You are absolutely correct about the policy rules.
> > 
> > IIRC, I initially looked through all the tables, but skipped this
> > behavior when I rewrote the code for 2.6.11. Currently I'm once again
> > in favor of looping through them all. This is IMO at least closer to the
> > spirit of RFC 2461 section 8.3. where a host SHOULD update its
> > destination cache upon receiving a redirect. If we don't look through
> > all tables, we can't ensure this happens.
> 
> Without a route cache, I think our only choice is to search through
> all tables.  The same thing applies to PMTU updates as well.

Actually, if we're adding policy routing, we should seriously consider
whether living without a routing cache is still viable or not because
the cost of a route lookup has just gone up.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
  2006-07-31 14:01     ` Herbert Xu
  2006-07-31 14:02       ` Herbert Xu
@ 2006-07-31 15:41       ` Thomas Graf
  2006-07-31 20:09         ` David Miller
  1 sibling, 1 reply; 54+ messages in thread
From: Thomas Graf @ 2006-07-31 15:41 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Ville Nuorvala, netdev, usagi-core, yoshfuji, davem, anttit

* Herbert Xu <herbert@gondor.apana.org.au> 2006-08-01 00:01
> Without a route cache, I think our only choice is to search through
> all tables.  The same thing applies to PMTU updates as well.

I think PMTU etc. should be moved out of the route into a
some form of flow cache. It's currently using rt6_lookup()
which even goes through the rules.

Doing a few thousand trie lookups after Patrick's changes
in the worst case for every redirect might be acceptable
but doing so for every PMTU update could become an issue.

> Actually, if we're adding policy routing, we should seriously consider
> whether living without a routing cache is still viable or not because
> the cost of a route lookup has just gone up.

Absolutely.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
  2006-07-31 15:41       ` Thomas Graf
@ 2006-07-31 20:09         ` David Miller
  0 siblings, 0 replies; 54+ messages in thread
From: David Miller @ 2006-07-31 20:09 UTC (permalink / raw)
  To: tgraf; +Cc: herbert, vnuorval, netdev, usagi-core, yoshfuji, anttit

From: Thomas Graf <tgraf@suug.ch>
Date: Mon, 31 Jul 2006 17:41:42 +0200

> * Herbert Xu <herbert@gondor.apana.org.au> 2006-08-01 00:01
> > Actually, if we're adding policy routing, we should seriously consider
> > whether living without a routing cache is still viable or not because
> > the cost of a route lookup has just gone up.
> 
> Absolutely.

This is something I wanted to bring up too.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
  2006-07-31 13:55   ` Ville Nuorvala
  2006-07-31 14:01     ` Herbert Xu
@ 2006-07-31 15:34     ` Thomas Graf
  1 sibling, 0 replies; 54+ messages in thread
From: Thomas Graf @ 2006-07-31 15:34 UTC (permalink / raw)
  To: Ville Nuorvala; +Cc: netdev, usagi-core, yoshfuji, davem, anttit

* Ville Nuorvala <vnuorval@tcs.hut.fi> 2006-07-31 16:55
> > When locating routes for redirects only the main table is
> > searched for now. Since policy rules will not be reversible
> > it is unclear whether it makes sense to change this.
> 
> This is a good point. You are absolutely correct about the policy rules.
> 
> IIRC, I initially looked through all the tables, but skipped this
> behavior when I rewrote the code for 2.6.11. Currently I'm once again
> in favor of looping through them all. This is IMO at least closer to the
> spirit of RFC 2461 section 8.3. where a host SHOULD update its
> destination cache upon receiving a redirect. If we don't look through
> all tables, we can't ensure this happens.

I agree, it will depend on what way is being followed regarding a
flow cache or route cache.

> > +#define RT6_TABLE_UNSPEC	RT_TABLE_UNSPEC
> > +#define RT6_TABLE_MAIN		RT_TABLE_MAIN
> > +#define RT6_TABLE_LOCAL		RT6_TABLE_MAIN
> > +#define RT6_TABLE_DFLT		RT6_TABLE_MAIN
> > +#define RT6_TABLE_INFO		RT6_TABLE_MAIN
> 
> IMO it's a bit inconsistent to define a separate table entry for Route
> Information generated routes, but not Prefix Information based ones.
> What do you say about adding a RT6_TABLE_PRFX?

Sounds good.

> > @@ -1435,12 +1523,15 @@ static struct rt6_info *rt6_add_route_in
> >  struct rt6_info *rt6_get_dflt_router(struct in6_addr *addr, struct net_device *dev)
> >  {	
> >  	struct rt6_info *rt;
> > -	struct fib6_node *fn;
> > +	struct fib6_table *table;
> >  
> > -	fn = &ip6_routing_table;
> > +	/* TODO: It might be better to search all tables */
> > +	table = fib6_get_table(RT6_TABLE_DFLT);
> 
> As long as the table for default routes is RT6_TABLE_DFLT and can't be
> configured by the user, I think the correct behavior is just to search
> RT6_TABLE_DFLT.

I agree, I intended to remove that comment but missed it.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-26 22:11 [RFC] Multiple IPV6 Routing Tables & Policy Routing Thomas Graf
  2006-07-26 22:00 ` [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency Thomas Graf
  2006-07-26 22:00 ` [PATCH 2/5] [IPV6]: Multiple Routing Tables Thomas Graf
@ 2006-07-26 22:00 ` Thomas Graf
  2006-07-26 22:41   ` David Miller
  2006-07-27  5:58   ` James Morris
  2006-07-26 22:00 ` [PATCH 4/5] [IPV6]: Policy Routing Rules Thomas Graf
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 54+ messages in thread
From: Thomas Graf @ 2006-07-26 22:00 UTC (permalink / raw)
  To: netdev; +Cc: vnuorval, usagi-core, yoshfuji, davem, anttit

[-- Attachment #1: generic_fib_rules --]
[-- Type: text/plain, Size: 15525 bytes --]

Derived from net/ipv6/fib_rules.c

Signed-off-by: Thomas Graf <tgraf@suug.ch>

Index: net-2.6.git/include/linux/fib_rules.h
===================================================================
--- /dev/null
+++ net-2.6.git/include/linux/fib_rules.h
@@ -0,0 +1,60 @@
+#ifndef __LINUX_FIB_RULES_H
+#define __LINUX_FIB_RULES_H
+
+#include <linux/types.h>
+#include <linux/rtnetlink.h>
+
+/* rule is permanent, and cannot be deleted */
+#define FIB_RULE_PERMANENT	1
+
+struct fib_rule_hdr
+{
+	__u8		family;
+	__u8		dst_len;
+	__u8		src_len;
+	__u8		tos;
+
+	__u8		table;
+	__u8		res1;	/* reserved */
+	__u8		res2;	/* reserved */
+	__u8		action;
+
+	__u32		flags;
+};
+
+enum
+{
+	FRA_UNSPEC,
+	FRA_DST,	/* destination address */
+	FRA_SRC,	/* source address */
+	FRA_IFNAME,	/* interface name */
+	FRA_UNUSED1,
+	FRA_UNUSED2,
+	FRA_PRIORITY,	/* priority/preference */
+	FRA_UNUSED3,
+	FRA_UNUSED4,
+	FRA_UNUSED5,
+	FRA_FWMARK,	/* netfilter mark (IPv4) */
+	FRA_FLOW,	/* flow/class id */
+	__FRA_MAX
+};
+
+#define FRA_MAX (__FRA_MAX - 1)
+
+enum
+{
+	FR_ACT_UNSPEC,
+	FR_ACT_TO_TBL,		/* Pass to fixed table */
+	FR_ACT_RES1,
+	FR_ACT_RES2,
+	FR_ACT_RES3,
+	FR_ACT_RES4,
+	FR_ACT_BLACKHOLE,	/* Drop without notification */
+	FR_ACT_UNREACHABLE,	/* Drop with ENETUNREACH */
+	FR_ACT_PROHIBIT,	/* Drop with EACCES */
+	__FR_ACT_MAX,
+};
+
+#define FR_ACT_MAX (__FR_ACT_MAX - 1)
+
+#endif
Index: net-2.6.git/include/net/fib_rules.h
===================================================================
--- /dev/null
+++ net-2.6.git/include/net/fib_rules.h
@@ -0,0 +1,89 @@
+#ifndef __NET_FIB_RULES_H
+#define __NET_FIB_RULES_H
+
+#include <linux/types.h>
+#include <linux/netdevice.h>
+#include <linux/fib_rules.h>
+#include <net/flow.h>
+#include <net/netlink.h>
+
+struct fib_rule
+{
+	struct list_head	list;
+	atomic_t		refcnt;
+	int			ifindex;
+	char			ifname[IFNAMSIZ];
+	u32			pref;
+	u32			flags;
+	u32			table;
+	u8			action;
+	struct rcu_head		rcu;
+};
+
+struct fib_lookup_arg
+{
+	void			*lookup_ptr;
+	void			*result;
+	struct fib_rule		*rule;
+};
+
+struct fib_rules_ops
+{
+	int			family;
+	struct list_head	list;
+	int			rule_size;
+
+	int			(*action)(struct fib_rule *,
+					  struct flowi *, int,
+					  struct fib_lookup_arg *);
+	int			(*match)(struct fib_rule *,
+					 struct flowi *, int);
+	int			(*configure)(struct fib_rule *,
+					     struct sk_buff *,
+					     struct nlmsghdr *,
+					     struct fib_rule_hdr *,
+					     struct nlattr **);
+	int			(*compare)(struct fib_rule *,
+					   struct fib_rule_hdr *,
+					   struct nlattr **);
+	int			(*fill)(struct fib_rule *, struct sk_buff *,
+					struct nlmsghdr *,
+					struct fib_rule_hdr *);
+	u32			(*default_pref)(void);
+
+	struct nla_policy	*policy;
+	struct list_head	*rules_list;
+	struct module		*owner;
+};
+
+static inline void fib_rule_get(struct fib_rule *rule)
+{
+	atomic_inc(&rule->refcnt);
+}
+
+static inline void fib_rule_put_rcu(struct rcu_head *head)
+{
+	struct fib_rule *rule = container_of(head, struct fib_rule, rcu);
+	kfree(rule);
+}
+
+static inline void fib_rule_put(struct fib_rule *rule)
+{
+	if (atomic_dec_and_test(&rule->refcnt))
+		call_rcu(&rule->rcu, fib_rule_put_rcu);
+}
+
+extern int			fib_rules_register(struct fib_rules_ops *);
+extern int			fib_rules_unregister(struct fib_rules_ops *);
+
+extern int			fib_rules_lookup(struct fib_rules_ops *,
+						 struct flowi *, int flags,
+						 struct fib_lookup_arg *);
+
+extern int			fib_nl_newrule(struct sk_buff *,
+					       struct nlmsghdr *, void *);
+extern int			fib_nl_delrule(struct sk_buff *,
+					       struct nlmsghdr *, void *);
+extern int			fib_rules_dump(struct sk_buff *,
+					       struct netlink_callback *, int);
+#endif
Index: net-2.6.git/net/Kconfig
===================================================================
--- net-2.6.git.orig/net/Kconfig
+++ net-2.6.git/net/Kconfig
@@ -249,6 +249,9 @@ source "net/ieee80211/Kconfig"
 config WIRELESS_EXT
 	bool
 
+config FIB_RULES
+	bool
+
 endif   # if NET
 endmenu # Networking
 
Index: net-2.6.git/net/core/Makefile
===================================================================
--- net-2.6.git.orig/net/core/Makefile
+++ net-2.6.git/net/core/Makefile
@@ -17,3 +17,4 @@ obj-$(CONFIG_NET_PKTGEN) += pktgen.o
 obj-$(CONFIG_WIRELESS_EXT) += wireless.o
 obj-$(CONFIG_NETPOLL) += netpoll.o
 obj-$(CONFIG_NET_DMA) += user_dma.o
+obj-$(CONFIG_FIB_RULES) += fib_rules.o
Index: net-2.6.git/net/core/fib_rules.c
===================================================================
--- /dev/null
+++ net-2.6.git/net/core/fib_rules.c
@@ -0,0 +1,410 @@
+/*
+ * net/core/fib_rules.c		Generic Routing Rules
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License as
+ *	published by the Free Software Foundation, version 2.
+ *
+ * Authors:	Thomas Graf <tgraf@suug.ch>
+ */
+
+#include <linux/config.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <net/fib_rules.h>
+
+static LIST_HEAD(rules_ops);
+static DEFINE_SPINLOCK(rules_mod_lock);
+
+static void notify_rule_change(int event, struct fib_rule *rule,
+			       struct fib_rules_ops *ops);
+
+static struct fib_rules_ops *lookup_rules_ops(int family)
+{
+	struct fib_rules_ops *ops;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(ops, &rules_ops, list) {
+		if (ops->family == family) {
+			if (!try_module_get(ops->owner))
+				ops = NULL;
+			rcu_read_unlock();
+			return ops;
+		}
+	}
+	rcu_read_unlock();
+
+	return NULL;
+}
+
+static void rules_ops_put(struct fib_rules_ops *ops)
+{
+	if (ops)
+		module_put(ops->owner);
+}
+
+int fib_rules_register(struct fib_rules_ops *ops)
+{
+	int err = -EEXIST;
+	struct fib_rules_ops *o;
+
+	if (ops->rule_size < sizeof(struct fib_rule))
+		return -EINVAL;
+
+	if (ops->match == NULL || ops->configure == NULL ||
+	    ops->compare == NULL || ops->fill == NULL ||
+	    ops->action == NULL)
+		return -EINVAL;
+
+	spin_lock_bh(&rules_mod_lock);
+	list_for_each_entry(o, &rules_ops, list)
+		if (ops->family == o->family)
+			goto errout;
+
+	list_add_tail_rcu(&ops->list, &rules_ops);
+	err = 0;
+errout:
+	spin_unlock_bh(&rules_mod_lock);
+
+	return err;
+}
+
+static void cleanup_ops(struct fib_rules_ops *ops)
+{
+	struct fib_rule *rule, *tmp;
+
+	list_for_each_entry_safe(rule, tmp, ops->rules_list, list) {
+		list_del_rcu(&rule->list);
+		fib_rule_put(rule);
+	}
+}
+
+int fib_rules_unregister(struct fib_rules_ops *ops)
+{
+	int err = 0;
+	struct fib_rules_ops *o;
+
+	spin_lock_bh(&rules_mod_lock);
+	list_for_each_entry(o, &rules_ops, list) {
+		if (o == ops) {
+			list_del_rcu(&o->list);
+			cleanup_ops(ops);
+			goto out;
+		}
+	}
+
+	err = -ENOENT;
+out:
+	spin_unlock_bh(&rules_mod_lock);
+
+	synchronize_net();
+
+	return err;
+}
+
+int fib_rules_lookup(struct fib_rules_ops *ops, struct flowi *fl,
+		     int flags, struct fib_lookup_arg *arg)
+{
+	struct fib_rule *rule;
+	int err;
+
+	rcu_read_lock();
+
+	list_for_each_entry(rule, ops->rules_list, list) {
+		if (rule->ifname[0] && (rule->ifindex != fl->iif))
+			continue;
+
+		if (!ops->match(rule, fl, flags))
+			continue;
+
+		rcu_read_unlock();
+
+		err = ops->action(rule, fl, flags, arg);
+		if (err != -EAGAIN) {
+			fib_rule_get(rule);
+			arg->rule = rule;
+			goto out;
+		}
+	}
+
+	err = -ENETUNREACH;
+out:
+	rcu_read_unlock();
+
+	return err;
+}
+
+int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
+{
+	struct fib_rule_hdr *frh = nlmsg_data(nlh);
+	struct fib_rules_ops *ops = NULL;
+	struct fib_rule *rule, *r, *last = NULL;
+	struct nlattr *tb[FRA_MAX+1];
+	int err = -EINVAL;
+
+	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
+		goto errout;
+
+	ops = lookup_rules_ops(frh->family);
+	if (ops == NULL) {
+		err = EAFNOSUPPORT;
+		goto errout;
+	}
+
+	err = nlmsg_parse(nlh, sizeof(*frh), tb, FRA_MAX, ops->policy);
+	if (err < 0)
+		goto errout;
+
+	if (tb[FRA_IFNAME] && nla_len(tb[FRA_IFNAME]) > IFNAMSIZ)
+		goto errout;
+
+	rule = kmalloc(ops->rule_size, GFP_KERNEL);
+	if (rule == NULL) {
+		err = -ENOMEM;
+		goto errout;
+	}
+	memset(rule, 0, ops->rule_size);
+
+	if (tb[FRA_PRIORITY])
+		rule->pref = nla_get_u32(tb[FRA_PRIORITY]);
+
+	if (tb[FRA_IFNAME]) {
+		struct net_device *dev;
+
+		rule->ifindex = -1;
+		if (nla_strlcpy(rule->ifname, tb[FRA_IFNAME],
+				IFNAMSIZ) >= IFNAMSIZ)
+			goto errout_free;
+
+		dev = __dev_get_by_name(rule->ifname);
+		if (dev)
+			rule->ifindex = dev->ifindex;
+	}
+
+	rule->action = frh->action;
+	rule->flags = frh->flags;
+	rule->table = frh->table;
+
+	if (!rule->pref && ops->default_pref)
+		rule->pref = ops->default_pref();
+
+	err = ops->configure(rule, skb, nlh, frh, tb);
+	if (err < 0)
+		goto errout_free;
+
+	list_for_each_entry(r, ops->rules_list, list) {
+		if (r->pref > rule->pref)
+			break;
+		last = r;
+	}
+
+	fib_rule_get(rule);
+
+	if (last)
+		list_add_rcu(&rule->list, &last->list);
+	else
+		list_add_rcu(&rule->list, ops->rules_list);
+
+	notify_rule_change(RTM_NEWRULE, rule, ops);
+	rules_ops_put(ops);
+	return 0;
+
+errout_free:
+	kfree(rule);
+errout:
+	rules_ops_put(ops);
+	return err;
+}
+
+int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
+{
+	struct fib_rule_hdr *frh = nlmsg_data(nlh);
+	struct fib_rules_ops *ops = NULL;
+	struct fib_rule *rule;
+	struct nlattr *tb[FRA_MAX+1];
+	int err = -EINVAL;
+
+	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
+		goto errout;
+
+	ops = lookup_rules_ops(frh->family);
+	if (ops == NULL) {
+		err = EAFNOSUPPORT;
+		goto errout;
+	}
+
+	err = nlmsg_parse(nlh, sizeof(*frh), tb, FRA_MAX, ops->policy);
+	if (err < 0)
+		goto errout;
+
+	list_for_each_entry(rule, ops->rules_list, list) {
+		if (frh->action && (frh->action != rule->action))
+			continue;
+
+		if (frh->table && (frh->table != rule->table))
+			continue;
+
+		if (tb[FRA_PRIORITY] &&
+		    (rule->pref != nla_get_u32(tb[FRA_PRIORITY])))
+			continue;
+
+		if (tb[FRA_IFNAME] &&
+		    nla_strcmp(tb[FRA_IFNAME], rule->ifname))
+			continue;
+
+		if (!ops->compare(rule, frh, tb))
+			continue;
+
+		if (rule->flags & FIB_RULE_PERMANENT) {
+			err = -EPERM;
+			goto errout;
+		}
+
+		list_del_rcu(&rule->list);
+		notify_rule_change(RTM_DELRULE, rule, ops);
+		fib_rule_put(rule);
+		rules_ops_put(ops);
+		return 0;
+	}
+
+	err = -ENOENT;
+errout:
+	rules_ops_put(ops);
+	return err;
+}
+
+static int fib_nl_fill_rule(struct sk_buff *skb, struct fib_rule *rule,
+			    u32 pid, u32 seq, int type, int flags,
+			    struct fib_rules_ops *ops)
+{
+	struct nlmsghdr *nlh;
+	struct fib_rule_hdr *frh;
+
+	nlh = nlmsg_put(skb, pid, seq, type, sizeof(*frh), flags);
+	if (nlh == NULL)
+		return -1;
+
+	frh = nlmsg_data(nlh);
+	frh->table = rule->table;
+	frh->res1 = 0;
+	frh->res2 = 0;
+	frh->action = rule->action;
+	frh->flags = rule->flags;
+
+	if (rule->ifname[0])
+		NLA_PUT_STRING(skb, FRA_IFNAME, rule->ifname);
+
+	if (rule->pref)
+		NLA_PUT_U32(skb, FRA_PRIORITY, rule->pref);
+
+	if (ops->fill(rule, skb, nlh, frh) < 0)
+		goto nla_put_failure;
+
+	return nlmsg_end(skb, nlh);
+
+nla_put_failure:
+	return nlmsg_cancel(skb, nlh);
+}
+
+int fib_rules_dump(struct sk_buff *skb, struct netlink_callback *cb, int family)
+{
+	int idx = 0;
+	struct fib_rule *rule;
+	struct fib_rules_ops *ops;
+
+	ops = lookup_rules_ops(family);
+	if (ops == NULL)
+		return -EAFNOSUPPORT;
+
+	rcu_read_lock();
+	list_for_each_entry(rule, ops->rules_list, list) {
+		if (idx < cb->args[0])
+			goto skip;
+
+		if (fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).pid,
+				     cb->nlh->nlmsg_seq, RTM_NEWRULE,
+				     NLM_F_MULTI, ops) < 0)
+			break;
+skip:
+		idx++;
+	}
+	rcu_read_unlock();
+	cb->args[0] = idx;
+	rules_ops_put(ops);
+
+	return skb->len;
+}
+
+static void notify_rule_change(int event, struct fib_rule *rule,
+			       struct fib_rules_ops *ops)
+{
+	int size = nlmsg_total_size(sizeof(struct fib_rule_hdr) + 128);
+	struct sk_buff *skb = alloc_skb(size, GFP_KERNEL);
+
+	if (skb == NULL)
+		netlink_set_err(rtnl, 0, RTNLGRP_IPV4_RULE, ENOBUFS);
+	else if (fib_nl_fill_rule(skb, rule, 0, 0, event, 0, ops) < 0) {
+		kfree_skb(skb);
+		netlink_set_err(rtnl, 0, RTNLGRP_IPV4_RULE, EINVAL);
+	} else
+		netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV4_RULE, GFP_KERNEL);
+}
+
+static void attach_rules(struct list_head *rules, struct net_device *dev)
+{
+	struct fib_rule *rule;
+
+	list_for_each_entry(rule, rules, list) {
+		if (rule->ifindex == -1 &&
+		    strcmp(dev->name, rule->ifname) == 0)
+			rule->ifindex = dev->ifindex;
+	}
+}
+
+static void detach_rules(struct list_head *rules, struct net_device *dev)
+{
+	struct fib_rule *rule;
+
+	list_for_each_entry(rule, rules, list)
+		if (rule->ifindex == dev->ifindex)
+			rule->ifindex = -1;
+}
+
+
+static int fib_rules_event(struct notifier_block *this, unsigned long event,
+			    void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct fib_rules_ops *ops;
+
+	ASSERT_RTNL();
+	rcu_read_lock();
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		list_for_each_entry(ops, &rules_ops, list)
+			attach_rules(ops->rules_list, dev);
+		break;
+
+	case NETDEV_UNREGISTER:
+		list_for_each_entry(ops, &rules_ops, list)
+			detach_rules(ops->rules_list, dev);
+		break;
+	}
+
+	rcu_read_unlock();
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block fib_rules_notifier = {
+	.notifier_call = fib_rules_event,
+};
+
+static int __init fib_rules_init(void)
+{
+	return register_netdevice_notifier(&fib_rules_notifier);
+}
+
+subsys_initcall(fib_rules_init);
Index: net-2.6.git/net/core/rtnetlink.c
===================================================================
--- net-2.6.git.orig/net/core/rtnetlink.c
+++ net-2.6.git/net/core/rtnetlink.c
@@ -49,6 +49,7 @@
 #include <net/udp.h>
 #include <net/sock.h>
 #include <net/pkt_sched.h>
+#include <net/fib_rules.h>
 #include <net/netlink.h>
 #ifdef CONFIG_NET_WIRELESS_RTNETLINK
 #include <linux/wireless.h>
@@ -103,7 +104,7 @@ static const int rtm_min[RTM_NR_FAMILIES
 	[RTM_FAM(RTM_NEWADDR)]      = NLMSG_LENGTH(sizeof(struct ifaddrmsg)),
 	[RTM_FAM(RTM_NEWROUTE)]     = NLMSG_LENGTH(sizeof(struct rtmsg)),
 	[RTM_FAM(RTM_NEWNEIGH)]     = NLMSG_LENGTH(sizeof(struct ndmsg)),
-	[RTM_FAM(RTM_NEWRULE)]      = NLMSG_LENGTH(sizeof(struct rtmsg)),
+	[RTM_FAM(RTM_NEWRULE)]      = NLMSG_LENGTH(sizeof(struct fib_rule_hdr)),
 	[RTM_FAM(RTM_NEWQDISC)]     = NLMSG_LENGTH(sizeof(struct tcmsg)),
 	[RTM_FAM(RTM_NEWTCLASS)]    = NLMSG_LENGTH(sizeof(struct tcmsg)),
 	[RTM_FAM(RTM_NEWTFILTER)]   = NLMSG_LENGTH(sizeof(struct tcmsg)),
@@ -120,7 +121,7 @@ static const int rta_max[RTM_NR_FAMILIES
 	[RTM_FAM(RTM_NEWADDR)]      = IFA_MAX,
 	[RTM_FAM(RTM_NEWROUTE)]     = RTA_MAX,
 	[RTM_FAM(RTM_NEWNEIGH)]     = NDA_MAX,
-	[RTM_FAM(RTM_NEWRULE)]      = RTA_MAX,
+	[RTM_FAM(RTM_NEWRULE)]      = FRA_MAX,
 	[RTM_FAM(RTM_NEWQDISC)]     = TCA_MAX,
 	[RTM_FAM(RTM_NEWTCLASS)]    = TCA_MAX,
 	[RTM_FAM(RTM_NEWTFILTER)]   = TCA_MAX,
@@ -744,6 +745,10 @@ static struct rtnetlink_link link_rtnetl
 	[RTM_NEWNEIGH    - RTM_BASE] = { .doit   = neigh_add		 },
 	[RTM_DELNEIGH    - RTM_BASE] = { .doit   = neigh_delete		 },
 	[RTM_GETNEIGH    - RTM_BASE] = { .dumpit = neigh_dump_info	 },
+#ifdef CONFIG_FIB_RULES
+	[RTM_NEWRULE     - RTM_BASE] = { .doit   = fib_nl_newrule	 },
+	[RTM_DELRULE     - RTM_BASE] = { .doit   = fib_nl_delrule	 },
+#endif
 	[RTM_GETRULE     - RTM_BASE] = { .dumpit = rtnetlink_dump_all	 },
 	[RTM_GETNEIGHTBL - RTM_BASE] = { .dumpit = neightbl_dump_info	 },
 	[RTM_SETNEIGHTBL - RTM_BASE] = { .doit   = neightbl_set		 },


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-26 22:00 ` [PATCH 3/5] [NET]: Protocol Independant Policy Routing Rules Framework Thomas Graf
@ 2006-07-26 22:41   ` David Miller
  2006-07-27  5:58   ` James Morris
  1 sibling, 0 replies; 54+ messages in thread
From: David Miller @ 2006-07-26 22:41 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, vnuorval, usagi-core, yoshfuji, anttit

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 27 Jul 2006 00:00:03 +0200

> Derived from net/ipv6/fib_rules.c
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>

A very nice abstraction, looks great.

Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-26 22:00 ` [PATCH 3/5] [NET]: Protocol Independant Policy Routing Rules Framework Thomas Graf
  2006-07-26 22:41   ` David Miller
@ 2006-07-27  5:58   ` James Morris
  2006-07-27  6:02     ` David Miller
  1 sibling, 1 reply; 54+ messages in thread
From: James Morris @ 2006-07-27  5:58 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, vnuorval, usagi-core, yoshfuji, davem, anttit

On Thu, 27 Jul 2006, Thomas Graf wrote:

> +	rule = kmalloc(ops->rule_size, GFP_KERNEL);
> +	if (rule == NULL) {
> +		err = -ENOMEM;
> +		goto errout;
> +	}
> +	memset(rule, 0, ops->rule_size);
> +

kzalloc() ? :-)



- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-27  5:58   ` James Morris
@ 2006-07-27  6:02     ` David Miller
  2006-07-27 22:39       ` [RESEND " Thomas Graf
  0 siblings, 1 reply; 54+ messages in thread
From: David Miller @ 2006-07-27  6:02 UTC (permalink / raw)
  To: jmorris; +Cc: tgraf, netdev, vnuorval, usagi-core, yoshfuji, anttit

From: James Morris <jmorris@namei.org>
Date: Thu, 27 Jul 2006 01:58:58 -0400 (EDT)

> On Thu, 27 Jul 2006, Thomas Graf wrote:
> 
> > +	rule = kmalloc(ops->rule_size, GFP_KERNEL);
> > +	if (rule == NULL) {
> > +		err = -ENOMEM;
> > +		goto errout;
> > +	}
> > +	memset(rule, 0, ops->rule_size);
> > +
> 
> kzalloc() ? :-)

Good catch :)  He did use kzalloc() in a few other places, to his
credit.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-27  6:02     ` David Miller
@ 2006-07-27 22:39       ` Thomas Graf
  2006-07-27 22:58         ` Patrick McHardy
                           ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Thomas Graf @ 2006-07-27 22:39 UTC (permalink / raw)
  To: David Miller; +Cc: jmorris, netdev, vnuorval, usagi-core, yoshfuji, anttit

Derived from net/ipv6/fib_rules.c

Signed-off-by: Thomas Graf <tgraf@suug.ch>

Index: net-2.6.git/include/linux/fib_rules.h
===================================================================
--- /dev/null
+++ net-2.6.git/include/linux/fib_rules.h
@@ -0,0 +1,60 @@
+#ifndef __LINUX_FIB_RULES_H
+#define __LINUX_FIB_RULES_H
+
+#include <linux/types.h>
+#include <linux/rtnetlink.h>
+
+/* rule is permanent, and cannot be deleted */
+#define FIB_RULE_PERMANENT	1
+
+struct fib_rule_hdr
+{
+	__u8		family;
+	__u8		dst_len;
+	__u8		src_len;
+	__u8		tos;
+
+	__u8		table;
+	__u8		res1;	/* reserved */
+	__u8		res2;	/* reserved */
+	__u8		action;
+
+	__u32		flags;
+};
+
+enum
+{
+	FRA_UNSPEC,
+	FRA_DST,	/* destination address */
+	FRA_SRC,	/* source address */
+	FRA_IFNAME,	/* interface name */
+	FRA_UNUSED1,
+	FRA_UNUSED2,
+	FRA_PRIORITY,	/* priority/preference */
+	FRA_UNUSED3,
+	FRA_UNUSED4,
+	FRA_UNUSED5,
+	FRA_FWMARK,	/* netfilter mark (IPv4) */
+	FRA_FLOW,	/* flow/class id */
+	__FRA_MAX
+};
+
+#define FRA_MAX (__FRA_MAX - 1)
+
+enum
+{
+	FR_ACT_UNSPEC,
+	FR_ACT_TO_TBL,		/* Pass to fixed table */
+	FR_ACT_RES1,
+	FR_ACT_RES2,
+	FR_ACT_RES3,
+	FR_ACT_RES4,
+	FR_ACT_BLACKHOLE,	/* Drop without notification */
+	FR_ACT_UNREACHABLE,	/* Drop with ENETUNREACH */
+	FR_ACT_PROHIBIT,	/* Drop with EACCES */
+	__FR_ACT_MAX,
+};
+
+#define FR_ACT_MAX (__FR_ACT_MAX - 1)
+
+#endif
Index: net-2.6.git/include/net/fib_rules.h
===================================================================
--- /dev/null
+++ net-2.6.git/include/net/fib_rules.h
@@ -0,0 +1,89 @@
+#ifndef __NET_FIB_RULES_H
+#define __NET_FIB_RULES_H
+
+#include <linux/types.h>
+#include <linux/netdevice.h>
+#include <linux/fib_rules.h>
+#include <net/flow.h>
+#include <net/netlink.h>
+
+struct fib_rule
+{
+	struct list_head	list;
+	atomic_t		refcnt;
+	int			ifindex;
+	char			ifname[IFNAMSIZ];
+	u32			pref;
+	u32			flags;
+	u32			table;
+	u8			action;
+	struct rcu_head		rcu;
+};
+
+struct fib_lookup_arg
+{
+	void			*lookup_ptr;
+	void			*result;
+	struct fib_rule		*rule;
+};
+
+struct fib_rules_ops
+{
+	int			family;
+	struct list_head	list;
+	int			rule_size;
+
+	int			(*action)(struct fib_rule *,
+					  struct flowi *, int,
+					  struct fib_lookup_arg *);
+	int			(*match)(struct fib_rule *,
+					 struct flowi *, int);
+	int			(*configure)(struct fib_rule *,
+					     struct sk_buff *,
+					     struct nlmsghdr *,
+					     struct fib_rule_hdr *,
+					     struct nlattr **);
+	int			(*compare)(struct fib_rule *,
+					   struct fib_rule_hdr *,
+					   struct nlattr **);
+	int			(*fill)(struct fib_rule *, struct sk_buff *,
+					struct nlmsghdr *,
+					struct fib_rule_hdr *);
+	u32			(*default_pref)(void);
+
+	struct nla_policy	*policy;
+	struct list_head	*rules_list;
+	struct module		*owner;
+};
+
+static inline void fib_rule_get(struct fib_rule *rule)
+{
+	atomic_inc(&rule->refcnt);
+}
+
+static inline void fib_rule_put_rcu(struct rcu_head *head)
+{
+	struct fib_rule *rule = container_of(head, struct fib_rule, rcu);
+	kfree(rule);
+}
+
+static inline void fib_rule_put(struct fib_rule *rule)
+{
+	if (atomic_dec_and_test(&rule->refcnt))
+		call_rcu(&rule->rcu, fib_rule_put_rcu);
+}
+
+extern int			fib_rules_register(struct fib_rules_ops *);
+extern int			fib_rules_unregister(struct fib_rules_ops *);
+
+extern int			fib_rules_lookup(struct fib_rules_ops *,
+						 struct flowi *, int flags,
+						 struct fib_lookup_arg *);
+
+extern int			fib_nl_newrule(struct sk_buff *,
+					       struct nlmsghdr *, void *);
+extern int			fib_nl_delrule(struct sk_buff *,
+					       struct nlmsghdr *, void *);
+extern int			fib_rules_dump(struct sk_buff *,
+					       struct netlink_callback *, int);
+#endif
Index: net-2.6.git/net/Kconfig
===================================================================
--- net-2.6.git.orig/net/Kconfig
+++ net-2.6.git/net/Kconfig
@@ -249,6 +249,9 @@ source "net/ieee80211/Kconfig"
 config WIRELESS_EXT
 	bool
 
+config FIB_RULES
+	bool
+
 endif   # if NET
 endmenu # Networking
 
Index: net-2.6.git/net/core/Makefile
===================================================================
--- net-2.6.git.orig/net/core/Makefile
+++ net-2.6.git/net/core/Makefile
@@ -17,3 +17,4 @@ obj-$(CONFIG_NET_PKTGEN) += pktgen.o
 obj-$(CONFIG_WIRELESS_EXT) += wireless.o
 obj-$(CONFIG_NETPOLL) += netpoll.o
 obj-$(CONFIG_NET_DMA) += user_dma.o
+obj-$(CONFIG_FIB_RULES) += fib_rules.o
Index: net-2.6.git/net/core/fib_rules.c
===================================================================
--- /dev/null
+++ net-2.6.git/net/core/fib_rules.c
@@ -0,0 +1,417 @@
+/*
+ * net/core/fib_rules.c		Generic Routing Rules
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License as
+ *	published by the Free Software Foundation, version 2.
+ *
+ * Authors:	Thomas Graf <tgraf@suug.ch>
+ */
+
+#include <linux/config.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <net/fib_rules.h>
+
+static LIST_HEAD(rules_ops);
+static DEFINE_SPINLOCK(rules_mod_lock);
+
+static void notify_rule_change(int event, struct fib_rule *rule,
+			       struct fib_rules_ops *ops);
+
+static struct fib_rules_ops *lookup_rules_ops(int family)
+{
+	struct fib_rules_ops *ops;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(ops, &rules_ops, list) {
+		if (ops->family == family) {
+			if (!try_module_get(ops->owner))
+				ops = NULL;
+			rcu_read_unlock();
+			return ops;
+		}
+	}
+	rcu_read_unlock();
+
+	return NULL;
+}
+
+static void rules_ops_put(struct fib_rules_ops *ops)
+{
+	if (ops)
+		module_put(ops->owner);
+}
+
+int fib_rules_register(struct fib_rules_ops *ops)
+{
+	int err = -EEXIST;
+	struct fib_rules_ops *o;
+
+	if (ops->rule_size < sizeof(struct fib_rule))
+		return -EINVAL;
+
+	if (ops->match == NULL || ops->configure == NULL ||
+	    ops->compare == NULL || ops->fill == NULL ||
+	    ops->action == NULL)
+		return -EINVAL;
+
+	spin_lock_bh(&rules_mod_lock);
+	list_for_each_entry(o, &rules_ops, list)
+		if (ops->family == o->family)
+			goto errout;
+
+	list_add_tail_rcu(&ops->list, &rules_ops);
+	err = 0;
+errout:
+	spin_unlock_bh(&rules_mod_lock);
+
+	return err;
+}
+
+EXPORT_SYMBOL_GPL(fib_rules_register);
+
+static void cleanup_ops(struct fib_rules_ops *ops)
+{
+	struct fib_rule *rule, *tmp;
+
+	list_for_each_entry_safe(rule, tmp, ops->rules_list, list) {
+		list_del_rcu(&rule->list);
+		fib_rule_put(rule);
+	}
+}
+
+int fib_rules_unregister(struct fib_rules_ops *ops)
+{
+	int err = 0;
+	struct fib_rules_ops *o;
+
+	spin_lock_bh(&rules_mod_lock);
+	list_for_each_entry(o, &rules_ops, list) {
+		if (o == ops) {
+			list_del_rcu(&o->list);
+			cleanup_ops(ops);
+			goto out;
+		}
+	}
+
+	err = -ENOENT;
+out:
+	spin_unlock_bh(&rules_mod_lock);
+
+	synchronize_net();
+
+	return err;
+}
+
+EXPORT_SYMBOL_GPL(fib_rules_unregister);
+
+int fib_rules_lookup(struct fib_rules_ops *ops, struct flowi *fl,
+		     int flags, struct fib_lookup_arg *arg)
+{
+	struct fib_rule *rule;
+	int err;
+
+	rcu_read_lock();
+
+	list_for_each_entry(rule, ops->rules_list, list) {
+		if (rule->ifname[0] && (rule->ifindex != fl->iif))
+			continue;
+
+		if (!ops->match(rule, fl, flags))
+			continue;
+
+		rcu_read_unlock();
+
+		err = ops->action(rule, fl, flags, arg);
+		if (err != -EAGAIN) {
+			fib_rule_get(rule);
+			arg->rule = rule;
+			goto out;
+		}
+	}
+
+	err = -ENETUNREACH;
+out:
+	rcu_read_unlock();
+
+	return err;
+}
+
+EXPORT_SYMBOL_GPL(fib_rules_lookup);
+
+int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
+{
+	struct fib_rule_hdr *frh = nlmsg_data(nlh);
+	struct fib_rules_ops *ops = NULL;
+	struct fib_rule *rule, *r, *last = NULL;
+	struct nlattr *tb[FRA_MAX+1];
+	int err = -EINVAL;
+
+	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
+		goto errout;
+
+	ops = lookup_rules_ops(frh->family);
+	if (ops == NULL) {
+		err = EAFNOSUPPORT;
+		goto errout;
+	}
+
+	err = nlmsg_parse(nlh, sizeof(*frh), tb, FRA_MAX, ops->policy);
+	if (err < 0)
+		goto errout;
+
+	if (tb[FRA_IFNAME] && nla_len(tb[FRA_IFNAME]) > IFNAMSIZ)
+		goto errout;
+
+	rule = kzalloc(ops->rule_size, GFP_KERNEL);
+	if (rule == NULL) {
+		err = -ENOMEM;
+		goto errout;
+	}
+
+	if (tb[FRA_PRIORITY])
+		rule->pref = nla_get_u32(tb[FRA_PRIORITY]);
+
+	if (tb[FRA_IFNAME]) {
+		struct net_device *dev;
+
+		rule->ifindex = -1;
+		if (nla_strlcpy(rule->ifname, tb[FRA_IFNAME],
+				IFNAMSIZ) >= IFNAMSIZ)
+			goto errout_free;
+
+		dev = __dev_get_by_name(rule->ifname);
+		if (dev)
+			rule->ifindex = dev->ifindex;
+	}
+
+	rule->action = frh->action;
+	rule->flags = frh->flags;
+	rule->table = frh->table;
+
+	if (!rule->pref && ops->default_pref)
+		rule->pref = ops->default_pref();
+
+	err = ops->configure(rule, skb, nlh, frh, tb);
+	if (err < 0)
+		goto errout_free;
+
+	list_for_each_entry(r, ops->rules_list, list) {
+		if (r->pref > rule->pref)
+			break;
+		last = r;
+	}
+
+	fib_rule_get(rule);
+
+	if (last)
+		list_add_rcu(&rule->list, &last->list);
+	else
+		list_add_rcu(&rule->list, ops->rules_list);
+
+	notify_rule_change(RTM_NEWRULE, rule, ops);
+	rules_ops_put(ops);
+	return 0;
+
+errout_free:
+	kfree(rule);
+errout:
+	rules_ops_put(ops);
+	return err;
+}
+
+int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
+{
+	struct fib_rule_hdr *frh = nlmsg_data(nlh);
+	struct fib_rules_ops *ops = NULL;
+	struct fib_rule *rule;
+	struct nlattr *tb[FRA_MAX+1];
+	int err = -EINVAL;
+
+	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
+		goto errout;
+
+	ops = lookup_rules_ops(frh->family);
+	if (ops == NULL) {
+		err = EAFNOSUPPORT;
+		goto errout;
+	}
+
+	err = nlmsg_parse(nlh, sizeof(*frh), tb, FRA_MAX, ops->policy);
+	if (err < 0)
+		goto errout;
+
+	list_for_each_entry(rule, ops->rules_list, list) {
+		if (frh->action && (frh->action != rule->action))
+			continue;
+
+		if (frh->table && (frh->table != rule->table))
+			continue;
+
+		if (tb[FRA_PRIORITY] &&
+		    (rule->pref != nla_get_u32(tb[FRA_PRIORITY])))
+			continue;
+
+		if (tb[FRA_IFNAME] &&
+		    nla_strcmp(tb[FRA_IFNAME], rule->ifname))
+			continue;
+
+		if (!ops->compare(rule, frh, tb))
+			continue;
+
+		if (rule->flags & FIB_RULE_PERMANENT) {
+			err = -EPERM;
+			goto errout;
+		}
+
+		list_del_rcu(&rule->list);
+		notify_rule_change(RTM_DELRULE, rule, ops);
+		fib_rule_put(rule);
+		rules_ops_put(ops);
+		return 0;
+	}
+
+	err = -ENOENT;
+errout:
+	rules_ops_put(ops);
+	return err;
+}
+
+static int fib_nl_fill_rule(struct sk_buff *skb, struct fib_rule *rule,
+			    u32 pid, u32 seq, int type, int flags,
+			    struct fib_rules_ops *ops)
+{
+	struct nlmsghdr *nlh;
+	struct fib_rule_hdr *frh;
+
+	nlh = nlmsg_put(skb, pid, seq, type, sizeof(*frh), flags);
+	if (nlh == NULL)
+		return -1;
+
+	frh = nlmsg_data(nlh);
+	frh->table = rule->table;
+	frh->res1 = 0;
+	frh->res2 = 0;
+	frh->action = rule->action;
+	frh->flags = rule->flags;
+
+	if (rule->ifname[0])
+		NLA_PUT_STRING(skb, FRA_IFNAME, rule->ifname);
+
+	if (rule->pref)
+		NLA_PUT_U32(skb, FRA_PRIORITY, rule->pref);
+
+	if (ops->fill(rule, skb, nlh, frh) < 0)
+		goto nla_put_failure;
+
+	return nlmsg_end(skb, nlh);
+
+nla_put_failure:
+	return nlmsg_cancel(skb, nlh);
+}
+
+int fib_rules_dump(struct sk_buff *skb, struct netlink_callback *cb, int family)
+{
+	int idx = 0;
+	struct fib_rule *rule;
+	struct fib_rules_ops *ops;
+
+	ops = lookup_rules_ops(family);
+	if (ops == NULL)
+		return -EAFNOSUPPORT;
+
+	rcu_read_lock();
+	list_for_each_entry(rule, ops->rules_list, list) {
+		if (idx < cb->args[0])
+			goto skip;
+
+		if (fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).pid,
+				     cb->nlh->nlmsg_seq, RTM_NEWRULE,
+				     NLM_F_MULTI, ops) < 0)
+			break;
+skip:
+		idx++;
+	}
+	rcu_read_unlock();
+	cb->args[0] = idx;
+	rules_ops_put(ops);
+
+	return skb->len;
+}
+
+EXPORT_SYMBOL_GPL(fib_rules_dump);
+
+static void notify_rule_change(int event, struct fib_rule *rule,
+			       struct fib_rules_ops *ops)
+{
+	int size = nlmsg_total_size(sizeof(struct fib_rule_hdr) + 128);
+	struct sk_buff *skb = alloc_skb(size, GFP_KERNEL);
+
+	if (skb == NULL)
+		netlink_set_err(rtnl, 0, RTNLGRP_IPV4_RULE, ENOBUFS);
+	else if (fib_nl_fill_rule(skb, rule, 0, 0, event, 0, ops) < 0) {
+		kfree_skb(skb);
+		netlink_set_err(rtnl, 0, RTNLGRP_IPV4_RULE, EINVAL);
+	} else
+		netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV4_RULE, GFP_KERNEL);
+}
+
+static void attach_rules(struct list_head *rules, struct net_device *dev)
+{
+	struct fib_rule *rule;
+
+	list_for_each_entry(rule, rules, list) {
+		if (rule->ifindex == -1 &&
+		    strcmp(dev->name, rule->ifname) == 0)
+			rule->ifindex = dev->ifindex;
+	}
+}
+
+static void detach_rules(struct list_head *rules, struct net_device *dev)
+{
+	struct fib_rule *rule;
+
+	list_for_each_entry(rule, rules, list)
+		if (rule->ifindex == dev->ifindex)
+			rule->ifindex = -1;
+}
+
+
+static int fib_rules_event(struct notifier_block *this, unsigned long event,
+			    void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct fib_rules_ops *ops;
+
+	ASSERT_RTNL();
+	rcu_read_lock();
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		list_for_each_entry(ops, &rules_ops, list)
+			attach_rules(ops->rules_list, dev);
+		break;
+
+	case NETDEV_UNREGISTER:
+		list_for_each_entry(ops, &rules_ops, list)
+			detach_rules(ops->rules_list, dev);
+		break;
+	}
+
+	rcu_read_unlock();
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block fib_rules_notifier = {
+	.notifier_call = fib_rules_event,
+};
+
+static int __init fib_rules_init(void)
+{
+	return register_netdevice_notifier(&fib_rules_notifier);
+}
+
+subsys_initcall(fib_rules_init);
Index: net-2.6.git/net/core/rtnetlink.c
===================================================================
--- net-2.6.git.orig/net/core/rtnetlink.c
+++ net-2.6.git/net/core/rtnetlink.c
@@ -49,6 +49,7 @@
 #include <net/udp.h>
 #include <net/sock.h>
 #include <net/pkt_sched.h>
+#include <net/fib_rules.h>
 #include <net/netlink.h>
 #ifdef CONFIG_NET_WIRELESS_RTNETLINK
 #include <linux/wireless.h>
@@ -103,7 +104,7 @@ static const int rtm_min[RTM_NR_FAMILIES
 	[RTM_FAM(RTM_NEWADDR)]      = NLMSG_LENGTH(sizeof(struct ifaddrmsg)),
 	[RTM_FAM(RTM_NEWROUTE)]     = NLMSG_LENGTH(sizeof(struct rtmsg)),
 	[RTM_FAM(RTM_NEWNEIGH)]     = NLMSG_LENGTH(sizeof(struct ndmsg)),
-	[RTM_FAM(RTM_NEWRULE)]      = NLMSG_LENGTH(sizeof(struct rtmsg)),
+	[RTM_FAM(RTM_NEWRULE)]      = NLMSG_LENGTH(sizeof(struct fib_rule_hdr)),
 	[RTM_FAM(RTM_NEWQDISC)]     = NLMSG_LENGTH(sizeof(struct tcmsg)),
 	[RTM_FAM(RTM_NEWTCLASS)]    = NLMSG_LENGTH(sizeof(struct tcmsg)),
 	[RTM_FAM(RTM_NEWTFILTER)]   = NLMSG_LENGTH(sizeof(struct tcmsg)),
@@ -120,7 +121,7 @@ static const int rta_max[RTM_NR_FAMILIES
 	[RTM_FAM(RTM_NEWADDR)]      = IFA_MAX,
 	[RTM_FAM(RTM_NEWROUTE)]     = RTA_MAX,
 	[RTM_FAM(RTM_NEWNEIGH)]     = NDA_MAX,
-	[RTM_FAM(RTM_NEWRULE)]      = RTA_MAX,
+	[RTM_FAM(RTM_NEWRULE)]      = FRA_MAX,
 	[RTM_FAM(RTM_NEWQDISC)]     = TCA_MAX,
 	[RTM_FAM(RTM_NEWTCLASS)]    = TCA_MAX,
 	[RTM_FAM(RTM_NEWTFILTER)]   = TCA_MAX,
@@ -744,6 +745,10 @@ static struct rtnetlink_link link_rtnetl
 	[RTM_NEWNEIGH    - RTM_BASE] = { .doit   = neigh_add		 },
 	[RTM_DELNEIGH    - RTM_BASE] = { .doit   = neigh_delete		 },
 	[RTM_GETNEIGH    - RTM_BASE] = { .dumpit = neigh_dump_info	 },
+#ifdef CONFIG_FIB_RULES
+	[RTM_NEWRULE     - RTM_BASE] = { .doit   = fib_nl_newrule	 },
+	[RTM_DELRULE     - RTM_BASE] = { .doit   = fib_nl_delrule	 },
+#endif
 	[RTM_GETRULE     - RTM_BASE] = { .dumpit = rtnetlink_dump_all	 },
 	[RTM_GETNEIGHTBL - RTM_BASE] = { .dumpit = neightbl_dump_info	 },
 	[RTM_SETNEIGHTBL - RTM_BASE] = { .doit   = neightbl_set		 },

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-27 22:39       ` [RESEND " Thomas Graf
@ 2006-07-27 22:58         ` Patrick McHardy
  2006-07-27 23:17           ` David Miller
  2006-07-28  9:25           ` Martin Josefsson
  2006-07-27 23:30         ` Patrick McHardy
  2006-07-31 14:46         ` Ville Nuorvala
  2 siblings, 2 replies; 54+ messages in thread
From: Patrick McHardy @ 2006-07-27 22:58 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David Miller, jmorris, netdev, vnuorval, usagi-core, yoshfuji,
	anttit

Thomas Graf wrote:
> Derived from net/ipv6/fib_rules.c

This clashes with my routing table patch, guess we have to figure
out who should go first :)

> +int fib_rules_lookup(struct fib_rules_ops *ops, struct flowi *fl,
> +		     int flags, struct fib_lookup_arg *arg)
> +{
> +	struct fib_rule *rule;
> +	int err;
> +
> +	rcu_read_lock();
> +
> +	list_for_each_entry(rule, ops->rules_list, list) {

Shouldn't that be list_for_each_entry_rcu?

> +		if (rule->ifname[0] && (rule->ifindex != fl->iif))
> +			continue;
> +
> +		if (!ops->match(rule, fl, flags))
> +			continue;
> +
> +		rcu_read_unlock();
> +
> +		err = ops->action(rule, fl, flags, arg);
> +		if (err != -EAGAIN) {
> +			fib_rule_get(rule);
> +			arg->rule = rule;
> +			goto out;
> +		}
> +	}
> +
> +	err = -ENETUNREACH;
> +out:
> +	rcu_read_unlock();

rcu_read_unlock might get called multiple times in the list iteration
and once again here.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-27 22:58         ` Patrick McHardy
@ 2006-07-27 23:17           ` David Miller
  2006-07-27 23:31             ` Patrick McHardy
  2006-07-28  9:25           ` Martin Josefsson
  1 sibling, 1 reply; 54+ messages in thread
From: David Miller @ 2006-07-27 23:17 UTC (permalink / raw)
  To: kaber; +Cc: tgraf, jmorris, netdev, vnuorval, usagi-core, yoshfuji, anttit

From: Patrick McHardy <kaber@trash.net>
Date: Fri, 28 Jul 2006 00:58:49 +0200

> Thomas Graf wrote:
> > Derived from net/ipv6/fib_rules.c
> 
> This clashes with my routing table patch, guess we have to figure
> out who should go first :)

I think since USAGI has some work that depends on this, we
should get Thomas's stuff in first.

It shouldn't be a big deal to rework your >256 tables patch
against Thomas's should it?


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-27 23:17           ` David Miller
@ 2006-07-27 23:31             ` Patrick McHardy
  0 siblings, 0 replies; 54+ messages in thread
From: Patrick McHardy @ 2006-07-27 23:31 UTC (permalink / raw)
  To: David Miller
  Cc: tgraf, jmorris, netdev, vnuorval, usagi-core, yoshfuji, anttit

David Miller wrote:
> From: Patrick McHardy <kaber@trash.net>
> Date: Fri, 28 Jul 2006 00:58:49 +0200
> 
>>This clashes with my routing table patch, guess we have to figure
>>out who should go first :)
> 
> 
> I think since USAGI has some work that depends on this, we
> should get Thomas's stuff in first.

OK.

> It shouldn't be a big deal to rework your >256 tables patch
> against Thomas's should it?

No, that shouldn't be much work.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-27 22:58         ` Patrick McHardy
  2006-07-27 23:17           ` David Miller
@ 2006-07-28  9:25           ` Martin Josefsson
  2006-07-29  1:40             ` Patrick McHardy
  1 sibling, 1 reply; 54+ messages in thread
From: Martin Josefsson @ 2006-07-28  9:25 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Thomas Graf, David Miller, jmorris, netdev, vnuorval, usagi-core,
	yoshfuji, anttit

[-- Attachment #1: Type: text/plain, Size: 2182 bytes --]

On Fri, 2006-07-28 at 00:58 +0200, Patrick McHardy wrote:

> > +int fib_rules_lookup(struct fib_rules_ops *ops, struct flowi *fl,
> > +		     int flags, struct fib_lookup_arg *arg)
> > +{
> > +	struct fib_rule *rule;
> > +	int err;
> > +
> > +	rcu_read_lock();
> > +
> > +	list_for_each_entry(rule, ops->rules_list, list) {
> 
> Shouldn't that be list_for_each_entry_rcu?

Yes that's correct, it should.

> > +		if (rule->ifname[0] && (rule->ifindex != fl->iif))
> > +			continue;
> > +
> > +		if (!ops->match(rule, fl, flags))
> > +			continue;
> > +
> > +		rcu_read_unlock();
> > +
> > +		err = ops->action(rule, fl, flags, arg);
> > +		if (err != -EAGAIN) {
> > +			fib_rule_get(rule);
> > +			arg->rule = rule;
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	err = -ENETUNREACH;
> > +out:
> > +	rcu_read_unlock();
> 
> rcu_read_unlock might get called multiple times in the list iteration
> and once again here.

Yes, the rcu_read_unlock() in the list iteration is misplaced, it
shouldn't be there. Besides the unbalanced lock/unlocks it suffers from
the general issue described below

As a somewhat related note, I've just digged a bit through RCU land,
talked to dipankar and mckenney, and discovered that rcu_read_lock() /
rcu_read_unlock() aren't strictly needed in softirqs since preempt is
already disabled in softirqs. This means that you can use the result of
the rcu read-side critical outside of the rcu_read_lock() /
rcu_read_unlock() section. BUT this changes with the -rt kernel where
softirqs are preemptable and where rcu_read_lock() / rcu_read_unlock()
doesn't disable/enable preempt anymore, which means the rcu read-side
critical section is also preemptable. This means that we can get
preempted in the read-side critical section but the resulting grace
period won't occur until rcu_read_unlock() is called, which means that
using results of an read-side critical section outside of the critical
section is just not going to work in softirqs in -rt kernels.
I'm sure Ingo has reviewed the RCU usage in softirqs but I don't know if
there's been any changes in this area after his review.

-- 
/Martin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-28  9:25           ` Martin Josefsson
@ 2006-07-29  1:40             ` Patrick McHardy
  2006-07-29  7:25               ` Martin Josefsson
  0 siblings, 1 reply; 54+ messages in thread
From: Patrick McHardy @ 2006-07-29  1:40 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Thomas Graf, David Miller, jmorris, netdev, vnuorval, usagi-core,
	yoshfuji, anttit

Martin Josefsson wrote:
> As a somewhat related note, I've just digged a bit through RCU land,
> talked to dipankar and mckenney, and discovered that rcu_read_lock() /
> rcu_read_unlock() aren't strictly needed in softirqs since preempt is
> already disabled in softirqs. This means that you can use the result of
> the rcu read-side critical outside of the rcu_read_lock() /

Thats true, but in this case the code is executed both in softirq-
and user-context. Using rcu_read_lock and still relying on softirq
properties outside the locked section is also very confusing in my
opinion.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-29  1:40             ` Patrick McHardy
@ 2006-07-29  7:25               ` Martin Josefsson
  0 siblings, 0 replies; 54+ messages in thread
From: Martin Josefsson @ 2006-07-29  7:25 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Thomas Graf, David Miller, jmorris, netdev, vnuorval, usagi-core,
	yoshfuji, anttit

[-- Attachment #1: Type: text/plain, Size: 893 bytes --]

On Sat, 2006-07-29 at 03:40 +0200, Patrick McHardy wrote:
> Martin Josefsson wrote:
> > As a somewhat related note, I've just digged a bit through RCU land,
> > talked to dipankar and mckenney, and discovered that rcu_read_lock() /
> > rcu_read_unlock() aren't strictly needed in softirqs since preempt is
> > already disabled in softirqs. This means that you can use the result of
> > the rcu read-side critical outside of the rcu_read_lock() /
> 
> Thats true, but in this case the code is executed both in softirq-
> and user-context. Using rcu_read_lock and still relying on softirq
> properties outside the locked section is also very confusing in my
> opinion.

Yes it is very fishy relying on the softirq properties, especially since
they don't apply to -rt kernels and that there might be other changes in
this area in the future. It's not recommended.

-- 
/Martin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-27 22:39       ` [RESEND " Thomas Graf
  2006-07-27 22:58         ` Patrick McHardy
@ 2006-07-27 23:30         ` Patrick McHardy
  2006-07-28 10:23           ` Thomas Graf
  2006-07-31 14:46         ` Ville Nuorvala
  2 siblings, 1 reply; 54+ messages in thread
From: Patrick McHardy @ 2006-07-27 23:30 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David Miller, jmorris, netdev, vnuorval, usagi-core, yoshfuji,
	anttit

Thomas Graf wrote:
> --- /dev/null
> +++ net-2.6.git/net/core/fib_rules.c
> +int fib_rules_register(struct fib_rules_ops *ops)
> +{
> +	int err = -EEXIST;
> +	struct fib_rules_ops *o;
> +
> +	if (ops->rule_size < sizeof(struct fib_rule))
> +		return -EINVAL;
> +
> +	if (ops->match == NULL || ops->configure == NULL ||
> +	    ops->compare == NULL || ops->fill == NULL ||
> +	    ops->action == NULL)
> +		return -EINVAL;
> +
> +	spin_lock_bh(&rules_mod_lock);

This doesn't look like it needs bh protection.

> +	list_for_each_entry(o, &rules_ops, list)
> +		if (ops->family == o->family)
> +			goto errout;
> +
> +	list_add_tail_rcu(&ops->list, &rules_ops);
> +	err = 0;
> +errout:
> +	spin_unlock_bh(&rules_mod_lock);
> +
> +	return err;
> +}
> +
> +EXPORT_SYMBOL_GPL(fib_rules_register);
> +
> +static void cleanup_ops(struct fib_rules_ops *ops)
> +{
> +	struct fib_rule *rule, *tmp;
> +
> +	list_for_each_entry_safe(rule, tmp, ops->rules_list, list) {
> +		list_del_rcu(&rule->list);
> +		fib_rule_put(rule);
> +	}
> +}
> +
> +int fib_rules_unregister(struct fib_rules_ops *ops)
> +{
> +	int err = 0;
> +	struct fib_rules_ops *o;
> +
> +	spin_lock_bh(&rules_mod_lock);
> +	list_for_each_entry(o, &rules_ops, list) {
> +		if (o == ops) {
> +			list_del_rcu(&o->list);
> +			cleanup_ops(ops);
> +			goto out;
> +		}
> +	}
> +
> +	err = -ENOENT;
> +out:
> +	spin_unlock_bh(&rules_mod_lock);
> +
> +	synchronize_net();
> +
> +	return err;
> +}
> +
> +EXPORT_SYMBOL_GPL(fib_rules_unregister);
> +
> +int fib_rules_lookup(struct fib_rules_ops *ops, struct flowi *fl,
> +		     int flags, struct fib_lookup_arg *arg)
> +{
> +	struct fib_rule *rule;
> +	int err;
> +
> +	rcu_read_lock();
> +
> +	list_for_each_entry(rule, ops->rules_list, list) {
> +		if (rule->ifname[0] && (rule->ifindex != fl->iif))
> +			continue;

ifindex may be unset even if ifname is set (in case the interface
does not exist yet). In that case it will match falsely on
locally generated packets.

> +
> +		if (!ops->match(rule, fl, flags))
> +			continue;
> +
> +		rcu_read_unlock();
> +
> +		err = ops->action(rule, fl, flags, arg);
> +		if (err != -EAGAIN) {
> +			fib_rule_get(rule);
> +			arg->rule = rule;
> +			goto out;
> +		}

This seems to race with fib_nl_delrule:

CPU1			CPU2
			list_for_each_entry -> find matching entry
list_del_rcu
fib_rule_put
call_rcu(fib_rule_put_rcu)
			fib_rule_get

Moving fib_rule_get inside the rcu protected section and
calling synchronize_rcu before fib_rule_put in fib_nl_delrule
looks like the easiest fix.

> +	}
> +
> +	err = -ENETUNREACH;
> +out:
> +	rcu_read_unlock();
> +
> +	return err;
> +}
> +
> +EXPORT_SYMBOL_GPL(fib_rules_lookup);
> +
> +int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
> +{
> +	struct fib_rule_hdr *frh = nlmsg_data(nlh);
> +	struct fib_rules_ops *ops = NULL;
> +	struct fib_rule *rule, *r, *last = NULL;
> +	struct nlattr *tb[FRA_MAX+1];
> +	int err = -EINVAL;
> +
> +	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
> +		goto errout;
> +
> +	ops = lookup_rules_ops(frh->family);
> +	if (ops == NULL) {
> +		err = EAFNOSUPPORT;
> +		goto errout;
> +	}
> +
> +	err = nlmsg_parse(nlh, sizeof(*frh), tb, FRA_MAX, ops->policy);
> +	if (err < 0)
> +		goto errout;
> +
> +	if (tb[FRA_IFNAME] && nla_len(tb[FRA_IFNAME]) > IFNAMSIZ)
> +		goto errout;
> +
> +	rule = kzalloc(ops->rule_size, GFP_KERNEL);
> +	if (rule == NULL) {
> +		err = -ENOMEM;
> +		goto errout;
> +	}
> +
> +	if (tb[FRA_PRIORITY])
> +		rule->pref = nla_get_u32(tb[FRA_PRIORITY]);
> +
> +	if (tb[FRA_IFNAME]) {
> +		struct net_device *dev;
> +
> +		rule->ifindex = -1;
> +		if (nla_strlcpy(rule->ifname, tb[FRA_IFNAME],
> +				IFNAMSIZ) >= IFNAMSIZ)
> +			goto errout_free;
> +
> +		dev = __dev_get_by_name(rule->ifname);
> +		if (dev)
> +			rule->ifindex = dev->ifindex;
> +	}
> +
> +	rule->action = frh->action;
> +	rule->flags = frh->flags;
> +	rule->table = frh->table;
> +
> +	if (!rule->pref && ops->default_pref)
> +		rule->pref = ops->default_pref();
> +
> +	err = ops->configure(rule, skb, nlh, frh, tb);
> +	if (err < 0)
> +		goto errout_free;
> +
> +	list_for_each_entry(r, ops->rules_list, list) {
> +		if (r->pref > rule->pref)
> +			break;
> +		last = r;
> +	}
> +
> +	fib_rule_get(rule);
> +
> +	if (last)
> +		list_add_rcu(&rule->list, &last->list);
> +	else
> +		list_add_rcu(&rule->list, ops->rules_list);
> +
> +	notify_rule_change(RTM_NEWRULE, rule, ops);
> +	rules_ops_put(ops);
> +	return 0;
> +
> +errout_free:
> +	kfree(rule);
> +errout:
> +	rules_ops_put(ops);
> +	return err;
> +}
> +
> +int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
> +{
> +	struct fib_rule_hdr *frh = nlmsg_data(nlh);
> +	struct fib_rules_ops *ops = NULL;
> +	struct fib_rule *rule;
> +	struct nlattr *tb[FRA_MAX+1];
> +	int err = -EINVAL;
> +
> +	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
> +		goto errout;
> +
> +	ops = lookup_rules_ops(frh->family);
> +	if (ops == NULL) {
> +		err = EAFNOSUPPORT;
> +		goto errout;
> +	}
> +
> +	err = nlmsg_parse(nlh, sizeof(*frh), tb, FRA_MAX, ops->policy);
> +	if (err < 0)
> +		goto errout;
> +
> +	list_for_each_entry(rule, ops->rules_list, list) {
> +		if (frh->action && (frh->action != rule->action))
> +			continue;
> +
> +		if (frh->table && (frh->table != rule->table))
> +			continue;
> +
> +		if (tb[FRA_PRIORITY] &&
> +		    (rule->pref != nla_get_u32(tb[FRA_PRIORITY])))
> +			continue;
> +
> +		if (tb[FRA_IFNAME] &&
> +		    nla_strcmp(tb[FRA_IFNAME], rule->ifname))
> +			continue;
> +
> +		if (!ops->compare(rule, frh, tb))
> +			continue;
> +
> +		if (rule->flags & FIB_RULE_PERMANENT) {
> +			err = -EPERM;
> +			goto errout;
> +		}
> +
> +		list_del_rcu(&rule->list);
> +		notify_rule_change(RTM_DELRULE, rule, ops);
> +		fib_rule_put(rule);
> +		rules_ops_put(ops);
> +		return 0;
> +	}
> +
> +	err = -ENOENT;
> +errout:
> +	rules_ops_put(ops);
> +	return err;
> +}
> +
> +static int fib_nl_fill_rule(struct sk_buff *skb, struct fib_rule *rule,
> +			    u32 pid, u32 seq, int type, int flags,
> +			    struct fib_rules_ops *ops)
> +{
> +	struct nlmsghdr *nlh;
> +	struct fib_rule_hdr *frh;
> +
> +	nlh = nlmsg_put(skb, pid, seq, type, sizeof(*frh), flags);
> +	if (nlh == NULL)
> +		return -1;
> +
> +	frh = nlmsg_data(nlh);
> +	frh->table = rule->table;
> +	frh->res1 = 0;
> +	frh->res2 = 0;
> +	frh->action = rule->action;
> +	frh->flags = rule->flags;
> +
> +	if (rule->ifname[0])
> +		NLA_PUT_STRING(skb, FRA_IFNAME, rule->ifname);
> +
> +	if (rule->pref)
> +		NLA_PUT_U32(skb, FRA_PRIORITY, rule->pref);
> +
> +	if (ops->fill(rule, skb, nlh, frh) < 0)
> +		goto nla_put_failure;
> +
> +	return nlmsg_end(skb, nlh);
> +
> +nla_put_failure:
> +	return nlmsg_cancel(skb, nlh);
> +}
> +
> +int fib_rules_dump(struct sk_buff *skb, struct netlink_callback *cb, int family)
> +{
> +	int idx = 0;
> +	struct fib_rule *rule;
> +	struct fib_rules_ops *ops;
> +
> +	ops = lookup_rules_ops(family);
> +	if (ops == NULL)
> +		return -EAFNOSUPPORT;
> +
> +	rcu_read_lock();
> +	list_for_each_entry(rule, ops->rules_list, list) {
> +		if (idx < cb->args[0])
> +			goto skip;
> +
> +		if (fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).pid,
> +				     cb->nlh->nlmsg_seq, RTM_NEWRULE,
> +				     NLM_F_MULTI, ops) < 0)
> +			break;
> +skip:
> +		idx++;
> +	}
> +	rcu_read_unlock();
> +	cb->args[0] = idx;
> +	rules_ops_put(ops);
> +
> +	return skb->len;
> +}
> +
> +EXPORT_SYMBOL_GPL(fib_rules_dump);
> +
> +static void notify_rule_change(int event, struct fib_rule *rule,
> +			       struct fib_rules_ops *ops)
> +{
> +	int size = nlmsg_total_size(sizeof(struct fib_rule_hdr) + 128);
> +	struct sk_buff *skb = alloc_skb(size, GFP_KERNEL);
> +
> +	if (skb == NULL)
> +		netlink_set_err(rtnl, 0, RTNLGRP_IPV4_RULE, ENOBUFS);
> +	else if (fib_nl_fill_rule(skb, rule, 0, 0, event, 0, ops) < 0) {
> +		kfree_skb(skb);
> +		netlink_set_err(rtnl, 0, RTNLGRP_IPV4_RULE, EINVAL);
> +	} else
> +		netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV4_RULE, GFP_KERNEL);
> +}

Shouldn't different families use different groups? Userspace
might (rightfully, I think) expect not to see anything but
IPv4 rules on RTNLGRP_IPV4_RULE.

> +static void attach_rules(struct list_head *rules, struct net_device *dev)
> +{
> +	struct fib_rule *rule;
> +
> +	list_for_each_entry(rule, rules, list) {
> +		if (rule->ifindex == -1 &&
> +		    strcmp(dev->name, rule->ifname) == 0)
> +			rule->ifindex = dev->ifindex;
> +	}
> +}
> +
> +static void detach_rules(struct list_head *rules, struct net_device *dev)
> +{
> +	struct fib_rule *rule;
> +
> +	list_for_each_entry(rule, rules, list)
> +		if (rule->ifindex == dev->ifindex)
> +			rule->ifindex = -1;
> +}
> +
> +
> +static int fib_rules_event(struct notifier_block *this, unsigned long event,
> +			    void *ptr)
> +{
> +	struct net_device *dev = ptr;
> +	struct fib_rules_ops *ops;
> +
> +	ASSERT_RTNL();
> +	rcu_read_lock();
> +
> +	switch (event) {
> +	case NETDEV_REGISTER:
> +		list_for_each_entry(ops, &rules_ops, list)
> +			attach_rules(ops->rules_list, dev);
> +		break;
> +
> +	case NETDEV_UNREGISTER:
> +		list_for_each_entry(ops, &rules_ops, list)
> +			detach_rules(ops->rules_list, dev);
> +		break;
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block fib_rules_notifier = {
> +	.notifier_call = fib_rules_event,
> +};
> +
> +static int __init fib_rules_init(void)
> +{
> +	return register_netdevice_notifier(&fib_rules_notifier);
> +}
> +
> +subsys_initcall(fib_rules_init);

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-27 23:30         ` Patrick McHardy
@ 2006-07-28 10:23           ` Thomas Graf
  0 siblings, 0 replies; 54+ messages in thread
From: Thomas Graf @ 2006-07-28 10:23 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David Miller, jmorris, netdev, vnuorval, usagi-core, yoshfuji,
	anttit

* Patrick McHardy <kaber@trash.net> 2006-07-28 01:30
> > +int fib_rules_lookup(struct fib_rules_ops *ops, struct flowi *fl,
> > +		     int flags, struct fib_lookup_arg *arg)
> > +{
> > +	struct fib_rule *rule;
> > +	int err;
> > +
> > +	rcu_read_lock();
> > +
> > +	list_for_each_entry(rule, ops->rules_list, list) {
> > +		if (rule->ifname[0] && (rule->ifindex != fl->iif))
> > +			continue;
> 
> ifindex may be unset even if ifname is set (in case the interface
> does not exist yet). In that case it will match falsely on
> locally generated packets.

Then rule->ifindex would be -1 and it shouldn't match but I
changed it, it makes more sense.

> > +static void notify_rule_change(int event, struct fib_rule *rule,
> > +			       struct fib_rules_ops *ops)
> > +{
> > +	int size = nlmsg_total_size(sizeof(struct fib_rule_hdr) + 128);
> > +	struct sk_buff *skb = alloc_skb(size, GFP_KERNEL);
> > +
> > +	if (skb == NULL)
> > +		netlink_set_err(rtnl, 0, RTNLGRP_IPV4_RULE, ENOBUFS);
> > +	else if (fib_nl_fill_rule(skb, rule, 0, 0, event, 0, ops) < 0) {
> > +		kfree_skb(skb);
> > +		netlink_set_err(rtnl, 0, RTNLGRP_IPV4_RULE, EINVAL);
> > +	} else
> > +		netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV4_RULE, GFP_KERNEL);
> > +}
> 
> Shouldn't different families use different groups? Userspace
> might (rightfully, I think) expect not to see anything but
> IPv4 rules on RTNLGRP_IPV4_RULE.

Right, I've added ops->nlgroup to fix this. Naturally I also
fixed all the other issues you brought up, I have the feeling
that there are more bugs, will look at the code again with
some distance in a few days.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-27 22:39       ` [RESEND " Thomas Graf
  2006-07-27 22:58         ` Patrick McHardy
  2006-07-27 23:30         ` Patrick McHardy
@ 2006-07-31 14:46         ` Ville Nuorvala
  2006-07-31 15:24           ` Thomas Graf
  2 siblings, 1 reply; 54+ messages in thread
From: Ville Nuorvala @ 2006-07-31 14:46 UTC (permalink / raw)
  To: Thomas Graf; +Cc: David Miller, jmorris, netdev, usagi-core, yoshfuji, anttit

Thomas Graf wrote:

Hi Thomas,


> Derived from net/ipv6/fib_rules.c

do you mean net/ipv4/fib_rules.c or net/ipv6/fib6_rules.c? :-)

A couple of comments below.

> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> 
> Index: net-2.6.git/include/linux/fib_rules.h
> ===================================================================
> --- /dev/null
> +++ net-2.6.git/include/linux/fib_rules.h
> @@ -0,0 +1,60 @@
> +#ifndef __LINUX_FIB_RULES_H
> +#define __LINUX_FIB_RULES_H
> +
> +#include <linux/types.h>
> +#include <linux/rtnetlink.h>
> +
> +/* rule is permanent, and cannot be deleted */
> +#define FIB_RULE_PERMANENT	1
> +
> +struct fib_rule_hdr
> +{
> +	__u8		family;
> +	__u8		dst_len;
> +	__u8		src_len;
> +	__u8		tos;
> +
> +	__u8		table;
> +	__u8		res1;	/* reserved */
> +	__u8		res2;	/* reserved */
> +	__u8		action;
> +
> +	__u32		flags;
> +};

I'm wondering if this is guaranteed to be equvalent to struct rtmsg?

struct rtmsg
{
	unsigned char		rtm_family;
	unsigned char		rtm_dst_len;
	unsigned char		rtm_src_len;
	unsigned char		rtm_tos;

	unsigned char		rtm_table;	/* Routing table id */
	unsigned char		rtm_protocol;	/* Routing protocol; see below	*/
	unsigned char		rtm_scope;	/* See below */	
	unsigned char		rtm_type;	/* See below	*/

	unsigned		rtm_flags;
};

Won't we otherwise be breaking the existing userland interface?

> +enum
> +{
> +	FRA_UNSPEC,
> +	FRA_DST,	/* destination address */
> +	FRA_SRC,	/* source address */
> +	FRA_IFNAME,	/* interface name */
> +	FRA_UNUSED1,
> +	FRA_UNUSED2,
> +	FRA_PRIORITY,	/* priority/preference */
> +	FRA_UNUSED3,
> +	FRA_UNUSED4,
> +	FRA_UNUSED5,
> +	FRA_FWMARK,	/* netfilter mark (IPv4) */
> +	FRA_FLOW,	/* flow/class id */
> +	__FRA_MAX
> +};
> +
> +#define FRA_MAX (__FRA_MAX - 1)
> +
> +enum
> +{
> +	FR_ACT_UNSPEC,
> +	FR_ACT_TO_TBL,		/* Pass to fixed table */
> +	FR_ACT_RES1,
> +	FR_ACT_RES2,
> +	FR_ACT_RES3,
> +	FR_ACT_RES4,
> +	FR_ACT_BLACKHOLE,	/* Drop without notification */
> +	FR_ACT_UNREACHABLE,	/* Drop with ENETUNREACH */
> +	FR_ACT_PROHIBIT,	/* Drop with EACCES */
> +	__FR_ACT_MAX,
> +};
> +
> +#define FR_ACT_MAX (__FR_ACT_MAX - 1)
> +
> +#endif

Shouldn't all these (struct fib_rule_hdr included) actually be defined
in include/linux/rtnetlink.h?

Otherwise, looks good.

Regards,
Ville

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-31 14:46         ` Ville Nuorvala
@ 2006-07-31 15:24           ` Thomas Graf
  2006-07-31 18:01             ` Patrick McHardy
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Graf @ 2006-07-31 15:24 UTC (permalink / raw)
  To: Ville Nuorvala
  Cc: David Miller, jmorris, netdev, usagi-core, yoshfuji, anttit

* Ville Nuorvala <vnuorval@tcs.hut.fi> 2006-07-31 17:46
> > Derived from net/ipv6/fib_rules.c
> 
> do you mean net/ipv4/fib_rules.c or net/ipv6/fib6_rules.c? :-)

Hehe, I meant net/ipv4/fib_rules.c :-)

> > +struct fib_rule_hdr
> > +{
> > +	__u8		family;
> > +	__u8		dst_len;
> > +	__u8		src_len;
> > +	__u8		tos;
> > +
> > +	__u8		table;
> > +	__u8		res1;	/* reserved */
> > +	__u8		res2;	/* reserved */
> > +	__u8		action;
> > +
> > +	__u32		flags;
> > +};
> 
> I'm wondering if this is guaranteed to be equvalent to struct rtmsg?
> 
> struct rtmsg
> {
> 	unsigned char		rtm_family;
> 	unsigned char		rtm_dst_len;
> 	unsigned char		rtm_src_len;
> 	unsigned char		rtm_tos;
> 
> 	unsigned char		rtm_table;	/* Routing table id */
> 	unsigned char		rtm_protocol;	/* Routing protocol; see below	*/
> 	unsigned char		rtm_scope;	/* See below */	
> 	unsigned char		rtm_type;	/* See below	*/
> 
> 	unsigned		rtm_flags;
> };
> 
> Won't we otherwise be breaking the existing userland interface?

It is equivalent but you're right, it would break userland
interfaces otherwise. I've defined this new header to add
implicit names and stop the confusion with unused fields.

> > +enum
> > +{
> > +	FRA_UNSPEC,
> > +	FRA_DST,	/* destination address */
> > +	FRA_SRC,	/* source address */
> > +	FRA_IFNAME,	/* interface name */
> > +	FRA_UNUSED1,
> > +	FRA_UNUSED2,
> > +	FRA_PRIORITY,	/* priority/preference */
> > +	FRA_UNUSED3,
> > +	FRA_UNUSED4,
> > +	FRA_UNUSED5,
> > +	FRA_FWMARK,	/* netfilter mark (IPv4) */
> > +	FRA_FLOW,	/* flow/class id */
> > +	__FRA_MAX
> > +};
> > +
> > +#define FRA_MAX (__FRA_MAX - 1)
> > +
> > +enum
> > +{
> > +	FR_ACT_UNSPEC,
> > +	FR_ACT_TO_TBL,		/* Pass to fixed table */
> > +	FR_ACT_RES1,
> > +	FR_ACT_RES2,
> > +	FR_ACT_RES3,
> > +	FR_ACT_RES4,
> > +	FR_ACT_BLACKHOLE,	/* Drop without notification */
> > +	FR_ACT_UNREACHABLE,	/* Drop with ENETUNREACH */
> > +	FR_ACT_PROHIBIT,	/* Drop with EACCES */
> > +	__FR_ACT_MAX,
> > +};
> > +
> > +#define FR_ACT_MAX (__FR_ACT_MAX - 1)
> > +
> > +#endif
> 
> Shouldn't all these (struct fib_rule_hdr included) actually be defined
> in include/linux/rtnetlink.h?

We used to stuff everything into rtnetlink.h for no good reason. Having
independant include/linux/<subsystem>.h to export the interface to
userspace and include/net/<subsystem>.h to export the kernel interface
instead of contributing to the ifdef hell seems a lot cleaner to me.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-31 15:24           ` Thomas Graf
@ 2006-07-31 18:01             ` Patrick McHardy
  2006-07-31 20:01               ` Thomas Graf
  0 siblings, 1 reply; 54+ messages in thread
From: Patrick McHardy @ 2006-07-31 18:01 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Ville Nuorvala, David Miller, jmorris, netdev, usagi-core,
	yoshfuji, anttit

Thomas Graf wrote:
> * Ville Nuorvala <vnuorval@tcs.hut.fi> 2006-07-31 17:46
>
>>Shouldn't all these (struct fib_rule_hdr included) actually be defined
>>in include/linux/rtnetlink.h?
> 
> 
> We used to stuff everything into rtnetlink.h for no good reason. Having
> independant include/linux/<subsystem>.h to export the interface to
> userspace and include/net/<subsystem>.h to export the kernel interface
> instead of contributing to the ifdef hell seems a lot cleaner to me.


I agree, but then we should also split up rtnetlink.h. Having one
special case will just make it harder to find.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
  2006-07-31 18:01             ` Patrick McHardy
@ 2006-07-31 20:01               ` Thomas Graf
  0 siblings, 0 replies; 54+ messages in thread
From: Thomas Graf @ 2006-07-31 20:01 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Ville Nuorvala, David Miller, jmorris, netdev, usagi-core,
	yoshfuji, anttit

* Patrick McHardy <kaber@trash.net> 2006-07-31 20:01
> Thomas Graf wrote:
> > * Ville Nuorvala <vnuorval@tcs.hut.fi> 2006-07-31 17:46
> >
> >>Shouldn't all these (struct fib_rule_hdr included) actually be defined
> >>in include/linux/rtnetlink.h?
> > 
> > 
> > We used to stuff everything into rtnetlink.h for no good reason. Having
> > independant include/linux/<subsystem>.h to export the interface to
> > userspace and include/net/<subsystem>.h to export the kernel interface
> > instead of contributing to the ifdef hell seems a lot cleaner to me.
> 
> 
> I agree, but then we should also split up rtnetlink.h. Having one
> special case will just make it harder to find.

Already done in the patchset converting things to the new netlink
interface that I'll start submiting in the next days.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 4/5] [IPV6]: Policy Routing Rules
  2006-07-26 22:11 [RFC] Multiple IPV6 Routing Tables & Policy Routing Thomas Graf
                   ` (2 preceding siblings ...)
  2006-07-26 22:00 ` [PATCH 3/5] [NET]: Protocol Independant Policy Routing Rules Framework Thomas Graf
@ 2006-07-26 22:00 ` Thomas Graf
  2006-07-26 22:42   ` David Miller
                     ` (2 more replies)
  2006-07-26 22:00 ` [PATCH 5/5] [IPV4]: Use Protocol Independant Policy Routing Rules Framework Thomas Graf
                   ` (2 subsequent siblings)
  6 siblings, 3 replies; 54+ messages in thread
From: Thomas Graf @ 2006-07-26 22:00 UTC (permalink / raw)
  To: netdev; +Cc: vnuorval, usagi-core, yoshfuji, davem, anttit

[-- Attachment #1: ip6_fib_rules --]
[-- Type: text/plain, Size: 11813 bytes --]

Adds support for policy routing rules including a new
local table for routes with a local destination.

Signed-off-by: Thomas Graf <tgraf@suug.ch>

Index: net-2.6.git/net/ipv6/fib6_rules.c
===================================================================
--- /dev/null
+++ net-2.6.git/net/ipv6/fib6_rules.c
@@ -0,0 +1,256 @@
+/*
+ * net/ipv6/fib6_rules.c	IPv6 Routing Policy Rules
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License as
+ *	published by the Free Software Foundation, version 2.
+ *
+ * Authors
+ *	Ville Nuorvala		<vnuorval@tcs.hut.fi>
+ *	Thomas Graf		<tgraf@suug.ch>
+ */
+
+#include <linux/config.h>
+#include <linux/netdevice.h>
+
+#include <net/fib_rules.h>
+#include <net/ipv6.h>
+#include <net/ip6_route.h>
+#include <net/netlink.h>
+
+struct fib6_rule
+{
+	struct fib_rule		common;
+	struct rt6key		src;
+	struct rt6key		dst;
+	u8			tclass;
+};
+
+static struct fib_rules_ops fib6_rules_ops;
+
+static struct fib6_rule main_rule = {
+	.common = {
+		.refcnt =	ATOMIC_INIT(2),
+		.pref =		0x7FFE,
+		.action =	FR_ACT_TO_TBL,
+		.table =	RT6_TABLE_MAIN,
+	},
+};
+
+static struct fib6_rule local_rule = {
+	.common = {
+		.refcnt =	ATOMIC_INIT(2),
+		.pref =		0,
+		.action =	FR_ACT_TO_TBL,
+		.table =	RT6_TABLE_LOCAL,
+		.flags =	FIB_RULE_PERMANENT,
+	},
+};
+
+static LIST_HEAD(fib6_rules);
+
+struct dst_entry *fib6_rule_lookup(struct flowi *fl, int flags,
+				   pol_lookup_t lookup)
+{
+	struct fib_lookup_arg arg = {
+		.lookup_ptr = lookup,
+	};
+
+	fib_rules_lookup(&fib6_rules_ops, fl, flags, &arg);
+	fib_rule_put(arg.rule);
+
+	return (struct dst_entry *) arg.result;
+}
+
+int fib6_rule_action(struct fib_rule *rule, struct flowi *flp,
+		     int flags, struct fib_lookup_arg *arg)
+{
+	struct rt6_info *rt = NULL;
+	struct fib6_table *table;
+	pol_lookup_t lookup = arg->lookup_ptr;
+
+	switch (rule->action) {
+	case FR_ACT_TO_TBL:
+		break;
+	case FR_ACT_UNREACHABLE:
+		goto no_rt;
+	default:
+	case FR_ACT_BLACKHOLE:
+		rt = &ip6_blk_hole_entry;
+		goto discard_pkt;
+	case FR_ACT_PROHIBIT:
+		rt = &ip6_prohibit_entry;
+		goto discard_pkt;
+	}
+
+	table = fib6_get_table(rule->table);
+	if (table)
+		rt = lookup(table, flp, flags);
+
+	if (rt == &ip6_null_entry)
+		dst_release(&rt->u.dst);
+	else
+		goto out;
+no_rt:
+	rt = &ip6_null_entry;
+discard_pkt:
+	dst_hold(&rt->u.dst);
+out:
+
+	arg->result = rt;
+	return 0;
+}
+
+
+static int fib6_rule_match(struct fib_rule *rule, struct flowi *fl, int flags)
+{
+	struct fib6_rule *r = (struct fib6_rule *) rule;
+
+	if (!ipv6_prefix_equal(&fl->fl6_dst, &r->dst.addr, r->dst.plen))
+		return 0;
+
+#ifdef CONFIG_IPV6_SUBTREES
+	if ((flags & RT6_F_HAS_SADDR) &&
+	    !ipv6_prefix_equal(&fl->fl6_src, &r->r_src.addr, r->r_src.plen))
+		return 0;
+#endif
+
+	return 1;
+}
+
+static struct nla_policy fib6_rule_policy[RTA_MAX+1] __read_mostly = {
+	[FRA_IFNAME]	= { .type = NLA_STRING },
+	[FRA_PRIORITY]	= { .type = NLA_U32 },
+	[FRA_SRC]	= { .minlen = sizeof(struct in6_addr) },
+	[FRA_DST]	= { .minlen = sizeof(struct in6_addr) },
+};
+
+static int fib6_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
+			       struct nlmsghdr *nlh, struct fib_rule_hdr *frh,
+			       struct nlattr **tb)
+{
+	int err = -EINVAL;
+	struct fib6_rule *rule6 = (struct fib6_rule *) rule;
+
+	if (frh->src_len > 128 || frh->dst_len > 128 ||
+	    (frh->tos & ~IPV6_FLOWINFO_MASK))
+		goto errout;
+
+#ifndef CONFIG_IPV6_SUBTREES
+	if (frh->src_len > 0)
+		goto errout;
+#endif
+
+	if (rule->action == FR_ACT_TO_TBL) {
+		if (rule->table == RT6_TABLE_UNSPEC)
+			goto errout;
+
+		if (fib6_new_table(rule->table) == NULL) {
+			err = -ENOBUFS;
+			goto errout;
+		}
+	}
+
+	if (tb[FRA_SRC])
+		nla_memcpy(&rule6->src.addr, tb[FRA_SRC],
+			   sizeof(struct in6_addr));
+
+	if (tb[FRA_DST])
+		nla_memcpy(&rule6->dst.addr, tb[FRA_DST],
+			   sizeof(struct in6_addr));
+
+	rule6->src.plen = frh->src_len;
+	rule6->dst.plen = frh->dst_len;
+	rule6->tclass = frh->tos;
+
+	err = 0;
+errout:
+	return err;
+}
+
+static int fib6_rule_compare(struct fib_rule *rule, struct fib_rule_hdr *frh,
+			     struct nlattr **tb)
+{
+	struct fib6_rule *rule6 = (struct fib6_rule *) rule;
+
+	if (frh->src_len && (rule6->src.plen != frh->src_len))
+		return 0;
+
+	if (frh->dst_len && (rule6->dst.plen != frh->dst_len))
+		return 0;
+
+	if (frh->tos && (rule6->tclass != frh->tos))
+		return 0;
+
+	if (tb[FRA_SRC] &&
+	    nla_memcmp(tb[FRA_SRC], &rule6->src.addr, sizeof(struct in6_addr)))
+		return 0;
+
+	if (tb[FRA_DST] &&
+	    nla_memcmp(tb[FRA_DST], &rule6->dst.addr, sizeof(struct in6_addr)))
+		return 0;
+
+	return 1;
+}
+
+static int fib6_rule_fill(struct fib_rule *rule, struct sk_buff *skb,
+			  struct nlmsghdr *nlh, struct fib_rule_hdr *frh)
+{
+	struct fib6_rule *rule6 = (struct fib6_rule *) rule;
+
+	frh->family = AF_INET6;
+	frh->dst_len = rule6->dst.plen;
+	frh->src_len = rule6->src.plen;
+	frh->tos = rule6->tclass;
+
+	if (rule6->dst.plen)
+		NLA_PUT(skb, FRA_DST, sizeof(struct in6_addr),
+			&rule6->dst.addr);
+
+	if (rule6->src.plen)
+		NLA_PUT(skb, FRA_SRC, sizeof(struct in6_addr),
+			&rule6->src.addr);
+
+	return 0;
+
+nla_put_failure:
+	return -ENOBUFS;
+}
+
+int fib6_rules_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	return fib_rules_dump(skb, cb, AF_INET6);
+}
+
+static u32 fib6_rule_default_pref(void)
+{
+	return 0x3FFF;
+}
+
+static struct fib_rules_ops fib6_rules_ops = {
+	.family			= AF_INET6,
+	.rule_size		= sizeof(struct fib6_rule),
+	.action			= fib6_rule_action,
+	.match			= fib6_rule_match,
+	.configure		= fib6_rule_configure,
+	.compare		= fib6_rule_compare,
+	.fill			= fib6_rule_fill,
+	.default_pref		= fib6_rule_default_pref,
+	.policy			= fib6_rule_policy,
+	.rules_list		= &fib6_rules,
+	.owner			= THIS_MODULE,
+};
+
+void __init fib6_rules_init(void)
+{
+	list_add_tail(&local_rule.common.list, &fib6_rules);
+	list_add_tail(&main_rule.common.list, &fib6_rules);
+
+	fib_rules_register(&fib6_rules_ops);
+}
+
+void __exit fib6_rules_cleanup(void)
+{
+	fib_rules_unregister(&fib6_rules_ops);
+}
+
Index: net-2.6.git/net/ipv6/Kconfig
===================================================================
--- net-2.6.git.orig/net/ipv6/Kconfig
+++ net-2.6.git/net/ipv6/Kconfig
@@ -138,6 +138,7 @@ config IPV6_TUNNEL
 config IPV6_MULTIPLE_TABLES
 	bool "IPv6: Multiple Routing Tables"
 	depends on IPV6 && EXPERIMENTAL
+	select FIB_RULES
 	---help---
 	  Support multiple routing tables.
 
Index: net-2.6.git/net/ipv6/Makefile
===================================================================
--- net-2.6.git.orig/net/ipv6/Makefile
+++ net-2.6.git/net/ipv6/Makefile
@@ -29,3 +29,5 @@ obj-$(CONFIG_IPV6_TUNNEL) += ip6_tunnel.
 obj-y += exthdrs_core.o
 
 obj-$(subst m,y,$(CONFIG_IPV6)) += inet6_hashtables.o
+
+obj-$(CONFIG_IPV6_MULTIPLE_TABLES) += fib6_rules.o
Index: net-2.6.git/include/net/ip6_fib.h
===================================================================
--- net-2.6.git.orig/include/net/ip6_fib.h
+++ net-2.6.git/include/net/ip6_fib.h
@@ -155,16 +155,17 @@ struct fib6_table {
 
 #define RT6_TABLE_UNSPEC	RT_TABLE_UNSPEC
 #define RT6_TABLE_MAIN		RT_TABLE_MAIN
-#define RT6_TABLE_LOCAL		RT6_TABLE_MAIN
 #define RT6_TABLE_DFLT		RT6_TABLE_MAIN
 #define RT6_TABLE_INFO		RT6_TABLE_MAIN
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
 #define FIB6_TABLE_MIN		1
 #define FIB6_TABLE_MAX		RT_TABLE_MAX
+#define RT6_TABLE_LOCAL		RT_TABLE_LOCAL
 #else
 #define FIB6_TABLE_MIN		RT_TABLE_MAIN
 #define FIB6_TABLE_MAX		FIB6_TABLE_MIN
+#define RT6_TABLE_LOCAL		RT6_TABLE_MAIN
 #endif
 
 #define RT6_F_STRICT		1
@@ -219,5 +220,11 @@ extern void			fib6_run_gc(unsigned long 
 extern void			fib6_gc_cleanup(void);
 
 extern void			fib6_init(void);
+
+extern void			fib6_rules_init(void);
+extern void			fib6_rules_cleanup(void);
+extern int			fib6_rules_dump(struct sk_buff *,
+						struct netlink_callback *);
+
 #endif
 #endif
Index: net-2.6.git/net/ipv6/route.c
===================================================================
--- net-2.6.git.orig/net/ipv6/route.c
+++ net-2.6.git/net/ipv6/route.c
@@ -139,6 +139,50 @@ struct rt6_info ip6_null_entry = {
 	.rt6i_ref	= ATOMIC_INIT(1),
 };
 
+#ifdef CONFIG_IPV6_MULTIPLE_TABLES
+
+struct rt6_info ip6_prohibit_entry = {
+	.u = {
+		.dst = {
+			.__refcnt	= ATOMIC_INIT(1),
+			.__use		= 1,
+			.dev		= &loopback_dev,
+			.obsolete	= -1,
+			.error		= -EACCES,
+			.metrics	= { [RTAX_HOPLIMIT - 1] = 255, },
+			.input		= ip6_pkt_discard,
+			.output		= ip6_pkt_discard_out,
+			.ops		= &ip6_dst_ops,
+			.path		= (struct dst_entry*)&ip6_prohibit_entry,
+		}
+	},
+	.rt6i_flags	= (RTF_REJECT | RTF_NONEXTHOP),
+	.rt6i_metric	= ~(u32) 0,
+	.rt6i_ref	= ATOMIC_INIT(1),
+};
+
+struct rt6_info ip6_blk_hole_entry = {
+	.u = {
+		.dst = {
+			.__refcnt	= ATOMIC_INIT(1),
+			.__use		= 1,
+			.dev		= &loopback_dev,
+			.obsolete	= -1,
+			.error		= -EINVAL,
+			.metrics	= { [RTAX_HOPLIMIT - 1] = 255, },
+			.input		= ip6_pkt_discard,
+			.output		= ip6_pkt_discard_out,
+			.ops		= &ip6_dst_ops,
+			.path		= (struct dst_entry*)&ip6_blk_hole_entry,
+		}
+	},
+	.rt6i_flags	= (RTF_REJECT | RTF_NONEXTHOP),
+	.rt6i_metric	= ~(u32) 0,
+	.rt6i_ref	= ATOMIC_INIT(1),
+};
+
+#endif
+
 /* allocate dst with ip6_dst_ops */
 static __inline__ struct rt6_info *ip6_dst_alloc(void)
 {
@@ -2398,10 +2442,16 @@ void __init ip6_route_init(void)
 #ifdef CONFIG_XFRM
 	xfrm6_init();
 #endif
+#ifdef CONFIG_IPV6_MULTIPLE_TABLES
+	fib6_rules_init();
+#endif
 }
 
 void ip6_route_cleanup(void)
 {
+#ifdef CONFIG_IPV6_MULTIPLE_TABLES
+	fib6_rules_cleanup();
+#endif
 #ifdef CONFIG_PROC_FS
 	proc_net_remove("ipv6_route");
 	proc_net_remove("rt6_stats");
Index: net-2.6.git/net/ipv6/ip6_fib.c
===================================================================
--- net-2.6.git.orig/net/ipv6/ip6_fib.c
+++ net-2.6.git/net/ipv6/ip6_fib.c
@@ -159,6 +159,15 @@ static struct fib6_table fib6_main_tbl =
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
 
+static struct fib6_table fib6_local_tbl = {
+	.tb6_id		= RT6_TABLE_LOCAL,
+	.tb6_lock	= RW_LOCK_UNLOCKED,
+	.tb6_root 	= {
+		.leaf		= &ip6_null_entry,
+		.fn_flags	= RTN_ROOT | RTN_TL_ROOT | RTN_RTINFO,
+	},
+};
+
 #define FIB_TABLE_HASHSZ 256
 static struct hlist_head fib_table_hash[FIB_TABLE_HASHSZ];
 
@@ -228,20 +237,10 @@ struct fib6_table *fib6_get_table(u32 id
 	return NULL;
 }
 
-struct dst_entry *fib6_rule_lookup(struct flowi *fl, int flags,
-				   pol_lookup_t lookup)
-{
-	/*
-	 * TODO: Add rule lookup
-	 */
-	struct fib6_table *table = fib6_get_table(RT6_TABLE_MAIN);
-
-	return (struct dst_entry *) lookup(table, fl, flags);
-}
-
 static void __init fib6_tables_init(void)
 {
 	fib6_link_table(&fib6_main_tbl);
+	fib6_link_table(&fib6_local_tbl);
 }
 
 #else
Index: net-2.6.git/include/net/ip6_route.h
===================================================================
--- net-2.6.git.orig/include/net/ip6_route.h
+++ net-2.6.git/include/net/ip6_route.h
@@ -41,6 +41,11 @@ struct pol_chain {
 
 extern struct rt6_info	ip6_null_entry;
 
+#ifdef CONFIG_IPV6_MULTIPLE_TABLES
+extern struct rt6_info	ip6_prohibit_entry;
+extern struct rt6_info	ip6_blk_hole_entry;
+#endif
+
 extern int ip6_rt_gc_interval;
 
 extern void			ip6_route_input(struct sk_buff *skb);
Index: net-2.6.git/net/ipv6/addrconf.c
===================================================================
--- net-2.6.git.orig/net/ipv6/addrconf.c
+++ net-2.6.git/net/ipv6/addrconf.c
@@ -3370,6 +3370,7 @@ static struct rtnetlink_link inet6_rtnet
 	[RTM_DELROUTE - RTM_BASE] = { .doit	= inet6_rtm_delroute, },
 	[RTM_GETROUTE - RTM_BASE] = { .doit	= inet6_rtm_getroute,
 				      .dumpit	= inet6_dump_fib, },
+	[RTM_GETRULE  - RTM_BASE] = { .dumpit   = fib6_rules_dump,   },
 };
 
 static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/5] [IPV6]: Policy Routing Rules
  2006-07-26 22:00 ` [PATCH 4/5] [IPV6]: Policy Routing Rules Thomas Graf
@ 2006-07-26 22:42   ` David Miller
  2006-07-26 23:26   ` David Miller
  2006-07-26 23:33   ` David Miller
  2 siblings, 0 replies; 54+ messages in thread
From: David Miller @ 2006-07-26 22:42 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, vnuorval, usagi-core, yoshfuji, anttit

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 27 Jul 2006 00:00:04 +0200

> Adds support for policy routing rules including a new
> local table for routes with a local destination.
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>

With the generic fib_rules this is straightforward,
looks good.

Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/5] [IPV6]: Policy Routing Rules
  2006-07-26 22:00 ` [PATCH 4/5] [IPV6]: Policy Routing Rules Thomas Graf
  2006-07-26 22:42   ` David Miller
@ 2006-07-26 23:26   ` David Miller
  2006-07-26 23:33   ` David Miller
  2 siblings, 0 replies; 54+ messages in thread
From: David Miller @ 2006-07-26 23:26 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, vnuorval, usagi-core, yoshfuji, anttit

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 27 Jul 2006 00:00:04 +0200

> Index: net-2.6.git/net/ipv6/Makefile
> ===================================================================
> --- net-2.6.git.orig/net/ipv6/Makefile
> +++ net-2.6.git/net/ipv6/Makefile
> @@ -29,3 +29,5 @@ obj-$(CONFIG_IPV6_TUNNEL) += ip6_tunnel.
>  obj-y += exthdrs_core.o
>  
>  obj-$(subst m,y,$(CONFIG_IPV6)) += inet6_hashtables.o
> +
> +obj-$(CONFIG_IPV6_MULTIPLE_TABLES) += fib6_rules.o

This will cause a build failure if IPV6 is built modular.

The fix is simple, make CONFIG_IPV6_MULTIPLE_TABLES
a tristate instead of a bool.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/5] [IPV6]: Policy Routing Rules
  2006-07-26 22:00 ` [PATCH 4/5] [IPV6]: Policy Routing Rules Thomas Graf
  2006-07-26 22:42   ` David Miller
  2006-07-26 23:26   ` David Miller
@ 2006-07-26 23:33   ` David Miller
  2006-07-26 23:40     ` David Miller
  2 siblings, 1 reply; 54+ messages in thread
From: David Miller @ 2006-07-26 23:33 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, vnuorval, usagi-core, yoshfuji, anttit

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 27 Jul 2006 00:00:04 +0200

> Index: net-2.6.git/include/net/ip6_route.h
> ===================================================================
> --- net-2.6.git.orig/include/net/ip6_route.h
> +++ net-2.6.git/include/net/ip6_route.h
> @@ -41,6 +41,11 @@ struct pol_chain {
>  
>  extern struct rt6_info	ip6_null_entry;
>  
> +#ifdef CONFIG_IPV6_MULTIPLE_TABLES
> +extern struct rt6_info	ip6_prohibit_entry;
> +extern struct rt6_info	ip6_blk_hole_entry;
> +#endif
> +
>  extern int ip6_rt_gc_interval;
>  
>  extern void			ip6_route_input(struct sk_buff *skb);

Because of my other suggestion, changing IPV6_MULTIPLE_TABLES
to tristate, the build will fail unless we add a second condition
to this ifdef, namely "CONFIG_IPV6_MULTIPLE_TABLES_MODULE".

In fact, all this doesn't look so nice.

All we want to do is make fib6_rules.o get built into ipv6-objs
if it is enabled.

Therfore, the better fix seems to be, to change the Makefile
rule to be:

ipv6-$(CONFIG_IPV6_MULTIPLE_TABLES) += fib6_rules.o

Ok?


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/5] [IPV6]: Policy Routing Rules
  2006-07-26 23:33   ` David Miller
@ 2006-07-26 23:40     ` David Miller
  2006-07-27 22:40       ` [RESEND " Thomas Graf
  0 siblings, 1 reply; 54+ messages in thread
From: David Miller @ 2006-07-26 23:40 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, vnuorval, usagi-core, yoshfuji, anttit

From: David Miller <davem@davemloft.net>
Date: Wed, 26 Jul 2006 16:33:43 -0700 (PDT)

> Therfore, the better fix seems to be, to change the Makefile
> rule to be:
> 
> ipv6-$(CONFIG_IPV6_MULTIPLE_TABLES) += fib6_rules.o
> 
> Ok?

And once this is fixed, new problems crop up :-)

WARNING: net/ipv6/ipv6.o - Section mismatch: reference to .exit.text:fib6_rules_cleanup from .text between 'ip6_route_cleanup' (at offset 0xbfd4) and 'rt6_mtu_change_route'
WARNING: "fib_rules_dump" [net/ipv6/ipv6.ko] undefined!
WARNING: "fib_rules_register" [net/ipv6/ipv6.ko] undefined!
WARNING: "fib_rules_lookup" [net/ipv6/ipv6.ko] undefined!
WARNING: "fib_rules_unregister" [net/ipv6/ipv6.ko] undefined!

The fib_rules_* cases are easy to fix, simply export
those symbols using EXPORT_SYMBOL_GPL() in net/core/fib_rules.c

The other problem is a bit more onerous.  ip6_route_cleanup is not
marked __exit, since it is called from initialization as well as
exit contexts.  Therefore fib6_rules_cleanup() has to have it's
__exit marker deleted too.

IPv6 seems to build modular now with all of these issues taken
care of.  Thomas, please integrate these fixes into your patch.

Thanks.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RESEND 4/5] [IPV6]: Policy Routing Rules
  2006-07-26 23:40     ` David Miller
@ 2006-07-27 22:40       ` Thomas Graf
  2006-07-31 14:55         ` Ville Nuorvala
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Graf @ 2006-07-27 22:40 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, vnuorval, usagi-core, yoshfuji, anttit

Adds support for policy routing rules including a new
local table for routes with a local destination.

Signed-off-by: Thomas Graf <tgraf@suug.ch>

Index: net-2.6.git/net/ipv6/fib6_rules.c
===================================================================
--- /dev/null
+++ net-2.6.git/net/ipv6/fib6_rules.c
@@ -0,0 +1,255 @@
+/*
+ * net/ipv6/fib6_rules.c	IPv6 Routing Policy Rules
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License as
+ *	published by the Free Software Foundation, version 2.
+ *
+ * Authors
+ *	Ville Nuorvala		<vnuorval@tcs.hut.fi>
+ *	Thomas Graf		<tgraf@suug.ch>
+ */
+
+#include <linux/config.h>
+#include <linux/netdevice.h>
+
+#include <net/fib_rules.h>
+#include <net/ipv6.h>
+#include <net/ip6_route.h>
+#include <net/netlink.h>
+
+struct fib6_rule
+{
+	struct fib_rule		common;
+	struct rt6key		src;
+	struct rt6key		dst;
+	u8			tclass;
+};
+
+static struct fib_rules_ops fib6_rules_ops;
+
+static struct fib6_rule main_rule = {
+	.common = {
+		.refcnt =	ATOMIC_INIT(2),
+		.pref =		0x7FFE,
+		.action =	FR_ACT_TO_TBL,
+		.table =	RT6_TABLE_MAIN,
+	},
+};
+
+static struct fib6_rule local_rule = {
+	.common = {
+		.refcnt =	ATOMIC_INIT(2),
+		.pref =		0,
+		.action =	FR_ACT_TO_TBL,
+		.table =	RT6_TABLE_LOCAL,
+		.flags =	FIB_RULE_PERMANENT,
+	},
+};
+
+static LIST_HEAD(fib6_rules);
+
+struct dst_entry *fib6_rule_lookup(struct flowi *fl, int flags,
+				   pol_lookup_t lookup)
+{
+	struct fib_lookup_arg arg = {
+		.lookup_ptr = lookup,
+	};
+
+	fib_rules_lookup(&fib6_rules_ops, fl, flags, &arg);
+	if (arg.rule)
+		fib_rule_put(arg.rule);
+
+	return (struct dst_entry *) arg.result;
+}
+
+int fib6_rule_action(struct fib_rule *rule, struct flowi *flp,
+		     int flags, struct fib_lookup_arg *arg)
+{
+	struct rt6_info *rt = NULL;
+	struct fib6_table *table;
+	pol_lookup_t lookup = arg->lookup_ptr;
+
+	switch (rule->action) {
+	case FR_ACT_TO_TBL:
+		break;
+	case FR_ACT_UNREACHABLE:
+		rt = &ip6_null_entry;
+		goto discard_pkt;
+	default:
+	case FR_ACT_BLACKHOLE:
+		rt = &ip6_blk_hole_entry;
+		goto discard_pkt;
+	case FR_ACT_PROHIBIT:
+		rt = &ip6_prohibit_entry;
+		goto discard_pkt;
+	}
+
+	table = fib6_get_table(rule->table);
+	if (table)
+		rt = lookup(table, flp, flags);
+
+	if (rt != &ip6_null_entry)
+		goto out;
+
+	dst_release(&rt->u.dst);
+discard_pkt:
+	dst_hold(&rt->u.dst);
+out:
+	arg->result = rt;
+	return rt == NULL ? -EAGAIN : 0;
+}
+
+
+static int fib6_rule_match(struct fib_rule *rule, struct flowi *fl, int flags)
+{
+	struct fib6_rule *r = (struct fib6_rule *) rule;
+
+	if (!ipv6_prefix_equal(&fl->fl6_dst, &r->dst.addr, r->dst.plen))
+		return 0;
+
+#ifdef CONFIG_IPV6_SUBTREES
+	if ((flags & RT6_F_HAS_SADDR) &&
+	    !ipv6_prefix_equal(&fl->fl6_src, &r->r_src.addr, r->r_src.plen))
+		return 0;
+#endif
+
+	return 1;
+}
+
+static struct nla_policy fib6_rule_policy[RTA_MAX+1] __read_mostly = {
+	[FRA_IFNAME]	= { .type = NLA_STRING },
+	[FRA_PRIORITY]	= { .type = NLA_U32 },
+	[FRA_SRC]	= { .minlen = sizeof(struct in6_addr) },
+	[FRA_DST]	= { .minlen = sizeof(struct in6_addr) },
+};
+
+static int fib6_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
+			       struct nlmsghdr *nlh, struct fib_rule_hdr *frh,
+			       struct nlattr **tb)
+{
+	int err = -EINVAL;
+	struct fib6_rule *rule6 = (struct fib6_rule *) rule;
+
+	if (frh->src_len > 128 || frh->dst_len > 128 ||
+	    (frh->tos & ~IPV6_FLOWINFO_MASK))
+		goto errout;
+
+#ifndef CONFIG_IPV6_SUBTREES
+	if (frh->src_len > 0)
+		goto errout;
+#endif
+
+	if (rule->action == FR_ACT_TO_TBL) {
+		if (rule->table == RT6_TABLE_UNSPEC)
+			goto errout;
+
+		if (fib6_new_table(rule->table) == NULL) {
+			err = -ENOBUFS;
+			goto errout;
+		}
+	}
+
+	if (tb[FRA_SRC])
+		nla_memcpy(&rule6->src.addr, tb[FRA_SRC],
+			   sizeof(struct in6_addr));
+
+	if (tb[FRA_DST])
+		nla_memcpy(&rule6->dst.addr, tb[FRA_DST],
+			   sizeof(struct in6_addr));
+
+	rule6->src.plen = frh->src_len;
+	rule6->dst.plen = frh->dst_len;
+	rule6->tclass = frh->tos;
+
+	err = 0;
+errout:
+	return err;
+}
+
+static int fib6_rule_compare(struct fib_rule *rule, struct fib_rule_hdr *frh,
+			     struct nlattr **tb)
+{
+	struct fib6_rule *rule6 = (struct fib6_rule *) rule;
+
+	if (frh->src_len && (rule6->src.plen != frh->src_len))
+		return 0;
+
+	if (frh->dst_len && (rule6->dst.plen != frh->dst_len))
+		return 0;
+
+	if (frh->tos && (rule6->tclass != frh->tos))
+		return 0;
+
+	if (tb[FRA_SRC] &&
+	    nla_memcmp(tb[FRA_SRC], &rule6->src.addr, sizeof(struct in6_addr)))
+		return 0;
+
+	if (tb[FRA_DST] &&
+	    nla_memcmp(tb[FRA_DST], &rule6->dst.addr, sizeof(struct in6_addr)))
+		return 0;
+
+	return 1;
+}
+
+static int fib6_rule_fill(struct fib_rule *rule, struct sk_buff *skb,
+			  struct nlmsghdr *nlh, struct fib_rule_hdr *frh)
+{
+	struct fib6_rule *rule6 = (struct fib6_rule *) rule;
+
+	frh->family = AF_INET6;
+	frh->dst_len = rule6->dst.plen;
+	frh->src_len = rule6->src.plen;
+	frh->tos = rule6->tclass;
+
+	if (rule6->dst.plen)
+		NLA_PUT(skb, FRA_DST, sizeof(struct in6_addr),
+			&rule6->dst.addr);
+
+	if (rule6->src.plen)
+		NLA_PUT(skb, FRA_SRC, sizeof(struct in6_addr),
+			&rule6->src.addr);
+
+	return 0;
+
+nla_put_failure:
+	return -ENOBUFS;
+}
+
+int fib6_rules_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	return fib_rules_dump(skb, cb, AF_INET6);
+}
+
+static u32 fib6_rule_default_pref(void)
+{
+	return 0x3FFF;
+}
+
+static struct fib_rules_ops fib6_rules_ops = {
+	.family			= AF_INET6,
+	.rule_size		= sizeof(struct fib6_rule),
+	.action			= fib6_rule_action,
+	.match			= fib6_rule_match,
+	.configure		= fib6_rule_configure,
+	.compare		= fib6_rule_compare,
+	.fill			= fib6_rule_fill,
+	.default_pref		= fib6_rule_default_pref,
+	.policy			= fib6_rule_policy,
+	.rules_list		= &fib6_rules,
+	.owner			= THIS_MODULE,
+};
+
+void __init fib6_rules_init(void)
+{
+	list_add_tail(&local_rule.common.list, &fib6_rules);
+	list_add_tail(&main_rule.common.list, &fib6_rules);
+
+	fib_rules_register(&fib6_rules_ops);
+}
+
+void fib6_rules_cleanup(void)
+{
+	fib_rules_unregister(&fib6_rules_ops);
+}
+
Index: net-2.6.git/net/ipv6/Kconfig
===================================================================
--- net-2.6.git.orig/net/ipv6/Kconfig
+++ net-2.6.git/net/ipv6/Kconfig
@@ -138,6 +138,7 @@ config IPV6_TUNNEL
 config IPV6_MULTIPLE_TABLES
 	bool "IPv6: Multiple Routing Tables"
 	depends on IPV6 && EXPERIMENTAL
+	select FIB_RULES
 	---help---
 	  Support multiple routing tables.
 
Index: net-2.6.git/net/ipv6/Makefile
===================================================================
--- net-2.6.git.orig/net/ipv6/Makefile
+++ net-2.6.git/net/ipv6/Makefile
@@ -13,6 +13,7 @@ ipv6-objs :=	af_inet6.o anycast.o ip6_ou
 ipv6-$(CONFIG_XFRM) += xfrm6_policy.o xfrm6_state.o xfrm6_input.o \
 	xfrm6_output.o
 ipv6-$(CONFIG_NETFILTER) += netfilter.o
+ipv6-$(CONFIG_IPV6_MULTIPLE_TABLES) += fib6_rules.o
 ipv6-objs += $(ipv6-y)
 
 obj-$(CONFIG_INET6_AH) += ah6.o
Index: net-2.6.git/include/net/ip6_fib.h
===================================================================
--- net-2.6.git.orig/include/net/ip6_fib.h
+++ net-2.6.git/include/net/ip6_fib.h
@@ -155,16 +155,17 @@ struct fib6_table {
 
 #define RT6_TABLE_UNSPEC	RT_TABLE_UNSPEC
 #define RT6_TABLE_MAIN		RT_TABLE_MAIN
-#define RT6_TABLE_LOCAL		RT6_TABLE_MAIN
 #define RT6_TABLE_DFLT		RT6_TABLE_MAIN
 #define RT6_TABLE_INFO		RT6_TABLE_MAIN
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
 #define FIB6_TABLE_MIN		1
 #define FIB6_TABLE_MAX		RT_TABLE_MAX
+#define RT6_TABLE_LOCAL		RT_TABLE_LOCAL
 #else
 #define FIB6_TABLE_MIN		RT_TABLE_MAIN
 #define FIB6_TABLE_MAX		FIB6_TABLE_MIN
+#define RT6_TABLE_LOCAL		RT6_TABLE_MAIN
 #endif
 
 #define RT6_F_STRICT		1
@@ -219,5 +220,11 @@ extern void			fib6_run_gc(unsigned long 
 extern void			fib6_gc_cleanup(void);
 
 extern void			fib6_init(void);
+
+extern void			fib6_rules_init(void);
+extern void			fib6_rules_cleanup(void);
+extern int			fib6_rules_dump(struct sk_buff *,
+						struct netlink_callback *);
+
 #endif
 #endif
Index: net-2.6.git/net/ipv6/route.c
===================================================================
--- net-2.6.git.orig/net/ipv6/route.c
+++ net-2.6.git/net/ipv6/route.c
@@ -139,6 +139,50 @@ struct rt6_info ip6_null_entry = {
 	.rt6i_ref	= ATOMIC_INIT(1),
 };
 
+#ifdef CONFIG_IPV6_MULTIPLE_TABLES
+
+struct rt6_info ip6_prohibit_entry = {
+	.u = {
+		.dst = {
+			.__refcnt	= ATOMIC_INIT(1),
+			.__use		= 1,
+			.dev		= &loopback_dev,
+			.obsolete	= -1,
+			.error		= -EACCES,
+			.metrics	= { [RTAX_HOPLIMIT - 1] = 255, },
+			.input		= ip6_pkt_discard,
+			.output		= ip6_pkt_discard_out,
+			.ops		= &ip6_dst_ops,
+			.path		= (struct dst_entry*)&ip6_prohibit_entry,
+		}
+	},
+	.rt6i_flags	= (RTF_REJECT | RTF_NONEXTHOP),
+	.rt6i_metric	= ~(u32) 0,
+	.rt6i_ref	= ATOMIC_INIT(1),
+};
+
+struct rt6_info ip6_blk_hole_entry = {
+	.u = {
+		.dst = {
+			.__refcnt	= ATOMIC_INIT(1),
+			.__use		= 1,
+			.dev		= &loopback_dev,
+			.obsolete	= -1,
+			.error		= -EINVAL,
+			.metrics	= { [RTAX_HOPLIMIT - 1] = 255, },
+			.input		= ip6_pkt_discard,
+			.output		= ip6_pkt_discard_out,
+			.ops		= &ip6_dst_ops,
+			.path		= (struct dst_entry*)&ip6_blk_hole_entry,
+		}
+	},
+	.rt6i_flags	= (RTF_REJECT | RTF_NONEXTHOP),
+	.rt6i_metric	= ~(u32) 0,
+	.rt6i_ref	= ATOMIC_INIT(1),
+};
+
+#endif
+
 /* allocate dst with ip6_dst_ops */
 static __inline__ struct rt6_info *ip6_dst_alloc(void)
 {
@@ -2398,10 +2442,16 @@ void __init ip6_route_init(void)
 #ifdef CONFIG_XFRM
 	xfrm6_init();
 #endif
+#ifdef CONFIG_IPV6_MULTIPLE_TABLES
+	fib6_rules_init();
+#endif
 }
 
 void ip6_route_cleanup(void)
 {
+#ifdef CONFIG_IPV6_MULTIPLE_TABLES
+	fib6_rules_cleanup();
+#endif
 #ifdef CONFIG_PROC_FS
 	proc_net_remove("ipv6_route");
 	proc_net_remove("rt6_stats");
Index: net-2.6.git/net/ipv6/ip6_fib.c
===================================================================
--- net-2.6.git.orig/net/ipv6/ip6_fib.c
+++ net-2.6.git/net/ipv6/ip6_fib.c
@@ -159,6 +159,15 @@ static struct fib6_table fib6_main_tbl =
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
 
+static struct fib6_table fib6_local_tbl = {
+	.tb6_id		= RT6_TABLE_LOCAL,
+	.tb6_lock	= RW_LOCK_UNLOCKED,
+	.tb6_root 	= {
+		.leaf		= &ip6_null_entry,
+		.fn_flags	= RTN_ROOT | RTN_TL_ROOT | RTN_RTINFO,
+	},
+};
+
 #define FIB_TABLE_HASHSZ 256
 static struct hlist_head fib_table_hash[FIB_TABLE_HASHSZ];
 
@@ -228,20 +237,10 @@ struct fib6_table *fib6_get_table(u32 id
 	return NULL;
 }
 
-struct dst_entry *fib6_rule_lookup(struct flowi *fl, int flags,
-				   pol_lookup_t lookup)
-{
-	/*
-	 * TODO: Add rule lookup
-	 */
-	struct fib6_table *table = fib6_get_table(RT6_TABLE_MAIN);
-
-	return (struct dst_entry *) lookup(table, fl, flags);
-}
-
 static void __init fib6_tables_init(void)
 {
 	fib6_link_table(&fib6_main_tbl);
+	fib6_link_table(&fib6_local_tbl);
 }
 
 #else
Index: net-2.6.git/include/net/ip6_route.h
===================================================================
--- net-2.6.git.orig/include/net/ip6_route.h
+++ net-2.6.git/include/net/ip6_route.h
@@ -41,6 +41,11 @@ struct pol_chain {
 
 extern struct rt6_info	ip6_null_entry;
 
+#ifdef CONFIG_IPV6_MULTIPLE_TABLES
+extern struct rt6_info	ip6_prohibit_entry;
+extern struct rt6_info	ip6_blk_hole_entry;
+#endif
+
 extern int ip6_rt_gc_interval;
 
 extern void			ip6_route_input(struct sk_buff *skb);
Index: net-2.6.git/net/ipv6/addrconf.c
===================================================================
--- net-2.6.git.orig/net/ipv6/addrconf.c
+++ net-2.6.git/net/ipv6/addrconf.c
@@ -3370,6 +3370,7 @@ static struct rtnetlink_link inet6_rtnet
 	[RTM_DELROUTE - RTM_BASE] = { .doit	= inet6_rtm_delroute, },
 	[RTM_GETROUTE - RTM_BASE] = { .doit	= inet6_rtm_getroute,
 				      .dumpit	= inet6_dump_fib, },
+	[RTM_GETRULE  - RTM_BASE] = { .dumpit   = fib6_rules_dump,   },
 };
 
 static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RESEND 4/5] [IPV6]: Policy Routing Rules
  2006-07-27 22:40       ` [RESEND " Thomas Graf
@ 2006-07-31 14:55         ` Ville Nuorvala
  0 siblings, 0 replies; 54+ messages in thread
From: Ville Nuorvala @ 2006-07-31 14:55 UTC (permalink / raw)
  To: Thomas Graf; +Cc: David Miller, netdev, usagi-core, yoshfuji, anttit

Thomas Graf wrote:

> Adds support for policy routing rules including a new
> local table for routes with a local destination.

Looks good!

> Signed-off-by: Thomas Graf <tgraf@suug.ch>

Signed-off-by: Ville Nuorvala <vnuorval@tcs.hut.fi>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 5/5] [IPV4]: Use Protocol Independant Policy Routing Rules Framework
  2006-07-26 22:11 [RFC] Multiple IPV6 Routing Tables & Policy Routing Thomas Graf
                   ` (3 preceding siblings ...)
  2006-07-26 22:00 ` [PATCH 4/5] [IPV6]: Policy Routing Rules Thomas Graf
@ 2006-07-26 22:00 ` Thomas Graf
  2006-07-26 22:43   ` David Miller
  2006-07-26 23:47   ` David Miller
  2006-07-28  6:10 ` [RFC] Multiple IPV6 Routing Tables & Policy Routing YOSHIFUJI Hideaki / 吉藤英明
  2006-07-31 11:01 ` Ville Nuorvala
  6 siblings, 2 replies; 54+ messages in thread
From: Thomas Graf @ 2006-07-26 22:00 UTC (permalink / raw)
  To: netdev; +Cc: vnuorval, usagi-core, yoshfuji, davem, anttit

[-- Attachment #1: ip4_fib_rules --]
[-- Type: text/plain, Size: 22068 bytes --]

Signed-off-by: Thomas Graf <tgraf@suug.ch>

Index: net-2.6.git/include/net/ip_fib.h
===================================================================
--- net-2.6.git.orig/include/net/ip_fib.h
+++ net-2.6.git/include/net/ip_fib.h
@@ -18,6 +18,7 @@
 
 #include <net/flow.h>
 #include <linux/seq_file.h>
+#include <net/fib_rules.h>
 
 /* WARNING: The ordering of these elements must match ordering
  *          of RTA_* rtnetlink attribute numbers.
@@ -203,9 +204,8 @@ static inline void fib_select_default(co
 #define ip_fib_main_table (fib_tables[RT_TABLE_MAIN])
 
 extern struct fib_table * fib_tables[RT_TABLE_MAX+1];
-extern int fib_lookup(const struct flowi *flp, struct fib_result *res);
+extern int fib_lookup(struct flowi *flp, struct fib_result *res);
 extern struct fib_table *__fib_new_table(int id);
-extern void fib_rule_put(struct fib_rule *r);
 
 static inline struct fib_table *fib_get_table(int id)
 {
@@ -251,15 +251,15 @@ extern u32  __fib_res_prefsrc(struct fib
 extern struct fib_table *fib_hash_init(int id);
 
 #ifdef CONFIG_IP_MULTIPLE_TABLES
-/* Exported by fib_rules.c */
+extern int fib4_rules_dump(struct sk_buff *skb, struct netlink_callback *cb);
+
+extern void __init fib4_rules_init(void);
+extern void __exit fib4_rules_cleanup(void);
 
-extern int inet_rtm_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg);
-extern int inet_rtm_newrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg);
-extern int inet_dump_rules(struct sk_buff *skb, struct netlink_callback *cb);
 #ifdef CONFIG_NET_CLS_ROUTE
 extern u32 fib_rules_tclass(struct fib_result *res);
 #endif
-extern void fib_rules_init(void);
+
 #endif
 
 static inline void fib_combine_itag(u32 *itag, struct fib_result *res)
Index: net-2.6.git/net/ipv4/Kconfig
===================================================================
--- net-2.6.git.orig/net/ipv4/Kconfig
+++ net-2.6.git/net/ipv4/Kconfig
@@ -88,6 +88,7 @@ config IP_FIB_HASH
 config IP_MULTIPLE_TABLES
 	bool "IP: policy routing"
 	depends on IP_ADVANCED_ROUTER
+	select FIB_RULES
 	---help---
 	  Normally, a router decides what to do with a received packet based
 	  solely on the packet's final destination address. If you say Y here,
Index: net-2.6.git/net/ipv4/devinet.c
===================================================================
--- net-2.6.git.orig/net/ipv4/devinet.c
+++ net-2.6.git/net/ipv4/devinet.c
@@ -1153,9 +1153,7 @@ static struct rtnetlink_link inet_rtnetl
 	[RTM_GETROUTE - RTM_BASE] = { .doit	= inet_rtm_getroute,
 				      .dumpit	= inet_dump_fib,	},
 #ifdef CONFIG_IP_MULTIPLE_TABLES
-	[RTM_NEWRULE  - RTM_BASE] = { .doit	= inet_rtm_newrule,	},
-	[RTM_DELRULE  - RTM_BASE] = { .doit	= inet_rtm_delrule,	},
-	[RTM_GETRULE  - RTM_BASE] = { .dumpit	= inet_dump_rules,	},
+	[RTM_GETRULE  - RTM_BASE] = { .dumpit	= fib4_rules_dump,	},
 #endif
 };
 
Index: net-2.6.git/net/ipv4/fib_rules.c
===================================================================
--- net-2.6.git.orig/net/ipv4/fib_rules.c
+++ net-2.6.git/net/ipv4/fib_rules.c
@@ -5,9 +5,8 @@
  *
  *		IPv4 Forwarding Information Base: policy rules.
  *
- * Version:	$Id: fib_rules.c,v 1.17 2001/10/31 21:55:54 davem Exp $
- *
  * Authors:	Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
+ * 		Thomas Graf <tgraf@suug.ch>
  *
  *		This program is free software; you can redistribute it and/or
  *		modify it under the terms of the GNU General Public License
@@ -19,129 +18,154 @@
  *		Marc Boucher	:	routing by fwmark
  */
 
-#include <asm/uaccess.h>
-#include <asm/system.h>
-#include <linux/bitops.h>
 #include <linux/types.h>
 #include <linux/kernel.h>
-#include <linux/sched.h>
-#include <linux/mm.h>
-#include <linux/string.h>
-#include <linux/socket.h>
-#include <linux/sockios.h>
-#include <linux/errno.h>
-#include <linux/in.h>
-#include <linux/inet.h>
-#include <linux/inetdevice.h>
 #include <linux/netdevice.h>
-#include <linux/if_arp.h>
-#include <linux/proc_fs.h>
-#include <linux/skbuff.h>
 #include <linux/netlink.h>
+#include <linux/inetdevice.h>
 #include <linux/init.h>
 #include <linux/list.h>
 #include <linux/rcupdate.h>
-
 #include <net/ip.h>
-#include <net/protocol.h>
 #include <net/route.h>
 #include <net/tcp.h>
-#include <net/sock.h>
 #include <net/ip_fib.h>
+#include <net/fib_rules.h>
 
-#define FRprintk(a...)
+static struct fib_rules_ops fib4_rules_ops;
 
-struct fib_rule
+struct fib4_rule
 {
-	struct hlist_node hlist;
-	atomic_t	r_clntref;
-	u32		r_preference;
-	unsigned char	r_table;
-	unsigned char	r_action;
-	unsigned char	r_dst_len;
-	unsigned char	r_src_len;
-	u32		r_src;
-	u32		r_srcmask;
-	u32		r_dst;
-	u32		r_dstmask;
-	u32		r_srcmap;
-	u8		r_flags;
-	u8		r_tos;
+	struct fib_rule		common;
+	u8			dst_len;
+	u8			src_len;
+	u8			tos;
+	u32			src;
+	u32			srcmask;
+	u32			dst;
+	u32			dstmask;
 #ifdef CONFIG_IP_ROUTE_FWMARK
-	u32		r_fwmark;
+	u32			fwmark;
 #endif
-	int		r_ifindex;
 #ifdef CONFIG_NET_CLS_ROUTE
-	__u32		r_tclassid;
+	u32			tclassid;
 #endif
-	char		r_ifname[IFNAMSIZ];
-	int		r_dead;
-	struct		rcu_head rcu;
 };
 
-static struct fib_rule default_rule = {
-	.r_clntref =	ATOMIC_INIT(2),
-	.r_preference =	0x7FFF,
-	.r_table =	RT_TABLE_DEFAULT,
-	.r_action =	RTN_UNICAST,
+static struct fib4_rule default_rule = {
+	.common = {
+		.refcnt =	ATOMIC_INIT(2),
+		.pref =		0x7FFF,
+		.table =	RT_TABLE_DEFAULT,
+		.action =	FR_ACT_TO_TBL,
+	},
 };
 
-static struct fib_rule main_rule = {
-	.r_clntref =	ATOMIC_INIT(2),
-	.r_preference =	0x7FFE,
-	.r_table =	RT_TABLE_MAIN,
-	.r_action =	RTN_UNICAST,
+static struct fib4_rule main_rule = {
+	.common = {
+		.refcnt =	ATOMIC_INIT(2),
+		.pref =		0x7FFE,
+		.table =	RT_TABLE_MAIN,
+		.action =	FR_ACT_TO_TBL,
+	},
 };
 
-static struct fib_rule local_rule = {
-	.r_clntref =	ATOMIC_INIT(2),
-	.r_table =	RT_TABLE_LOCAL,
-	.r_action =	RTN_UNICAST,
+static struct fib4_rule local_rule = {
+	.common = {
+		.refcnt =	ATOMIC_INIT(2),
+		.table =	RT_TABLE_LOCAL,
+		.action =	FR_ACT_TO_TBL,
+		.flags =	FIB_RULE_PERMANENT,
+	},
 };
 
-static struct hlist_head fib_rules;
+static LIST_HEAD(fib4_rules);
+
+#ifdef CONFIG_NET_CLS_ROUTE
+u32 fib_rules_tclass(struct fib_result *res)
+{
+	return res->r ? ((struct fib4_rule *) res->r)->tclassid : 0;
+}
+#endif
 
-/* writer func called from netlink -- rtnl_sem hold*/
+int fib_lookup(struct flowi *flp, struct fib_result *res)
+{
+	struct fib_lookup_arg arg = {
+		.result = res,
+	};
+	int err;
 
-static void rtmsg_rule(int, struct fib_rule *);
+	err = fib_rules_lookup(&fib4_rules_ops, flp, 0, &arg);
+	res->r = arg.rule;
 
-int inet_rtm_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
+	return err;
+}
+
+int fib4_rule_action(struct fib_rule *rule, struct flowi *flp, int flags,
+		     struct fib_lookup_arg *arg)
 {
-	struct rtattr **rta = arg;
-	struct rtmsg *rtm = NLMSG_DATA(nlh);
-	struct fib_rule *r;
-	struct hlist_node *node;
-	int err = -ESRCH;
-
-	hlist_for_each_entry(r, node, &fib_rules, hlist) {
-		if ((!rta[RTA_SRC-1] || memcmp(RTA_DATA(rta[RTA_SRC-1]), &r->r_src, 4) == 0) &&
-		    rtm->rtm_src_len == r->r_src_len &&
-		    rtm->rtm_dst_len == r->r_dst_len &&
-		    (!rta[RTA_DST-1] || memcmp(RTA_DATA(rta[RTA_DST-1]), &r->r_dst, 4) == 0) &&
-		    rtm->rtm_tos == r->r_tos &&
-#ifdef CONFIG_IP_ROUTE_FWMARK
-		    (!rta[RTA_PROTOINFO-1] || memcmp(RTA_DATA(rta[RTA_PROTOINFO-1]), &r->r_fwmark, 4) == 0) &&
-#endif
-		    (!rtm->rtm_type || rtm->rtm_type == r->r_action) &&
-		    (!rta[RTA_PRIORITY-1] || memcmp(RTA_DATA(rta[RTA_PRIORITY-1]), &r->r_preference, 4) == 0) &&
-		    (!rta[RTA_IIF-1] || rtattr_strcmp(rta[RTA_IIF-1], r->r_ifname) == 0) &&
-		    (!rtm->rtm_table || (r && rtm->rtm_table == r->r_table))) {
-			err = -EPERM;
-			if (r == &local_rule)
-				break;
-
-			hlist_del_rcu(&r->hlist);
-			r->r_dead = 1;
-			rtmsg_rule(RTM_DELRULE, r);
-			fib_rule_put(r);
-			err = 0;
-			break;
-		}
+	int err = -EAGAIN;
+	struct fib_table *tbl;
+
+	switch (rule->action) {
+	case FR_ACT_TO_TBL:
+		break;
+
+	case FR_ACT_UNREACHABLE:
+		err = -ENETUNREACH;
+		goto errout;
+
+	case FR_ACT_PROHIBIT:
+		err = -EACCES;
+		goto errout;
+
+	case FR_ACT_BLACKHOLE:
+	default:
+		err = -EINVAL;
+		goto errout;
 	}
+
+	if ((tbl = fib_get_table(rule->table)) == NULL)
+		goto errout;
+
+	err = tbl->tb_lookup(tbl, flp, (struct fib_result *) arg->result);
+	if (err > 0)
+		err = -EAGAIN;
+errout:
 	return err;
 }
 
-/* Allocate new unique table id */
+
+void fib_select_default(const struct flowi *flp, struct fib_result *res)
+{
+	if (res->r && res->r->action == FR_ACT_TO_TBL &&
+	    FIB_RES_GW(*res) && FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK) {
+		struct fib_table *tb;
+		if ((tb = fib_get_table(res->r->table)) != NULL)
+			tb->tb_select_default(tb, flp, res);
+	}
+}
+
+static int fib4_rule_match(struct fib_rule *rule, struct flowi *fl, int flags)
+{
+	struct fib4_rule *r = (struct fib4_rule *) rule;
+	u32 daddr = fl->fl4_dst;
+	u32 saddr = fl->fl4_src;
+
+	if (((saddr ^ r->src) & r->srcmask) ||
+	    ((daddr ^ r->dst) & r->dstmask))
+		return 0;
+
+	if (r->tos && (r->tos != fl->fl4_tos))
+		return 0;
+
+#ifdef CONFIG_IP_ROUTE_FWMARK
+	if (r->fwmark && (r->fwmark != fl->fl4_fwmark))
+		return 0;
+#endif
+
+	return 1;
+}
 
 static struct fib_table *fib_empty_table(void)
 {
@@ -153,330 +177,177 @@ static struct fib_table *fib_empty_table
 	return NULL;
 }
 
-static inline void fib_rule_put_rcu(struct rcu_head *head)
-{
-	struct fib_rule *r = container_of(head, struct fib_rule, rcu);
-	kfree(r);
-}
-
-void fib_rule_put(struct fib_rule *r)
-{
-	if (atomic_dec_and_test(&r->r_clntref)) {
-		if (r->r_dead)
-			call_rcu(&r->rcu, fib_rule_put_rcu);
-		else
-			printk("Freeing alive rule %p\n", r);
-	}
-}
+static struct nla_policy fib4_rule_policy[FRA_MAX+1] __read_mostly = {
+	[FRA_IFNAME]	= { .type = NLA_STRING },
+	[FRA_PRIORITY]	= { .type = NLA_U32 },
+	[FRA_SRC]	= { .type = NLA_U32 },
+	[FRA_DST]	= { .type = NLA_U32 },
+	[FRA_FWMARK]	= { .type = NLA_U32 },
+	[FRA_FLOW]	= { .type = NLA_U32 },
+};
 
-/* writer func called from netlink -- rtnl_sem hold*/
+static int fib4_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
+			       struct nlmsghdr *nlh, struct fib_rule_hdr *frh,
+			       struct nlattr **tb)
+{
+	int err = -EINVAL;
+	struct fib4_rule *rule4 = (struct fib4_rule *) rule;
+
+	if (frh->src_len > 32 || frh->dst_len > 32 ||
+	    (frh->tos & ~IPTOS_TOS_MASK))
+		goto errout;
+
+	if (rule->table == RT_TABLE_UNSPEC) {
+		if (rule->action == FR_ACT_TO_TBL) {
+			struct fib_table *table;
+
+			table = fib_empty_table();
+			if (table == NULL) {
+				err = -ENOBUFS;
+				goto errout;
+			}
 
-int inet_rtm_newrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
-{
-	struct rtattr **rta = arg;
-	struct rtmsg *rtm = NLMSG_DATA(nlh);
-	struct fib_rule *r, *new_r, *last = NULL;
-	struct hlist_node *node = NULL;
-	unsigned char table_id;
-
-	if (rtm->rtm_src_len > 32 || rtm->rtm_dst_len > 32 ||
-	    (rtm->rtm_tos & ~IPTOS_TOS_MASK))
-		return -EINVAL;
-
-	if (rta[RTA_IIF-1] && RTA_PAYLOAD(rta[RTA_IIF-1]) > IFNAMSIZ)
-		return -EINVAL;
-
-	table_id = rtm->rtm_table;
-	if (table_id == RT_TABLE_UNSPEC) {
-		struct fib_table *table;
-		if (rtm->rtm_type == RTN_UNICAST) {
-			if ((table = fib_empty_table()) == NULL)
-				return -ENOBUFS;
-			table_id = table->tb_id;
+			rule->table = table->tb_id;
 		}
 	}
 
-	new_r = kmalloc(sizeof(*new_r), GFP_KERNEL);
-	if (!new_r)
-		return -ENOMEM;
-	memset(new_r, 0, sizeof(*new_r));
-
-	if (rta[RTA_SRC-1])
-		memcpy(&new_r->r_src, RTA_DATA(rta[RTA_SRC-1]), 4);
-	if (rta[RTA_DST-1])
-		memcpy(&new_r->r_dst, RTA_DATA(rta[RTA_DST-1]), 4);
-	if (rta[RTA_GATEWAY-1])
-		memcpy(&new_r->r_srcmap, RTA_DATA(rta[RTA_GATEWAY-1]), 4);
-	new_r->r_src_len = rtm->rtm_src_len;
-	new_r->r_dst_len = rtm->rtm_dst_len;
-	new_r->r_srcmask = inet_make_mask(rtm->rtm_src_len);
-	new_r->r_dstmask = inet_make_mask(rtm->rtm_dst_len);
-	new_r->r_tos = rtm->rtm_tos;
+	if (tb[FRA_SRC])
+		rule4->src = nla_get_u32(tb[FRA_SRC]);
+
+	if (tb[FRA_DST])
+		rule4->dst = nla_get_u32(tb[FRA_DST]);
+
 #ifdef CONFIG_IP_ROUTE_FWMARK
-	if (rta[RTA_PROTOINFO-1])
-		memcpy(&new_r->r_fwmark, RTA_DATA(rta[RTA_PROTOINFO-1]), 4);
+	if (tb[FRA_FWMARK])
+		rule4->fwmark = nla_get_u32(tb[FRA_FWMARK]);
 #endif
-	new_r->r_action = rtm->rtm_type;
-	new_r->r_flags = rtm->rtm_flags;
-	if (rta[RTA_PRIORITY-1])
-		memcpy(&new_r->r_preference, RTA_DATA(rta[RTA_PRIORITY-1]), 4);
-	new_r->r_table = table_id;
-	if (rta[RTA_IIF-1]) {
-		struct net_device *dev;
-		rtattr_strlcpy(new_r->r_ifname, rta[RTA_IIF-1], IFNAMSIZ);
-		new_r->r_ifindex = -1;
-		dev = __dev_get_by_name(new_r->r_ifname);
-		if (dev)
-			new_r->r_ifindex = dev->ifindex;
-	}
+
 #ifdef CONFIG_NET_CLS_ROUTE
-	if (rta[RTA_FLOW-1])
-		memcpy(&new_r->r_tclassid, RTA_DATA(rta[RTA_FLOW-1]), 4);
+	if (tb[FRA_FLOW])
+		rule4->tclassid = nla_get_u32(tb[FRA_FLOW]);
 #endif
-	r = container_of(fib_rules.first, struct fib_rule, hlist);
-
-	if (!new_r->r_preference) {
-		if (r && r->hlist.next != NULL) {
-			r = container_of(r->hlist.next, struct fib_rule, hlist);
-			if (r->r_preference)
-				new_r->r_preference = r->r_preference - 1;
-		}
-	}
-
-	hlist_for_each_entry(r, node, &fib_rules, hlist) {
-		if (r->r_preference > new_r->r_preference)
-			break;
-		last = r;
-	}
-	atomic_inc(&new_r->r_clntref);
 
-	if (last)
-		hlist_add_after_rcu(&last->hlist, &new_r->hlist);
-	else
-		hlist_add_before_rcu(&new_r->hlist, &r->hlist);
+	rule4->src_len = frh->src_len;
+	rule4->srcmask = inet_make_mask(rule4->src_len);
+	rule4->dst_len = frh->dst_len;
+	rule4->dstmask = inet_make_mask(rule4->dst_len);
+	rule4->tos = frh->tos;
 
-	rtmsg_rule(RTM_NEWRULE, new_r);
-	return 0;
+	err = 0;
+errout:
+	return err;
 }
 
-#ifdef CONFIG_NET_CLS_ROUTE
-u32 fib_rules_tclass(struct fib_result *res)
+static int fib4_rule_compare(struct fib_rule *rule, struct fib_rule_hdr *frh,
+			     struct nlattr **tb)
 {
-	if (res->r)
-		return res->r->r_tclassid;
-	return 0;
-}
-#endif
+	struct fib4_rule *rule4 = (struct fib4_rule *) rule;
 
-/* callers should hold rtnl semaphore */
+	if (frh->src_len && (rule4->src_len != frh->src_len))
+		return 0;
 
-static void fib_rules_detach(struct net_device *dev)
-{
-	struct hlist_node *node;
-	struct fib_rule *r;
+	if (frh->dst_len && (rule4->dst_len != frh->dst_len))
+		return 0;
 
-	hlist_for_each_entry(r, node, &fib_rules, hlist) {
-		if (r->r_ifindex == dev->ifindex)
-			r->r_ifindex = -1;
+	if (frh->tos && (rule4->tos != frh->tos))
+		return 0;
 
-	}
-}
-
-/* callers should hold rtnl semaphore */
-
-static void fib_rules_attach(struct net_device *dev)
-{
-	struct hlist_node *node;
-	struct fib_rule *r;
-
-	hlist_for_each_entry(r, node, &fib_rules, hlist) {
-		if (r->r_ifindex == -1 && strcmp(dev->name, r->r_ifname) == 0)
-			r->r_ifindex = dev->ifindex;
-	}
-}
-
-int fib_lookup(const struct flowi *flp, struct fib_result *res)
-{
-	int err;
-	struct fib_rule *r, *policy;
-	struct fib_table *tb;
-	struct hlist_node *node;
-
-	u32 daddr = flp->fl4_dst;
-	u32 saddr = flp->fl4_src;
-
-FRprintk("Lookup: %u.%u.%u.%u <- %u.%u.%u.%u ",
-	NIPQUAD(flp->fl4_dst), NIPQUAD(flp->fl4_src));
-
-	rcu_read_lock();
-
-	hlist_for_each_entry_rcu(r, node, &fib_rules, hlist) {
-		if (((saddr^r->r_src) & r->r_srcmask) ||
-		    ((daddr^r->r_dst) & r->r_dstmask) ||
-		    (r->r_tos && r->r_tos != flp->fl4_tos) ||
 #ifdef CONFIG_IP_ROUTE_FWMARK
-		    (r->r_fwmark && r->r_fwmark != flp->fl4_fwmark) ||
+	if (tb[FRA_FWMARK] && (rule4->fwmark != nla_get_u32(tb[FRA_FWMARK])))
+		return 0;
 #endif
-		    (r->r_ifindex && r->r_ifindex != flp->iif))
-			continue;
-
-FRprintk("tb %d r %d ", r->r_table, r->r_action);
-		switch (r->r_action) {
-		case RTN_UNICAST:
-			policy = r;
-			break;
-		case RTN_UNREACHABLE:
-			rcu_read_unlock();
-			return -ENETUNREACH;
-		default:
-		case RTN_BLACKHOLE:
-			rcu_read_unlock();
-			return -EINVAL;
-		case RTN_PROHIBIT:
-			rcu_read_unlock();
-			return -EACCES;
-		}
 
-		if ((tb = fib_get_table(r->r_table)) == NULL)
-			continue;
-		err = tb->tb_lookup(tb, flp, res);
-		if (err == 0) {
-			res->r = policy;
-			if (policy)
-				atomic_inc(&policy->r_clntref);
-			rcu_read_unlock();
-			return 0;
-		}
-		if (err < 0 && err != -EAGAIN) {
-			rcu_read_unlock();
-			return err;
-		}
-	}
-FRprintk("FAILURE\n");
-	rcu_read_unlock();
-	return -ENETUNREACH;
-}
+#ifdef CONFIG_NET_CLS_ROUTE
+	if (tb[FRA_FLOW] && (rule4->tclassid != nla_get_u32(tb[FRA_FLOW])))
+		return 0;
+#endif
 
-void fib_select_default(const struct flowi *flp, struct fib_result *res)
-{
-	if (res->r && res->r->r_action == RTN_UNICAST &&
-	    FIB_RES_GW(*res) && FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK) {
-		struct fib_table *tb;
-		if ((tb = fib_get_table(res->r->r_table)) != NULL)
-			tb->tb_select_default(tb, flp, res);
-	}
-}
+	if (tb[FRA_SRC] && (rule4->src != nla_get_u32(tb[FRA_SRC])))
+		return 0;
 
-static int fib_rules_event(struct notifier_block *this, unsigned long event, void *ptr)
-{
-	struct net_device *dev = ptr;
+	if (tb[FRA_DST] && (rule4->dst != nla_get_u32(tb[FRA_DST])))
+		return 0;
 
-	if (event == NETDEV_UNREGISTER)
-		fib_rules_detach(dev);
-	else if (event == NETDEV_REGISTER)
-		fib_rules_attach(dev);
-	return NOTIFY_DONE;
+	return 1;
 }
 
+static int fib4_rule_fill(struct fib_rule *rule, struct sk_buff *skb,
+			  struct nlmsghdr *nlh, struct fib_rule_hdr *frh)
+{
+	struct fib4_rule *rule4 = (struct fib4_rule *) rule;
 
-static struct notifier_block fib_rules_notifier = {
-	.notifier_call =fib_rules_event,
-};
+	frh->family = AF_INET;
+	frh->dst_len = rule4->dst_len;
+	frh->src_len = rule4->src_len;
+	frh->tos = rule4->tos;
 
-static __inline__ int inet_fill_rule(struct sk_buff *skb,
-				     struct fib_rule *r,
-				     u32 pid, u32 seq, int event,
-				     unsigned int flags)
-{
-	struct rtmsg *rtm;
-	struct nlmsghdr  *nlh;
-	unsigned char	 *b = skb->tail;
-
-	nlh = NLMSG_NEW(skb, pid, seq, event, sizeof(*rtm), flags);
-	rtm = NLMSG_DATA(nlh);
-	rtm->rtm_family = AF_INET;
-	rtm->rtm_dst_len = r->r_dst_len;
-	rtm->rtm_src_len = r->r_src_len;
-	rtm->rtm_tos = r->r_tos;
 #ifdef CONFIG_IP_ROUTE_FWMARK
-	if (r->r_fwmark)
-		RTA_PUT(skb, RTA_PROTOINFO, 4, &r->r_fwmark);
+	if (rule4->fwmark)
+		NLA_PUT_U32(skb, FRA_FWMARK, rule4->fwmark);
 #endif
-	rtm->rtm_table = r->r_table;
-	rtm->rtm_protocol = 0;
-	rtm->rtm_scope = 0;
-	rtm->rtm_type = r->r_action;
-	rtm->rtm_flags = r->r_flags;
-
-	if (r->r_dst_len)
-		RTA_PUT(skb, RTA_DST, 4, &r->r_dst);
-	if (r->r_src_len)
-		RTA_PUT(skb, RTA_SRC, 4, &r->r_src);
-	if (r->r_ifname[0])
-		RTA_PUT(skb, RTA_IIF, IFNAMSIZ, &r->r_ifname);
-	if (r->r_preference)
-		RTA_PUT(skb, RTA_PRIORITY, 4, &r->r_preference);
-	if (r->r_srcmap)
-		RTA_PUT(skb, RTA_GATEWAY, 4, &r->r_srcmap);
+
+	if (rule4->dst_len)
+		NLA_PUT_U32(skb, FRA_DST, rule4->dst);
+
+	if (rule4->src_len)
+		NLA_PUT_U32(skb, FRA_SRC, rule4->src);
+
 #ifdef CONFIG_NET_CLS_ROUTE
-	if (r->r_tclassid)
-		RTA_PUT(skb, RTA_FLOW, 4, &r->r_tclassid);
+	if (rule4->tclassid)
+		NLA_PUT_U32(skb, FRA_FLOW, rule4->tclassid);
 #endif
-	nlh->nlmsg_len = skb->tail - b;
-	return skb->len;
+	return 0;
 
-nlmsg_failure:
-rtattr_failure:
-	skb_trim(skb, b - skb->data);
-	return -1;
+nla_put_failure:
+	return -ENOBUFS;
 }
 
-/* callers should hold rtnl semaphore */
+int fib4_rules_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	return fib_rules_dump(skb, cb, AF_INET);
+}
 
-static void rtmsg_rule(int event, struct fib_rule *r)
+static u32 fib4_rule_default_pref(void)
 {
-	int size = NLMSG_SPACE(sizeof(struct rtmsg) + 128);
-	struct sk_buff *skb = alloc_skb(size, GFP_KERNEL);
+	struct list_head *pos;
+	struct fib_rule *rule;
 
-	if (!skb)
-		netlink_set_err(rtnl, 0, RTNLGRP_IPV4_RULE, ENOBUFS);
-	else if (inet_fill_rule(skb, r, 0, 0, event, 0) < 0) {
-		kfree_skb(skb);
-		netlink_set_err(rtnl, 0, RTNLGRP_IPV4_RULE, EINVAL);
-	} else {
-		netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV4_RULE, GFP_KERNEL);
+	if (!list_empty(&fib4_rules)) {
+		pos = fib4_rules.next;
+		if (pos->next != &fib4_rules) {
+			rule = list_entry(pos->next, struct fib_rule, list);
+			if (rule->pref)
+				return rule->pref - 1;
+		}
 	}
+
+	return 0;
 }
 
-int inet_dump_rules(struct sk_buff *skb, struct netlink_callback *cb)
+static struct fib_rules_ops fib4_rules_ops = {
+	.family		= AF_INET,
+	.rule_size	= sizeof(struct fib4_rule),
+	.action		= fib4_rule_action,
+	.match		= fib4_rule_match,
+	.configure	= fib4_rule_configure,
+	.compare	= fib4_rule_compare,
+	.fill		= fib4_rule_fill,
+	.default_pref	= fib4_rule_default_pref,
+	.policy		= fib4_rule_policy,
+	.rules_list	= &fib4_rules,
+	.owner		= THIS_MODULE,
+};
+
+void __init fib4_rules_init(void)
 {
-	int idx = 0;
-	int s_idx = cb->args[0];
-	struct fib_rule *r;
-	struct hlist_node *node;
-
-	rcu_read_lock();
-	hlist_for_each_entry(r, node, &fib_rules, hlist) {
-		if (idx < s_idx)
-			goto next;
-		if (inet_fill_rule(skb, r, NETLINK_CB(cb->skb).pid,
-				   cb->nlh->nlmsg_seq,
-				   RTM_NEWRULE, NLM_F_MULTI) < 0)
-			break;
-next:
-		idx++;
-	}
-	rcu_read_unlock();
-	cb->args[0] = idx;
+	list_add_tail(&local_rule.common.list, &fib4_rules);
+	list_add_tail(&main_rule.common.list, &fib4_rules);
+	list_add_tail(&default_rule.common.list, &fib4_rules);
 
-	return skb->len;
+	fib_rules_register(&fib4_rules_ops);
 }
 
-void __init fib_rules_init(void)
+void __exit fib4_rules_cleanup(void)
 {
-	INIT_HLIST_HEAD(&fib_rules);
-	hlist_add_head(&local_rule.hlist, &fib_rules);
-	hlist_add_after(&local_rule.hlist, &main_rule.hlist);
-	hlist_add_after(&main_rule.hlist, &default_rule.hlist);
-	register_netdevice_notifier(&fib_rules_notifier);
+	fib_rules_unregister(&fib4_rules_ops);
 }
Index: net-2.6.git/net/ipv4/fib_frontend.c
===================================================================
--- net-2.6.git.orig/net/ipv4/fib_frontend.c
+++ net-2.6.git/net/ipv4/fib_frontend.c
@@ -656,7 +656,7 @@ void __init ip_fib_init(void)
 	ip_fib_local_table = fib_hash_init(RT_TABLE_LOCAL);
 	ip_fib_main_table  = fib_hash_init(RT_TABLE_MAIN);
 #else
-	fib_rules_init();
+	fib4_rules_init();
 #endif
 
 	register_netdevice_notifier(&fib_netdev_notifier);


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/5] [IPV4]: Use Protocol Independant Policy Routing Rules Framework
  2006-07-26 22:00 ` [PATCH 5/5] [IPV4]: Use Protocol Independant Policy Routing Rules Framework Thomas Graf
@ 2006-07-26 22:43   ` David Miller
  2006-07-26 23:47   ` David Miller
  1 sibling, 0 replies; 54+ messages in thread
From: David Miller @ 2006-07-26 22:43 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, vnuorval, usagi-core, yoshfuji, anttit

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 27 Jul 2006 00:00:05 +0200

> Signed-off-by: Thomas Graf <tgraf@suug.ch>

Like the ipv6 counterpart, this looks fine too.

Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 5/5] [IPV4]: Use Protocol Independant Policy Routing Rules Framework
  2006-07-26 22:00 ` [PATCH 5/5] [IPV4]: Use Protocol Independant Policy Routing Rules Framework Thomas Graf
  2006-07-26 22:43   ` David Miller
@ 2006-07-26 23:47   ` David Miller
  2006-07-27 22:40     ` [RESEND " Thomas Graf
  1 sibling, 1 reply; 54+ messages in thread
From: David Miller @ 2006-07-26 23:47 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, vnuorval, usagi-core, yoshfuji, anttit

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 27 Jul 2006 00:00:05 +0200

> Index: net-2.6.git/net/ipv4/fib_rules.c
> ===================================================================
> --- net-2.6.git.orig/net/ipv4/fib_rules.c
> +++ net-2.6.git/net/ipv4/fib_rules.c
 ...
> -	new_r = kmalloc(sizeof(*new_r), GFP_KERNEL);
> -	if (!new_r)
> -		return -ENOMEM;
> -	memset(new_r, 0, sizeof(*new_r));

When you respin these patches, please regenerate them against
the current tree which now uses kzalloc() here.

Thanks.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RESEND 5/5] [IPV4]: Use Protocol Independant Policy Routing Rules Framework
  2006-07-26 23:47   ` David Miller
@ 2006-07-27 22:40     ` Thomas Graf
  0 siblings, 0 replies; 54+ messages in thread
From: Thomas Graf @ 2006-07-27 22:40 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, vnuorval, usagi-core, yoshfuji, anttit


Signed-off-by: Thomas Graf <tgraf@suug.ch>

Index: net-2.6.git/include/net/ip_fib.h
===================================================================
--- net-2.6.git.orig/include/net/ip_fib.h
+++ net-2.6.git/include/net/ip_fib.h
@@ -18,6 +18,7 @@
 
 #include <net/flow.h>
 #include <linux/seq_file.h>
+#include <net/fib_rules.h>
 
 /* WARNING: The ordering of these elements must match ordering
  *          of RTA_* rtnetlink attribute numbers.
@@ -203,9 +204,8 @@ static inline void fib_select_default(co
 #define ip_fib_main_table (fib_tables[RT_TABLE_MAIN])
 
 extern struct fib_table * fib_tables[RT_TABLE_MAX+1];
-extern int fib_lookup(const struct flowi *flp, struct fib_result *res);
+extern int fib_lookup(struct flowi *flp, struct fib_result *res);
 extern struct fib_table *__fib_new_table(int id);
-extern void fib_rule_put(struct fib_rule *r);
 
 static inline struct fib_table *fib_get_table(int id)
 {
@@ -251,15 +251,15 @@ extern u32  __fib_res_prefsrc(struct fib
 extern struct fib_table *fib_hash_init(int id);
 
 #ifdef CONFIG_IP_MULTIPLE_TABLES
-/* Exported by fib_rules.c */
+extern int fib4_rules_dump(struct sk_buff *skb, struct netlink_callback *cb);
+
+extern void __init fib4_rules_init(void);
+extern void __exit fib4_rules_cleanup(void);
 
-extern int inet_rtm_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg);
-extern int inet_rtm_newrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg);
-extern int inet_dump_rules(struct sk_buff *skb, struct netlink_callback *cb);
 #ifdef CONFIG_NET_CLS_ROUTE
 extern u32 fib_rules_tclass(struct fib_result *res);
 #endif
-extern void fib_rules_init(void);
+
 #endif
 
 static inline void fib_combine_itag(u32 *itag, struct fib_result *res)
Index: net-2.6.git/net/ipv4/Kconfig
===================================================================
--- net-2.6.git.orig/net/ipv4/Kconfig
+++ net-2.6.git/net/ipv4/Kconfig
@@ -88,6 +88,7 @@ config IP_FIB_HASH
 config IP_MULTIPLE_TABLES
 	bool "IP: policy routing"
 	depends on IP_ADVANCED_ROUTER
+	select FIB_RULES
 	---help---
 	  Normally, a router decides what to do with a received packet based
 	  solely on the packet's final destination address. If you say Y here,
Index: net-2.6.git/net/ipv4/devinet.c
===================================================================
--- net-2.6.git.orig/net/ipv4/devinet.c
+++ net-2.6.git/net/ipv4/devinet.c
@@ -1151,9 +1151,7 @@ static struct rtnetlink_link inet_rtnetl
 	[RTM_GETROUTE - RTM_BASE] = { .doit	= inet_rtm_getroute,
 				      .dumpit	= inet_dump_fib,	},
 #ifdef CONFIG_IP_MULTIPLE_TABLES
-	[RTM_NEWRULE  - RTM_BASE] = { .doit	= inet_rtm_newrule,	},
-	[RTM_DELRULE  - RTM_BASE] = { .doit	= inet_rtm_delrule,	},
-	[RTM_GETRULE  - RTM_BASE] = { .dumpit	= inet_dump_rules,	},
+	[RTM_GETRULE  - RTM_BASE] = { .dumpit	= fib4_rules_dump,	},
 #endif
 };
 
Index: net-2.6.git/net/ipv4/fib_rules.c
===================================================================
--- net-2.6.git.orig/net/ipv4/fib_rules.c
+++ net-2.6.git/net/ipv4/fib_rules.c
@@ -5,9 +5,8 @@
  *
  *		IPv4 Forwarding Information Base: policy rules.
  *
- * Version:	$Id: fib_rules.c,v 1.17 2001/10/31 21:55:54 davem Exp $
- *
  * Authors:	Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
+ * 		Thomas Graf <tgraf@suug.ch>
  *
  *		This program is free software; you can redistribute it and/or
  *		modify it under the terms of the GNU General Public License
@@ -19,129 +18,154 @@
  *		Marc Boucher	:	routing by fwmark
  */
 
-#include <asm/uaccess.h>
-#include <asm/system.h>
-#include <linux/bitops.h>
 #include <linux/types.h>
 #include <linux/kernel.h>
-#include <linux/sched.h>
-#include <linux/mm.h>
-#include <linux/string.h>
-#include <linux/socket.h>
-#include <linux/sockios.h>
-#include <linux/errno.h>
-#include <linux/in.h>
-#include <linux/inet.h>
-#include <linux/inetdevice.h>
 #include <linux/netdevice.h>
-#include <linux/if_arp.h>
-#include <linux/proc_fs.h>
-#include <linux/skbuff.h>
 #include <linux/netlink.h>
+#include <linux/inetdevice.h>
 #include <linux/init.h>
 #include <linux/list.h>
 #include <linux/rcupdate.h>
-
 #include <net/ip.h>
-#include <net/protocol.h>
 #include <net/route.h>
 #include <net/tcp.h>
-#include <net/sock.h>
 #include <net/ip_fib.h>
+#include <net/fib_rules.h>
 
-#define FRprintk(a...)
+static struct fib_rules_ops fib4_rules_ops;
 
-struct fib_rule
+struct fib4_rule
 {
-	struct hlist_node hlist;
-	atomic_t	r_clntref;
-	u32		r_preference;
-	unsigned char	r_table;
-	unsigned char	r_action;
-	unsigned char	r_dst_len;
-	unsigned char	r_src_len;
-	u32		r_src;
-	u32		r_srcmask;
-	u32		r_dst;
-	u32		r_dstmask;
-	u32		r_srcmap;
-	u8		r_flags;
-	u8		r_tos;
+	struct fib_rule		common;
+	u8			dst_len;
+	u8			src_len;
+	u8			tos;
+	u32			src;
+	u32			srcmask;
+	u32			dst;
+	u32			dstmask;
 #ifdef CONFIG_IP_ROUTE_FWMARK
-	u32		r_fwmark;
+	u32			fwmark;
 #endif
-	int		r_ifindex;
 #ifdef CONFIG_NET_CLS_ROUTE
-	__u32		r_tclassid;
+	u32			tclassid;
 #endif
-	char		r_ifname[IFNAMSIZ];
-	int		r_dead;
-	struct		rcu_head rcu;
 };
 
-static struct fib_rule default_rule = {
-	.r_clntref =	ATOMIC_INIT(2),
-	.r_preference =	0x7FFF,
-	.r_table =	RT_TABLE_DEFAULT,
-	.r_action =	RTN_UNICAST,
+static struct fib4_rule default_rule = {
+	.common = {
+		.refcnt =	ATOMIC_INIT(2),
+		.pref =		0x7FFF,
+		.table =	RT_TABLE_DEFAULT,
+		.action =	FR_ACT_TO_TBL,
+	},
 };
 
-static struct fib_rule main_rule = {
-	.r_clntref =	ATOMIC_INIT(2),
-	.r_preference =	0x7FFE,
-	.r_table =	RT_TABLE_MAIN,
-	.r_action =	RTN_UNICAST,
+static struct fib4_rule main_rule = {
+	.common = {
+		.refcnt =	ATOMIC_INIT(2),
+		.pref =		0x7FFE,
+		.table =	RT_TABLE_MAIN,
+		.action =	FR_ACT_TO_TBL,
+	},
 };
 
-static struct fib_rule local_rule = {
-	.r_clntref =	ATOMIC_INIT(2),
-	.r_table =	RT_TABLE_LOCAL,
-	.r_action =	RTN_UNICAST,
+static struct fib4_rule local_rule = {
+	.common = {
+		.refcnt =	ATOMIC_INIT(2),
+		.table =	RT_TABLE_LOCAL,
+		.action =	FR_ACT_TO_TBL,
+		.flags =	FIB_RULE_PERMANENT,
+	},
 };
 
-static struct hlist_head fib_rules;
+static LIST_HEAD(fib4_rules);
+
+#ifdef CONFIG_NET_CLS_ROUTE
+u32 fib_rules_tclass(struct fib_result *res)
+{
+	return res->r ? ((struct fib4_rule *) res->r)->tclassid : 0;
+}
+#endif
 
-/* writer func called from netlink -- rtnl_sem hold*/
+int fib_lookup(struct flowi *flp, struct fib_result *res)
+{
+	struct fib_lookup_arg arg = {
+		.result = res,
+	};
+	int err;
 
-static void rtmsg_rule(int, struct fib_rule *);
+	err = fib_rules_lookup(&fib4_rules_ops, flp, 0, &arg);
+	res->r = arg.rule;
 
-int inet_rtm_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
+	return err;
+}
+
+int fib4_rule_action(struct fib_rule *rule, struct flowi *flp, int flags,
+		     struct fib_lookup_arg *arg)
 {
-	struct rtattr **rta = arg;
-	struct rtmsg *rtm = NLMSG_DATA(nlh);
-	struct fib_rule *r;
-	struct hlist_node *node;
-	int err = -ESRCH;
-
-	hlist_for_each_entry(r, node, &fib_rules, hlist) {
-		if ((!rta[RTA_SRC-1] || memcmp(RTA_DATA(rta[RTA_SRC-1]), &r->r_src, 4) == 0) &&
-		    rtm->rtm_src_len == r->r_src_len &&
-		    rtm->rtm_dst_len == r->r_dst_len &&
-		    (!rta[RTA_DST-1] || memcmp(RTA_DATA(rta[RTA_DST-1]), &r->r_dst, 4) == 0) &&
-		    rtm->rtm_tos == r->r_tos &&
-#ifdef CONFIG_IP_ROUTE_FWMARK
-		    (!rta[RTA_PROTOINFO-1] || memcmp(RTA_DATA(rta[RTA_PROTOINFO-1]), &r->r_fwmark, 4) == 0) &&
-#endif
-		    (!rtm->rtm_type || rtm->rtm_type == r->r_action) &&
-		    (!rta[RTA_PRIORITY-1] || memcmp(RTA_DATA(rta[RTA_PRIORITY-1]), &r->r_preference, 4) == 0) &&
-		    (!rta[RTA_IIF-1] || rtattr_strcmp(rta[RTA_IIF-1], r->r_ifname) == 0) &&
-		    (!rtm->rtm_table || (r && rtm->rtm_table == r->r_table))) {
-			err = -EPERM;
-			if (r == &local_rule)
-				break;
-
-			hlist_del_rcu(&r->hlist);
-			r->r_dead = 1;
-			rtmsg_rule(RTM_DELRULE, r);
-			fib_rule_put(r);
-			err = 0;
-			break;
-		}
+	int err = -EAGAIN;
+	struct fib_table *tbl;
+
+	switch (rule->action) {
+	case FR_ACT_TO_TBL:
+		break;
+
+	case FR_ACT_UNREACHABLE:
+		err = -ENETUNREACH;
+		goto errout;
+
+	case FR_ACT_PROHIBIT:
+		err = -EACCES;
+		goto errout;
+
+	case FR_ACT_BLACKHOLE:
+	default:
+		err = -EINVAL;
+		goto errout;
 	}
+
+	if ((tbl = fib_get_table(rule->table)) == NULL)
+		goto errout;
+
+	err = tbl->tb_lookup(tbl, flp, (struct fib_result *) arg->result);
+	if (err > 0)
+		err = -EAGAIN;
+errout:
 	return err;
 }
 
-/* Allocate new unique table id */
+
+void fib_select_default(const struct flowi *flp, struct fib_result *res)
+{
+	if (res->r && res->r->action == FR_ACT_TO_TBL &&
+	    FIB_RES_GW(*res) && FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK) {
+		struct fib_table *tb;
+		if ((tb = fib_get_table(res->r->table)) != NULL)
+			tb->tb_select_default(tb, flp, res);
+	}
+}
+
+static int fib4_rule_match(struct fib_rule *rule, struct flowi *fl, int flags)
+{
+	struct fib4_rule *r = (struct fib4_rule *) rule;
+	u32 daddr = fl->fl4_dst;
+	u32 saddr = fl->fl4_src;
+
+	if (((saddr ^ r->src) & r->srcmask) ||
+	    ((daddr ^ r->dst) & r->dstmask))
+		return 0;
+
+	if (r->tos && (r->tos != fl->fl4_tos))
+		return 0;
+
+#ifdef CONFIG_IP_ROUTE_FWMARK
+	if (r->fwmark && (r->fwmark != fl->fl4_fwmark))
+		return 0;
+#endif
+
+	return 1;
+}
 
 static struct fib_table *fib_empty_table(void)
 {
@@ -153,329 +177,177 @@ static struct fib_table *fib_empty_table
 	return NULL;
 }
 
-static inline void fib_rule_put_rcu(struct rcu_head *head)
-{
-	struct fib_rule *r = container_of(head, struct fib_rule, rcu);
-	kfree(r);
-}
-
-void fib_rule_put(struct fib_rule *r)
-{
-	if (atomic_dec_and_test(&r->r_clntref)) {
-		if (r->r_dead)
-			call_rcu(&r->rcu, fib_rule_put_rcu);
-		else
-			printk("Freeing alive rule %p\n", r);
-	}
-}
+static struct nla_policy fib4_rule_policy[FRA_MAX+1] __read_mostly = {
+	[FRA_IFNAME]	= { .type = NLA_STRING },
+	[FRA_PRIORITY]	= { .type = NLA_U32 },
+	[FRA_SRC]	= { .type = NLA_U32 },
+	[FRA_DST]	= { .type = NLA_U32 },
+	[FRA_FWMARK]	= { .type = NLA_U32 },
+	[FRA_FLOW]	= { .type = NLA_U32 },
+};
 
-/* writer func called from netlink -- rtnl_sem hold*/
+static int fib4_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
+			       struct nlmsghdr *nlh, struct fib_rule_hdr *frh,
+			       struct nlattr **tb)
+{
+	int err = -EINVAL;
+	struct fib4_rule *rule4 = (struct fib4_rule *) rule;
+
+	if (frh->src_len > 32 || frh->dst_len > 32 ||
+	    (frh->tos & ~IPTOS_TOS_MASK))
+		goto errout;
+
+	if (rule->table == RT_TABLE_UNSPEC) {
+		if (rule->action == FR_ACT_TO_TBL) {
+			struct fib_table *table;
+
+			table = fib_empty_table();
+			if (table == NULL) {
+				err = -ENOBUFS;
+				goto errout;
+			}
 
-int inet_rtm_newrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
-{
-	struct rtattr **rta = arg;
-	struct rtmsg *rtm = NLMSG_DATA(nlh);
-	struct fib_rule *r, *new_r, *last = NULL;
-	struct hlist_node *node = NULL;
-	unsigned char table_id;
-
-	if (rtm->rtm_src_len > 32 || rtm->rtm_dst_len > 32 ||
-	    (rtm->rtm_tos & ~IPTOS_TOS_MASK))
-		return -EINVAL;
-
-	if (rta[RTA_IIF-1] && RTA_PAYLOAD(rta[RTA_IIF-1]) > IFNAMSIZ)
-		return -EINVAL;
-
-	table_id = rtm->rtm_table;
-	if (table_id == RT_TABLE_UNSPEC) {
-		struct fib_table *table;
-		if (rtm->rtm_type == RTN_UNICAST) {
-			if ((table = fib_empty_table()) == NULL)
-				return -ENOBUFS;
-			table_id = table->tb_id;
+			rule->table = table->tb_id;
 		}
 	}
 
-	new_r = kzalloc(sizeof(*new_r), GFP_KERNEL);
-	if (!new_r)
-		return -ENOMEM;
-
-	if (rta[RTA_SRC-1])
-		memcpy(&new_r->r_src, RTA_DATA(rta[RTA_SRC-1]), 4);
-	if (rta[RTA_DST-1])
-		memcpy(&new_r->r_dst, RTA_DATA(rta[RTA_DST-1]), 4);
-	if (rta[RTA_GATEWAY-1])
-		memcpy(&new_r->r_srcmap, RTA_DATA(rta[RTA_GATEWAY-1]), 4);
-	new_r->r_src_len = rtm->rtm_src_len;
-	new_r->r_dst_len = rtm->rtm_dst_len;
-	new_r->r_srcmask = inet_make_mask(rtm->rtm_src_len);
-	new_r->r_dstmask = inet_make_mask(rtm->rtm_dst_len);
-	new_r->r_tos = rtm->rtm_tos;
+	if (tb[FRA_SRC])
+		rule4->src = nla_get_u32(tb[FRA_SRC]);
+
+	if (tb[FRA_DST])
+		rule4->dst = nla_get_u32(tb[FRA_DST]);
+
 #ifdef CONFIG_IP_ROUTE_FWMARK
-	if (rta[RTA_PROTOINFO-1])
-		memcpy(&new_r->r_fwmark, RTA_DATA(rta[RTA_PROTOINFO-1]), 4);
+	if (tb[FRA_FWMARK])
+		rule4->fwmark = nla_get_u32(tb[FRA_FWMARK]);
 #endif
-	new_r->r_action = rtm->rtm_type;
-	new_r->r_flags = rtm->rtm_flags;
-	if (rta[RTA_PRIORITY-1])
-		memcpy(&new_r->r_preference, RTA_DATA(rta[RTA_PRIORITY-1]), 4);
-	new_r->r_table = table_id;
-	if (rta[RTA_IIF-1]) {
-		struct net_device *dev;
-		rtattr_strlcpy(new_r->r_ifname, rta[RTA_IIF-1], IFNAMSIZ);
-		new_r->r_ifindex = -1;
-		dev = __dev_get_by_name(new_r->r_ifname);
-		if (dev)
-			new_r->r_ifindex = dev->ifindex;
-	}
+
 #ifdef CONFIG_NET_CLS_ROUTE
-	if (rta[RTA_FLOW-1])
-		memcpy(&new_r->r_tclassid, RTA_DATA(rta[RTA_FLOW-1]), 4);
+	if (tb[FRA_FLOW])
+		rule4->tclassid = nla_get_u32(tb[FRA_FLOW]);
 #endif
-	r = container_of(fib_rules.first, struct fib_rule, hlist);
-
-	if (!new_r->r_preference) {
-		if (r && r->hlist.next != NULL) {
-			r = container_of(r->hlist.next, struct fib_rule, hlist);
-			if (r->r_preference)
-				new_r->r_preference = r->r_preference - 1;
-		}
-	}
-
-	hlist_for_each_entry(r, node, &fib_rules, hlist) {
-		if (r->r_preference > new_r->r_preference)
-			break;
-		last = r;
-	}
-	atomic_inc(&new_r->r_clntref);
 
-	if (last)
-		hlist_add_after_rcu(&last->hlist, &new_r->hlist);
-	else
-		hlist_add_before_rcu(&new_r->hlist, &r->hlist);
+	rule4->src_len = frh->src_len;
+	rule4->srcmask = inet_make_mask(rule4->src_len);
+	rule4->dst_len = frh->dst_len;
+	rule4->dstmask = inet_make_mask(rule4->dst_len);
+	rule4->tos = frh->tos;
 
-	rtmsg_rule(RTM_NEWRULE, new_r);
-	return 0;
+	err = 0;
+errout:
+	return err;
 }
 
-#ifdef CONFIG_NET_CLS_ROUTE
-u32 fib_rules_tclass(struct fib_result *res)
+static int fib4_rule_compare(struct fib_rule *rule, struct fib_rule_hdr *frh,
+			     struct nlattr **tb)
 {
-	if (res->r)
-		return res->r->r_tclassid;
-	return 0;
-}
-#endif
+	struct fib4_rule *rule4 = (struct fib4_rule *) rule;
 
-/* callers should hold rtnl semaphore */
+	if (frh->src_len && (rule4->src_len != frh->src_len))
+		return 0;
 
-static void fib_rules_detach(struct net_device *dev)
-{
-	struct hlist_node *node;
-	struct fib_rule *r;
+	if (frh->dst_len && (rule4->dst_len != frh->dst_len))
+		return 0;
 
-	hlist_for_each_entry(r, node, &fib_rules, hlist) {
-		if (r->r_ifindex == dev->ifindex)
-			r->r_ifindex = -1;
+	if (frh->tos && (rule4->tos != frh->tos))
+		return 0;
 
-	}
-}
-
-/* callers should hold rtnl semaphore */
-
-static void fib_rules_attach(struct net_device *dev)
-{
-	struct hlist_node *node;
-	struct fib_rule *r;
-
-	hlist_for_each_entry(r, node, &fib_rules, hlist) {
-		if (r->r_ifindex == -1 && strcmp(dev->name, r->r_ifname) == 0)
-			r->r_ifindex = dev->ifindex;
-	}
-}
-
-int fib_lookup(const struct flowi *flp, struct fib_result *res)
-{
-	int err;
-	struct fib_rule *r, *policy;
-	struct fib_table *tb;
-	struct hlist_node *node;
-
-	u32 daddr = flp->fl4_dst;
-	u32 saddr = flp->fl4_src;
-
-FRprintk("Lookup: %u.%u.%u.%u <- %u.%u.%u.%u ",
-	NIPQUAD(flp->fl4_dst), NIPQUAD(flp->fl4_src));
-
-	rcu_read_lock();
-
-	hlist_for_each_entry_rcu(r, node, &fib_rules, hlist) {
-		if (((saddr^r->r_src) & r->r_srcmask) ||
-		    ((daddr^r->r_dst) & r->r_dstmask) ||
-		    (r->r_tos && r->r_tos != flp->fl4_tos) ||
 #ifdef CONFIG_IP_ROUTE_FWMARK
-		    (r->r_fwmark && r->r_fwmark != flp->fl4_fwmark) ||
+	if (tb[FRA_FWMARK] && (rule4->fwmark != nla_get_u32(tb[FRA_FWMARK])))
+		return 0;
 #endif
-		    (r->r_ifindex && r->r_ifindex != flp->iif))
-			continue;
-
-FRprintk("tb %d r %d ", r->r_table, r->r_action);
-		switch (r->r_action) {
-		case RTN_UNICAST:
-			policy = r;
-			break;
-		case RTN_UNREACHABLE:
-			rcu_read_unlock();
-			return -ENETUNREACH;
-		default:
-		case RTN_BLACKHOLE:
-			rcu_read_unlock();
-			return -EINVAL;
-		case RTN_PROHIBIT:
-			rcu_read_unlock();
-			return -EACCES;
-		}
 
-		if ((tb = fib_get_table(r->r_table)) == NULL)
-			continue;
-		err = tb->tb_lookup(tb, flp, res);
-		if (err == 0) {
-			res->r = policy;
-			if (policy)
-				atomic_inc(&policy->r_clntref);
-			rcu_read_unlock();
-			return 0;
-		}
-		if (err < 0 && err != -EAGAIN) {
-			rcu_read_unlock();
-			return err;
-		}
-	}
-FRprintk("FAILURE\n");
-	rcu_read_unlock();
-	return -ENETUNREACH;
-}
+#ifdef CONFIG_NET_CLS_ROUTE
+	if (tb[FRA_FLOW] && (rule4->tclassid != nla_get_u32(tb[FRA_FLOW])))
+		return 0;
+#endif
 
-void fib_select_default(const struct flowi *flp, struct fib_result *res)
-{
-	if (res->r && res->r->r_action == RTN_UNICAST &&
-	    FIB_RES_GW(*res) && FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK) {
-		struct fib_table *tb;
-		if ((tb = fib_get_table(res->r->r_table)) != NULL)
-			tb->tb_select_default(tb, flp, res);
-	}
-}
+	if (tb[FRA_SRC] && (rule4->src != nla_get_u32(tb[FRA_SRC])))
+		return 0;
 
-static int fib_rules_event(struct notifier_block *this, unsigned long event, void *ptr)
-{
-	struct net_device *dev = ptr;
+	if (tb[FRA_DST] && (rule4->dst != nla_get_u32(tb[FRA_DST])))
+		return 0;
 
-	if (event == NETDEV_UNREGISTER)
-		fib_rules_detach(dev);
-	else if (event == NETDEV_REGISTER)
-		fib_rules_attach(dev);
-	return NOTIFY_DONE;
+	return 1;
 }
 
+static int fib4_rule_fill(struct fib_rule *rule, struct sk_buff *skb,
+			  struct nlmsghdr *nlh, struct fib_rule_hdr *frh)
+{
+	struct fib4_rule *rule4 = (struct fib4_rule *) rule;
 
-static struct notifier_block fib_rules_notifier = {
-	.notifier_call =fib_rules_event,
-};
+	frh->family = AF_INET;
+	frh->dst_len = rule4->dst_len;
+	frh->src_len = rule4->src_len;
+	frh->tos = rule4->tos;
 
-static __inline__ int inet_fill_rule(struct sk_buff *skb,
-				     struct fib_rule *r,
-				     u32 pid, u32 seq, int event,
-				     unsigned int flags)
-{
-	struct rtmsg *rtm;
-	struct nlmsghdr  *nlh;
-	unsigned char	 *b = skb->tail;
-
-	nlh = NLMSG_NEW(skb, pid, seq, event, sizeof(*rtm), flags);
-	rtm = NLMSG_DATA(nlh);
-	rtm->rtm_family = AF_INET;
-	rtm->rtm_dst_len = r->r_dst_len;
-	rtm->rtm_src_len = r->r_src_len;
-	rtm->rtm_tos = r->r_tos;
 #ifdef CONFIG_IP_ROUTE_FWMARK
-	if (r->r_fwmark)
-		RTA_PUT(skb, RTA_PROTOINFO, 4, &r->r_fwmark);
+	if (rule4->fwmark)
+		NLA_PUT_U32(skb, FRA_FWMARK, rule4->fwmark);
 #endif
-	rtm->rtm_table = r->r_table;
-	rtm->rtm_protocol = 0;
-	rtm->rtm_scope = 0;
-	rtm->rtm_type = r->r_action;
-	rtm->rtm_flags = r->r_flags;
-
-	if (r->r_dst_len)
-		RTA_PUT(skb, RTA_DST, 4, &r->r_dst);
-	if (r->r_src_len)
-		RTA_PUT(skb, RTA_SRC, 4, &r->r_src);
-	if (r->r_ifname[0])
-		RTA_PUT(skb, RTA_IIF, IFNAMSIZ, &r->r_ifname);
-	if (r->r_preference)
-		RTA_PUT(skb, RTA_PRIORITY, 4, &r->r_preference);
-	if (r->r_srcmap)
-		RTA_PUT(skb, RTA_GATEWAY, 4, &r->r_srcmap);
+
+	if (rule4->dst_len)
+		NLA_PUT_U32(skb, FRA_DST, rule4->dst);
+
+	if (rule4->src_len)
+		NLA_PUT_U32(skb, FRA_SRC, rule4->src);
+
 #ifdef CONFIG_NET_CLS_ROUTE
-	if (r->r_tclassid)
-		RTA_PUT(skb, RTA_FLOW, 4, &r->r_tclassid);
+	if (rule4->tclassid)
+		NLA_PUT_U32(skb, FRA_FLOW, rule4->tclassid);
 #endif
-	nlh->nlmsg_len = skb->tail - b;
-	return skb->len;
+	return 0;
 
-nlmsg_failure:
-rtattr_failure:
-	skb_trim(skb, b - skb->data);
-	return -1;
+nla_put_failure:
+	return -ENOBUFS;
 }
 
-/* callers should hold rtnl semaphore */
+int fib4_rules_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	return fib_rules_dump(skb, cb, AF_INET);
+}
 
-static void rtmsg_rule(int event, struct fib_rule *r)
+static u32 fib4_rule_default_pref(void)
 {
-	int size = NLMSG_SPACE(sizeof(struct rtmsg) + 128);
-	struct sk_buff *skb = alloc_skb(size, GFP_KERNEL);
+	struct list_head *pos;
+	struct fib_rule *rule;
 
-	if (!skb)
-		netlink_set_err(rtnl, 0, RTNLGRP_IPV4_RULE, ENOBUFS);
-	else if (inet_fill_rule(skb, r, 0, 0, event, 0) < 0) {
-		kfree_skb(skb);
-		netlink_set_err(rtnl, 0, RTNLGRP_IPV4_RULE, EINVAL);
-	} else {
-		netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV4_RULE, GFP_KERNEL);
+	if (!list_empty(&fib4_rules)) {
+		pos = fib4_rules.next;
+		if (pos->next != &fib4_rules) {
+			rule = list_entry(pos->next, struct fib_rule, list);
+			if (rule->pref)
+				return rule->pref - 1;
+		}
 	}
+
+	return 0;
 }
 
-int inet_dump_rules(struct sk_buff *skb, struct netlink_callback *cb)
+static struct fib_rules_ops fib4_rules_ops = {
+	.family		= AF_INET,
+	.rule_size	= sizeof(struct fib4_rule),
+	.action		= fib4_rule_action,
+	.match		= fib4_rule_match,
+	.configure	= fib4_rule_configure,
+	.compare	= fib4_rule_compare,
+	.fill		= fib4_rule_fill,
+	.default_pref	= fib4_rule_default_pref,
+	.policy		= fib4_rule_policy,
+	.rules_list	= &fib4_rules,
+	.owner		= THIS_MODULE,
+};
+
+void __init fib4_rules_init(void)
 {
-	int idx = 0;
-	int s_idx = cb->args[0];
-	struct fib_rule *r;
-	struct hlist_node *node;
-
-	rcu_read_lock();
-	hlist_for_each_entry(r, node, &fib_rules, hlist) {
-		if (idx < s_idx)
-			goto next;
-		if (inet_fill_rule(skb, r, NETLINK_CB(cb->skb).pid,
-				   cb->nlh->nlmsg_seq,
-				   RTM_NEWRULE, NLM_F_MULTI) < 0)
-			break;
-next:
-		idx++;
-	}
-	rcu_read_unlock();
-	cb->args[0] = idx;
+	list_add_tail(&local_rule.common.list, &fib4_rules);
+	list_add_tail(&main_rule.common.list, &fib4_rules);
+	list_add_tail(&default_rule.common.list, &fib4_rules);
 
-	return skb->len;
+	fib_rules_register(&fib4_rules_ops);
 }
 
-void __init fib_rules_init(void)
+void __exit fib4_rules_cleanup(void)
 {
-	INIT_HLIST_HEAD(&fib_rules);
-	hlist_add_head(&local_rule.hlist, &fib_rules);
-	hlist_add_after(&local_rule.hlist, &main_rule.hlist);
-	hlist_add_after(&main_rule.hlist, &default_rule.hlist);
-	register_netdevice_notifier(&fib_rules_notifier);
+	fib_rules_unregister(&fib4_rules_ops);
 }
Index: net-2.6.git/net/ipv4/fib_frontend.c
===================================================================
--- net-2.6.git.orig/net/ipv4/fib_frontend.c
+++ net-2.6.git/net/ipv4/fib_frontend.c
@@ -656,7 +656,7 @@ void __init ip_fib_init(void)
 	ip_fib_local_table = fib_hash_init(RT_TABLE_LOCAL);
 	ip_fib_main_table  = fib_hash_init(RT_TABLE_MAIN);
 #else
-	fib_rules_init();
+	fib4_rules_init();
 #endif
 
 	register_netdevice_notifier(&fib_netdev_notifier);

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC] Multiple IPV6 Routing Tables & Policy Routing
  2006-07-26 22:11 [RFC] Multiple IPV6 Routing Tables & Policy Routing Thomas Graf
                   ` (4 preceding siblings ...)
  2006-07-26 22:00 ` [PATCH 5/5] [IPV4]: Use Protocol Independant Policy Routing Rules Framework Thomas Graf
@ 2006-07-28  6:10 ` YOSHIFUJI Hideaki / 吉藤英明
  2006-07-28  8:23   ` David Miller
  2006-07-28 10:32   ` Thomas Graf
  2006-07-31 11:01 ` Ville Nuorvala
  6 siblings, 2 replies; 54+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2006-07-28  6:10 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, vnuorval, usagi-core, davem, anttit, yoshfuji

Hello.

In article <20060726221100.325687073@postel.suug.ch> (at Thu, 27 Jul 2006 00:11:00 +0200), Thomas Graf <tgraf@suug.ch> says:

> It's not final but somewhat working, I'm eager to see comments
> or patches. I apologize if I've tramped onto anybody's foot
> by taking this up and submitting it, this isn't meant as an
> attempt to steal credits but rather to pick up good code and
> finally get it upstream after a very long while.

First of all, Thank you for giving us good starting point.
I am interested in this area because I have been spending several
month for this release engineering.

Well, as you stated, the code is based on the work of MIPL,
which is being jointly developed by Helsinki University of Technology (HUT)
and USAGI/WIDE Project.  Please describe this on your commit logs.
Please retain copyright information as well.

Thank you very much.

-- 
YOSHIFUJI Hideaki @ USAGI Project  <yoshfuji@linux-ipv6.org>
GPG-FP  : 9022 65EB 1ECF 3AD1 0BDF  80D8 4807 F894 E062 0EEA

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC] Multiple IPV6 Routing Tables & Policy Routing
  2006-07-28  6:10 ` [RFC] Multiple IPV6 Routing Tables & Policy Routing YOSHIFUJI Hideaki / 吉藤英明
@ 2006-07-28  8:23   ` David Miller
  2006-07-28 10:32   ` Thomas Graf
  1 sibling, 0 replies; 54+ messages in thread
From: David Miller @ 2006-07-28  8:23 UTC (permalink / raw)
  To: yoshfuji; +Cc: tgraf, netdev, vnuorval, usagi-core, anttit

From: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Date: Fri, 28 Jul 2006 15:10:09 +0900 (JST)

> Well, as you stated, the code is based on the work of MIPL,
> which is being jointly developed by Helsinki University of Technology (HUT)
> and USAGI/WIDE Project.  Please describe this on your commit logs.
> Please retain copyright information as well.

Ok.

Thomas could you please respin the patches with:

1) Updated credits, based upon Yoshifuji's feedback

2) Fixes for the small issues Patrick pointer out in
   his review

Thanks a lot.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC] Multiple IPV6 Routing Tables & Policy Routing
  2006-07-28  6:10 ` [RFC] Multiple IPV6 Routing Tables & Policy Routing YOSHIFUJI Hideaki / 吉藤英明
  2006-07-28  8:23   ` David Miller
@ 2006-07-28 10:32   ` Thomas Graf
  2006-07-29  4:27     ` YOSHIFUJI Hideaki / 吉藤英明
  1 sibling, 1 reply; 54+ messages in thread
From: Thomas Graf @ 2006-07-28 10:32 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki / ?$B5HF#1QL@
  Cc: netdev, vnuorval, usagi-core, davem, anttit

* YOSHIFUJI Hideaki / ?$B5HF#1QL@ <yoshfuji@linux-ipv6.org> 2006-07-28 15:10
> Well, as you stated, the code is based on the work of MIPL,
> which is being jointly developed by Helsinki University of Technology (HUT)
> and USAGI/WIDE Project.  Please describe this on your commit logs.
> Please retain copyright information as well.

Please tell me which copyright I missed? 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC] Multiple IPV6 Routing Tables & Policy Routing
  2006-07-28 10:32   ` Thomas Graf
@ 2006-07-29  4:27     ` YOSHIFUJI Hideaki / 吉藤英明
  0 siblings, 0 replies; 54+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2006-07-29  4:27 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, vnuorval, usagi-core, davem, anttit, yoshfuji

In article <20060728103229.GD14627@postel.suug.ch> (at Fri, 28 Jul 2006 12:32:29 +0200), Thomas Graf <tgraf@suug.ch> says:

> * YOSHIFUJI Hideaki / ?$B5HF#1QL@ <yoshfuji@linux-ipv6.org> 2006-07-28 15:10
> > Well, as you stated, the code is based on the work of MIPL,
> > which is being jointly developed by Helsinki University of Technology (HUT)
> > and USAGI/WIDE Project.  Please describe this on your commit logs.
> > Please retain copyright information as well.
> 
> Please tell me which copyright I missed? 

Please put these lines in net/ipv6/fib6_rules.c
(and maybe, its derived files):

  Copyright (C)2003-2006 Helsinki University of Technology
  Copyright (C)2003-2006 USAGI/WIDE Project

You can add your own copyright notice if you want, of course.

Thank you.

--yoshfuji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC] Multiple IPV6 Routing Tables & Policy Routing
  2006-07-26 22:11 [RFC] Multiple IPV6 Routing Tables & Policy Routing Thomas Graf
                   ` (5 preceding siblings ...)
  2006-07-28  6:10 ` [RFC] Multiple IPV6 Routing Tables & Policy Routing YOSHIFUJI Hideaki / 吉藤英明
@ 2006-07-31 11:01 ` Ville Nuorvala
  6 siblings, 0 replies; 54+ messages in thread
From: Ville Nuorvala @ 2006-07-31 11:01 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, usagi-core, yoshfuji, davem, anttit

Thomas Graf wrote:
> Hello,

Hi Thomas!

> Thought it might be time to go through a round of comments
> on this work. Even though I've almost rewritten all the code
> the patches are based on the work found on www.mobile-ipv6.org.
> I have no idea which code was written by whom so just email me
> to get the credits right.

The policy routing stuff (multiple tables and source address based
routing) was almost entirely written by me. Therefore you can apply my
name as you see fit ;-)

Tushar Gohad at MontaVista, Benjamin Thery at Bull and of course USAGI
have also worked on the code.

> Main differences to the version found on mobile-ipv6.org is
> that I removed table refcnt and defined that tables cannot
> disappear once created to simplify things and avoid too many
> atomic operations when looking up routes.

Yes, that sounds good. As the ipv6 module doesn't really seem to become
unloadable anytime soon, there isn't really any good reason to refcount
the tables.

> I've replaced the
> table array with a hash table to prepare it for > 255 tables
> and made things aware of the new default router selection
> code and experimental route info stuff added recently.

Good! I never had the time to merge our changes with 2.6.17.

> It's not final but somewhat working, I'm eager to see comments
> or patches.

I'll try to comment on them the best I can.

> I apologize if I've tramped onto anybody's foot
> by taking this up and submitting it, this isn't meant as an
> attempt to steal credits but rather to pick up good code and
> finally get it upstream after a very long while.

No offense taken! It's great that someone wants to push these things
upstream as I personally have neither had the time nor the energy to do
so lately.

Regards,
Ville

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCHSET] Multiple IPV6 Routing Tables & Policy Routing
@ 2006-08-04 10:23 Thomas Graf
  2006-08-03 22:00 ` [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency Thomas Graf
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Graf @ 2006-08-04 10:23 UTC (permalink / raw)
  To: netdev; +Cc: vnuorval, usagi-core, yoshfuji, davem, anttit

Hello,

This patchset implements multiple IPv6 routing tables and
policy routing. Even though the code is almost rewritten
entirely the work is based on the MIPL patch found on
mobile-ipv6.org which is being jointly developed by Helsinki
University of Technology (HUT) and the USAGI/WIDE Project.

Reference counting on routing tables was removed to decrease
the number of required atomic operations during route lookup.

The tables are now organized in a hashtable to prepare for
the change in supporting more than 255 tables.

Matching sources address in policy rules is no longer dependant
on CONFIG_SUBTREES.

Git tree with changes:
master.kernel.org:/pub/scm/linux/kernel/git/tgraf/net-2.6.19-mrt.git

 include/linux/fib_rules.h |   60 ++++
 include/linux/rtnetlink.h |    2 
 include/net/fib_rules.h   |   90 +++++++
 include/net/ip6_fib.h     |   46 +++
 include/net/ip6_route.h   |    8 +
 include/net/ip_fib.h      |   14 +
 net/Kconfig               |    3 
 net/core/Makefile         |    1 
 net/core/fib_rules.c      |  416 +++++++++++++++++++++++++++++++
 net/core/rtnetlink.c      |    9 +
 net/ipv4/Kconfig          |    1 
 net/ipv4/devinet.c        |    4 
 net/ipv4/fib_frontend.c   |    2 
 net/ipv4/fib_rules.c      |  613 ++++++++++++++++++---------------------------
 net/ipv6/Kconfig          |    7 +
 net/ipv6/Makefile         |    1 
 net/ipv6/addrconf.c       |    7 -
 net/ipv6/fib6_rules.c     |  251 ++++++++++++++++++
 net/ipv6/ip6_fib.c        |  143 ++++++++++
 net/ipv6/route.c          |  442 +++++++++++++++++++++++---------
 20 files changed, 1597 insertions(+), 523 deletions(-


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency
  2006-08-04 10:23 [PATCHSET] " Thomas Graf
@ 2006-08-03 22:00 ` Thomas Graf
  0 siblings, 0 replies; 54+ messages in thread
From: Thomas Graf @ 2006-08-03 22:00 UTC (permalink / raw)
  To: netdev; +Cc: vnuorval, usagi-core, yoshfuji, davem, anttit

[-- Attachment #1: ip6_mrt_ndisc_lock --]
[-- Type: text/plain, Size: 1541 bytes --]

(Ab)using rt6_lock wouldn't work anymore if rt6_lock is
converted into a per table lock.

Signed-off-by: Thomas Graf <tgraf@suug.ch>

Index: net-2.6.19/net/ipv6/route.c
===================================================================
--- net-2.6.19.orig/net/ipv6/route.c
+++ net-2.6.19/net/ipv6/route.c
@@ -745,8 +745,6 @@ static void ip6_rt_update_pmtu(struct ds
 	}
 }
 
-/* Protected by rt6_lock.  */
-static struct dst_entry *ndisc_dst_gc_list;
 static int ipv6_get_mtu(struct net_device *dev);
 
 static inline unsigned int ipv6_advmss(unsigned int mtu)
@@ -767,6 +765,9 @@ static inline unsigned int ipv6_advmss(u
 	return mtu;
 }
 
+static struct dst_entry *ndisc_dst_gc_list;
+DEFINE_SPINLOCK(ndisc_lock);
+
 struct dst_entry *ndisc_dst_alloc(struct net_device *dev, 
 				  struct neighbour *neigh,
 				  struct in6_addr *addr,
@@ -807,10 +808,10 @@ struct dst_entry *ndisc_dst_alloc(struct
 	rt->rt6i_dst.plen = 128;
 #endif
 
-	write_lock_bh(&rt6_lock);
+	spin_lock_bh(&ndisc_lock);
 	rt->u.dst.next = ndisc_dst_gc_list;
 	ndisc_dst_gc_list = &rt->u.dst;
-	write_unlock_bh(&rt6_lock);
+	spin_unlock_bh(&ndisc_lock);
 
 	fib6_force_start_gc();
 
@@ -824,8 +825,11 @@ int ndisc_dst_gc(int *more)
 	int freed;
 
 	next = NULL;
+ 	freed = 0;
+
+	spin_lock_bh(&ndisc_lock);
 	pprev = &ndisc_dst_gc_list;
-	freed = 0;
+
 	while ((dst = *pprev) != NULL) {
 		if (!atomic_read(&dst->__refcnt)) {
 			*pprev = dst->next;
@@ -837,6 +841,8 @@ int ndisc_dst_gc(int *more)
 		}
 	}
 
+	spin_unlock_bh(&ndisc_lock);
+
 	return freed;
 }
 


^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2006-08-04 10:27 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-26 22:11 [RFC] Multiple IPV6 Routing Tables & Policy Routing Thomas Graf
2006-07-26 22:00 ` [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency Thomas Graf
2006-07-26 22:28   ` David Miller
2006-07-26 23:34     ` Tushar Gohad
2006-07-26 23:34       ` David Miller
2006-07-31 11:01     ` Ville Nuorvala
2006-07-26 22:00 ` [PATCH 2/5] [IPV6]: Multiple Routing Tables Thomas Graf
2006-07-26 22:39   ` David Miller
2006-07-26 22:48     ` Thomas Graf
2006-07-26 22:55       ` David Miller
2006-07-29  4:13   ` YOSHIFUJI Hideaki / 吉藤英明
2006-07-29  4:14     ` David Miller
2006-07-29  4:28       ` YOSHIFUJI Hideaki / 吉藤英明
2006-07-29 10:29     ` Thomas Graf
2006-07-31 13:55   ` Ville Nuorvala
2006-07-31 14:01     ` Herbert Xu
2006-07-31 14:02       ` Herbert Xu
2006-07-31 15:41       ` Thomas Graf
2006-07-31 20:09         ` David Miller
2006-07-31 15:34     ` Thomas Graf
2006-07-26 22:00 ` [PATCH 3/5] [NET]: Protocol Independant Policy Routing Rules Framework Thomas Graf
2006-07-26 22:41   ` David Miller
2006-07-27  5:58   ` James Morris
2006-07-27  6:02     ` David Miller
2006-07-27 22:39       ` [RESEND " Thomas Graf
2006-07-27 22:58         ` Patrick McHardy
2006-07-27 23:17           ` David Miller
2006-07-27 23:31             ` Patrick McHardy
2006-07-28  9:25           ` Martin Josefsson
2006-07-29  1:40             ` Patrick McHardy
2006-07-29  7:25               ` Martin Josefsson
2006-07-27 23:30         ` Patrick McHardy
2006-07-28 10:23           ` Thomas Graf
2006-07-31 14:46         ` Ville Nuorvala
2006-07-31 15:24           ` Thomas Graf
2006-07-31 18:01             ` Patrick McHardy
2006-07-31 20:01               ` Thomas Graf
2006-07-26 22:00 ` [PATCH 4/5] [IPV6]: Policy Routing Rules Thomas Graf
2006-07-26 22:42   ` David Miller
2006-07-26 23:26   ` David Miller
2006-07-26 23:33   ` David Miller
2006-07-26 23:40     ` David Miller
2006-07-27 22:40       ` [RESEND " Thomas Graf
2006-07-31 14:55         ` Ville Nuorvala
2006-07-26 22:00 ` [PATCH 5/5] [IPV4]: Use Protocol Independant Policy Routing Rules Framework Thomas Graf
2006-07-26 22:43   ` David Miller
2006-07-26 23:47   ` David Miller
2006-07-27 22:40     ` [RESEND " Thomas Graf
2006-07-28  6:10 ` [RFC] Multiple IPV6 Routing Tables & Policy Routing YOSHIFUJI Hideaki / 吉藤英明
2006-07-28  8:23   ` David Miller
2006-07-28 10:32   ` Thomas Graf
2006-07-29  4:27     ` YOSHIFUJI Hideaki / 吉藤英明
2006-07-31 11:01 ` Ville Nuorvala
  -- strict thread matches above, loose matches on Subject: below --
2006-08-04 10:23 [PATCHSET] " Thomas Graf
2006-08-03 22:00 ` [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency Thomas Graf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).