[PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism
@ 2006-06-27 20:50 Steve Wise
  2006-06-27 20:51 ` [PATCH Round 3 1/2] " Steve Wise
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Steve Wise @ 2006-06-27 20:50 UTC (permalink / raw)
  To: davem; +Cc: netdev

Round 3 Changes:

- changed netlink msg for neighbour change to (RTM_NEIGHUPD)
- added netlink msg for PMTU change events (RTM_ROUTEUPD)
- added netlink messages for redirect (RTM_DELROUTE + RTM_NEWROUTE)
- tested neighbour change events via netlink for ipv4 and ipv6.
- tested redirect change events via netlink for ipv4.

Round 2 Changes:

- cleaned up event structures per review feedback.
- began integration with netlink (see neighbour changes in patch 2).
- added IPv6 support.

TODO: 

- review feedback changes, if any
- more testing
- retest with RDMA NIC

------

This patch implements a mechanism that allows interested clients to
register for notification of certain network events. The intended use
is to allow RDMA devices (linux/drivers/infiniband) to be notified of
neighbour updates, ICMP redirects, path MTU changes, and route changes.

The reason these devices need update events is because they typically
cache this information in hardware and need to be notified when this
information has been updated.  For information on RDMA protocols, see:
http://www.ietf.org/html.charters/rddp-charter.html.

The key events of interest are:

- neighbour mac address change 
- routing redirect (the next hop neighbour changes for a dst_entry)
- path mtu change (the path mtu for a dst_entry changes).
- route add/deletes

NOTE: These new netevents are also passed up to user space via netlink.

We would like to get this or similar functionality included in 2.6.19
and request comments.

This patchset consists of 2 patches:

1) New files implementing the Network Event Notifier
2) Core network changes to generate network event notifications

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH Round 3 1/2] Network Event Notifier Mechanism.
  2006-06-27 20:50 [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Steve Wise
@ 2006-06-27 20:51 ` Steve Wise
  2006-06-27 20:51 ` [PATCH Round 3 2/2] Core network changes to support network event notification Steve Wise
  2006-06-28  2:54 ` [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Herbert Xu
  2 siblings, 0 replies; 19+ messages in thread
From: Steve Wise @ 2006-06-27 20:51 UTC (permalink / raw)
  To: davem; +Cc: netdev


This patch uses notifier blocks to implement a network event
notifier mechanism.

Clients register their callback function by calling
register_netevent_notifier() like this:

static struct notifier_block nb = {
        .notifier_call = my_callback_func
};

...

register_netevent_notifier(&nb);
---

 include/net/netevent.h |   49 +++++++++++++++++++++++++++++++++++
 net/core/netevent.c    |   68 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+), 0 deletions(-)

diff --git a/include/net/netevent.h b/include/net/netevent.h
new file mode 100644
index 0000000..22214c8
--- /dev/null
+++ b/include/net/netevent.h
@@ -0,0 +1,49 @@
+#ifndef _NET_EVENT_H
+#define _NET_EVENT_H
+
+/*
+ *	Generic netevent notifiers
+ *
+ *	Authors:
+ *      Tom Tucker              <tom@opengridcomputing.com>
+ *
+ * 	Changes:
+ */
+
+#ifdef __KERNEL__
+
+#include <net/dst.h>
+
+/* 
+ * Generic route info structure.
+ *
+ * Family	  Data ptr type
+ * --------------------------------
+ * AF_INET 	- struct fib_info *
+ * AF_INET6	- struct rt6_info *
+ * AF_DECnet	- struct dn_route *
+ */
+struct netevent_route_info {
+	u16 family;
+	void *data;	
+};
+
+struct netevent_redirect {
+	struct dst_entry *old;
+	struct dst_entry *new;
+};
+
+enum netevent_notif_type {
+	NETEVENT_NEIGH_UPDATE = 1, /* arg is struct neighbour ptr */
+	NETEVENT_ROUTE_ADD,   	   /* arg is struct netevent_route_info ptr */
+	NETEVENT_ROUTE_DEL,   	   /* arg is struct netevent_route_info ptr */
+	NETEVENT_PMTU_UPDATE,	   /* arg is struct dst_entry ptr */
+	NETEVENT_REDIRECT,	   /* arg is struct netevent_redirect ptr */
+};
+
+extern int register_netevent_notifier(struct notifier_block *nb);
+extern int unregister_netevent_notifier(struct notifier_block *nb);
+extern int call_netevent_notifiers(unsigned long val, void *v);
+
+#endif
+#endif
diff --git a/net/core/netevent.c b/net/core/netevent.c
new file mode 100644
index 0000000..e995751
--- /dev/null
+++ b/net/core/netevent.c
@@ -0,0 +1,68 @@
+/*
+ *	Network event notifiers
+ *
+ *	Authors:
+ *      Tom Tucker             <tom@opengridcomputing.com>
+ *
+ *	This program is free software; you can redistribute it and/or
+ *      modify it under the terms of the GNU General Public License
+ *      as published by the Free Software Foundation; either version
+ *      2 of the License, or (at your option) any later version.
+ *
+ *	Fixes:
+ */
+
+#include <linux/rtnetlink.h>
+#include <linux/notifier.h>
+
+static ATOMIC_NOTIFIER_HEAD(netevent_notif_chain);
+
+/**
+ *	register_netevent_notifier - register a netevent notifier block
+ *	@nb: notifier
+ *
+ *	Register a notifier to be called when a netevent occurs.
+ *	The notifier passed is linked into the kernel structures and must
+ *	not be reused until it has been unregistered. A negative errno code
+ *	is returned on a failure.
+ */
+int register_netevent_notifier(struct notifier_block *nb)
+{
+	int err;
+
+	err = atomic_notifier_chain_register(&netevent_notif_chain, nb);
+	return err;
+}
+
+/**
+ *	netevent_unregister_notifier - unregister a netevent notifier block
+ *	@nb: notifier
+ *
+ *	Unregister a notifier previously registered by
+ *	register_neigh_notifier(). The notifier is unlinked into the
+ *	kernel structures and may then be reused. A negative errno code
+ *	is returned on a failure.
+ */
+
+int unregister_netevent_notifier(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_unregister(&netevent_notif_chain, nb);
+}
+
+/**
+ *	call_netevent_notifiers - call all netevent notifier blocks
+ *      @val: value passed unmodified to notifier function
+ *      @v:   pointer passed unmodified to notifier function
+ *
+ *	Call all neighbour notifier blocks.  Parameters and return value
+ *	are as for notifier_call_chain().
+ */
+
+int call_netevent_notifiers(unsigned long val, void *v)
+{
+	return atomic_notifier_call_chain(&netevent_notif_chain, val, v);
+}
+
+EXPORT_SYMBOL_GPL(register_netevent_notifier);
+EXPORT_SYMBOL_GPL(unregister_netevent_notifier);
+EXPORT_SYMBOL_GPL(call_netevent_notifiers);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH Round 3 2/2] Core network changes to support network event notification.
  2006-06-27 20:50 [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Steve Wise
  2006-06-27 20:51 ` [PATCH Round 3 1/2] " Steve Wise
@ 2006-06-27 20:51 ` Steve Wise
  2006-06-28  2:54 ` [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Herbert Xu
  2 siblings, 0 replies; 19+ messages in thread
From: Steve Wise @ 2006-06-27 20:51 UTC (permalink / raw)
  To: davem; +Cc: netdev


This patch adds netevent and netlink calls for neighbour change, route
add/del, pmtu change, and routing redirect events.

Netlink Details:

Neighbour change events are broadcast as a new ndmsg type RTM_NEIGHUPD.

Path mtu change events are broadcast as a new rtmsg type RTM_ROUTEUPD.

Routing redirect events are broadcast as a pair of rtmsgs, RTM_DELROUTE
and RTM_NEWROUTE.
---

 include/linux/rtnetlink.h |    4 ++
 net/core/Makefile         |    2 +
 net/core/neighbour.c      |   37 ++++++++++++++++---
 net/ipv4/fib_semantics.c  |    9 +++++
 net/ipv4/route.c          |   86 ++++++++++++++++++++++++++++++++++++++++++--
 net/ipv6/route.c          |   87 +++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 213 insertions(+), 12 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index facd9ee..340ca4f 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -35,6 +35,8 @@ #define RTM_NEWROUTE	RTM_NEWROUTE
 #define RTM_DELROUTE	RTM_DELROUTE
 	RTM_GETROUTE,
 #define RTM_GETROUTE	RTM_GETROUTE
+	RTM_ROUTEUPD,
+#define RTM_ROUTEUPD	RTM_ROUTEUPD
 
 	RTM_NEWNEIGH	= 28,
 #define RTM_NEWNEIGH	RTM_NEWNEIGH
@@ -42,6 +44,8 @@ #define RTM_NEWNEIGH	RTM_NEWNEIGH
 #define RTM_DELNEIGH	RTM_DELNEIGH
 	RTM_GETNEIGH,
 #define RTM_GETNEIGH	RTM_GETNEIGH
+	RTM_NEIGHUPD,
+#define RTM_NEIGHUPD	RTM_NEIGHUPD
 
 	RTM_NEWRULE	= 32,
 #define RTM_NEWRULE	RTM_NEWRULE
diff --git a/net/core/Makefile b/net/core/Makefile
index e9bd246..2645ba4 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -7,7 +7,7 @@ obj-y := sock.o request_sock.o skbuff.o 
 
 obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 
-obj-y		     += dev.o ethtool.o dev_mcast.o dst.o \
+obj-y		     += dev.o ethtool.o dev_mcast.o dst.o netevent.o \
 			neighbour.o rtnetlink.o utils.o link_watch.o filter.o
 
 obj-$(CONFIG_XFRM) += flow.o
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 50a8c73..bf70981 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -30,9 +30,11 @@ #include <linux/times.h>
 #include <net/neighbour.h>
 #include <net/dst.h>
 #include <net/sock.h>
+#include <net/netevent.h>
 #include <linux/rtnetlink.h>
 #include <linux/random.h>
 #include <linux/string.h>
+#include <linux/notifier.h>
 
 #define NEIGH_DEBUG 1
 
@@ -59,6 +61,7 @@ static void neigh_app_notify(struct neig
 #endif
 static int pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev);
 void neigh_changeaddr(struct neigh_table *tbl, struct net_device *dev);
+static void rtm_neigh_change(struct neighbour *n);
 
 static struct neigh_table *neigh_tables;
 #ifdef CONFIG_PROC_FS
@@ -755,6 +758,7 @@ #endif
 			neigh->nud_state = NUD_STALE;
 			neigh->updated = jiffies;
 			neigh_suspect(neigh);
+			notify = 1;
 		}
 	} else if (state & NUD_DELAY) {
 		if (time_before_eq(now, 
@@ -763,6 +767,7 @@ #endif
 			neigh->nud_state = NUD_REACHABLE;
 			neigh->updated = jiffies;
 			neigh_connect(neigh);
+			notify = 1;
 			next = neigh->confirmed + neigh->parms->reachable_time;
 		} else {
 			NEIGH_PRINTK2("neigh %p is probed.\n", neigh);
@@ -820,6 +825,8 @@ #endif
 out:
 		write_unlock(&neigh->lock);
 	}
+	if (notify)
+		rtm_neigh_change(neigh);
 
 #ifdef CONFIG_ARPD
 	if (notify && neigh->parms->app_probes)
@@ -927,9 +934,7 @@ int neigh_update(struct neighbour *neigh
 {
 	u8 old;
 	int err;
-#ifdef CONFIG_ARPD
 	int notify = 0;
-#endif
 	struct net_device *dev;
 	int update_isrouter = 0;
 
@@ -949,9 +954,7 @@ #endif
 			neigh_suspect(neigh);
 		neigh->nud_state = new;
 		err = 0;
-#ifdef CONFIG_ARPD
 		notify = old & NUD_VALID;
-#endif
 		goto out;
 	}
 
@@ -1023,9 +1026,7 @@ #endif
 		if (!(new & NUD_CONNECTED))
 			neigh->confirmed = jiffies -
 				      (neigh->parms->base_reachable_time << 1);
-#ifdef CONFIG_ARPD
 		notify = 1;
-#endif
 	}
 	if (new == old)
 		goto out;
@@ -1056,7 +1057,11 @@ out:
 			(neigh->flags | NTF_ROUTER) :
 			(neigh->flags & ~NTF_ROUTER);
 	}
+
 	write_unlock_bh(&neigh->lock);
+
+	if (notify)
+		rtm_neigh_change(neigh);
 #ifdef CONFIG_ARPD
 	if (notify && neigh->parms->app_probes)
 		neigh_app_notify(neigh);
@@ -2370,9 +2375,27 @@ static void neigh_app_notify(struct neig
 	NETLINK_CB(skb).dst_group  = RTNLGRP_NEIGH;
 	netlink_broadcast(rtnl, skb, 0, RTNLGRP_NEIGH, GFP_ATOMIC);
 }
-
 #endif /* CONFIG_ARPD */
 
+static void rtm_neigh_change(struct neighbour *n)
+{
+	struct nlmsghdr *nlh;
+	int size = NLMSG_SPACE(sizeof(struct ndmsg) + 256);
+	struct sk_buff *skb = alloc_skb(size, GFP_ATOMIC);
+
+	call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, n);
+	if (!skb)
+		return;
+
+	if (neigh_fill_info(skb, n, 0, 0, RTM_NEIGHUPD, 0) < 0) {
+		kfree_skb(skb);
+		return;
+	}
+	nlh			   = (struct nlmsghdr *)skb->data;
+	NETLINK_CB(skb).dst_group  = RTNLGRP_NEIGH;
+	netlink_broadcast(rtnl, skb, 0, RTNLGRP_NEIGH, GFP_ATOMIC);
+}
+
 #ifdef CONFIG_SYSCTL
 
 static struct neigh_sysctl_table {
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 0f4145b..197c365 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -45,6 +45,7 @@ #include <net/tcp.h>
 #include <net/sock.h>
 #include <net/ip_fib.h>
 #include <net/ip_mp_alg.h>
+#include <net/netevent.h>
 
 #include "fib_lookup.h"
 
@@ -280,6 +281,14 @@ void rtmsg_fib(int event, u32 key, struc
 	struct sk_buff *skb;
 	u32 pid = req ? req->pid : n->nlmsg_pid;
 	int size = NLMSG_SPACE(sizeof(struct rtmsg)+256);
+	struct netevent_route_info nri;
+	int netevent;
+
+	nri.family = AF_INET;
+	nri.data = &fa->fa_info;
+	netevent = event == RTM_NEWROUTE ? NETEVENT_ROUTE_ADD 
+					 : NETEVENT_ROUTE_DEL;
+	call_netevent_notifiers(netevent, &nri);
 
 	skb = alloc_skb(size, GFP_KERNEL);
 	if (!skb)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 60b11ae..cef7c6d 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -105,6 +105,7 @@ #include <net/tcp.h>
 #include <net/icmp.h>
 #include <net/xfrm.h>
 #include <net/ip_mp_alg.h>
+#include <net/netevent.h>
 #ifdef CONFIG_SYSCTL
 #include <linux/sysctl.h>
 #endif
@@ -152,6 +153,8 @@ static struct dst_entry *ipv4_negative_a
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
 static int rt_garbage_collect(void);
+static int rt_fill_info(struct sk_buff *skb, u32 pid, u32 seq, int event,
+			int nowait, unsigned int flags, unsigned int prot);
 
 
 static struct dst_ops ipv4_dst_ops = {
@@ -1112,6 +1115,52 @@ static void rt_del(unsigned hash, struct
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 }
 
+static void rtm_redirect(struct rtable *old, struct rtable *new)
+{
+	struct netevent_redirect netevent;
+	struct sk_buff *skb;
+	int err;
+
+	netevent.old = &old->u.dst;
+	netevent.new = &new->u.dst;
+
+	/* notify netevent subscribers */
+	call_netevent_notifiers(NETEVENT_REDIRECT, &netevent);
+
+	/* Post NETLINK messages:  RTM_DELROUTE for old route, 
+				   RTM_NEWROUTE for new route */
+	skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC);
+	if (!skb)
+		return;
+	skb->mac.raw = skb->nh.raw = skb->data;
+	skb->dst = &old->u.dst;
+	NETLINK_CB(skb).dst_pid = 0;
+
+	err = rt_fill_info(skb, 0, 0, RTM_DELROUTE, 1, 0, RTPROT_UNSPEC);
+	if (err <= 0)
+		goto out_free;
+
+	netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV4_ROUTE, GFP_ATOMIC);
+
+	skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC);
+	if (!skb)
+		return;
+	skb->mac.raw = skb->nh.raw = skb->data;
+	skb->dst = &new->u.dst;
+	NETLINK_CB(skb).dst_pid = 0;
+
+	err = rt_fill_info(skb, 0, 0, RTM_NEWROUTE, 1, 0, RTPROT_REDIRECT);
+	if (err <= 0)
+		goto out_free;
+
+	netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV4_ROUTE, GFP_ATOMIC);
+	return;
+
+out_free:
+	kfree_skb(skb);
+	return;
+}
+
 void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw,
 		    u32 saddr, struct net_device *dev)
 {
@@ -1211,6 +1260,8 @@ void ip_rt_redirect(u32 old_gw, u32 dadd
 					rt_drop(rt);
 					goto do_next;
 				}
+				
+				rtm_redirect(rth, rt);
 
 				rt_del(hash, rth);
 				if (!rt_intern_hash(hash, rt, &rt))
@@ -1437,6 +1488,32 @@ unsigned short ip_rt_frag_needed(struct 
 	return est_mtu ? : new_mtu;
 }
 
+static void rtm_pmtu_update(struct rtable *rt)
+{
+	struct sk_buff *skb;
+	int err;
+
+	call_netevent_notifiers(NETEVENT_PMTU_UPDATE, &rt->u.dst);
+
+	skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC);
+	if (!skb)
+		return;
+	skb->mac.raw = skb->nh.raw = skb->data;
+	skb->dst = &rt->u.dst;
+	NETLINK_CB(skb).dst_pid = 0;
+
+	err = rt_fill_info(skb, 0, 0, RTM_ROUTEUPD, 1, 0, RTPROT_UNSPEC);
+	if (err <= 0)
+		goto out_free;
+
+	netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV4_ROUTE, GFP_ATOMIC);
+	return;
+
+out_free:
+	kfree_skb(skb);
+	return;
+}
+
 static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu)
 {
 	if (dst->metrics[RTAX_MTU-1] > mtu && mtu >= 68 &&
@@ -1447,6 +1524,7 @@ static void ip_rt_update_pmtu(struct dst
 		}
 		dst->metrics[RTAX_MTU-1] = mtu;
 		dst_set_expires(dst, ip_rt_mtu_expires);
+		rtm_pmtu_update((struct rtable *)dst);
 	}
 }
 
@@ -2622,7 +2700,7 @@ int ip_route_output_key(struct rtable **
 }
 
 static int rt_fill_info(struct sk_buff *skb, u32 pid, u32 seq, int event,
-			int nowait, unsigned int flags)
+			int nowait, unsigned int flags, unsigned int prot)
 {
 	struct rtable *rt = (struct rtable*)skb->dst;
 	struct rtmsg *r;
@@ -2641,7 +2719,7 @@ #endif
 	r->rtm_table	= RT_TABLE_MAIN;
 	r->rtm_type	= rt->rt_type;
 	r->rtm_scope	= RT_SCOPE_UNIVERSE;
-	r->rtm_protocol = RTPROT_UNSPEC;
+	r->rtm_protocol = prot;
 	r->rtm_flags	= (rt->rt_flags & ~0xFFFF) | RTM_F_CLONED;
 	if (rt->rt_flags & RTCF_NOTIFY)
 		r->rtm_flags |= RTM_F_NOTIFY;
@@ -2787,7 +2865,7 @@ int inet_rtm_getroute(struct sk_buff *in
 	NETLINK_CB(skb).dst_pid = NETLINK_CB(in_skb).pid;
 
 	err = rt_fill_info(skb, NETLINK_CB(in_skb).pid, nlh->nlmsg_seq,
-				RTM_NEWROUTE, 0, 0);
+				RTM_NEWROUTE, 0, 0, RTPROT_UNSPEC);
 	if (!err)
 		goto out_free;
 	if (err < 0) {
@@ -2825,7 +2903,7 @@ int ip_rt_dump(struct sk_buff *skb,  str
 			skb->dst = dst_clone(&rt->u.dst);
 			if (rt_fill_info(skb, NETLINK_CB(cb->skb).pid,
 					 cb->nlh->nlmsg_seq, RTM_NEWROUTE, 
-					 1, NLM_F_MULTI) <= 0) {
+					 1, NLM_F_MULTI, RTPROT_UNSPEC) <= 0) {
 				dst_release(xchg(&skb->dst, NULL));
 				rcu_read_unlock_bh();
 				goto done;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 8a77793..95f68bc 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -54,6 +54,7 @@ #include <net/tcp.h>
 #include <linux/rtnetlink.h>
 #include <net/dst.h>
 #include <net/xfrm.h>
+#include <net/netevent.h>
 
 #include <asm/uaccess.h>
 
@@ -97,6 +98,10 @@ static int		ip6_pkt_discard(struct sk_bu
 static int		ip6_pkt_discard_out(struct sk_buff *skb);
 static void		ip6_link_failure(struct sk_buff *skb);
 static void		ip6_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
+static int rt6_fill_node(struct sk_buff *skb, struct rt6_info *rt,
+			 struct in6_addr *dst, struct in6_addr *src,
+			 int iif, int type, u32 pid, u32 seq,
+			 int prefix, unsigned int flags);
 
 #ifdef CONFIG_IPV6_ROUTE_INFO
 static struct rt6_info *rt6_add_route_info(struct in6_addr *prefix, int prefixlen,
@@ -732,6 +737,32 @@ static void ip6_link_failure(struct sk_b
 	}
 }
 
+static void rtm_pmtu_update(struct rt6_info *rt)
+{
+	struct sk_buff *skb;
+	int err;
+
+	call_netevent_notifiers(NETEVENT_PMTU_UPDATE, &rt->u.dst);
+
+	skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC);
+	if (!skb)
+		return;
+	skb->mac.raw = skb->nh.raw = skb->data;
+	skb->dst = &rt->u.dst;
+	NETLINK_CB(skb).dst_pid = 0;
+
+	err = rt6_fill_node(skb, rt, NULL, NULL, 0, RTM_ROUTEUPD, 0, 0, 0, 0);
+	if (err <= 0)
+		goto out_free;
+
+	netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV6_ROUTE, GFP_ATOMIC);
+	return;
+
+out_free:
+	kfree_skb(skb);
+	return;
+}
+
 static void ip6_rt_update_pmtu(struct dst_entry *dst, u32 mtu)
 {
 	struct rt6_info *rt6 = (struct rt6_info*)dst;
@@ -743,6 +774,7 @@ static void ip6_rt_update_pmtu(struct ds
 			dst->metrics[RTAX_FEATURES-1] |= RTAX_FEATURE_ALLFRAG;
 		}
 		dst->metrics[RTAX_MTU-1] = mtu;
+		rtm_pmtu_update(rt6);
 	}
 }
 
@@ -908,6 +940,7 @@ int ip6_route_add(struct in6_rtmsg *rtms
 	struct net_device *dev = NULL;
 	struct inet6_dev *idev = NULL;
 	int addr_type;
+	struct netevent_route_info nri;
 
 	rta = (struct rtattr **) _rtattr;
 
@@ -1086,6 +1119,9 @@ install_route:
 		rt->u.dst.metrics[RTAX_ADVMSS-1] = ipv6_advmss(dst_mtu(&rt->u.dst));
 	rt->u.dst.dev = dev;
 	rt->rt6i_idev = idev;
+	nri.family = AF_INET6;
+	nri.data = rt;
+	call_netevent_notifiers(NETEVENT_ROUTE_ADD, &nri);
 	return ip6_ins_rt(rt, nlh, _rtattr, req);
 
 out:
@@ -1117,6 +1153,7 @@ static int ip6_route_del(struct in6_rtms
 	struct fib6_node *fn;
 	struct rt6_info *rt;
 	int err = -ESRCH;
+	struct netevent_route_info nri;
 
 	read_lock_bh(&rt6_lock);
 
@@ -1138,6 +1175,10 @@ static int ip6_route_del(struct in6_rtms
 				continue;
 			dst_hold(&rt->u.dst);
 			read_unlock_bh(&rt6_lock);
+			
+			nri.family = AF_INET6;
+			nri.data = rt;
+			call_netevent_notifiers(NETEVENT_ROUTE_DEL, &nri);
 
 			return ip6_del_rt(rt, nlh, _rtattr, req);
 		}
@@ -1147,6 +1188,50 @@ static int ip6_route_del(struct in6_rtms
 	return err;
 }
 
+static void rtm_redirect(struct rt6_info *old, struct rt6_info *new)
+{
+	struct netevent_redirect netevent;
+	struct sk_buff *skb;
+	int err;
+
+	netevent.old = &old->u.dst;
+	netevent.new = &new->u.dst;
+	call_netevent_notifiers(NETEVENT_REDIRECT, &netevent);
+
+	/* Post NETLINK messages:  RTM_DELROUTE for old route, 
+				   RTM_NEWROUTE for new route */
+	skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC);
+	if (!skb)
+		return;
+	skb->mac.raw = skb->nh.raw = skb->data;
+	NETLINK_CB(skb).dst_pid = 0;
+	NETLINK_CB(skb).dst_group = RTNLGRP_IPV6_ROUTE;
+
+	err = rt6_fill_node(skb, old, NULL, NULL, 0, RTM_DELROUTE, 0, 0, 0, 0);
+	if (err <= 0)
+		goto out_free;
+
+	netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV6_ROUTE, GFP_ATOMIC);
+
+	skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC);
+	if (!skb)
+		return;
+	skb->mac.raw = skb->nh.raw = skb->data;
+	NETLINK_CB(skb).dst_pid = 0;
+	NETLINK_CB(skb).dst_group = RTNLGRP_IPV6_ROUTE;
+
+	err = rt6_fill_node(skb, new, NULL, NULL, 0, RTM_NEWROUTE, 0, 0, 0, 0);
+	if (err <= 0)
+		goto out_free;
+
+	netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV6_ROUTE, GFP_ATOMIC);
+	return;
+
+out_free:
+	kfree_skb(skb);
+	return;
+}
+
 /*
  *	Handle redirects
  */
@@ -1253,6 +1338,8 @@ restart:
 	if (ip6_ins_rt(nrt, NULL, NULL, NULL))
 		goto out;
 
+	rtm_redirect(rt, nrt);
+
 	if (rt->rt6i_flags&RTF_CACHE) {
 		ip6_del_rt(rt, NULL, NULL, NULL);
 		return;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism
  2006-06-27 20:50 [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Steve Wise
  2006-06-27 20:51 ` [PATCH Round 3 1/2] " Steve Wise
  2006-06-27 20:51 ` [PATCH Round 3 2/2] Core network changes to support network event notification Steve Wise
@ 2006-06-28  2:54 ` Herbert Xu
  2006-06-28  3:04   ` Herbert Xu
  2 siblings, 1 reply; 19+ messages in thread
From: Herbert Xu @ 2006-06-28  2:54 UTC (permalink / raw)
  To: Steve Wise; +Cc: davem, netdev

Steve Wise <swise@opengridcomputing.com> wrote:
> 
> The reason these devices need update events is because they typically
> cache this information in hardware and need to be notified when this
> information has been updated.  For information on RDMA protocols, see:
> http://www.ietf.org/html.charters/rddp-charter.html.

Please give more specific reasons for needing these events because it
is certainly far from obvious from reading those documents.

Without reasons these invasive changes may turn out to be completely
inappropriate.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism
  2006-06-28  2:54 ` [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Herbert Xu
@ 2006-06-28  3:04   ` Herbert Xu
  2006-06-28  3:24     ` Jeff Garzik
  0 siblings, 1 reply; 19+ messages in thread
From: Herbert Xu @ 2006-06-28  3:04 UTC (permalink / raw)
  To: Steve Wise; +Cc: davem, netdev, Jeff Garzik

On Wed, Jun 28, 2006 at 12:54:10PM +1000, Herbert Xu wrote:
> 
> Please give more specific reasons for needing these events because it
> is certainly far from obvious from reading those documents.

Never mind, I've found your earlier messages on the list which explains
your reasons more clearly.  It would be nice if you could include those
explanations in your patch description.

BTW, does this mean that we're now comfortable with full TOE?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism
  2006-06-28  3:04   ` Herbert Xu
@ 2006-06-28  3:24     ` Jeff Garzik
  2006-06-28  3:37       ` Herbert Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff Garzik @ 2006-06-28  3:24 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Steve Wise, davem, netdev

Herbert Xu wrote:
> On Wed, Jun 28, 2006 at 12:54:10PM +1000, Herbert Xu wrote:
>> Please give more specific reasons for needing these events because it
>> is certainly far from obvious from reading those documents.
> 
> Never mind, I've found your earlier messages on the list which explains
> your reasons more clearly.  It would be nice if you could include those
> explanations in your patch description.
> 
> BTW, does this mean that we're now comfortable with full TOE?

I don't see how that position has changed?

http://linux-net.osdl.org/index.php/TOE

	Jeff



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism
  2006-06-28  3:24     ` Jeff Garzik
@ 2006-06-28  3:37       ` Herbert Xu
  2006-06-28  4:18         ` TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) Jeff Garzik
  0 siblings, 1 reply; 19+ messages in thread
From: Herbert Xu @ 2006-06-28  3:37 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Steve Wise, davem, netdev

On Tue, Jun 27, 2006 at 11:24:25PM -0400, Jeff Garzik wrote:
>
> I don't see how that position has changed?
> 
> http://linux-net.osdl.org/index.php/TOE

Well I must say that RDMA over TCP smells very much like TOE.  They've
got an ARP table, a routing table, and presumably a TCP stack.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism)
  2006-06-28  3:37       ` Herbert Xu
@ 2006-06-28  4:18         ` Jeff Garzik
  2006-06-28  4:29           ` Herbert Xu
  2006-06-28 14:18           ` Steve Wise
  0 siblings, 2 replies; 19+ messages in thread
From: Jeff Garzik @ 2006-06-28  4:18 UTC (permalink / raw)
  To: Herbert Xu, davem; +Cc: Steve Wise, netdev

Herbert Xu wrote:
> On Tue, Jun 27, 2006 at 11:24:25PM -0400, Jeff Garzik wrote:
>> I don't see how that position has changed?
>>
>> http://linux-net.osdl.org/index.php/TOE
> 
> Well I must say that RDMA over TCP smells very much like TOE.  They've
> got an ARP table, a routing table, and presumably a TCP stack.

A PCI device that presents itself as a SCSI controller, but under the 
hood is really iSCSI-over-TCP smells like TOE.  Running a virtualized 
Linux guest on top of a proprietary stack [which provides networking 
services to guests] also smells like TOE.  :)

If a TOE vendors wants to do TOE in a way that is transparent to the 
kernel, more power to them.  Such non-Linux TCP stack solutions still 
suffer many of the problems listed at the web page above, but at least 
they impose no burden on kernel maintenance.

i.e. we really _do not_ want to get into the habit of co-managing arp 
tables, routing tables, filtering rules, and dozens of other such 
resources with multiple remote, independent TCP stack.  We have enough 
complexity as it is today, coordinating between the random variations of 
SMP, uniprocessor, and NUMA machines out there.  Not to mention 
competing with under-the-hood firmware actions (ASF) on NICs.

As an aside, RDMA over TCP just seems silly.  TCP was _not_ meant to do 
the things that RDMA users want.  The infiniband/RDMA programming model 
is an ultra-low-latency polling model where one or two apps are allowed 
to completely consume the machine, either busy-waiting or processing 
messages.

Unfortunately I don't have more details, so you just get a generalized 
rant :)

	Jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism)
  2006-06-28  4:18         ` TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) Jeff Garzik
@ 2006-06-28  4:29           ` Herbert Xu
  2006-06-28  4:40             ` Jeff Garzik
                               ` (2 more replies)
  2006-06-28 14:18           ` Steve Wise
  1 sibling, 3 replies; 19+ messages in thread
From: Herbert Xu @ 2006-06-28  4:29 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: davem, Steve Wise, netdev

On Wed, Jun 28, 2006 at 12:18:25AM -0400, Jeff Garzik wrote:
> 
> A PCI device that presents itself as a SCSI controller, but under the 
> hood is really iSCSI-over-TCP smells like TOE.  Running a virtualized 
> Linux guest on top of a proprietary stack [which provides networking 
> services to guests] also smells like TOE.  :)

Agreed.  However, when they start adding hooks to the ARP table, the
routing table, and PMTU management, it begs the question what more is
there to add for TOE (well, user-space driven TOE at least)?

> Unfortunately I don't have more details, so you just get a generalized 
> rant :)

OK, the patch under discussion here adds hooks to all the stuff in the
previous paragraph for the purpose of RDMA over TCP (well I must say
that the exact RDMA application/hardware has never been clearly given
but this is what I can gather from the previous posts).

Put it another way, I think the dividing line between TOE and iSCSI or
virtualisation is exactly the interface between them and the Linux kernel.
If the interface is an existing one such as SCSI or standard IP then it's
OK.  However, when it starts poking in the guts of the Linux stack I'd say
that it has crossed the line.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism)
  2006-06-28  4:29           ` Herbert Xu
@ 2006-06-28  4:40             ` Jeff Garzik
  2006-06-28  4:43             ` TOE, etc David Miller
  2006-06-28 14:31             ` TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) Steve Wise
  2 siblings, 0 replies; 19+ messages in thread
From: Jeff Garzik @ 2006-06-28  4:40 UTC (permalink / raw)
  To: Herbert Xu; +Cc: davem, Steve Wise, netdev

Herbert Xu wrote:
> On Wed, Jun 28, 2006 at 12:18:25AM -0400, Jeff Garzik wrote:
>> A PCI device that presents itself as a SCSI controller, but under the 
>> hood is really iSCSI-over-TCP smells like TOE.  Running a virtualized 
>> Linux guest on top of a proprietary stack [which provides networking 
>> services to guests] also smells like TOE.  :)
> 
> Agreed.  However, when they start adding hooks to the ARP table, the
> routing table, and PMTU management, it begs the question what more is
> there to add for TOE (well, user-space driven TOE at least)?

Well, you've always been able to implement userspace (or otherwise 
completely-virtualized) network stack.  tuntap and the packet socket 
enable that, if nothing else.  But, like you characterize below, those 
are existing, well-defined, easily contained interfaces.


> Put it another way, I think the dividing line between TOE and iSCSI or
> virtualisation is exactly the interface between them and the Linux kernel.
> If the interface is an existing one such as SCSI or standard IP then it's
> OK.  However, when it starts poking in the guts of the Linux stack I'd say
> that it has crossed the line.

Strongly agreed.

	Jeff



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: TOE, etc.
  2006-06-28  4:29           ` Herbert Xu
  2006-06-28  4:40             ` Jeff Garzik
@ 2006-06-28  4:43             ` David Miller
  2006-06-28  5:35               ` Herbert Xu
  2006-06-28 14:31             ` TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) Steve Wise
  2 siblings, 1 reply; 19+ messages in thread
From: David Miller @ 2006-06-28  4:43 UTC (permalink / raw)
  To: herbert; +Cc: jgarzik, swise, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 28 Jun 2006 14:29:59 +1000

> On Wed, Jun 28, 2006 at 12:18:25AM -0400, Jeff Garzik wrote:
> > 
> > A PCI device that presents itself as a SCSI controller, but under the 
> > hood is really iSCSI-over-TCP smells like TOE.  Running a virtualized 
> > Linux guest on top of a proprietary stack [which provides networking 
> > services to guests] also smells like TOE.  :)
> 
> Agreed.  However, when they start adding hooks to the ARP table, the
> routing table, and PMTU management, it begs the question what more is
> there to add for TOE (well, user-space driven TOE at least)?

Socket state, and that is one thing I don't see them doing yet.

> Put it another way, I think the dividing line between TOE and iSCSI or
> virtualisation is exactly the interface between them and the Linux kernel.
> If the interface is an existing one such as SCSI or standard IP then it's
> OK.  However, when it starts poking in the guts of the Linux stack I'd say
> that it has crossed the line.

Yeah, it's starting to smell really bad.

But we have to realize they've already been given %95 of the
interfaces they need to speak IP using our routes and our neighbour
entries.

Right?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: TOE, etc.
  2006-06-28  4:43             ` TOE, etc David Miller
@ 2006-06-28  5:35               ` Herbert Xu
  2006-06-28  6:31                 ` David Miller
                                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Herbert Xu @ 2006-06-28  5:35 UTC (permalink / raw)
  To: David Miller; +Cc: jgarzik, swise, netdev

On Tue, Jun 27, 2006 at 09:43:23PM -0700, David Miller wrote:
> 
> Socket state, and that is one thing I don't see them doing yet.

I wonder what happens when the Linux TCP stack attempts to open a
connection to a remote host when that connection is already open
in the RDMA NIC?  For that matter what happens if a Linux application
decides to listen on a TCP port already listened on by the RDMA
NIC?

The only saving grace is that they're only doing RDMA rather than
arbitrary TCP.  However, exactly the same infrastructure can be used
to do arbitrary TCP should they wish to.

> But we have to realize they've already been given %95 of the
> interfaces they need to speak IP using our routes and our neighbour
> entries.
> 
> Right?

Yes, however I think the same argument could be applied to TOE.

With their RDMA NIC, we'll have TCP/SCTP connections that bypass
netfilter, tc, IPsec, AF_PACKET/tcpdump and the rest of our stack
while at the same time it is using the same IP address as us and
deciding what packets we will or won't see.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: TOE, etc.
  2006-06-28  5:35               ` Herbert Xu
@ 2006-06-28  6:31                 ` David Miller
  2006-06-28 14:41                 ` Steve Wise
  2006-06-28 14:54                 ` Steve Wise
  2 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2006-06-28  6:31 UTC (permalink / raw)
  To: herbert; +Cc: jgarzik, swise, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 28 Jun 2006 15:35:54 +1000

> With their RDMA NIC, we'll have TCP/SCTP connections that bypass
> netfilter, tc, IPsec, AF_PACKET/tcpdump and the rest of our stack
> while at the same time it is using the same IP address as us and
> deciding what packets we will or won't see.

That's true.  I don't think we should really add any more
help for these kinds of things then.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism)
  2006-06-28  4:18         ` TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) Jeff Garzik
  2006-06-28  4:29           ` Herbert Xu
@ 2006-06-28 14:18           ` Steve Wise
  1 sibling, 0 replies; 19+ messages in thread
From: Steve Wise @ 2006-06-28 14:18 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Herbert Xu, davem, netdev

On Wed, 2006-06-28 at 00:18 -0400, Jeff Garzik wrote:
> Herbert Xu wrote:
> > On Tue, Jun 27, 2006 at 11:24:25PM -0400, Jeff Garzik wrote:
> >> I don't see how that position has changed?
> >>
> >> http://linux-net.osdl.org/index.php/TOE
> > 
> > Well I must say that RDMA over TCP smells very much like TOE.  They've
> > got an ARP table, a routing table, and presumably a TCP stack.
> 
> A PCI device that presents itself as a SCSI controller, but under the 
> hood is really iSCSI-over-TCP smells like TOE.  Running a virtualized 
> Linux guest on top of a proprietary stack [which provides networking 
> services to guests] also smells like TOE.  :)
> 

I wonder if the existing iSCSI solutions handle next host mac addr
changes, pmtu changes, etc?  They _can_, if they implement the entire
ARP and ICMP suites in HW/FW.  Just curious if these vendors also see
merit in the netevent changes I'm proposing...
  
> If a TOE vendors wants to do TOE in a way that is transparent to the 
> kernel, more power to them.  Such non-Linux TCP stack solutions still 
> suffer many of the problems listed at the web page above, but at least 
> they impose no burden on kernel maintenance.
> 
> i.e. we really _do not_ want to get into the habit of co-managing arp 
> tables, routing tables, filtering rules, and dozens of other such 
> resources with multiple remote, independent TCP stack.  We have enough 
> complexity as it is today, coordinating between the random variations of 
> SMP, uniprocessor, and NUMA machines out there.  Not to mention 
> competing with under-the-hood firmware actions (ASF) on NICs.
> 
> As an aside, RDMA over TCP just seems silly.  TCP was _not_ meant to do 
> the things that RDMA users want.  The infiniband/RDMA programming model 
> is an ultra-low-latency polling model where one or two apps are allowed 
> to completely consume the machine, either busy-waiting or processing 
> messages.
> 

With RDMA over TCP, you can get the same ultra-low-latency, interrupt
coalescing or avoidance, and copy avoidance as with Infiniband (with
newly arriving 10Gb RDMA NICs).  The benefit over IB, IMO, is the fact
that your infrastructure is all IP.

If you're interested in a kernel-mode app that benfits from RDMA, then
check out NFS-RDMA, which runs today transparently over Infiniband and
RDMA/TCP devices using the Infiniband RDMA-CM and verbs with minor
changes to support RDMA/TCP devices. 

http://sourceforge.net/projects/nfs-rdma/


Steve.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism)
  2006-06-28  4:29           ` Herbert Xu
  2006-06-28  4:40             ` Jeff Garzik
  2006-06-28  4:43             ` TOE, etc David Miller
@ 2006-06-28 14:31             ` Steve Wise
  2 siblings, 0 replies; 19+ messages in thread
From: Steve Wise @ 2006-06-28 14:31 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Jeff Garzik, davem, netdev

On Wed, 2006-06-28 at 14:29 +1000, Herbert Xu wrote:
> On Wed, Jun 28, 2006 at 12:18:25AM -0400, Jeff Garzik wrote:
> > 
> > A PCI device that presents itself as a SCSI controller, but under the 
> > hood is really iSCSI-over-TCP smells like TOE.  Running a virtualized 
> > Linux guest on top of a proprietary stack [which provides networking 
> > services to guests] also smells like TOE.  :)
> 
> Agreed.  However, when they start adding hooks to the ARP table, the
> routing table, and PMTU management, it begs the question what more is
> there to add for TOE (well, user-space driven TOE at least)?
>  
> > Unfortunately I don't have more details, so you just get a generalized 
> > rant :)
> 
> OK, the patch under discussion here adds hooks to all the stuff in the
> previous paragraph for the purpose of RDMA over TCP (well I must say
> that the exact RDMA application/hardware has never been clearly given
> but this is what I can gather from the previous posts).

There are Ammasso and Chelsio RDMA/Ethernet drivers in the openib.org
svn iwarp branch today.  The goal is to submit them for review and
inclusion into linux.  The Ammasso driver has been through 3 review
cycles on lkml and netdev.  There are other vendors with drivers, but
they're currently not disclosing any information to me about their
status.

Applications: 

kernel: NFS-RDMA, iSER, RDP.  
user:  MPI, uDAPL (both middle ware).

The Ammasso driver is a different model.  It actually has a full
TCP/ARP/ICMP stack and doesn't require these hooks.  But the RDMA/TCP
model defined and implemented, I think, by most vendors is a model where
the HW is doing a limited TCP offload, relying on the native stack for
L2 and L3 integration (as described in the netevent patch).

> Put it another way, I think the dividing line between TOE and iSCSI or
> virtualisation is exactly the interface between them and the Linux kernel.
> If the interface is an existing one such as SCSI or standard IP then it's
> OK.  However, when it starts poking in the guts of the Linux stack I'd say
> that it has crossed the line.
> 

Don't these netevent hooks have utility for other purposes?  IE:  Should
we really shoot changes to linux _just because_ they might possibly
enable TOE? 

Steve.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: TOE, etc.
  2006-06-28  5:35               ` Herbert Xu
  2006-06-28  6:31                 ` David Miller
@ 2006-06-28 14:41                 ` Steve Wise
  2006-06-28 14:54                 ` Steve Wise
  2 siblings, 0 replies; 19+ messages in thread
From: Steve Wise @ 2006-06-28 14:41 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, jgarzik, netdev

On Wed, 2006-06-28 at 15:35 +1000, Herbert Xu wrote:
> On Tue, Jun 27, 2006 at 09:43:23PM -0700, David Miller wrote:
> > 
> > Socket state, and that is one thing I don't see them doing yet.
> 
> I wonder what happens when the Linux TCP stack attempts to open a
> connection to a remote host when that connection is already open
> in the RDMA NIC?  For that matter what happens if a Linux application
> decides to listen on a TCP port already listened on by the RDMA
> NIC?
> 
> The only saving grace is that they're only doing RDMA rather than
> arbitrary TCP.  However, exactly the same infrastructure can be used
> to do arbitrary TCP should they wish to.
>  
> > But we have to realize they've already been given %95 of the
> > interfaces they need to speak IP using our routes and our neighbour
> > entries.
> > 
> > Right?
> 
> Yes, however I think the same argument could be applied to TOE.
> 
> With their RDMA NIC, we'll have TCP/SCTP connections that bypass
> netfilter, tc, IPsec, AF_PACKET/tcpdump and the rest of our stack
> while at the same time it is using the same IP address as us and
> deciding what packets we will or won't see.
> 

Doesn't iSCSI have the same issue?  No netfilter, IPsec, tcpdump, etc...





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: TOE, etc.
  2006-06-28  5:35               ` Herbert Xu
  2006-06-28  6:31                 ` David Miller
  2006-06-28 14:41                 ` Steve Wise
@ 2006-06-28 14:54                 ` Steve Wise
  2006-06-28 18:36                   ` David Miller
  2 siblings, 1 reply; 19+ messages in thread
From: Steve Wise @ 2006-06-28 14:54 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, jgarzik, netdev

On Wed, 2006-06-28 at 15:35 +1000, Herbert Xu wrote:
> On Tue, Jun 27, 2006 at 09:43:23PM -0700, David Miller wrote:
> > 
> > Socket state, and that is one thing I don't see them doing yet.
> 
> I wonder what happens when the Linux TCP stack attempts to open a
> connection to a remote host when that connection is already open
> in the RDMA NIC?  For that matter what happens if a Linux application
> decides to listen on a TCP port already listened on by the RDMA
> NIC?
> 

This issue would have to be handled by using seperate IP addresses for
RDMA connections vs native stack TCP.

Consider NFS-RDMA server.  Through administration, it would be
configured to listen on the specific rdma ip addresses, and the native
stack tcp ip addresses and thus support both TCP and RDMA NFS
connections.

There are definitely issues with this that could be resolved via tighter
integration, but that seems to not be a goal of the linux community at
this time...


> The only saving grace is that they're only doing RDMA rather than
> arbitrary TCP.  However, exactly the same infrastructure can be used
> to do arbitrary TCP should they wish to.
>  
> > But we have to realize they've already been given %95 of the
> > interfaces they need to speak IP using our routes and our neighbour
> > entries.
> > 
> > Right?
> 
> Yes, however I think the same argument could be applied to TOE.
> 
> With their RDMA NIC, we'll have TCP/SCTP connections that bypass
> netfilter, tc, IPsec, AF_PACKET/tcpdump and the rest of our stack
> while at the same time it is using the same IP address as us and
> deciding what packets we will or won't see.
> 


Doesn't iSCSI have this same issue?

Steve.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: TOE, etc.
  2006-06-28 14:54                 ` Steve Wise
@ 2006-06-28 18:36                   ` David Miller
  2006-06-28 18:56                     ` Steve Wise
  0 siblings, 1 reply; 19+ messages in thread
From: David Miller @ 2006-06-28 18:36 UTC (permalink / raw)
  To: swise; +Cc: herbert, jgarzik, netdev

From: Steve Wise <swise@opengridcomputing.com>
Date: Wed, 28 Jun 2006 09:54:57 -0500

> Doesn't iSCSI have this same issue?

Software iSCSI implementations don't have the issue because
they go through the stack using normal sockets and normal
device send and receive.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: TOE, etc.
  2006-06-28 18:36                   ` David Miller
@ 2006-06-28 18:56                     ` Steve Wise
  0 siblings, 0 replies; 19+ messages in thread
From: Steve Wise @ 2006-06-28 18:56 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, jgarzik, netdev

On Wed, 2006-06-28 at 11:36 -0700, David Miller wrote:
> From: Steve Wise <swise@opengridcomputing.com>
> Date: Wed, 28 Jun 2006 09:54:57 -0500
> 
> > Doesn't iSCSI have this same issue?
> 
> Software iSCSI implementations don't have the issue because
> they go through the stack using normal sockets and normal
> device send and receive.
> -

Right.  I was assuming, in this thread we were talking about iSCSI
devices where the TCP stack is in HW/FW on the adapter...

Steve.



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2006-06-28 18:56 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-27 20:50 [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Steve Wise
2006-06-27 20:51 ` [PATCH Round 3 1/2] " Steve Wise
2006-06-27 20:51 ` [PATCH Round 3 2/2] Core network changes to support network event notification Steve Wise
2006-06-28  2:54 ` [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Herbert Xu
2006-06-28  3:04   ` Herbert Xu
2006-06-28  3:24     ` Jeff Garzik
2006-06-28  3:37       ` Herbert Xu
2006-06-28  4:18         ` TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) Jeff Garzik
2006-06-28  4:29           ` Herbert Xu
2006-06-28  4:40             ` Jeff Garzik
2006-06-28  4:43             ` TOE, etc David Miller
2006-06-28  5:35               ` Herbert Xu
2006-06-28  6:31                 ` David Miller
2006-06-28 14:41                 ` Steve Wise
2006-06-28 14:54                 ` Steve Wise
2006-06-28 18:36                   ` David Miller
2006-06-28 18:56                     ` Steve Wise
2006-06-28 14:31             ` TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) Steve Wise
2006-06-28 14:18           ` Steve Wise

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).