* [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism
@ 2006-06-27 20:50 Steve Wise
2006-06-27 20:51 ` [PATCH Round 3 1/2] " Steve Wise
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Steve Wise @ 2006-06-27 20:50 UTC (permalink / raw)
To: davem; +Cc: netdev
Round 3 Changes:
- changed netlink msg for neighbour change to (RTM_NEIGHUPD)
- added netlink msg for PMTU change events (RTM_ROUTEUPD)
- added netlink messages for redirect (RTM_DELROUTE + RTM_NEWROUTE)
- tested neighbour change events via netlink for ipv4 and ipv6.
- tested redirect change events via netlink for ipv4.
Round 2 Changes:
- cleaned up event structures per review feedback.
- began integration with netlink (see neighbour changes in patch 2).
- added IPv6 support.
TODO:
- review feedback changes, if any
- more testing
- retest with RDMA NIC
------
This patch implements a mechanism that allows interested clients to
register for notification of certain network events. The intended use
is to allow RDMA devices (linux/drivers/infiniband) to be notified of
neighbour updates, ICMP redirects, path MTU changes, and route changes.
The reason these devices need update events is because they typically
cache this information in hardware and need to be notified when this
information has been updated. For information on RDMA protocols, see:
http://www.ietf.org/html.charters/rddp-charter.html.
The key events of interest are:
- neighbour mac address change
- routing redirect (the next hop neighbour changes for a dst_entry)
- path mtu change (the path mtu for a dst_entry changes).
- route add/deletes
NOTE: These new netevents are also passed up to user space via netlink.
We would like to get this or similar functionality included in 2.6.19
and request comments.
This patchset consists of 2 patches:
1) New files implementing the Network Event Notifier
2) Core network changes to generate network event notifications
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH Round 3 1/2] Network Event Notifier Mechanism. 2006-06-27 20:50 [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Steve Wise @ 2006-06-27 20:51 ` Steve Wise 2006-06-27 20:51 ` [PATCH Round 3 2/2] Core network changes to support network event notification Steve Wise 2006-06-28 2:54 ` [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Herbert Xu 2 siblings, 0 replies; 19+ messages in thread From: Steve Wise @ 2006-06-27 20:51 UTC (permalink / raw) To: davem; +Cc: netdev This patch uses notifier blocks to implement a network event notifier mechanism. Clients register their callback function by calling register_netevent_notifier() like this: static struct notifier_block nb = { .notifier_call = my_callback_func }; ... register_netevent_notifier(&nb); --- include/net/netevent.h | 49 +++++++++++++++++++++++++++++++++++ net/core/netevent.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 117 insertions(+), 0 deletions(-) diff --git a/include/net/netevent.h b/include/net/netevent.h new file mode 100644 index 0000000..22214c8 --- /dev/null +++ b/include/net/netevent.h @@ -0,0 +1,49 @@ +#ifndef _NET_EVENT_H +#define _NET_EVENT_H + +/* + * Generic netevent notifiers + * + * Authors: + * Tom Tucker <tom@opengridcomputing.com> + * + * Changes: + */ + +#ifdef __KERNEL__ + +#include <net/dst.h> + +/* + * Generic route info structure. + * + * Family Data ptr type + * -------------------------------- + * AF_INET - struct fib_info * + * AF_INET6 - struct rt6_info * + * AF_DECnet - struct dn_route * + */ +struct netevent_route_info { + u16 family; + void *data; +}; + +struct netevent_redirect { + struct dst_entry *old; + struct dst_entry *new; +}; + +enum netevent_notif_type { + NETEVENT_NEIGH_UPDATE = 1, /* arg is struct neighbour ptr */ + NETEVENT_ROUTE_ADD, /* arg is struct netevent_route_info ptr */ + NETEVENT_ROUTE_DEL, /* arg is struct netevent_route_info ptr */ + NETEVENT_PMTU_UPDATE, /* arg is struct dst_entry ptr */ + NETEVENT_REDIRECT, /* arg is struct netevent_redirect ptr */ +}; + +extern int register_netevent_notifier(struct notifier_block *nb); +extern int unregister_netevent_notifier(struct notifier_block *nb); +extern int call_netevent_notifiers(unsigned long val, void *v); + +#endif +#endif diff --git a/net/core/netevent.c b/net/core/netevent.c new file mode 100644 index 0000000..e995751 --- /dev/null +++ b/net/core/netevent.c @@ -0,0 +1,68 @@ +/* + * Network event notifiers + * + * Authors: + * Tom Tucker <tom@opengridcomputing.com> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Fixes: + */ + +#include <linux/rtnetlink.h> +#include <linux/notifier.h> + +static ATOMIC_NOTIFIER_HEAD(netevent_notif_chain); + +/** + * register_netevent_notifier - register a netevent notifier block + * @nb: notifier + * + * Register a notifier to be called when a netevent occurs. + * The notifier passed is linked into the kernel structures and must + * not be reused until it has been unregistered. A negative errno code + * is returned on a failure. + */ +int register_netevent_notifier(struct notifier_block *nb) +{ + int err; + + err = atomic_notifier_chain_register(&netevent_notif_chain, nb); + return err; +} + +/** + * netevent_unregister_notifier - unregister a netevent notifier block + * @nb: notifier + * + * Unregister a notifier previously registered by + * register_neigh_notifier(). The notifier is unlinked into the + * kernel structures and may then be reused. A negative errno code + * is returned on a failure. + */ + +int unregister_netevent_notifier(struct notifier_block *nb) +{ + return atomic_notifier_chain_unregister(&netevent_notif_chain, nb); +} + +/** + * call_netevent_notifiers - call all netevent notifier blocks + * @val: value passed unmodified to notifier function + * @v: pointer passed unmodified to notifier function + * + * Call all neighbour notifier blocks. Parameters and return value + * are as for notifier_call_chain(). + */ + +int call_netevent_notifiers(unsigned long val, void *v) +{ + return atomic_notifier_call_chain(&netevent_notif_chain, val, v); +} + +EXPORT_SYMBOL_GPL(register_netevent_notifier); +EXPORT_SYMBOL_GPL(unregister_netevent_notifier); +EXPORT_SYMBOL_GPL(call_netevent_notifiers); ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH Round 3 2/2] Core network changes to support network event notification. 2006-06-27 20:50 [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Steve Wise 2006-06-27 20:51 ` [PATCH Round 3 1/2] " Steve Wise @ 2006-06-27 20:51 ` Steve Wise 2006-06-28 2:54 ` [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Herbert Xu 2 siblings, 0 replies; 19+ messages in thread From: Steve Wise @ 2006-06-27 20:51 UTC (permalink / raw) To: davem; +Cc: netdev This patch adds netevent and netlink calls for neighbour change, route add/del, pmtu change, and routing redirect events. Netlink Details: Neighbour change events are broadcast as a new ndmsg type RTM_NEIGHUPD. Path mtu change events are broadcast as a new rtmsg type RTM_ROUTEUPD. Routing redirect events are broadcast as a pair of rtmsgs, RTM_DELROUTE and RTM_NEWROUTE. --- include/linux/rtnetlink.h | 4 ++ net/core/Makefile | 2 + net/core/neighbour.c | 37 ++++++++++++++++--- net/ipv4/fib_semantics.c | 9 +++++ net/ipv4/route.c | 86 ++++++++++++++++++++++++++++++++++++++++++-- net/ipv6/route.c | 87 +++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 213 insertions(+), 12 deletions(-) diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index facd9ee..340ca4f 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -35,6 +35,8 @@ #define RTM_NEWROUTE RTM_NEWROUTE #define RTM_DELROUTE RTM_DELROUTE RTM_GETROUTE, #define RTM_GETROUTE RTM_GETROUTE + RTM_ROUTEUPD, +#define RTM_ROUTEUPD RTM_ROUTEUPD RTM_NEWNEIGH = 28, #define RTM_NEWNEIGH RTM_NEWNEIGH @@ -42,6 +44,8 @@ #define RTM_NEWNEIGH RTM_NEWNEIGH #define RTM_DELNEIGH RTM_DELNEIGH RTM_GETNEIGH, #define RTM_GETNEIGH RTM_GETNEIGH + RTM_NEIGHUPD, +#define RTM_NEIGHUPD RTM_NEIGHUPD RTM_NEWRULE = 32, #define RTM_NEWRULE RTM_NEWRULE diff --git a/net/core/Makefile b/net/core/Makefile index e9bd246..2645ba4 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -7,7 +7,7 @@ obj-y := sock.o request_sock.o skbuff.o obj-$(CONFIG_SYSCTL) += sysctl_net_core.o -obj-y += dev.o ethtool.o dev_mcast.o dst.o \ +obj-y += dev.o ethtool.o dev_mcast.o dst.o netevent.o \ neighbour.o rtnetlink.o utils.o link_watch.o filter.o obj-$(CONFIG_XFRM) += flow.o diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 50a8c73..bf70981 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -30,9 +30,11 @@ #include <linux/times.h> #include <net/neighbour.h> #include <net/dst.h> #include <net/sock.h> +#include <net/netevent.h> #include <linux/rtnetlink.h> #include <linux/random.h> #include <linux/string.h> +#include <linux/notifier.h> #define NEIGH_DEBUG 1 @@ -59,6 +61,7 @@ static void neigh_app_notify(struct neig #endif static int pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev); void neigh_changeaddr(struct neigh_table *tbl, struct net_device *dev); +static void rtm_neigh_change(struct neighbour *n); static struct neigh_table *neigh_tables; #ifdef CONFIG_PROC_FS @@ -755,6 +758,7 @@ #endif neigh->nud_state = NUD_STALE; neigh->updated = jiffies; neigh_suspect(neigh); + notify = 1; } } else if (state & NUD_DELAY) { if (time_before_eq(now, @@ -763,6 +767,7 @@ #endif neigh->nud_state = NUD_REACHABLE; neigh->updated = jiffies; neigh_connect(neigh); + notify = 1; next = neigh->confirmed + neigh->parms->reachable_time; } else { NEIGH_PRINTK2("neigh %p is probed.\n", neigh); @@ -820,6 +825,8 @@ #endif out: write_unlock(&neigh->lock); } + if (notify) + rtm_neigh_change(neigh); #ifdef CONFIG_ARPD if (notify && neigh->parms->app_probes) @@ -927,9 +934,7 @@ int neigh_update(struct neighbour *neigh { u8 old; int err; -#ifdef CONFIG_ARPD int notify = 0; -#endif struct net_device *dev; int update_isrouter = 0; @@ -949,9 +954,7 @@ #endif neigh_suspect(neigh); neigh->nud_state = new; err = 0; -#ifdef CONFIG_ARPD notify = old & NUD_VALID; -#endif goto out; } @@ -1023,9 +1026,7 @@ #endif if (!(new & NUD_CONNECTED)) neigh->confirmed = jiffies - (neigh->parms->base_reachable_time << 1); -#ifdef CONFIG_ARPD notify = 1; -#endif } if (new == old) goto out; @@ -1056,7 +1057,11 @@ out: (neigh->flags | NTF_ROUTER) : (neigh->flags & ~NTF_ROUTER); } + write_unlock_bh(&neigh->lock); + + if (notify) + rtm_neigh_change(neigh); #ifdef CONFIG_ARPD if (notify && neigh->parms->app_probes) neigh_app_notify(neigh); @@ -2370,9 +2375,27 @@ static void neigh_app_notify(struct neig NETLINK_CB(skb).dst_group = RTNLGRP_NEIGH; netlink_broadcast(rtnl, skb, 0, RTNLGRP_NEIGH, GFP_ATOMIC); } - #endif /* CONFIG_ARPD */ +static void rtm_neigh_change(struct neighbour *n) +{ + struct nlmsghdr *nlh; + int size = NLMSG_SPACE(sizeof(struct ndmsg) + 256); + struct sk_buff *skb = alloc_skb(size, GFP_ATOMIC); + + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, n); + if (!skb) + return; + + if (neigh_fill_info(skb, n, 0, 0, RTM_NEIGHUPD, 0) < 0) { + kfree_skb(skb); + return; + } + nlh = (struct nlmsghdr *)skb->data; + NETLINK_CB(skb).dst_group = RTNLGRP_NEIGH; + netlink_broadcast(rtnl, skb, 0, RTNLGRP_NEIGH, GFP_ATOMIC); +} + #ifdef CONFIG_SYSCTL static struct neigh_sysctl_table { diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 0f4145b..197c365 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -45,6 +45,7 @@ #include <net/tcp.h> #include <net/sock.h> #include <net/ip_fib.h> #include <net/ip_mp_alg.h> +#include <net/netevent.h> #include "fib_lookup.h" @@ -280,6 +281,14 @@ void rtmsg_fib(int event, u32 key, struc struct sk_buff *skb; u32 pid = req ? req->pid : n->nlmsg_pid; int size = NLMSG_SPACE(sizeof(struct rtmsg)+256); + struct netevent_route_info nri; + int netevent; + + nri.family = AF_INET; + nri.data = &fa->fa_info; + netevent = event == RTM_NEWROUTE ? NETEVENT_ROUTE_ADD + : NETEVENT_ROUTE_DEL; + call_netevent_notifiers(netevent, &nri); skb = alloc_skb(size, GFP_KERNEL); if (!skb) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 60b11ae..cef7c6d 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -105,6 +105,7 @@ #include <net/tcp.h> #include <net/icmp.h> #include <net/xfrm.h> #include <net/ip_mp_alg.h> +#include <net/netevent.h> #ifdef CONFIG_SYSCTL #include <linux/sysctl.h> #endif @@ -152,6 +153,8 @@ static struct dst_entry *ipv4_negative_a static void ipv4_link_failure(struct sk_buff *skb); static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu); static int rt_garbage_collect(void); +static int rt_fill_info(struct sk_buff *skb, u32 pid, u32 seq, int event, + int nowait, unsigned int flags, unsigned int prot); static struct dst_ops ipv4_dst_ops = { @@ -1112,6 +1115,52 @@ static void rt_del(unsigned hash, struct spin_unlock_bh(rt_hash_lock_addr(hash)); } +static void rtm_redirect(struct rtable *old, struct rtable *new) +{ + struct netevent_redirect netevent; + struct sk_buff *skb; + int err; + + netevent.old = &old->u.dst; + netevent.new = &new->u.dst; + + /* notify netevent subscribers */ + call_netevent_notifiers(NETEVENT_REDIRECT, &netevent); + + /* Post NETLINK messages: RTM_DELROUTE for old route, + RTM_NEWROUTE for new route */ + skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); + if (!skb) + return; + skb->mac.raw = skb->nh.raw = skb->data; + skb->dst = &old->u.dst; + NETLINK_CB(skb).dst_pid = 0; + + err = rt_fill_info(skb, 0, 0, RTM_DELROUTE, 1, 0, RTPROT_UNSPEC); + if (err <= 0) + goto out_free; + + netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV4_ROUTE, GFP_ATOMIC); + + skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); + if (!skb) + return; + skb->mac.raw = skb->nh.raw = skb->data; + skb->dst = &new->u.dst; + NETLINK_CB(skb).dst_pid = 0; + + err = rt_fill_info(skb, 0, 0, RTM_NEWROUTE, 1, 0, RTPROT_REDIRECT); + if (err <= 0) + goto out_free; + + netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV4_ROUTE, GFP_ATOMIC); + return; + +out_free: + kfree_skb(skb); + return; +} + void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw, u32 saddr, struct net_device *dev) { @@ -1211,6 +1260,8 @@ void ip_rt_redirect(u32 old_gw, u32 dadd rt_drop(rt); goto do_next; } + + rtm_redirect(rth, rt); rt_del(hash, rth); if (!rt_intern_hash(hash, rt, &rt)) @@ -1437,6 +1488,32 @@ unsigned short ip_rt_frag_needed(struct return est_mtu ? : new_mtu; } +static void rtm_pmtu_update(struct rtable *rt) +{ + struct sk_buff *skb; + int err; + + call_netevent_notifiers(NETEVENT_PMTU_UPDATE, &rt->u.dst); + + skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); + if (!skb) + return; + skb->mac.raw = skb->nh.raw = skb->data; + skb->dst = &rt->u.dst; + NETLINK_CB(skb).dst_pid = 0; + + err = rt_fill_info(skb, 0, 0, RTM_ROUTEUPD, 1, 0, RTPROT_UNSPEC); + if (err <= 0) + goto out_free; + + netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV4_ROUTE, GFP_ATOMIC); + return; + +out_free: + kfree_skb(skb); + return; +} + static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu) { if (dst->metrics[RTAX_MTU-1] > mtu && mtu >= 68 && @@ -1447,6 +1524,7 @@ static void ip_rt_update_pmtu(struct dst } dst->metrics[RTAX_MTU-1] = mtu; dst_set_expires(dst, ip_rt_mtu_expires); + rtm_pmtu_update((struct rtable *)dst); } } @@ -2622,7 +2700,7 @@ int ip_route_output_key(struct rtable ** } static int rt_fill_info(struct sk_buff *skb, u32 pid, u32 seq, int event, - int nowait, unsigned int flags) + int nowait, unsigned int flags, unsigned int prot) { struct rtable *rt = (struct rtable*)skb->dst; struct rtmsg *r; @@ -2641,7 +2719,7 @@ #endif r->rtm_table = RT_TABLE_MAIN; r->rtm_type = rt->rt_type; r->rtm_scope = RT_SCOPE_UNIVERSE; - r->rtm_protocol = RTPROT_UNSPEC; + r->rtm_protocol = prot; r->rtm_flags = (rt->rt_flags & ~0xFFFF) | RTM_F_CLONED; if (rt->rt_flags & RTCF_NOTIFY) r->rtm_flags |= RTM_F_NOTIFY; @@ -2787,7 +2865,7 @@ int inet_rtm_getroute(struct sk_buff *in NETLINK_CB(skb).dst_pid = NETLINK_CB(in_skb).pid; err = rt_fill_info(skb, NETLINK_CB(in_skb).pid, nlh->nlmsg_seq, - RTM_NEWROUTE, 0, 0); + RTM_NEWROUTE, 0, 0, RTPROT_UNSPEC); if (!err) goto out_free; if (err < 0) { @@ -2825,7 +2903,7 @@ int ip_rt_dump(struct sk_buff *skb, str skb->dst = dst_clone(&rt->u.dst); if (rt_fill_info(skb, NETLINK_CB(cb->skb).pid, cb->nlh->nlmsg_seq, RTM_NEWROUTE, - 1, NLM_F_MULTI) <= 0) { + 1, NLM_F_MULTI, RTPROT_UNSPEC) <= 0) { dst_release(xchg(&skb->dst, NULL)); rcu_read_unlock_bh(); goto done; diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 8a77793..95f68bc 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -54,6 +54,7 @@ #include <net/tcp.h> #include <linux/rtnetlink.h> #include <net/dst.h> #include <net/xfrm.h> +#include <net/netevent.h> #include <asm/uaccess.h> @@ -97,6 +98,10 @@ static int ip6_pkt_discard(struct sk_bu static int ip6_pkt_discard_out(struct sk_buff *skb); static void ip6_link_failure(struct sk_buff *skb); static void ip6_rt_update_pmtu(struct dst_entry *dst, u32 mtu); +static int rt6_fill_node(struct sk_buff *skb, struct rt6_info *rt, + struct in6_addr *dst, struct in6_addr *src, + int iif, int type, u32 pid, u32 seq, + int prefix, unsigned int flags); #ifdef CONFIG_IPV6_ROUTE_INFO static struct rt6_info *rt6_add_route_info(struct in6_addr *prefix, int prefixlen, @@ -732,6 +737,32 @@ static void ip6_link_failure(struct sk_b } } +static void rtm_pmtu_update(struct rt6_info *rt) +{ + struct sk_buff *skb; + int err; + + call_netevent_notifiers(NETEVENT_PMTU_UPDATE, &rt->u.dst); + + skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); + if (!skb) + return; + skb->mac.raw = skb->nh.raw = skb->data; + skb->dst = &rt->u.dst; + NETLINK_CB(skb).dst_pid = 0; + + err = rt6_fill_node(skb, rt, NULL, NULL, 0, RTM_ROUTEUPD, 0, 0, 0, 0); + if (err <= 0) + goto out_free; + + netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV6_ROUTE, GFP_ATOMIC); + return; + +out_free: + kfree_skb(skb); + return; +} + static void ip6_rt_update_pmtu(struct dst_entry *dst, u32 mtu) { struct rt6_info *rt6 = (struct rt6_info*)dst; @@ -743,6 +774,7 @@ static void ip6_rt_update_pmtu(struct ds dst->metrics[RTAX_FEATURES-1] |= RTAX_FEATURE_ALLFRAG; } dst->metrics[RTAX_MTU-1] = mtu; + rtm_pmtu_update(rt6); } } @@ -908,6 +940,7 @@ int ip6_route_add(struct in6_rtmsg *rtms struct net_device *dev = NULL; struct inet6_dev *idev = NULL; int addr_type; + struct netevent_route_info nri; rta = (struct rtattr **) _rtattr; @@ -1086,6 +1119,9 @@ install_route: rt->u.dst.metrics[RTAX_ADVMSS-1] = ipv6_advmss(dst_mtu(&rt->u.dst)); rt->u.dst.dev = dev; rt->rt6i_idev = idev; + nri.family = AF_INET6; + nri.data = rt; + call_netevent_notifiers(NETEVENT_ROUTE_ADD, &nri); return ip6_ins_rt(rt, nlh, _rtattr, req); out: @@ -1117,6 +1153,7 @@ static int ip6_route_del(struct in6_rtms struct fib6_node *fn; struct rt6_info *rt; int err = -ESRCH; + struct netevent_route_info nri; read_lock_bh(&rt6_lock); @@ -1138,6 +1175,10 @@ static int ip6_route_del(struct in6_rtms continue; dst_hold(&rt->u.dst); read_unlock_bh(&rt6_lock); + + nri.family = AF_INET6; + nri.data = rt; + call_netevent_notifiers(NETEVENT_ROUTE_DEL, &nri); return ip6_del_rt(rt, nlh, _rtattr, req); } @@ -1147,6 +1188,50 @@ static int ip6_route_del(struct in6_rtms return err; } +static void rtm_redirect(struct rt6_info *old, struct rt6_info *new) +{ + struct netevent_redirect netevent; + struct sk_buff *skb; + int err; + + netevent.old = &old->u.dst; + netevent.new = &new->u.dst; + call_netevent_notifiers(NETEVENT_REDIRECT, &netevent); + + /* Post NETLINK messages: RTM_DELROUTE for old route, + RTM_NEWROUTE for new route */ + skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); + if (!skb) + return; + skb->mac.raw = skb->nh.raw = skb->data; + NETLINK_CB(skb).dst_pid = 0; + NETLINK_CB(skb).dst_group = RTNLGRP_IPV6_ROUTE; + + err = rt6_fill_node(skb, old, NULL, NULL, 0, RTM_DELROUTE, 0, 0, 0, 0); + if (err <= 0) + goto out_free; + + netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV6_ROUTE, GFP_ATOMIC); + + skb = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC); + if (!skb) + return; + skb->mac.raw = skb->nh.raw = skb->data; + NETLINK_CB(skb).dst_pid = 0; + NETLINK_CB(skb).dst_group = RTNLGRP_IPV6_ROUTE; + + err = rt6_fill_node(skb, new, NULL, NULL, 0, RTM_NEWROUTE, 0, 0, 0, 0); + if (err <= 0) + goto out_free; + + netlink_broadcast(rtnl, skb, 0, RTNLGRP_IPV6_ROUTE, GFP_ATOMIC); + return; + +out_free: + kfree_skb(skb); + return; +} + /* * Handle redirects */ @@ -1253,6 +1338,8 @@ restart: if (ip6_ins_rt(nrt, NULL, NULL, NULL)) goto out; + rtm_redirect(rt, nrt); + if (rt->rt6i_flags&RTF_CACHE) { ip6_del_rt(rt, NULL, NULL, NULL); return; ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism 2006-06-27 20:50 [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Steve Wise 2006-06-27 20:51 ` [PATCH Round 3 1/2] " Steve Wise 2006-06-27 20:51 ` [PATCH Round 3 2/2] Core network changes to support network event notification Steve Wise @ 2006-06-28 2:54 ` Herbert Xu 2006-06-28 3:04 ` Herbert Xu 2 siblings, 1 reply; 19+ messages in thread From: Herbert Xu @ 2006-06-28 2:54 UTC (permalink / raw) To: Steve Wise; +Cc: davem, netdev Steve Wise <swise@opengridcomputing.com> wrote: > > The reason these devices need update events is because they typically > cache this information in hardware and need to be notified when this > information has been updated. For information on RDMA protocols, see: > http://www.ietf.org/html.charters/rddp-charter.html. Please give more specific reasons for needing these events because it is certainly far from obvious from reading those documents. Without reasons these invasive changes may turn out to be completely inappropriate. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism 2006-06-28 2:54 ` [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Herbert Xu @ 2006-06-28 3:04 ` Herbert Xu 2006-06-28 3:24 ` Jeff Garzik 0 siblings, 1 reply; 19+ messages in thread From: Herbert Xu @ 2006-06-28 3:04 UTC (permalink / raw) To: Steve Wise; +Cc: davem, netdev, Jeff Garzik On Wed, Jun 28, 2006 at 12:54:10PM +1000, Herbert Xu wrote: > > Please give more specific reasons for needing these events because it > is certainly far from obvious from reading those documents. Never mind, I've found your earlier messages on the list which explains your reasons more clearly. It would be nice if you could include those explanations in your patch description. BTW, does this mean that we're now comfortable with full TOE? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism 2006-06-28 3:04 ` Herbert Xu @ 2006-06-28 3:24 ` Jeff Garzik 2006-06-28 3:37 ` Herbert Xu 0 siblings, 1 reply; 19+ messages in thread From: Jeff Garzik @ 2006-06-28 3:24 UTC (permalink / raw) To: Herbert Xu; +Cc: Steve Wise, davem, netdev Herbert Xu wrote: > On Wed, Jun 28, 2006 at 12:54:10PM +1000, Herbert Xu wrote: >> Please give more specific reasons for needing these events because it >> is certainly far from obvious from reading those documents. > > Never mind, I've found your earlier messages on the list which explains > your reasons more clearly. It would be nice if you could include those > explanations in your patch description. > > BTW, does this mean that we're now comfortable with full TOE? I don't see how that position has changed? http://linux-net.osdl.org/index.php/TOE Jeff ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism 2006-06-28 3:24 ` Jeff Garzik @ 2006-06-28 3:37 ` Herbert Xu 2006-06-28 4:18 ` TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) Jeff Garzik 0 siblings, 1 reply; 19+ messages in thread From: Herbert Xu @ 2006-06-28 3:37 UTC (permalink / raw) To: Jeff Garzik; +Cc: Steve Wise, davem, netdev On Tue, Jun 27, 2006 at 11:24:25PM -0400, Jeff Garzik wrote: > > I don't see how that position has changed? > > http://linux-net.osdl.org/index.php/TOE Well I must say that RDMA over TCP smells very much like TOE. They've got an ARP table, a routing table, and presumably a TCP stack. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 19+ messages in thread
* TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) 2006-06-28 3:37 ` Herbert Xu @ 2006-06-28 4:18 ` Jeff Garzik 2006-06-28 4:29 ` Herbert Xu 2006-06-28 14:18 ` Steve Wise 0 siblings, 2 replies; 19+ messages in thread From: Jeff Garzik @ 2006-06-28 4:18 UTC (permalink / raw) To: Herbert Xu, davem; +Cc: Steve Wise, netdev Herbert Xu wrote: > On Tue, Jun 27, 2006 at 11:24:25PM -0400, Jeff Garzik wrote: >> I don't see how that position has changed? >> >> http://linux-net.osdl.org/index.php/TOE > > Well I must say that RDMA over TCP smells very much like TOE. They've > got an ARP table, a routing table, and presumably a TCP stack. A PCI device that presents itself as a SCSI controller, but under the hood is really iSCSI-over-TCP smells like TOE. Running a virtualized Linux guest on top of a proprietary stack [which provides networking services to guests] also smells like TOE. :) If a TOE vendors wants to do TOE in a way that is transparent to the kernel, more power to them. Such non-Linux TCP stack solutions still suffer many of the problems listed at the web page above, but at least they impose no burden on kernel maintenance. i.e. we really _do not_ want to get into the habit of co-managing arp tables, routing tables, filtering rules, and dozens of other such resources with multiple remote, independent TCP stack. We have enough complexity as it is today, coordinating between the random variations of SMP, uniprocessor, and NUMA machines out there. Not to mention competing with under-the-hood firmware actions (ASF) on NICs. As an aside, RDMA over TCP just seems silly. TCP was _not_ meant to do the things that RDMA users want. The infiniband/RDMA programming model is an ultra-low-latency polling model where one or two apps are allowed to completely consume the machine, either busy-waiting or processing messages. Unfortunately I don't have more details, so you just get a generalized rant :) Jeff ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) 2006-06-28 4:18 ` TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) Jeff Garzik @ 2006-06-28 4:29 ` Herbert Xu 2006-06-28 4:40 ` Jeff Garzik ` (2 more replies) 2006-06-28 14:18 ` Steve Wise 1 sibling, 3 replies; 19+ messages in thread From: Herbert Xu @ 2006-06-28 4:29 UTC (permalink / raw) To: Jeff Garzik; +Cc: davem, Steve Wise, netdev On Wed, Jun 28, 2006 at 12:18:25AM -0400, Jeff Garzik wrote: > > A PCI device that presents itself as a SCSI controller, but under the > hood is really iSCSI-over-TCP smells like TOE. Running a virtualized > Linux guest on top of a proprietary stack [which provides networking > services to guests] also smells like TOE. :) Agreed. However, when they start adding hooks to the ARP table, the routing table, and PMTU management, it begs the question what more is there to add for TOE (well, user-space driven TOE at least)? > Unfortunately I don't have more details, so you just get a generalized > rant :) OK, the patch under discussion here adds hooks to all the stuff in the previous paragraph for the purpose of RDMA over TCP (well I must say that the exact RDMA application/hardware has never been clearly given but this is what I can gather from the previous posts). Put it another way, I think the dividing line between TOE and iSCSI or virtualisation is exactly the interface between them and the Linux kernel. If the interface is an existing one such as SCSI or standard IP then it's OK. However, when it starts poking in the guts of the Linux stack I'd say that it has crossed the line. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) 2006-06-28 4:29 ` Herbert Xu @ 2006-06-28 4:40 ` Jeff Garzik 2006-06-28 4:43 ` TOE, etc David Miller 2006-06-28 14:31 ` TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) Steve Wise 2 siblings, 0 replies; 19+ messages in thread From: Jeff Garzik @ 2006-06-28 4:40 UTC (permalink / raw) To: Herbert Xu; +Cc: davem, Steve Wise, netdev Herbert Xu wrote: > On Wed, Jun 28, 2006 at 12:18:25AM -0400, Jeff Garzik wrote: >> A PCI device that presents itself as a SCSI controller, but under the >> hood is really iSCSI-over-TCP smells like TOE. Running a virtualized >> Linux guest on top of a proprietary stack [which provides networking >> services to guests] also smells like TOE. :) > > Agreed. However, when they start adding hooks to the ARP table, the > routing table, and PMTU management, it begs the question what more is > there to add for TOE (well, user-space driven TOE at least)? Well, you've always been able to implement userspace (or otherwise completely-virtualized) network stack. tuntap and the packet socket enable that, if nothing else. But, like you characterize below, those are existing, well-defined, easily contained interfaces. > Put it another way, I think the dividing line between TOE and iSCSI or > virtualisation is exactly the interface between them and the Linux kernel. > If the interface is an existing one such as SCSI or standard IP then it's > OK. However, when it starts poking in the guts of the Linux stack I'd say > that it has crossed the line. Strongly agreed. Jeff ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: TOE, etc. 2006-06-28 4:29 ` Herbert Xu 2006-06-28 4:40 ` Jeff Garzik @ 2006-06-28 4:43 ` David Miller 2006-06-28 5:35 ` Herbert Xu 2006-06-28 14:31 ` TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) Steve Wise 2 siblings, 1 reply; 19+ messages in thread From: David Miller @ 2006-06-28 4:43 UTC (permalink / raw) To: herbert; +Cc: jgarzik, swise, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 28 Jun 2006 14:29:59 +1000 > On Wed, Jun 28, 2006 at 12:18:25AM -0400, Jeff Garzik wrote: > > > > A PCI device that presents itself as a SCSI controller, but under the > > hood is really iSCSI-over-TCP smells like TOE. Running a virtualized > > Linux guest on top of a proprietary stack [which provides networking > > services to guests] also smells like TOE. :) > > Agreed. However, when they start adding hooks to the ARP table, the > routing table, and PMTU management, it begs the question what more is > there to add for TOE (well, user-space driven TOE at least)? Socket state, and that is one thing I don't see them doing yet. > Put it another way, I think the dividing line between TOE and iSCSI or > virtualisation is exactly the interface between them and the Linux kernel. > If the interface is an existing one such as SCSI or standard IP then it's > OK. However, when it starts poking in the guts of the Linux stack I'd say > that it has crossed the line. Yeah, it's starting to smell really bad. But we have to realize they've already been given %95 of the interfaces they need to speak IP using our routes and our neighbour entries. Right? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: TOE, etc. 2006-06-28 4:43 ` TOE, etc David Miller @ 2006-06-28 5:35 ` Herbert Xu 2006-06-28 6:31 ` David Miller ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Herbert Xu @ 2006-06-28 5:35 UTC (permalink / raw) To: David Miller; +Cc: jgarzik, swise, netdev On Tue, Jun 27, 2006 at 09:43:23PM -0700, David Miller wrote: > > Socket state, and that is one thing I don't see them doing yet. I wonder what happens when the Linux TCP stack attempts to open a connection to a remote host when that connection is already open in the RDMA NIC? For that matter what happens if a Linux application decides to listen on a TCP port already listened on by the RDMA NIC? The only saving grace is that they're only doing RDMA rather than arbitrary TCP. However, exactly the same infrastructure can be used to do arbitrary TCP should they wish to. > But we have to realize they've already been given %95 of the > interfaces they need to speak IP using our routes and our neighbour > entries. > > Right? Yes, however I think the same argument could be applied to TOE. With their RDMA NIC, we'll have TCP/SCTP connections that bypass netfilter, tc, IPsec, AF_PACKET/tcpdump and the rest of our stack while at the same time it is using the same IP address as us and deciding what packets we will or won't see. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: TOE, etc. 2006-06-28 5:35 ` Herbert Xu @ 2006-06-28 6:31 ` David Miller 2006-06-28 14:41 ` Steve Wise 2006-06-28 14:54 ` Steve Wise 2 siblings, 0 replies; 19+ messages in thread From: David Miller @ 2006-06-28 6:31 UTC (permalink / raw) To: herbert; +Cc: jgarzik, swise, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 28 Jun 2006 15:35:54 +1000 > With their RDMA NIC, we'll have TCP/SCTP connections that bypass > netfilter, tc, IPsec, AF_PACKET/tcpdump and the rest of our stack > while at the same time it is using the same IP address as us and > deciding what packets we will or won't see. That's true. I don't think we should really add any more help for these kinds of things then. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: TOE, etc. 2006-06-28 5:35 ` Herbert Xu 2006-06-28 6:31 ` David Miller @ 2006-06-28 14:41 ` Steve Wise 2006-06-28 14:54 ` Steve Wise 2 siblings, 0 replies; 19+ messages in thread From: Steve Wise @ 2006-06-28 14:41 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, jgarzik, netdev On Wed, 2006-06-28 at 15:35 +1000, Herbert Xu wrote: > On Tue, Jun 27, 2006 at 09:43:23PM -0700, David Miller wrote: > > > > Socket state, and that is one thing I don't see them doing yet. > > I wonder what happens when the Linux TCP stack attempts to open a > connection to a remote host when that connection is already open > in the RDMA NIC? For that matter what happens if a Linux application > decides to listen on a TCP port already listened on by the RDMA > NIC? > > The only saving grace is that they're only doing RDMA rather than > arbitrary TCP. However, exactly the same infrastructure can be used > to do arbitrary TCP should they wish to. > > > But we have to realize they've already been given %95 of the > > interfaces they need to speak IP using our routes and our neighbour > > entries. > > > > Right? > > Yes, however I think the same argument could be applied to TOE. > > With their RDMA NIC, we'll have TCP/SCTP connections that bypass > netfilter, tc, IPsec, AF_PACKET/tcpdump and the rest of our stack > while at the same time it is using the same IP address as us and > deciding what packets we will or won't see. > Doesn't iSCSI have the same issue? No netfilter, IPsec, tcpdump, etc... ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: TOE, etc. 2006-06-28 5:35 ` Herbert Xu 2006-06-28 6:31 ` David Miller 2006-06-28 14:41 ` Steve Wise @ 2006-06-28 14:54 ` Steve Wise 2006-06-28 18:36 ` David Miller 2 siblings, 1 reply; 19+ messages in thread From: Steve Wise @ 2006-06-28 14:54 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, jgarzik, netdev On Wed, 2006-06-28 at 15:35 +1000, Herbert Xu wrote: > On Tue, Jun 27, 2006 at 09:43:23PM -0700, David Miller wrote: > > > > Socket state, and that is one thing I don't see them doing yet. > > I wonder what happens when the Linux TCP stack attempts to open a > connection to a remote host when that connection is already open > in the RDMA NIC? For that matter what happens if a Linux application > decides to listen on a TCP port already listened on by the RDMA > NIC? > This issue would have to be handled by using seperate IP addresses for RDMA connections vs native stack TCP. Consider NFS-RDMA server. Through administration, it would be configured to listen on the specific rdma ip addresses, and the native stack tcp ip addresses and thus support both TCP and RDMA NFS connections. There are definitely issues with this that could be resolved via tighter integration, but that seems to not be a goal of the linux community at this time... > The only saving grace is that they're only doing RDMA rather than > arbitrary TCP. However, exactly the same infrastructure can be used > to do arbitrary TCP should they wish to. > > > But we have to realize they've already been given %95 of the > > interfaces they need to speak IP using our routes and our neighbour > > entries. > > > > Right? > > Yes, however I think the same argument could be applied to TOE. > > With their RDMA NIC, we'll have TCP/SCTP connections that bypass > netfilter, tc, IPsec, AF_PACKET/tcpdump and the rest of our stack > while at the same time it is using the same IP address as us and > deciding what packets we will or won't see. > Doesn't iSCSI have this same issue? Steve. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: TOE, etc. 2006-06-28 14:54 ` Steve Wise @ 2006-06-28 18:36 ` David Miller 2006-06-28 18:56 ` Steve Wise 0 siblings, 1 reply; 19+ messages in thread From: David Miller @ 2006-06-28 18:36 UTC (permalink / raw) To: swise; +Cc: herbert, jgarzik, netdev From: Steve Wise <swise@opengridcomputing.com> Date: Wed, 28 Jun 2006 09:54:57 -0500 > Doesn't iSCSI have this same issue? Software iSCSI implementations don't have the issue because they go through the stack using normal sockets and normal device send and receive. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: TOE, etc. 2006-06-28 18:36 ` David Miller @ 2006-06-28 18:56 ` Steve Wise 0 siblings, 0 replies; 19+ messages in thread From: Steve Wise @ 2006-06-28 18:56 UTC (permalink / raw) To: David Miller; +Cc: herbert, jgarzik, netdev On Wed, 2006-06-28 at 11:36 -0700, David Miller wrote: > From: Steve Wise <swise@opengridcomputing.com> > Date: Wed, 28 Jun 2006 09:54:57 -0500 > > > Doesn't iSCSI have this same issue? > > Software iSCSI implementations don't have the issue because > they go through the stack using normal sockets and normal > device send and receive. > - Right. I was assuming, in this thread we were talking about iSCSI devices where the TCP stack is in HW/FW on the adapter... Steve. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) 2006-06-28 4:29 ` Herbert Xu 2006-06-28 4:40 ` Jeff Garzik 2006-06-28 4:43 ` TOE, etc David Miller @ 2006-06-28 14:31 ` Steve Wise 2 siblings, 0 replies; 19+ messages in thread From: Steve Wise @ 2006-06-28 14:31 UTC (permalink / raw) To: Herbert Xu; +Cc: Jeff Garzik, davem, netdev On Wed, 2006-06-28 at 14:29 +1000, Herbert Xu wrote: > On Wed, Jun 28, 2006 at 12:18:25AM -0400, Jeff Garzik wrote: > > > > A PCI device that presents itself as a SCSI controller, but under the > > hood is really iSCSI-over-TCP smells like TOE. Running a virtualized > > Linux guest on top of a proprietary stack [which provides networking > > services to guests] also smells like TOE. :) > > Agreed. However, when they start adding hooks to the ARP table, the > routing table, and PMTU management, it begs the question what more is > there to add for TOE (well, user-space driven TOE at least)? > > > Unfortunately I don't have more details, so you just get a generalized > > rant :) > > OK, the patch under discussion here adds hooks to all the stuff in the > previous paragraph for the purpose of RDMA over TCP (well I must say > that the exact RDMA application/hardware has never been clearly given > but this is what I can gather from the previous posts). There are Ammasso and Chelsio RDMA/Ethernet drivers in the openib.org svn iwarp branch today. The goal is to submit them for review and inclusion into linux. The Ammasso driver has been through 3 review cycles on lkml and netdev. There are other vendors with drivers, but they're currently not disclosing any information to me about their status. Applications: kernel: NFS-RDMA, iSER, RDP. user: MPI, uDAPL (both middle ware). The Ammasso driver is a different model. It actually has a full TCP/ARP/ICMP stack and doesn't require these hooks. But the RDMA/TCP model defined and implemented, I think, by most vendors is a model where the HW is doing a limited TCP offload, relying on the native stack for L2 and L3 integration (as described in the netevent patch). > Put it another way, I think the dividing line between TOE and iSCSI or > virtualisation is exactly the interface between them and the Linux kernel. > If the interface is an existing one such as SCSI or standard IP then it's > OK. However, when it starts poking in the guts of the Linux stack I'd say > that it has crossed the line. > Don't these netevent hooks have utility for other purposes? IE: Should we really shoot changes to linux _just because_ they might possibly enable TOE? Steve. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) 2006-06-28 4:18 ` TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) Jeff Garzik 2006-06-28 4:29 ` Herbert Xu @ 2006-06-28 14:18 ` Steve Wise 1 sibling, 0 replies; 19+ messages in thread From: Steve Wise @ 2006-06-28 14:18 UTC (permalink / raw) To: Jeff Garzik; +Cc: Herbert Xu, davem, netdev On Wed, 2006-06-28 at 00:18 -0400, Jeff Garzik wrote: > Herbert Xu wrote: > > On Tue, Jun 27, 2006 at 11:24:25PM -0400, Jeff Garzik wrote: > >> I don't see how that position has changed? > >> > >> http://linux-net.osdl.org/index.php/TOE > > > > Well I must say that RDMA over TCP smells very much like TOE. They've > > got an ARP table, a routing table, and presumably a TCP stack. > > A PCI device that presents itself as a SCSI controller, but under the > hood is really iSCSI-over-TCP smells like TOE. Running a virtualized > Linux guest on top of a proprietary stack [which provides networking > services to guests] also smells like TOE. :) > I wonder if the existing iSCSI solutions handle next host mac addr changes, pmtu changes, etc? They _can_, if they implement the entire ARP and ICMP suites in HW/FW. Just curious if these vendors also see merit in the netevent changes I'm proposing... > If a TOE vendors wants to do TOE in a way that is transparent to the > kernel, more power to them. Such non-Linux TCP stack solutions still > suffer many of the problems listed at the web page above, but at least > they impose no burden on kernel maintenance. > > i.e. we really _do not_ want to get into the habit of co-managing arp > tables, routing tables, filtering rules, and dozens of other such > resources with multiple remote, independent TCP stack. We have enough > complexity as it is today, coordinating between the random variations of > SMP, uniprocessor, and NUMA machines out there. Not to mention > competing with under-the-hood firmware actions (ASF) on NICs. > > As an aside, RDMA over TCP just seems silly. TCP was _not_ meant to do > the things that RDMA users want. The infiniband/RDMA programming model > is an ultra-low-latency polling model where one or two apps are allowed > to completely consume the machine, either busy-waiting or processing > messages. > With RDMA over TCP, you can get the same ultra-low-latency, interrupt coalescing or avoidance, and copy avoidance as with Infiniband (with newly arriving 10Gb RDMA NICs). The benefit over IB, IMO, is the fact that your infrastructure is all IP. If you're interested in a kernel-mode app that benfits from RDMA, then check out NFS-RDMA, which runs today transparently over Infiniband and RDMA/TCP devices using the Infiniband RDMA-CM and verbs with minor changes to support RDMA/TCP devices. http://sourceforge.net/projects/nfs-rdma/ Steve. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2006-06-28 18:56 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-06-27 20:50 [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Steve Wise 2006-06-27 20:51 ` [PATCH Round 3 1/2] " Steve Wise 2006-06-27 20:51 ` [PATCH Round 3 2/2] Core network changes to support network event notification Steve Wise 2006-06-28 2:54 ` [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism Herbert Xu 2006-06-28 3:04 ` Herbert Xu 2006-06-28 3:24 ` Jeff Garzik 2006-06-28 3:37 ` Herbert Xu 2006-06-28 4:18 ` TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) Jeff Garzik 2006-06-28 4:29 ` Herbert Xu 2006-06-28 4:40 ` Jeff Garzik 2006-06-28 4:43 ` TOE, etc David Miller 2006-06-28 5:35 ` Herbert Xu 2006-06-28 6:31 ` David Miller 2006-06-28 14:41 ` Steve Wise 2006-06-28 14:54 ` Steve Wise 2006-06-28 18:36 ` David Miller 2006-06-28 18:56 ` Steve Wise 2006-06-28 14:31 ` TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism) Steve Wise 2006-06-28 14:18 ` Steve Wise
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).