Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2] CDC NCM: release interfaces fix in unbind()
From: Alan Stern @ 2011-05-18 14:54 UTC (permalink / raw)
  To: Alexey Orishko; +Cc: netdev, davem, oliver, linux-usb, gregkh, Alexey Orishko
In-Reply-To: <1305662344-28355-1-git-send-email-alexey.orishko@stericsson.com>

On Tue, 17 May 2011, Alexey Orishko wrote:

> Changes:
> - claim slave/data interface during bind() and release in
>  unbind() unconditionally
> - in case of error during bind(), release claimed data
>  interface in the same function
> - remove obsolited "*_claimed" entries from driver context

...

> @@ -572,42 +559,32 @@ advance:
>  		goto error;
>  
>  	/* claim interfaces, if any */
> -	if (ctx->data != intf) {
> -		temp = usb_driver_claim_interface(driver, ctx->data, dev);
> -		if (temp)
> -			goto error;
> -		ctx->data_claimed = 1;
> -	}
> -
> -	if (ctx->control != intf) {
> -		temp = usb_driver_claim_interface(driver, ctx->control, dev);
> -		if (temp)
> -			goto error;
> -		ctx->control_claimed = 1;
> -	}
> +	temp = usb_driver_claim_interface(driver, ctx->data, dev);
> +	if (temp)
> +		goto error;

Here and later on, the patch seems to have forgotten about the control 
interface.  Is this deliberate or an oversight?

Alan Stern


^ permalink raw reply

* [PATCH net-next] bridge: add notification over netlink when STP changes state
From: Stephen Hemminger @ 2011-05-18 15:17 UTC (permalink / raw)
  To: David Miller; +Cc: bridge, netdev

The first netlink code in the bridge module was to notify
user space implementations of Spanning Tree Protocol about
new ports. It did not handle the case of kernel mode STP 
changing states which could be useful for monitoring.

This patch causes RTM_NEWLINK message to occur on kernel
transitions. It does not send message if request was from user
space STP, since that would cause reflection and break existing
API.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
---
 net/bridge/br_stp.c       |    4 +++-
 net/bridge/br_stp_if.c    |    3 +++
 net/bridge/br_stp_timer.c |    1 +
 3 files changed, 7 insertions(+), 1 deletion(-)

--- a/net/bridge/br_stp.c	2011-04-15 14:09:10.700851052 -0700
+++ b/net/bridge/br_stp.c	2011-05-17 10:06:09.006014826 -0700
@@ -363,6 +363,8 @@ static void br_make_blocking(struct net_
 
 		p->state = BR_STATE_BLOCKING;
 		br_log_state(p);
+		br_ifinfo_notify(RTM_NEWLINK, p);
+
 		del_timer(&p->forward_delay_timer);
 	}
 }
@@ -386,8 +388,8 @@ static void br_make_forwarding(struct ne
 		p->state = BR_STATE_LEARNING;
 
 	br_multicast_enable_port(p);
-
 	br_log_state(p);
+	br_ifinfo_notify(RTM_NEWLINK, p);
 
 	if (br->forward_delay != 0)
 		mod_timer(&p->forward_delay_timer, jiffies + br->forward_delay);
--- a/net/bridge/br_stp_if.c	2011-04-15 14:09:10.700851052 -0700
+++ b/net/bridge/br_stp_if.c	2011-05-17 10:06:09.006014826 -0700
@@ -88,6 +88,7 @@ void br_stp_enable_port(struct net_bridg
 	br_init_port(p);
 	br_port_state_selection(p->br);
 	br_log_state(p);
+	br_ifinfo_notify(RTM_NEWLINK, p);
 }
 
 /* called under bridge lock */
@@ -104,6 +105,8 @@ void br_stp_disable_port(struct net_brid
 	p->topology_change_ack = 0;
 	p->config_pending = 0;
 
+	br_ifinfo_notify(RTM_NEWLINK, p);
+
 	del_timer(&p->message_age_timer);
 	del_timer(&p->forward_delay_timer);
 	del_timer(&p->hold_timer);
--- a/net/bridge/br_stp_timer.c	2011-04-04 09:21:25.963009153 -0700
+++ b/net/bridge/br_stp_timer.c	2011-05-17 10:06:09.018014957 -0700
@@ -97,6 +97,7 @@ static void br_forward_delay_timer_expir
 		netif_carrier_on(br->dev);
 	}
 	br_log_state(p);
+	br_ifinfo_notify(RTM_NEWLINK, p);
 	spin_unlock(&br->lock);
 }
 

^ permalink raw reply

* Re: [PATCH net-next] bridge: add notification over netlink when STP changes state
From: Ben Greear @ 2011-05-18 15:19 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, bridge, netdev
In-Reply-To: <20110518081718.0b1fc390@nehalam>

On 05/18/2011 08:17 AM, Stephen Hemminger wrote:
> The first netlink code in the bridge module was to notify
> user space implementations of Spanning Tree Protocol about
> new ports. It did not handle the case of kernel mode STP
> changing states which could be useful for monitoring.
>
> This patch causes RTM_NEWLINK message to occur on kernel
> transitions. It does not send message if request was from user
> space STP, since that would cause reflection and break existing
> API.

So one app monitoring the system won't get updates if another
app makes changes?

Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [PATCH net-next] bridge: add notification over netlink when STP changes state
From: Stephen Hemminger @ 2011-05-18 15:39 UTC (permalink / raw)
  To: Ben Greear; +Cc: David Miller, bridge, netdev
In-Reply-To: <4DD3E38C.30609@candelatech.com>

On Wed, 18 May 2011 08:19:40 -0700
Ben Greear <greearb@candelatech.com> wrote:

> On 05/18/2011 08:17 AM, Stephen Hemminger wrote:
> > The first netlink code in the bridge module was to notify
> > user space implementations of Spanning Tree Protocol about
> > new ports. It did not handle the case of kernel mode STP
> > changing states which could be useful for monitoring.
> >
> > This patch causes RTM_NEWLINK message to occur on kernel
> > transitions. It does not send message if request was from user
> > space STP, since that would cause reflection and break existing
> > API.
> 
> So one app monitoring the system won't get updates if another
> app makes changes?
> 
> Ben

For now, yes. Just don't want to break any existing users
which is more important.

-- 

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michael S. Tsirkin @ 2011-05-18 15:47 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michał Mirosław, Ben Hutchings, David Miller,
	Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <1305729507.32080.6.camel@localhost.localdomain>

On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 13:40 +0200, Michał Mirosław wrote:
> > >> >> Not more other restrictions, skb clone is OK. pskb_expand_head()
> > looks
> > >> >> OK to me from code review.
> > >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> > >> > references to pages. How is that ok? What do I miss?
> > >> It's making copy of the skb_shinfo earlier, so the pages refcount
> > >> stays the same.
> > > Exactly. But the callback is invoked so the guest thinks it's ok to
> > > change this memory. If it does a corrupted packet will be sent out.
> > 
> > Hmm. I tool a quick look at skb_clone(), and it looks like this
> > sequence will break this scheme:
> > 
> > skb2 = skb_clone(skb...);
> > kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> > [use skb2, pages still referenced]
> > kfree_skb(skb); /* callback called again */
> > 
> > This sequence is common in bridge, might be in other places.
> > 
> > Maybe this ubuf thing should just track clones? This will make it work
> > on all devices then.
> 
> The callback was only invoked when last reference of skb was gone.
> skb_clone does increase skb refcnt. I tested tcpdump on lower device, it
> worked.

Right, it will normally work, but two issues I think you miss:
1. malicious guest can change the memory between when it is sent out by
   device and consumed by tcpdump, so you will see different things
   (not sure how important this is).
2. if tcpdump stops consuming stuff from the packet socket (it's
   userspace, can't be trusted) then we won't get a callback for
   page potentially forever, guest networking will get blocked etc.

> For the sequence of:
> 
> skb_clone  -> last refcnt + 1
> kfree_skb() or pskb_expand_head -> callback not called
> kfree_skb() -> callback called
> 
> I will check page refcount to see whether it's balanced.
> 
> Thanks
> shirley


pskb_expand_head is a problem anyway I think as it
can hang on to pages after it calls release_data.
Then guest will modify these pages and you get trash there.

-- 
MST

^ permalink raw reply

* Re: Bug, kernel panic, NULL dereference , cleanup_once / icmp_route_lookup.clone.19.clone / nat , 2.6.39-rc7-git11
From: Eric Dumazet @ 2011-05-18 15:52 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev
In-Reply-To: <4ae96506c000a4b3c8f78ccef836deaf@visp.net.lb>

Le mercredi 18 mai 2011 à 15:46 +0300, Denys Fedoryshchenko a écrit :

>  It is NAS, has ipv6 enabled recently (i am preparing for ipv6), but 
>  ipv6 is not routed anywhere, maybe just automatically addresses 
>  appearing on interfaces, including the one looking to customer subnet. 
>  ppp is ipv4 only.
> 

Hmm, it seems we have some inetpeer refcount leak somewhere.

Maybe one (struct rtable)->peer is not released on dst/rtable removal,
or we also leak dst/rtable (and their ->peer inetpeer)

Watch :

grep peer /proc/slabinfo
grep dst /proc/slabinfo



^ permalink raw reply

* [RFC V4 PATCH] rtnetlink: Compute and store minimum ifinfo dump size
From: Greg Rose @ 2011-05-18 15:54 UTC (permalink / raw)
  To: netdev; +Cc: bhutchings, davem, eric.dumazet

The message size allocated for rtnl info dumps was limited to a single
page.  This is not enough for additional interface info available with
devices that support SR-IOV.  Calculate the amount of data required so
the dump can allocate enough data to satisfy the request.

V2 added the calcit function to the rtnl_register calls so that
dump functions could get the minimum dump allocation size if they
needed to.

V3 leverages the fact that the netdev register function ends up
calling if_nlmsg_size.  We collect the minimum dump allocation size
there and keep it in a module static variable so that the calcit
function doesn't have to search for the device on every info dump.

V4 fixes the title.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
---

 include/linux/netlink.h              |    6 ++-
 include/net/rtnetlink.h              |    5 ++-
 net/bridge/br_netlink.c              |   15 ++++++---
 net/core/fib_rules.c                 |    6 ++-
 net/core/neighbour.c                 |   10 +++---
 net/core/rtnetlink.c                 |   59 ++++++++++++++++++++++++++++------
 net/dcb/dcbnl.c                      |    4 +-
 net/decnet/dn_dev.c                  |    6 ++-
 net/decnet/dn_fib.c                  |    4 +-
 net/decnet/dn_route.c                |    5 ++-
 net/ipv4/devinet.c                   |    6 ++-
 net/ipv4/fib_frontend.c              |    6 ++-
 net/ipv4/inet_diag.c                 |    2 +
 net/ipv4/ipmr.c                      |    3 +-
 net/ipv4/route.c                     |    2 +
 net/ipv6/addrconf.c                  |   12 +++----
 net/ipv6/addrlabel.c                 |    6 ++-
 net/ipv6/ip6_fib.c                   |    2 +
 net/ipv6/ip6mr.c                     |    2 +
 net/ipv6/route.c                     |    6 ++-
 net/netfilter/ipset/ip_set_core.c    |    2 +
 net/netfilter/nf_conntrack_netlink.c |    4 +-
 net/netlink/af_netlink.c             |   17 ++++++----
 net/netlink/genetlink.c              |    2 +
 net/phonet/pn_netlink.c              |   12 +++----
 net/sched/act_api.c                  |    6 ++-
 net/sched/cls_api.c                  |    6 ++-
 net/sched/sch_api.c                  |   12 +++----
 net/xfrm/xfrm_user.c                 |    2 +
 29 files changed, 141 insertions(+), 89 deletions(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 4c4ac3f..8b8dfb8 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,8 @@ struct netlink_callback {
 	int			(*dump)(struct sk_buff * skb,
 					struct netlink_callback *cb);
 	int			(*done)(struct netlink_callback *cb);
-	int			family;
+	u16			family;
+	u16			min_dump_alloc;
 	long			args[6];
 };
 
@@ -258,7 +259,8 @@ __nlmsg_put(struct sk_buff *skb, u32 pid, u32 seq, int type, int len, int flags)
 extern int netlink_dump_start(struct sock *ssk, struct sk_buff *skb,
 			      const struct nlmsghdr *nlh,
 			      int (*dump)(struct sk_buff *skb, struct netlink_callback*),
-			      int (*done)(struct netlink_callback*));
+			      int (*done)(struct netlink_callback*),
+			      u16 min_dump_alloc);
 
 
 #define NL_NONROOT_RECV 0x1
diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index 4093ca7..d1ac642 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -6,11 +6,12 @@
 
 typedef int (*rtnl_doit_func)(struct sk_buff *, struct nlmsghdr *, void *);
 typedef int (*rtnl_dumpit_func)(struct sk_buff *, struct netlink_callback *);
+typedef u16 (*rtnl_calcit_func)(struct sk_buff *);
 
 extern int	__rtnl_register(int protocol, int msgtype,
-				rtnl_doit_func, rtnl_dumpit_func);
+				rtnl_doit_func, rtnl_dumpit_func, rtnl_calcit_func);
 extern void	rtnl_register(int protocol, int msgtype,
-			      rtnl_doit_func, rtnl_dumpit_func);
+			      rtnl_doit_func, rtnl_dumpit_func, rtnl_calcit_func);
 extern int	rtnl_unregister(int protocol, int msgtype);
 extern void	rtnl_unregister_all(int protocol);
 
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index ffb0dc4..6814083 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -218,19 +218,24 @@ int __init br_netlink_init(void)
 	if (err < 0)
 		goto err1;
 
-	err = __rtnl_register(PF_BRIDGE, RTM_GETLINK, NULL, br_dump_ifinfo);
+	err = __rtnl_register(PF_BRIDGE, RTM_GETLINK, NULL,
+			      br_dump_ifinfo, NULL);
 	if (err)
 		goto err2;
-	err = __rtnl_register(PF_BRIDGE, RTM_SETLINK, br_rtm_setlink, NULL);
+	err = __rtnl_register(PF_BRIDGE, RTM_SETLINK,
+			      br_rtm_setlink, NULL, NULL);
 	if (err)
 		goto err3;
-	err = __rtnl_register(PF_BRIDGE, RTM_NEWNEIGH, br_fdb_add, NULL);
+	err = __rtnl_register(PF_BRIDGE, RTM_NEWNEIGH,
+			      br_fdb_add, NULL, NULL);
 	if (err)
 		goto err3;
-	err = __rtnl_register(PF_BRIDGE, RTM_DELNEIGH, br_fdb_delete, NULL);
+	err = __rtnl_register(PF_BRIDGE, RTM_DELNEIGH,
+			      br_fdb_delete, NULL, NULL);
 	if (err)
 		goto err3;
-	err = __rtnl_register(PF_BRIDGE, RTM_GETNEIGH, NULL, br_fdb_dump);
+	err = __rtnl_register(PF_BRIDGE, RTM_GETNEIGH,
+			      NULL, br_fdb_dump, NULL);
 	if (err)
 		goto err3;
 
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 3911586..56e6fc8 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -739,9 +739,9 @@ static struct pernet_operations fib_rules_net_ops = {
 static int __init fib_rules_init(void)
 {
 	int err;
-	rtnl_register(PF_UNSPEC, RTM_NEWRULE, fib_nl_newrule, NULL);
-	rtnl_register(PF_UNSPEC, RTM_DELRULE, fib_nl_delrule, NULL);
-	rtnl_register(PF_UNSPEC, RTM_GETRULE, NULL, fib_nl_dumprule);
+	rtnl_register(PF_UNSPEC, RTM_NEWRULE, fib_nl_newrule, NULL, NULL);
+	rtnl_register(PF_UNSPEC, RTM_DELRULE, fib_nl_delrule, NULL, NULL);
+	rtnl_register(PF_UNSPEC, RTM_GETRULE, NULL, fib_nl_dumprule, NULL);
 
 	err = register_pernet_subsys(&fib_rules_net_ops);
 	if (err < 0)
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 799f06e..a880b83 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2909,12 +2909,12 @@ EXPORT_SYMBOL(neigh_sysctl_unregister);
 
 static int __init neigh_init(void)
 {
-	rtnl_register(PF_UNSPEC, RTM_NEWNEIGH, neigh_add, NULL);
-	rtnl_register(PF_UNSPEC, RTM_DELNEIGH, neigh_delete, NULL);
-	rtnl_register(PF_UNSPEC, RTM_GETNEIGH, NULL, neigh_dump_info);
+	rtnl_register(PF_UNSPEC, RTM_NEWNEIGH, neigh_add, NULL, NULL);
+	rtnl_register(PF_UNSPEC, RTM_DELNEIGH, neigh_delete, NULL, NULL);
+	rtnl_register(PF_UNSPEC, RTM_GETNEIGH, NULL, neigh_dump_info, NULL);
 
-	rtnl_register(PF_UNSPEC, RTM_GETNEIGHTBL, NULL, neightbl_dump_info);
-	rtnl_register(PF_UNSPEC, RTM_SETNEIGHTBL, neightbl_set, NULL);
+	rtnl_register(PF_UNSPEC, RTM_GETNEIGHTBL, NULL, neightbl_dump_info, NULL);
+	rtnl_register(PF_UNSPEC, RTM_SETNEIGHTBL, neightbl_set, NULL, NULL);
 
 	return 0;
 }
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index d2ba259..a59e595 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -56,9 +56,11 @@
 struct rtnl_link {
 	rtnl_doit_func		doit;
 	rtnl_dumpit_func	dumpit;
+	rtnl_calcit_func calcit;
 };
 
 static DEFINE_MUTEX(rtnl_mutex);
+static u16 min_ifinfo_dump_size;
 
 void rtnl_lock(void)
 {
@@ -144,12 +146,28 @@ static rtnl_dumpit_func rtnl_get_dumpit(int protocol, int msgindex)
 	return tab ? tab[msgindex].dumpit : NULL;
 }
 
+static rtnl_calcit_func rtnl_get_calcit(int protocol, int msgindex)
+{
+	struct rtnl_link *tab;
+
+	if (protocol <= RTNL_FAMILY_MAX)
+		tab = rtnl_msg_handlers[protocol];
+	else
+		tab = NULL;
+
+	if (tab == NULL || tab[msgindex].calcit == NULL)
+		tab = rtnl_msg_handlers[PF_UNSPEC];
+
+	return tab ? tab[msgindex].calcit : NULL;
+}
+
 /**
  * __rtnl_register - Register a rtnetlink message type
  * @protocol: Protocol family or PF_UNSPEC
  * @msgtype: rtnetlink message type
  * @doit: Function pointer called for each request message
  * @dumpit: Function pointer called for each dump request (NLM_F_DUMP) message
+ * @calcit: Function pointer to calc size of dump message
  *
  * Registers the specified function pointers (at least one of them has
  * to be non-NULL) to be called whenever a request message for the
@@ -162,7 +180,8 @@ static rtnl_dumpit_func rtnl_get_dumpit(int protocol, int msgindex)
  * Returns 0 on success or a negative error code.
  */
 int __rtnl_register(int protocol, int msgtype,
-		    rtnl_doit_func doit, rtnl_dumpit_func dumpit)
+		    rtnl_doit_func doit, rtnl_dumpit_func dumpit,
+		    rtnl_calcit_func calcit)
 {
 	struct rtnl_link *tab;
 	int msgindex;
@@ -185,6 +204,9 @@ int __rtnl_register(int protocol, int msgtype,
 	if (dumpit)
 		tab[msgindex].dumpit = dumpit;
 
+	if (calcit)
+		tab[msgindex].calcit = calcit;
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(__rtnl_register);
@@ -199,9 +221,10 @@ EXPORT_SYMBOL_GPL(__rtnl_register);
  * of memory implies no sense in continuing.
  */
 void rtnl_register(int protocol, int msgtype,
-		   rtnl_doit_func doit, rtnl_dumpit_func dumpit)
+		   rtnl_doit_func doit, rtnl_dumpit_func dumpit,
+		   rtnl_calcit_func calcit)
 {
-	if (__rtnl_register(protocol, msgtype, doit, dumpit) < 0)
+	if (__rtnl_register(protocol, msgtype, doit, dumpit, calcit) < 0)
 		panic("Unable to register rtnetlink message handler, "
 		      "protocol = %d, message type = %d\n",
 		      protocol, msgtype);
@@ -1814,6 +1837,11 @@ static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
 	return err;
 }
 
+static u16 rtnl_calcit(struct sk_buff *skb)
+{
+	return min_ifinfo_dump_size;
+}
+
 static int rtnl_dump_all(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	int idx;
@@ -1843,11 +1871,14 @@ void rtmsg_ifinfo(int type, struct net_device *dev, unsigned change)
 	struct net *net = dev_net(dev);
 	struct sk_buff *skb;
 	int err = -ENOBUFS;
+	size_t if_info_size;
 
-	skb = nlmsg_new(if_nlmsg_size(dev), GFP_KERNEL);
+	skb = nlmsg_new((if_info_size = if_nlmsg_size(dev)), GFP_KERNEL);
 	if (skb == NULL)
 		goto errout;
 
+	min_ifinfo_dump_size = max_t(u16, if_info_size, min_ifinfo_dump_size);
+
 	err = rtnl_fill_ifinfo(skb, dev, type, 0, 0, change, 0);
 	if (err < 0) {
 		/* -EMSGSIZE implies BUG in if_nlmsg_size() */
@@ -1897,13 +1928,19 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	if (kind == 2 && nlh->nlmsg_flags&NLM_F_DUMP) {
 		struct sock *rtnl;
 		rtnl_dumpit_func dumpit;
+		rtnl_calcit_func calcit;
+		u16 min_dump_alloc = 0;
 
 		dumpit = rtnl_get_dumpit(family, type);
 		if (dumpit == NULL)
 			return -EOPNOTSUPP;
+		calcit = rtnl_get_calcit(family, type);
+		if (calcit)
+			min_dump_alloc = calcit(skb);
 
 		rtnl = net->rtnl;
-		return netlink_dump_start(rtnl, skb, nlh, dumpit, NULL);
+		return netlink_dump_start(rtnl, skb, nlh, dumpit,
+					  NULL, min_dump_alloc);
 	}
 
 	memset(rta_buf, 0, (rtattr_max * sizeof(struct rtattr *)));
@@ -2009,12 +2046,12 @@ void __init rtnetlink_init(void)
 	netlink_set_nonroot(NETLINK_ROUTE, NL_NONROOT_RECV);
 	register_netdevice_notifier(&rtnetlink_dev_notifier);
 
-	rtnl_register(PF_UNSPEC, RTM_GETLINK, rtnl_getlink, rtnl_dump_ifinfo);
-	rtnl_register(PF_UNSPEC, RTM_SETLINK, rtnl_setlink, NULL);
-	rtnl_register(PF_UNSPEC, RTM_NEWLINK, rtnl_newlink, NULL);
-	rtnl_register(PF_UNSPEC, RTM_DELLINK, rtnl_dellink, NULL);
+	rtnl_register(PF_UNSPEC, RTM_GETLINK, rtnl_getlink, rtnl_dump_ifinfo, rtnl_calcit);
+	rtnl_register(PF_UNSPEC, RTM_SETLINK, rtnl_setlink, NULL, NULL);
+	rtnl_register(PF_UNSPEC, RTM_NEWLINK, rtnl_newlink, NULL, NULL);
+	rtnl_register(PF_UNSPEC, RTM_DELLINK, rtnl_dellink, NULL, NULL);
 
-	rtnl_register(PF_UNSPEC, RTM_GETADDR, NULL, rtnl_dump_all);
-	rtnl_register(PF_UNSPEC, RTM_GETROUTE, NULL, rtnl_dump_all);
+	rtnl_register(PF_UNSPEC, RTM_GETADDR, NULL, rtnl_dump_all, NULL);
+	rtnl_register(PF_UNSPEC, RTM_GETROUTE, NULL, rtnl_dump_all, NULL);
 }
 
diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
index 3609eac..ed1bb8c 100644
--- a/net/dcb/dcbnl.c
+++ b/net/dcb/dcbnl.c
@@ -1819,8 +1819,8 @@ static int __init dcbnl_init(void)
 {
 	INIT_LIST_HEAD(&dcb_app_list);
 
-	rtnl_register(PF_UNSPEC, RTM_GETDCB, dcb_doit, NULL);
-	rtnl_register(PF_UNSPEC, RTM_SETDCB, dcb_doit, NULL);
+	rtnl_register(PF_UNSPEC, RTM_GETDCB, dcb_doit, NULL, NULL);
+	rtnl_register(PF_UNSPEC, RTM_SETDCB, dcb_doit, NULL, NULL);
 
 	return 0;
 }
diff --git a/net/decnet/dn_dev.c b/net/decnet/dn_dev.c
index 404fa15..0011eba 100644
--- a/net/decnet/dn_dev.c
+++ b/net/decnet/dn_dev.c
@@ -1419,9 +1419,9 @@ void __init dn_dev_init(void)
 
 	dn_dev_devices_on();
 
-	rtnl_register(PF_DECnet, RTM_NEWADDR, dn_nl_newaddr, NULL);
-	rtnl_register(PF_DECnet, RTM_DELADDR, dn_nl_deladdr, NULL);
-	rtnl_register(PF_DECnet, RTM_GETADDR, NULL, dn_nl_dump_ifaddr);
+	rtnl_register(PF_DECnet, RTM_NEWADDR, dn_nl_newaddr, NULL, NULL);
+	rtnl_register(PF_DECnet, RTM_DELADDR, dn_nl_deladdr, NULL, NULL);
+	rtnl_register(PF_DECnet, RTM_GETADDR, NULL, dn_nl_dump_ifaddr, NULL);
 
 	proc_net_fops_create(&init_net, "decnet_dev", S_IRUGO, &dn_dev_seq_fops);
 
diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
index 1c74ed3..104324d 100644
--- a/net/decnet/dn_fib.c
+++ b/net/decnet/dn_fib.c
@@ -763,8 +763,8 @@ void __init dn_fib_init(void)
 
 	register_dnaddr_notifier(&dn_fib_dnaddr_notifier);
 
-	rtnl_register(PF_DECnet, RTM_NEWROUTE, dn_fib_rtm_newroute, NULL);
-	rtnl_register(PF_DECnet, RTM_DELROUTE, dn_fib_rtm_delroute, NULL);
+	rtnl_register(PF_DECnet, RTM_NEWROUTE, dn_fib_rtm_newroute, NULL, NULL);
+	rtnl_register(PF_DECnet, RTM_DELROUTE, dn_fib_rtm_delroute, NULL, NULL);
 }
 
 
diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index 74544bc..2949ca4 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -1841,10 +1841,11 @@ void __init dn_route_init(void)
 	proc_net_fops_create(&init_net, "decnet_cache", S_IRUGO, &dn_rt_cache_seq_fops);
 
 #ifdef CONFIG_DECNET_ROUTER
-	rtnl_register(PF_DECnet, RTM_GETROUTE, dn_cache_getroute, dn_fib_dump);
+	rtnl_register(PF_DECnet, RTM_GETROUTE, dn_cache_getroute,
+		      dn_fib_dump, NULL);
 #else
 	rtnl_register(PF_DECnet, RTM_GETROUTE, dn_cache_getroute,
-		      dn_cache_dump);
+		      dn_cache_dump, NULL);
 #endif
 }
 
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 0d4a184..37b3c18 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1833,8 +1833,8 @@ void __init devinet_init(void)
 
 	rtnl_af_register(&inet_af_ops);
 
-	rtnl_register(PF_INET, RTM_NEWADDR, inet_rtm_newaddr, NULL);
-	rtnl_register(PF_INET, RTM_DELADDR, inet_rtm_deladdr, NULL);
-	rtnl_register(PF_INET, RTM_GETADDR, NULL, inet_dump_ifaddr);
+	rtnl_register(PF_INET, RTM_NEWADDR, inet_rtm_newaddr, NULL, NULL);
+	rtnl_register(PF_INET, RTM_DELADDR, inet_rtm_deladdr, NULL, NULL);
+	rtnl_register(PF_INET, RTM_GETADDR, NULL, inet_dump_ifaddr, NULL);
 }
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 2252471..92fc5f6 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1124,9 +1124,9 @@ static struct pernet_operations fib_net_ops = {
 
 void __init ip_fib_init(void)
 {
-	rtnl_register(PF_INET, RTM_NEWROUTE, inet_rtm_newroute, NULL);
-	rtnl_register(PF_INET, RTM_DELROUTE, inet_rtm_delroute, NULL);
-	rtnl_register(PF_INET, RTM_GETROUTE, NULL, inet_dump_fib);
+	rtnl_register(PF_INET, RTM_NEWROUTE, inet_rtm_newroute, NULL, NULL);
+	rtnl_register(PF_INET, RTM_DELROUTE, inet_rtm_delroute, NULL, NULL);
+	rtnl_register(PF_INET, RTM_GETROUTE, NULL, inet_dump_fib, NULL);
 
 	register_pernet_subsys(&fib_net_ops);
 	register_netdevice_notifier(&fib_netdev_notifier);
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 6ffe94c..5ff4765 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -871,7 +871,7 @@ static int inet_diag_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 		}
 
 		return netlink_dump_start(idiagnl, skb, nlh,
-					  inet_diag_dump, NULL);
+					  inet_diag_dump, NULL, 0);
 	}
 
 	return inet_diag_get_exact(skb, nlh);
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 30a7763..aae2bd8 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2544,7 +2544,8 @@ int __init ip_mr_init(void)
 		goto add_proto_fail;
 	}
 #endif
-	rtnl_register(RTNL_FAMILY_IPMR, RTM_GETROUTE, NULL, ipmr_rtm_dumproute);
+	rtnl_register(RTNL_FAMILY_IPMR, RTM_GETROUTE,
+		      NULL, ipmr_rtm_dumproute, NULL);
 	return 0;
 
 #ifdef CONFIG_IP_PIMSM_V2
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6a83840..eec0caa 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -3312,7 +3312,7 @@ int __init ip_rt_init(void)
 	xfrm_init();
 	xfrm4_init(ip_rt_max_size);
 #endif
-	rtnl_register(PF_INET, RTM_GETROUTE, inet_rtm_getroute, NULL);
+	rtnl_register(PF_INET, RTM_GETROUTE, inet_rtm_getroute, NULL, NULL);
 
 #ifdef CONFIG_SYSCTL
 	register_pernet_subsys(&sysctl_route_ops);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index f2f9b2e..f013979 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4704,16 +4704,16 @@ int __init addrconf_init(void)
 	if (err < 0)
 		goto errout_af;
 
-	err = __rtnl_register(PF_INET6, RTM_GETLINK, NULL, inet6_dump_ifinfo);
+	err = __rtnl_register(PF_INET6, RTM_GETLINK, NULL, inet6_dump_ifinfo, NULL);
 	if (err < 0)
 		goto errout;
 
 	/* Only the first call to __rtnl_register can fail */
-	__rtnl_register(PF_INET6, RTM_NEWADDR, inet6_rtm_newaddr, NULL);
-	__rtnl_register(PF_INET6, RTM_DELADDR, inet6_rtm_deladdr, NULL);
-	__rtnl_register(PF_INET6, RTM_GETADDR, inet6_rtm_getaddr, inet6_dump_ifaddr);
-	__rtnl_register(PF_INET6, RTM_GETMULTICAST, NULL, inet6_dump_ifmcaddr);
-	__rtnl_register(PF_INET6, RTM_GETANYCAST, NULL, inet6_dump_ifacaddr);
+	__rtnl_register(PF_INET6, RTM_NEWADDR, inet6_rtm_newaddr, NULL, NULL);
+	__rtnl_register(PF_INET6, RTM_DELADDR, inet6_rtm_deladdr, NULL, NULL);
+	__rtnl_register(PF_INET6, RTM_GETADDR, inet6_rtm_getaddr, inet6_dump_ifaddr, NULL);
+	__rtnl_register(PF_INET6, RTM_GETMULTICAST, NULL, inet6_dump_ifmcaddr, NULL);
+	__rtnl_register(PF_INET6, RTM_GETANYCAST, NULL, inet6_dump_ifacaddr, NULL);
 
 	ipv6_addr_label_rtnl_register();
 
diff --git a/net/ipv6/addrlabel.c b/net/ipv6/addrlabel.c
index c8993e5..f3aa749 100644
--- a/net/ipv6/addrlabel.c
+++ b/net/ipv6/addrlabel.c
@@ -592,8 +592,8 @@ out:
 
 void __init ipv6_addr_label_rtnl_register(void)
 {
-	__rtnl_register(PF_INET6, RTM_NEWADDRLABEL, ip6addrlbl_newdel, NULL);
-	__rtnl_register(PF_INET6, RTM_DELADDRLABEL, ip6addrlbl_newdel, NULL);
-	__rtnl_register(PF_INET6, RTM_GETADDRLABEL, ip6addrlbl_get, ip6addrlbl_dump);
+	__rtnl_register(PF_INET6, RTM_NEWADDRLABEL, ip6addrlbl_newdel, NULL, NULL);
+	__rtnl_register(PF_INET6, RTM_DELADDRLABEL, ip6addrlbl_newdel, NULL, NULL);
+	__rtnl_register(PF_INET6, RTM_GETADDRLABEL, ip6addrlbl_get, ip6addrlbl_dump, NULL);
 }
 
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 4076a0b..9b257da 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1586,7 +1586,7 @@ int __init fib6_init(void)
 	if (ret)
 		goto out_kmem_cache_create;
 
-	ret = __rtnl_register(PF_INET6, RTM_GETROUTE, NULL, inet6_dump_fib);
+	ret = __rtnl_register(PF_INET6, RTM_GETROUTE, NULL, inet6_dump_fib, NULL);
 	if (ret)
 		goto out_unregister_subsys;
 out:
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 82a8099..1edfcc9 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -1354,7 +1354,7 @@ int __init ip6_mr_init(void)
 		goto add_proto_fail;
 	}
 #endif
-	rtnl_register(RTNL_FAMILY_IP6MR, RTM_GETROUTE, NULL, ip6mr_rtm_dumproute);
+	rtnl_register(RTNL_FAMILY_IP6MR, RTM_GETROUTE, NULL, ip6mr_rtm_dumproute, NULL);
 	return 0;
 #ifdef CONFIG_IPV6_PIMSM_V2
 add_proto_fail:
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f1be5c5..1c49165 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2924,9 +2924,9 @@ int __init ip6_route_init(void)
 		goto xfrm6_init;
 
 	ret = -ENOBUFS;
-	if (__rtnl_register(PF_INET6, RTM_NEWROUTE, inet6_rtm_newroute, NULL) ||
-	    __rtnl_register(PF_INET6, RTM_DELROUTE, inet6_rtm_delroute, NULL) ||
-	    __rtnl_register(PF_INET6, RTM_GETROUTE, inet6_rtm_getroute, NULL))
+	if (__rtnl_register(PF_INET6, RTM_NEWROUTE, inet6_rtm_newroute, NULL, NULL) ||
+	    __rtnl_register(PF_INET6, RTM_DELROUTE, inet6_rtm_delroute, NULL, NULL) ||
+	    __rtnl_register(PF_INET6, RTM_GETROUTE, inet6_rtm_getroute, NULL, NULL))
 		goto fib6_rules_init;
 
 	ret = register_netdevice_notifier(&ip6_route_dev_notifier);
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 72d1ac6..dc1528c 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -1120,7 +1120,7 @@ ip_set_dump(struct sock *ctnl, struct sk_buff *skb,
 
 	return netlink_dump_start(ctnl, skb, nlh,
 				  ip_set_dump_start,
-				  ip_set_dump_done);
+				  ip_set_dump_done, 0);
 }
 
 /* Add, del and test */
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 482e90c..7dec88a 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -970,7 +970,7 @@ ctnetlink_get_conntrack(struct sock *ctnl, struct sk_buff *skb,
 
 	if (nlh->nlmsg_flags & NLM_F_DUMP)
 		return netlink_dump_start(ctnl, skb, nlh, ctnetlink_dump_table,
-					  ctnetlink_done);
+					  ctnetlink_done, 0);
 
 	err = ctnetlink_parse_zone(cda[CTA_ZONE], &zone);
 	if (err < 0)
@@ -1840,7 +1840,7 @@ ctnetlink_get_expect(struct sock *ctnl, struct sk_buff *skb,
 	if (nlh->nlmsg_flags & NLM_F_DUMP) {
 		return netlink_dump_start(ctnl, skb, nlh,
 					  ctnetlink_exp_dump_table,
-					  ctnetlink_exp_done);
+					  ctnetlink_exp_done, 0);
 	}
 
 	err = ctnetlink_parse_zone(cda[CTA_EXPECT_ZONE], &zone);
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index c8f35b5..063bee9 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1665,13 +1665,10 @@ static int netlink_dump(struct sock *sk)
 {
 	struct netlink_sock *nlk = nlk_sk(sk);
 	struct netlink_callback *cb;
-	struct sk_buff *skb;
+	struct sk_buff *skb = NULL;
 	struct nlmsghdr *nlh;
 	int len, err = -ENOBUFS;
-
-	skb = sock_rmalloc(sk, NLMSG_GOODSIZE, 0, GFP_KERNEL);
-	if (!skb)
-		goto errout;
+	int alloc_size;
 
 	mutex_lock(nlk->cb_mutex);
 
@@ -1681,6 +1678,12 @@ static int netlink_dump(struct sock *sk)
 		goto errout_skb;
 	}
 
+	alloc_size = max_t(int, cb->min_dump_alloc, NLMSG_GOODSIZE);
+
+	skb = sock_rmalloc(sk, alloc_size, 0, GFP_KERNEL);
+	if (!skb)
+		goto errout;
+
 	len = cb->dump(skb, cb);
 
 	if (len > 0) {
@@ -1727,7 +1730,8 @@ int netlink_dump_start(struct sock *ssk, struct sk_buff *skb,
 		       const struct nlmsghdr *nlh,
 		       int (*dump)(struct sk_buff *skb,
 				   struct netlink_callback *),
-		       int (*done)(struct netlink_callback *))
+		       int (*done)(struct netlink_callback *),
+		       u16 min_dump_alloc)
 {
 	struct netlink_callback *cb;
 	struct sock *sk;
@@ -1741,6 +1745,7 @@ int netlink_dump_start(struct sock *ssk, struct sk_buff *skb,
 	cb->dump = dump;
 	cb->done = done;
 	cb->nlh = nlh;
+	cb->min_dump_alloc = min_dump_alloc;
 	atomic_inc(&skb->users);
 	cb->skb = skb;
 
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 1781d99..482fa57 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -525,7 +525,7 @@ static int genl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 
 		genl_unlock();
 		err = netlink_dump_start(net->genl_sock, skb, nlh,
-					 ops->dumpit, ops->done);
+					 ops->dumpit, ops->done, 0);
 		genl_lock();
 		return err;
 	}
diff --git a/net/phonet/pn_netlink.c b/net/phonet/pn_netlink.c
index 438accb..4ad4bb9 100644
--- a/net/phonet/pn_netlink.c
+++ b/net/phonet/pn_netlink.c
@@ -289,15 +289,15 @@ out:
 
 int __init phonet_netlink_register(void)
 {
-	int err = __rtnl_register(PF_PHONET, RTM_NEWADDR, addr_doit, NULL);
+	int err = __rtnl_register(PF_PHONET, RTM_NEWADDR, addr_doit, NULL, NULL);
 	if (err)
 		return err;
 
 	/* Further __rtnl_register() cannot fail */
-	__rtnl_register(PF_PHONET, RTM_DELADDR, addr_doit, NULL);
-	__rtnl_register(PF_PHONET, RTM_GETADDR, NULL, getaddr_dumpit);
-	__rtnl_register(PF_PHONET, RTM_NEWROUTE, route_doit, NULL);
-	__rtnl_register(PF_PHONET, RTM_DELROUTE, route_doit, NULL);
-	__rtnl_register(PF_PHONET, RTM_GETROUTE, NULL, route_dumpit);
+	__rtnl_register(PF_PHONET, RTM_DELADDR, addr_doit, NULL, NULL);
+	__rtnl_register(PF_PHONET, RTM_GETADDR, NULL, getaddr_dumpit, NULL);
+	__rtnl_register(PF_PHONET, RTM_NEWROUTE, route_doit, NULL, NULL);
+	__rtnl_register(PF_PHONET, RTM_DELROUTE, route_doit, NULL, NULL);
+	__rtnl_register(PF_PHONET, RTM_GETROUTE, NULL, route_dumpit, NULL);
 	return 0;
 }
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 14b42f4..c857763 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1120,9 +1120,9 @@ nlmsg_failure:
 
 static int __init tc_action_init(void)
 {
-	rtnl_register(PF_UNSPEC, RTM_NEWACTION, tc_ctl_action, NULL);
-	rtnl_register(PF_UNSPEC, RTM_DELACTION, tc_ctl_action, NULL);
-	rtnl_register(PF_UNSPEC, RTM_GETACTION, tc_ctl_action, tc_dump_action);
+	rtnl_register(PF_UNSPEC, RTM_NEWACTION, tc_ctl_action, NULL, NULL);
+	rtnl_register(PF_UNSPEC, RTM_DELACTION, tc_ctl_action, NULL, NULL);
+	rtnl_register(PF_UNSPEC, RTM_GETACTION, tc_ctl_action, tc_dump_action, NULL);
 
 	return 0;
 }
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index bb2c523..9563887 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -610,10 +610,10 @@ EXPORT_SYMBOL(tcf_exts_dump_stats);
 
 static int __init tc_filter_init(void)
 {
-	rtnl_register(PF_UNSPEC, RTM_NEWTFILTER, tc_ctl_tfilter, NULL);
-	rtnl_register(PF_UNSPEC, RTM_DELTFILTER, tc_ctl_tfilter, NULL);
+	rtnl_register(PF_UNSPEC, RTM_NEWTFILTER, tc_ctl_tfilter, NULL, NULL);
+	rtnl_register(PF_UNSPEC, RTM_DELTFILTER, tc_ctl_tfilter, NULL, NULL);
 	rtnl_register(PF_UNSPEC, RTM_GETTFILTER, tc_ctl_tfilter,
-						 tc_dump_tfilter);
+		      tc_dump_tfilter, NULL);
 
 	return 0;
 }
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 7490f3f..7870a92 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1794,12 +1794,12 @@ static int __init pktsched_init(void)
 	register_qdisc(&pfifo_head_drop_qdisc_ops);
 	register_qdisc(&mq_qdisc_ops);
 
-	rtnl_register(PF_UNSPEC, RTM_NEWQDISC, tc_modify_qdisc, NULL);
-	rtnl_register(PF_UNSPEC, RTM_DELQDISC, tc_get_qdisc, NULL);
-	rtnl_register(PF_UNSPEC, RTM_GETQDISC, tc_get_qdisc, tc_dump_qdisc);
-	rtnl_register(PF_UNSPEC, RTM_NEWTCLASS, tc_ctl_tclass, NULL);
-	rtnl_register(PF_UNSPEC, RTM_DELTCLASS, tc_ctl_tclass, NULL);
-	rtnl_register(PF_UNSPEC, RTM_GETTCLASS, tc_ctl_tclass, tc_dump_tclass);
+	rtnl_register(PF_UNSPEC, RTM_NEWQDISC, tc_modify_qdisc, NULL, NULL);
+	rtnl_register(PF_UNSPEC, RTM_DELQDISC, tc_get_qdisc, NULL, NULL);
+	rtnl_register(PF_UNSPEC, RTM_GETQDISC, tc_get_qdisc, tc_dump_qdisc, NULL);
+	rtnl_register(PF_UNSPEC, RTM_NEWTCLASS, tc_ctl_tclass, NULL, NULL);
+	rtnl_register(PF_UNSPEC, RTM_DELTCLASS, tc_ctl_tclass, NULL, NULL);
+	rtnl_register(PF_UNSPEC, RTM_GETTCLASS, tc_ctl_tclass, tc_dump_tclass, NULL);
 
 	return 0;
 }
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index c658cb3..8bd79c8 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -2299,7 +2299,7 @@ static int xfrm_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 		if (link->dump == NULL)
 			return -EINVAL;
 
-		return netlink_dump_start(net->xfrm.nlsk, skb, nlh, link->dump, link->done);
+		return netlink_dump_start(net->xfrm.nlsk, skb, nlh, link->dump, link->done, 0);
 	}
 
 	err = nlmsg_parse(nlh, xfrm_msg_min[type], attrs, XFRMA_MAX,


^ permalink raw reply related

* Re: How to get backtrace? modprobe -r iwlagn; modprobe iwlagn kills kernel
From: Milton Miller @ 2011-05-18 15:55 UTC (permalink / raw)
  To: Stevie Trujillo; +Cc: linux-kernel, linux-wireless, netdev
In-Reply-To: <201105181637.46164.stevie.trujillo@gmail.com>

On Wed May 18 2011 about 10:37:54 EST, Stevie Trujillo wrote:
> I'm having some problem getting a backtrace. When I do "modprobe -r
> iwlagn; modprobe iwlagn" (2.6.38.6 with Intel-1030N) the kernel crashes
> (sometimes I need to try 2 or 3 times, and sometimes the modprobe -r
> is the one that crashes).
> 
> This spams my monitor with several oops/panics before it finally
> dies. I can only see bottom of the last one, which probably isn't
> very meaningful.  I tried loading netconsole, but I only get one or
> two lines before it stops sending/printing.
> 
> I then tried kexec+crashkernel, but I only managed to get a backtrace
> for the last panic (which I think is just a result of memory corruption
> or something).  Finally I tried compiling ramoops into my kernel,
> but it didn't want to load because of "No such device".

For the kexec+crashkernel, try to retrieve the kernel log buffer:

There are some macros available, but basically log_buf is a pointer
to a buffer of length log_buf_len initialized to __log_buf (but can
be expanded via the command line or sysctl, in). log_end is the
end of the buffer.  If you haven't wrapped then just print log_end
characters.

Also consider setting /proc/sys/kernel/panic_on_oops to 1 to
concentrate on the first one.  It won't help getting the oops to your
syslog but maybe it will keep it on the screen.

Hope this helps,
milton

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Shirley Ma @ 2011-05-18 16:02 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: Michael S. Tsirkin, Ben Hutchings, David Miller, Eric Dumazet,
	Avi Kivity, Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <1305729507.32080.6.camel@localhost.localdomain>

On Wed, 2011-05-18 at 07:38 -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 13:40 +0200, Michał Mirosław wrote:
> > >> >> Not more other restrictions, skb clone is OK.
> pskb_expand_head()
> > looks
> > >> >> OK to me from code review.
> > >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> > >> > references to pages. How is that ok? What do I miss?
> > >> It's making copy of the skb_shinfo earlier, so the pages refcount
> > >> stays the same.
> > > Exactly. But the callback is invoked so the guest thinks it's ok
> to
> > > change this memory. If it does a corrupted packet will be sent
> out.
> > 
> > Hmm. I tool a quick look at skb_clone(), and it looks like this
> > sequence will break this scheme:
> > 
> > skb2 = skb_clone(skb...);
> > kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> > [use skb2, pages still referenced]
> > kfree_skb(skb); /* callback called again */
> > 
> > This sequence is common in bridge, might be in other places.
> > 
> > Maybe this ubuf thing should just track clones? This will make it
> work
> > on all devices then.
> 
> The callback was only invoked when last reference of skb was gone.
> skb_clone does increase skb refcnt. I tested tcpdump on lower device,
> it
> worked.
> 
> For the sequence of:
> 
> skb_clone  -> last refcnt + 1
> kfree_skb() or pskb_expand_head -> callback not called
> kfree_skb() -> callback called
> 
> I will check page refcount to see whether it's balanced. 

The page refcounts are balanced too. 

In macvtap/vhost Real NIC zerocopy case, it always goes to fastpath in
pskb_expand_head, so I didn't hit any issue.

But rethinking about pskb_expand_head(), it calls skb_release_data() to
free old skb head when it's not in the fastpath (pskb_expand_head is not
the last reference of this skb); And it's impossible to track which skb
head (old one or new one) will be the last one to free. So better to
return error for zero-copy skbs when not using fastpath. Does it make
sense? 

Besides this, any other issue?


Thanks
Shirley

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: make sure rtnl is held in rtnl_fill_ifinfo()
From: Eric Dumazet @ 2011-05-18 16:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <1305711303.2983.24.camel@edumazet-laptop>

Le mercredi 18 mai 2011 à 11:35 +0200, Eric Dumazet a écrit :
> Commit e67f88dd12f610 (net: dont hold rtnl mutex during netlink dump
> callbacks) added a problem in rtnl_fill_ifinfo() not being always called
> with RTNL held, which is racy.
> 
> 1) This patch extends rtnl_mutex helper functions so that :
> 
> rtnl_lock() is able to BUG_ON() if recursively called.
> [This was only provided if LOCKDEP was on]
> 
> rtnl_is_locked() is able to check if current thread owns rtnl_mutex
> [ASSERT_RTNL() gets this added feature too]
> 
> 2) Add one ASSERT_RTNL() in rtnl_fill_ifinfo()
> 
> 3) Make sure rtnl is held in rtnl_dump_ifinfo()
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> --

Hmm, please disregard, there is a lock depency problem here

netlink_dump() does a mutex_lock(nlk->cb_mutex)
So if we try to lock rtnl_mutex later in rtnl_dump_ifinfo(), we can
deadlock with another task doing the reverse.

(first lock rtnl_mutex, then nlk->cb_mutex)





^ permalink raw reply

* Re: How to get backtrace? modprobe -r iwlagn; modprobe iwlagn kills kernel
From: Américo Wang @ 2011-05-18 16:02 UTC (permalink / raw)
  To: Milton Miller; +Cc: Stevie Trujillo, linux-kernel, linux-wireless, netdev
In-Reply-To: <1305734101_6569@mail4.comsite.net>

On Wed, May 18, 2011 at 11:55 PM, Milton Miller <miltonm@bga.com> wrote:
> On Wed May 18 2011 about 10:37:54 EST, Stevie Trujillo wrote:
>> I'm having some problem getting a backtrace. When I do "modprobe -r
>> iwlagn; modprobe iwlagn" (2.6.38.6 with Intel-1030N) the kernel crashes
>> (sometimes I need to try 2 or 3 times, and sometimes the modprobe -r
>> is the one that crashes).
>>
>> This spams my monitor with several oops/panics before it finally
>> dies. I can only see bottom of the last one, which probably isn't
>> very meaningful.  I tried loading netconsole, but I only get one or
>> two lines before it stops sending/printing.
>>
>> I then tried kexec+crashkernel, but I only managed to get a backtrace
>> for the last panic (which I think is just a result of memory corruption
>> or something).  Finally I tried compiling ramoops into my kernel,
>> but it didn't want to load because of "No such device".
>
> For the kexec+crashkernel, try to retrieve the kernel log buffer:
>

There is a utility in kexec-tools, named vmcore-dmesg, which
is supposed to do this kind of thing.

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Shirley Ma @ 2011-05-18 16:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Michał Mirosław, Ben Hutchings, David Miller,
	Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <20110518154746.GA21378@redhat.com>

On Wed, 2011-05-18 at 18:47 +0300, Michael S. Tsirkin wrote:
> On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote:
> > On Wed, 2011-05-18 at 13:40 +0200, Michał Mirosław wrote:
> > > >> >> Not more other restrictions, skb clone is OK.
> pskb_expand_head()
> > > looks
> > > >> >> OK to me from code review.
> > > >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> > > >> > references to pages. How is that ok? What do I miss?
> > > >> It's making copy of the skb_shinfo earlier, so the pages
> refcount
> > > >> stays the same.
> > > > Exactly. But the callback is invoked so the guest thinks it's ok
> to
> > > > change this memory. If it does a corrupted packet will be sent
> out.
> > > 
> > > Hmm. I tool a quick look at skb_clone(), and it looks like this
> > > sequence will break this scheme:
> > > 
> > > skb2 = skb_clone(skb...);
> > > kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> > > [use skb2, pages still referenced]
> > > kfree_skb(skb); /* callback called again */
> > > 
> > > This sequence is common in bridge, might be in other places.
> > > 
> > > Maybe this ubuf thing should just track clones? This will make it
> work
> > > on all devices then.
> > 
> > The callback was only invoked when last reference of skb was gone.
> > skb_clone does increase skb refcnt. I tested tcpdump on lower
> device, it
> > worked.
> 
> Right, it will normally work, but two issues I think you miss:
> 1. malicious guest can change the memory between when it is sent out
> by
>    device and consumed by tcpdump, so you will see different things
>    (not sure how important this is).
> 2. if tcpdump stops consuming stuff from the packet socket (it's
>    userspace, can't be trusted) then we won't get a callback for
>    page potentially forever, guest networking will get blocked etc.
> > For the sequence of:
> > 
> > skb_clone  -> last refcnt + 1
> > kfree_skb() or pskb_expand_head -> callback not called
> > kfree_skb() -> callback called
> > 
> > I will check page refcount to see whether it's balanced.
> > 
> > Thanks
> > shirley
> 
> 
> pskb_expand_head is a problem anyway I think as it
> can hang on to pages after it calls release_data.
> Then guest will modify these pages and you get trash there. 

This can be avoid by allowing pskb_expand_head in fastpath only, I
think. But not sure whether tcpdump can still work with this.

Thanks
Shirley

^ permalink raw reply

* Re: [PATCH net-next] bridge: add notification over netlink when STP changes state
From: Ben Greear @ 2011-05-18 16:18 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, bridge, netdev
In-Reply-To: <20110518083948.261ded68@nehalam>

On 05/18/2011 08:39 AM, Stephen Hemminger wrote:
> On Wed, 18 May 2011 08:19:40 -0700
> Ben Greear<greearb@candelatech.com>  wrote:
>
>> On 05/18/2011 08:17 AM, Stephen Hemminger wrote:
>>> The first netlink code in the bridge module was to notify
>>> user space implementations of Spanning Tree Protocol about
>>> new ports. It did not handle the case of kernel mode STP
>>> changing states which could be useful for monitoring.
>>>
>>> This patch causes RTM_NEWLINK message to occur on kernel
>>> transitions. It does not send message if request was from user
>>> space STP, since that would cause reflection and break existing
>>> API.
>>
>> So one app monitoring the system won't get updates if another
>> app makes changes?
>>
>> Ben
>
> For now, yes. Just don't want to break any existing users
> which is more important.

You can get RTM_NEWLINK messages for various reasons..it would
seem a non-buggy program should have no problem dealing with one
more source.

It seems like this code is going to be inconvenient
for anyone wanting to monitor bridges using netlink, and
even if you fix it in the future, there is going to be
backwards compatible hacks that are likely to be quite
ugly!

Thanks,
Ben

>


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michael S. Tsirkin @ 2011-05-18 16:23 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michał Mirosław, Ben Hutchings, David Miller,
	Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <1305734543.32080.50.camel@localhost.localdomain>

On Wed, May 18, 2011 at 09:02:23AM -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 07:38 -0700, Shirley Ma wrote:
> > On Wed, 2011-05-18 at 13:40 +0200, Michał Mirosław wrote:
> > > >> >> Not more other restrictions, skb clone is OK.
> > pskb_expand_head()
> > > looks
> > > >> >> OK to me from code review.
> > > >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> > > >> > references to pages. How is that ok? What do I miss?
> > > >> It's making copy of the skb_shinfo earlier, so the pages refcount
> > > >> stays the same.
> > > > Exactly. But the callback is invoked so the guest thinks it's ok
> > to
> > > > change this memory. If it does a corrupted packet will be sent
> > out.
> > > 
> > > Hmm. I tool a quick look at skb_clone(), and it looks like this
> > > sequence will break this scheme:
> > > 
> > > skb2 = skb_clone(skb...);
> > > kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> > > [use skb2, pages still referenced]
> > > kfree_skb(skb); /* callback called again */
> > > 
> > > This sequence is common in bridge, might be in other places.
> > > 
> > > Maybe this ubuf thing should just track clones? This will make it
> > work
> > > on all devices then.
> > 
> > The callback was only invoked when last reference of skb was gone.
> > skb_clone does increase skb refcnt. I tested tcpdump on lower device,
> > it
> > worked.
> > 
> > For the sequence of:
> > 
> > skb_clone  -> last refcnt + 1
> > kfree_skb() or pskb_expand_head -> callback not called
> > kfree_skb() -> callback called
> > 
> > I will check page refcount to see whether it's balanced. 
> 
> The page refcounts are balanced too. 
> 
> In macvtap/vhost Real NIC zerocopy case, it always goes to fastpath in
> pskb_expand_head, so I didn't hit any issue.
> 
> But rethinking about pskb_expand_head(), it calls skb_release_data() to
> free old skb head when it's not in the fastpath (pskb_expand_head is not
> the last reference of this skb); And it's impossible to track which skb
> head (old one or new one) will be the last one to free. So better to
> return error for zero-copy skbs when not using fastpath. Does it make
> sense? 

I'm not sure it does. Look e.g. at tg3 - if expand_head fails
packet gets dropped. No crash but unlikely to perform well :).

> Besides this, any other issue?
> 
> 
> Thanks
> Shirley

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michael S. Tsirkin @ 2011-05-18 16:36 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michał Mirosław, Ben Hutchings, David Miller,
	Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <1305734857.32080.53.camel@localhost.localdomain>

On Wed, May 18, 2011 at 09:07:37AM -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 18:47 +0300, Michael S. Tsirkin wrote:
> > On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote:
> > > On Wed, 2011-05-18 at 13:40 +0200, Michał Mirosław wrote:
> > > > >> >> Not more other restrictions, skb clone is OK.
> > pskb_expand_head()
> > > > looks
> > > > >> >> OK to me from code review.
> > > > >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> > > > >> > references to pages. How is that ok? What do I miss?
> > > > >> It's making copy of the skb_shinfo earlier, so the pages
> > refcount
> > > > >> stays the same.
> > > > > Exactly. But the callback is invoked so the guest thinks it's ok
> > to
> > > > > change this memory. If it does a corrupted packet will be sent
> > out.
> > > > 
> > > > Hmm. I tool a quick look at skb_clone(), and it looks like this
> > > > sequence will break this scheme:
> > > > 
> > > > skb2 = skb_clone(skb...);
> > > > kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> > > > [use skb2, pages still referenced]
> > > > kfree_skb(skb); /* callback called again */
> > > > 
> > > > This sequence is common in bridge, might be in other places.
> > > > 
> > > > Maybe this ubuf thing should just track clones? This will make it
> > work
> > > > on all devices then.
> > > 
> > > The callback was only invoked when last reference of skb was gone.
> > > skb_clone does increase skb refcnt. I tested tcpdump on lower
> > device, it
> > > worked.
> > 
> > Right, it will normally work, but two issues I think you miss:
> > 1. malicious guest can change the memory between when it is sent out
> > by
> >    device and consumed by tcpdump, so you will see different things
> >    (not sure how important this is).
> > 2. if tcpdump stops consuming stuff from the packet socket (it's
> >    userspace, can't be trusted) then we won't get a callback for
> >    page potentially forever, guest networking will get blocked etc.
> > > For the sequence of:
> > > 
> > > skb_clone  -> last refcnt + 1
> > > kfree_skb() or pskb_expand_head -> callback not called
> > > kfree_skb() -> callback called
> > > 
> > > I will check page refcount to see whether it's balanced.
> > > 
> > > Thanks
> > > shirley
> > 
> > 
> > pskb_expand_head is a problem anyway I think as it
> > can hang on to pages after it calls release_data.
> > Then guest will modify these pages and you get trash there. 
> 
> This can be avoid by allowing pskb_expand_head in fastpath only, I
> think. But not sure whether tcpdump can still work with this.
> 
> Thanks
> Shirley

Yes, I agree.  I think for tcpdump, we really need to copy the data
anyway, to avoid guest changing it in between.  So we do that and then
use the copy everywhere, release the old one. Hmm?

-- 
MST

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Shirley Ma @ 2011-05-18 16:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Michał Mirosław, Ben Hutchings, David Miller,
	Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <20110518163633.GB22001@redhat.com>

On Wed, 2011-05-18 at 19:36 +0300, Michael S. Tsirkin wrote:
> On Wed, May 18, 2011 at 09:07:37AM -0700, Shirley Ma wrote:
> > On Wed, 2011-05-18 at 18:47 +0300, Michael S. Tsirkin wrote:
> > > On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote:
> > > > On Wed, 2011-05-18 at 13:40 +0200, Michał Mirosław wrote:
> > > > > >> >> Not more other restrictions, skb clone is OK.
> > > pskb_expand_head()
> > > > > looks
> > > > > >> >> OK to me from code review.
> > > > > >> > Hmm. pskb_expand_head calls skb_release_data while
> keeping
> > > > > >> > references to pages. How is that ok? What do I miss?
> > > > > >> It's making copy of the skb_shinfo earlier, so the pages
> > > refcount
> > > > > >> stays the same.
> > > > > > Exactly. But the callback is invoked so the guest thinks
> it's ok
> > > to
> > > > > > change this memory. If it does a corrupted packet will be
> sent
> > > out.
> > > > > 
> > > > > Hmm. I tool a quick look at skb_clone(), and it looks like
> this
> > > > > sequence will break this scheme:
> > > > > 
> > > > > skb2 = skb_clone(skb...);
> > > > > kfree_skb(skb) or pskb_expand_head(skb);  /* callback called
> */
> > > > > [use skb2, pages still referenced]
> > > > > kfree_skb(skb); /* callback called again */
> > > > > 
> > > > > This sequence is common in bridge, might be in other places.
> > > > > 
> > > > > Maybe this ubuf thing should just track clones? This will make
> it
> > > work
> > > > > on all devices then.
> > > > 
> > > > The callback was only invoked when last reference of skb was
> gone.
> > > > skb_clone does increase skb refcnt. I tested tcpdump on lower
> > > device, it
> > > > worked.
> > > 
> > > Right, it will normally work, but two issues I think you miss:
> > > 1. malicious guest can change the memory between when it is sent
> out
> > > by
> > >    device and consumed by tcpdump, so you will see different
> things
> > >    (not sure how important this is).
> > > 2. if tcpdump stops consuming stuff from the packet socket (it's
> > >    userspace, can't be trusted) then we won't get a callback for
> > >    page potentially forever, guest networking will get blocked
> etc.
> > > > For the sequence of:
> > > > 
> > > > skb_clone  -> last refcnt + 1
> > > > kfree_skb() or pskb_expand_head -> callback not called
> > > > kfree_skb() -> callback called
> > > > 
> > > > I will check page refcount to see whether it's balanced.
> > > > 
> > > > Thanks
> > > > shirley
> > > 
> > > 
> > > pskb_expand_head is a problem anyway I think as it
> > > can hang on to pages after it calls release_data.
> > > Then guest will modify these pages and you get trash there. 
> > 
> > This can be avoid by allowing pskb_expand_head in fastpath only, I
> > think. But not sure whether tcpdump can still work with this.
> > 
> > Thanks
> > Shirley
> 
> Yes, I agree.  I think for tcpdump, we really need to copy the data
> anyway, to avoid guest changing it in between.  So we do that and then
> use the copy everywhere, release the old one. Hmm? 

Yes. Old one use zerocopy, new one use copy data.

Thanks
Shirley

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michael S. Tsirkin @ 2011-05-18 16:50 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: Shirley Ma, Ben Hutchings, David Miller, Eric Dumazet, Avi Kivity,
	Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <BANLkTi=g_PFfS5653K8ZoQ2Jhp8DhCV1hg@mail.gmail.com>

On Wed, May 18, 2011 at 01:40:29PM +0200, Michał Mirosław wrote:
> W dniu 18 maja 2011 13:17 użytkownik Michael S. Tsirkin
> <mst@redhat.com> napisał:
> > On Wed, May 18, 2011 at 01:10:50PM +0200, Michał Mirosław wrote:
> >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
> >> >> On Tue, 2011-05-17 at 23:48 +0200, Michał Mirosław wrote:
> >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> >> >> > > Hello Michael,
> >> >> > >
> >> >> > > Looks like to use a new flag requires more time/work. I am thinking
> >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
> >> >> > to
> >> >> > > avoid the new flag for now since mavctap uses real NICs as lower
> >> >> > device?
> >> >> >
> >> >> > Is there any other restriction besides requiring driver to not recycle
> >> >> > the skb? Are there any drivers that recycle TX skbs?
> >> >
> >> > Not just recycling skbs, keeping reference to any of the pages in the
> >> > skb. Another requirement is to invoke the callback
> >> > in a timely fashion.  For example virtio-net doesn't limit the time until
> >> > that happens (skbs are only freed when some other packet is
> >> > transmitted), so we need to avoid zcopy for such (nested-virt)
> >> > scenarious, right?
> >>
> >> Hmm. But every hardware driver supporting SG will keep reference to
> >> the pages until the packet is sent (or DMA'd to the device). This can
> >> take a long time if hardware queue happens to stall for some reason.
> >
> > That's a fundamental property of zero copy transmit.
> > You can't let the application/guest reuse the memory until
> > no one looks at it anymore.
> >
> >> Is it that you mean keeping a reference after all skbs pointing to the
> >> pages are released?
> > No one should reference the pages after the callback is invoked, yes.
> 
> >> >> Not more other restrictions, skb clone is OK. pskb_expand_head() looks
> >> >> OK to me from code review.
> >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> >> > references to pages. How is that ok? What do I miss?
> >> It's making copy of the skb_shinfo earlier, so the pages refcount
> >> stays the same.
> > Exactly. But the callback is invoked so the guest thinks it's ok to
> > change this memory. If it does a corrupted packet will be sent out.
> 
> Hmm. I tool a quick look at skb_clone(), and it looks like this
> sequence will break this scheme:
> 
> skb2 = skb_clone(skb...);
> kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> [use skb2, pages still referenced]
> kfree_skb(skb); /* callback called again */
> This sequence is common in bridge, might be in other places.
> 
> Maybe this ubuf thing should just track clones? This will make it work
> on all devices then.
> 
> Best Regards,
> Michał Mirosław

Well bridge has the problem that packet might get anywhere and it's
really hard to track. Same for tun - it can get queued forever.
veth, loopback are all a problem I think.

IOW we really want to limit this to real physical NICs
which mostly all DTRT. Whitelisting them with a new flag
is likely the most concervative approach, no?

-- 
MST

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michael S. Tsirkin @ 2011-05-18 16:51 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michał Mirosław, Ben Hutchings, David Miller,
	Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <1305737140.32080.59.camel@localhost.localdomain>

On Wed, May 18, 2011 at 09:45:40AM -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 19:36 +0300, Michael S. Tsirkin wrote:
> > On Wed, May 18, 2011 at 09:07:37AM -0700, Shirley Ma wrote:
> > > On Wed, 2011-05-18 at 18:47 +0300, Michael S. Tsirkin wrote:
> > > > On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote:
> > > > > On Wed, 2011-05-18 at 13:40 +0200, Michał Mirosław wrote:
> > > > > > >> >> Not more other restrictions, skb clone is OK.
> > > > pskb_expand_head()
> > > > > > looks
> > > > > > >> >> OK to me from code review.
> > > > > > >> > Hmm. pskb_expand_head calls skb_release_data while
> > keeping
> > > > > > >> > references to pages. How is that ok? What do I miss?
> > > > > > >> It's making copy of the skb_shinfo earlier, so the pages
> > > > refcount
> > > > > > >> stays the same.
> > > > > > > Exactly. But the callback is invoked so the guest thinks
> > it's ok
> > > > to
> > > > > > > change this memory. If it does a corrupted packet will be
> > sent
> > > > out.
> > > > > > 
> > > > > > Hmm. I tool a quick look at skb_clone(), and it looks like
> > this
> > > > > > sequence will break this scheme:
> > > > > > 
> > > > > > skb2 = skb_clone(skb...);
> > > > > > kfree_skb(skb) or pskb_expand_head(skb);  /* callback called
> > */
> > > > > > [use skb2, pages still referenced]
> > > > > > kfree_skb(skb); /* callback called again */
> > > > > > 
> > > > > > This sequence is common in bridge, might be in other places.
> > > > > > 
> > > > > > Maybe this ubuf thing should just track clones? This will make
> > it
> > > > work
> > > > > > on all devices then.
> > > > > 
> > > > > The callback was only invoked when last reference of skb was
> > gone.
> > > > > skb_clone does increase skb refcnt. I tested tcpdump on lower
> > > > device, it
> > > > > worked.
> > > > 
> > > > Right, it will normally work, but two issues I think you miss:
> > > > 1. malicious guest can change the memory between when it is sent
> > out
> > > > by
> > > >    device and consumed by tcpdump, so you will see different
> > things
> > > >    (not sure how important this is).
> > > > 2. if tcpdump stops consuming stuff from the packet socket (it's
> > > >    userspace, can't be trusted) then we won't get a callback for
> > > >    page potentially forever, guest networking will get blocked
> > etc.
> > > > > For the sequence of:
> > > > > 
> > > > > skb_clone  -> last refcnt + 1
> > > > > kfree_skb() or pskb_expand_head -> callback not called
> > > > > kfree_skb() -> callback called
> > > > > 
> > > > > I will check page refcount to see whether it's balanced.
> > > > > 
> > > > > Thanks
> > > > > shirley
> > > > 
> > > > 
> > > > pskb_expand_head is a problem anyway I think as it
> > > > can hang on to pages after it calls release_data.
> > > > Then guest will modify these pages and you get trash there. 
> > > 
> > > This can be avoid by allowing pskb_expand_head in fastpath only, I
> > > think. But not sure whether tcpdump can still work with this.
> > > 
> > > Thanks
> > > Shirley
> > 
> > Yes, I agree.  I think for tcpdump, we really need to copy the data
> > anyway, to avoid guest changing it in between.  So we do that and then
> > use the copy everywhere, release the old one. Hmm? 
> 
> Yes. Old one use zerocopy, new one use copy data.
> 
> Thanks
> Shirley

No, that's wrong, as they might become different with a
malicious guest. As long as we copied already, lets realease
the data and have everyone use the copy.

-- 
MST

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Shirley Ma @ 2011-05-18 17:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Michał Mirosław, Ben Hutchings, David Miller,
	Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <20110518165138.GD22001@redhat.com>

On Wed, 2011-05-18 at 19:51 +0300, Michael S. Tsirkin wrote:
> > > Yes, I agree.  I think for tcpdump, we really need to copy the
> data
> > > anyway, to avoid guest changing it in between.  So we do that and
> then
> > > use the copy everywhere, release the old one. Hmm? 
> > 
> > Yes. Old one use zerocopy, new one use copy data.
> > 
> > Thanks
> > Shirley
> 
> No, that's wrong, as they might become different with a
> malicious guest. As long as we copied already, lets realease
> the data and have everyone use the copy. 

Ok, I will patch pskb_expand_head to test it out.

Shirley

^ permalink raw reply

* RE: [PATCH v2] CDC NCM: release interfaces fix in unbind()
From: Alexey ORISHKO @ 2011-05-18 17:11 UTC (permalink / raw)
  To: Alan Stern, gregkh@suse.de, oliver@neukum.org
  Cc: netdev@vger.kernel.org, davem@davemloft.net,
	linux-usb@vger.kernel.org
In-Reply-To: <Pine.LNX.4.44L0.1105181053110.2131-100000@iolanthe.rowland.org>

> From: Alan Stern [mailto:stern@rowland.harvard.edu]
> Sent: Wednesday, May 18, 2011 4:55 PM
> 
> Here and later on, the patch seems to have forgotten about the control
> interface.  Is this deliberate or an oversight?
> 
> Alan Stern

Kernel docs says, that usb_driver_claim_interface() is used by usb
device drivers that need to claim more than one interface. I assume,
it's needed for second interface only. Am I wrong?

NCM driver was partly based on ECM code. I did check existing drivers
for CDC ACM and CDC ECM, which also uses 2 interfaces: master (control)
and data (bulk), but I was unable to find any code, which claims master
interface.

If we need explicitly claim/release master interface (which is *intf
parameter in bind() and unbind()), then cdc-acm and usb_ether drivers
should be updated as well.

I wonder, if Greg or Oliver could provide any comments why master interface
is not claimed in modem/ether drivers, since they are working with the
code for quite a while.

Regards,
Alexey

^ permalink raw reply

* Re: [PATCH] ethtool: ETHTOOL_SFEATURES: remove NETIF_F_COMPAT return
From: Ben Hutchings @ 2011-05-18 19:02 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: netdev, David Miller
In-Reply-To: <1305583775.2885.65.camel@bwh-desktop>

On Mon, 2011-05-16 at 23:09 +0100, Ben Hutchings wrote:
> On Mon, 2011-05-16 at 23:50 +0200, Michał Mirosław wrote:
> > On Mon, May 16, 2011 at 10:08:59PM +0100, Ben Hutchings wrote:
[...]
> > > I've explained before that I do not want to add new options to do
> > > (mostly) the same thing.  Users should have not have to use a different
> > > command depending on the kernel version.
> > 
> > We can avoid new option by checking feature-strings for unrecognised
> > arguments to -K. This way, we will have the old options which work
> > regardless of kernel version ('tx', 'rx', 'sg', etc.) and new options
> > which need recent kernel anyway (separated 'tx-checksum-*', 'loopback',
> > others coming in for 2.6.40).
> 
> This is just too subtle a distinction.  It will mostly confuse users.
[...]

Sorry, I think I misunderstood you here.  I agree that new feature names
that do not correspond exactly to existing keywords should be supported
as keywords after the -K option.  I think those that do (e.g.
"tx-udp-fragmentation" vs "ufo") should not be, as adding a
kernel-version-dependent *alias* would be confusing.

I also want users to benefit from your improvements (as I explained
above) even when they use the old names, if they are using a new kernel
version.  That is why I want ethtool to try using ETHTOOL_SFEATURES
first, and why the fallback in the kernel is problematic.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* RE: [PATCH v2] CDC NCM: release interfaces fix in unbind()
From: Alan Stern @ 2011-05-18 19:10 UTC (permalink / raw)
  To: Alexey ORISHKO
  Cc: gregkh@suse.de, oliver@neukum.org, netdev@vger.kernel.org,
	davem@davemloft.net, linux-usb@vger.kernel.org
In-Reply-To: <2AC7D4AD8BA1C640B4C60C61C8E520153E3ACE0774@EXDCVYMBSTM006.EQ1STM.local>

On Wed, 18 May 2011, Alexey ORISHKO wrote:

> > From: Alan Stern [mailto:stern@rowland.harvard.edu]
> > Sent: Wednesday, May 18, 2011 4:55 PM
> > 
> > Here and later on, the patch seems to have forgotten about the control
> > interface.  Is this deliberate or an oversight?
> > 
> > Alan Stern
> 
> Kernel docs says, that usb_driver_claim_interface() is used by usb
> device drivers that need to claim more than one interface. I assume,
> it's needed for second interface only. Am I wrong?

Well, if a driver wants to claim three interfaces (which is more than 
one), it would call usb_driver_claim_interface() for both the second 
and third interfaces, not only the second.

In general, when the driver gets probed for any one of the interfaces,
it should identify all the interfaces it's interested in and claim
them.  However, it should skip the interface currently being probed --
that usb_driver_claim_interface() call would fail anyway since an
interface can't be claimed while it is being probed.

Similarly, when any of the interfaces is unbound, the driver should 
release them all.

> NCM driver was partly based on ECM code. I did check existing drivers
> for CDC ACM and CDC ECM, which also uses 2 interfaces: master (control)
> and data (bulk), but I was unable to find any code, which claims master
> interface.
> 
> If we need explicitly claim/release master interface (which is *intf
> parameter in bind() and unbind()), then cdc-acm and usb_ether drivers
> should be updated as well.

As far as I can tell, those drivers check that they are being probed
for the control interface, so all they need to claim is the data
interface.

You could do something similar -- have the bind routine return 
-ENODEV if it's not being called for the control interface.  But the 
unbind routine would still have to release both interfaces, since it 
can't rely on being called for the control interface first.

Alan Stern

^ permalink raw reply

* Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.
From: David Miller @ 2011-05-18 19:26 UTC (permalink / raw)
  To: tsunanet; +Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, netdev, linux-kernel
In-Reply-To: <1305715384-81716-1-git-send-email-tsunanet@gmail.com>

From: Benoit Sigoure <tsunanet@gmail.com>
Date: Wed, 18 May 2011 03:43:04 -0700

> Instead of hardcoding the initial RTO to 3s and requiring
> the kernel to be recompiled to change it, expose it as a
> sysctl that can be tuned at runtime.  Leave the default
> value unchanged.
> 
> Signed-off-by: Benoit Sigoure <tsunanet@gmail.com>

If you read the ietf draft that reduces the initial RTO down to 1
second, it states that if we take a timeout during the initial
connection handshake then we have to revert the RTO back up to 3
seconds.

This fallback logic conflicts with being able to only change the
initial RTO via sysctl, I think.  Because there are actually two
values at stake and they depend upon eachother, the initial RTO and
the value we fallback to on initial handshake retransmissions.

So I'd rather get a patch that implements the 1 second initial
RTO with the 3 second fallback on SYN retransmit, than this patch.

We already have too many knobs.

^ permalink raw reply

* Re: Bug, kernel panic, NULL dereference , cleanup_once / icmp_route_lookup.clone.19.clone / nat , 2.6.39-rc7-git11
From: Eric Dumazet @ 2011-05-18 19:29 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev
In-Reply-To: <1305733944.2991.45.camel@edumazet-laptop>

Le mercredi 18 mai 2011 à 17:52 +0200, Eric Dumazet a écrit :

> Hmm, it seems we have some inetpeer refcount leak somewhere.
> 
> Maybe one (struct rtable)->peer is not released on dst/rtable removal,
> or we also leak dst/rtable (and their ->peer inetpeer)
> 
> Watch :
> 
> grep peer /proc/slabinfo
> grep dst /proc/slabinfo
> 

FYI, I started a bisection to find the faulty commit.



^ permalink raw reply

* Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.
From: tsuna @ 2011-05-18 19:40 UTC (permalink / raw)
  To: David Miller
  Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, netdev, linux-kernel
In-Reply-To: <20110518.152653.1486764697527722925.davem@davemloft.net>

On Wed, May 18, 2011 at 12:26 PM, David Miller <davem@davemloft.net> wrote:
> If you read the ietf draft that reduces the initial RTO down to 1
> second, it states that if we take a timeout during the initial
> connection handshake then we have to revert the RTO back up to 3
> seconds.
>
> This fallback logic conflicts with being able to only change the
> initial RTO via sysctl, I think.  Because there are actually two
> values at stake and they depend upon eachother, the initial RTO and
> the value we fallback to on initial handshake retransmissions.
>
> So I'd rather get a patch that implements the 1 second initial
> RTO with the 3 second fallback on SYN retransmit, than this patch.
>
> We already have too many knobs.

I was hoping this knob would be accepted because this is such an
important issue that it even warrants an IETF draft to attempt to
change the standard.  I'm not sure how long it will take for this
draft to be accepted and then implemented, so I thought adding this
simple knob today would really help in the future.

Plus, should the draft be accepted, this knob will still be just as
useful (e.g. to revert back to today's behavior), and people might
want to consider adding another knob for the fallback initRTO (this is
debatable).  I don't believe this knob conflicts with the proposed
change to the standard, it actually goes along with it pretty well and
helps us prepare better for this upcoming change.

I agree that there are too many knobs, and I hate feature creep too,
but I've found many of these knobs to be really useful, and the degree
to which Linux's TCP stack can be tuned is part of what makes it so
versatile.

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox