Netdev List
 help / color / mirror / Atom feed
* Re: [net-next PATCH V6 2/2] qdisc: dequeue bulking also pickup GSO/TSO packets
From: Eric Dumazet @ 2014-10-02 14:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke
In-Reply-To: <20141001203604.3321.91746.stgit@dragon>

On Wed, 2014-10-01 at 22:36 +0200, Jesper Dangaard Brouer wrote:
> The TSO and GSO segmented packets already benefit from bulking
> on their own.
> 

> 
> Signed-off-by: Florian Westphal <fw@strlen.de>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> ---
> 
>  net/sched/sch_generic.c |   12 +++---------
>  1 files changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index c2e87e6..797ebef 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -63,10 +63,6 @@ static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q,
>  	struct sk_buff *skb, *tail_skb = head_skb;
>  
>  	while (bytelimit > 0) {
> -		/* For now, don't bulk dequeue GSO (or GSO segmented) pkts */
> -		if (tail_skb->next || skb_is_gso(tail_skb))
> -			break;
> -
>  		skb = q->dequeue(q);
>  		if (!skb)
>  			break;
> @@ -76,11 +72,9 @@ static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q,
>  		if (!skb)
>  			break;
>  
> -		/* "skb" can be a skb list after validate call above
> -		 * (GSO segmented), but it is okay to append it to
> -		 * current tail_skb->next, because next round will exit
> -		 * in-case "tail_skb->next" is a skb list.
> -		 */
> +		while (tail_skb->next) /* GSO list goto tail */
> +			tail_skb = tail_skb->next;
> +
>  		tail_skb->next = skb;
>  		tail_skb = skb;

Thanks a lot guys. I am testing this patch set today.

^ permalink raw reply

* [PATCH] drivers/net/irda/Kconfig: Let SH_IRDA depend on HAS_IOMEM
From: Chen Gang @ 2014-10-02 14:32 UTC (permalink / raw)
  To: samuel
  Cc: Richard Weinberger, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org

SH_IRDA needs HAS_IOMEM, so depend on it. The related error(with
allmodconfig under um):

    CC [M]  drivers/net/irda/sh_irda.o
  drivers/net/irda/sh_irda.c: In function ‘sh_irda_probe’:
  drivers/net/irda/sh_irda.c:776:2: error: implicit declaration of function ‘ioremap_nocache’ [-Werror=implicit-function-declaration]
    self->membase = ioremap_nocache(res->start, resource_size(res));
    ^
  drivers/net/irda/sh_irda.c:776:16: warning: assignment makes pointer from integer without a cast [enabled by default]
    self->membase = ioremap_nocache(res->start, resource_size(res));
                  ^
  drivers/net/irda/sh_irda.c:821:2: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
    iounmap(self->membase);
    ^

Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
---
 drivers/net/irda/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/irda/Kconfig b/drivers/net/irda/Kconfig
index 8d101d6..a2c227b 100644
--- a/drivers/net/irda/Kconfig
+++ b/drivers/net/irda/Kconfig
@@ -397,7 +397,7 @@ config MCS_FIR
 config SH_IRDA
 	tristate "SuperH IrDA driver"
 	depends on IRDA
-	depends on ARCH_SHMOBILE || COMPILE_TEST
+	depends on (ARCH_SHMOBILE || COMPILE_TEST) && HAS_IOMEM
 	help
 	  Say Y here if your want to enable SuperH IrDA devices.
 
-- 
1.9.3

^ permalink raw reply related

* [PATCH] drivers/net/ethernet/marvell/Kconfig: Let PXA168_ETH depend on HAS_IOMEM
From: Chen Gang @ 2014-10-02 14:23 UTC (permalink / raw)
  To: davem, antoine.tenart, arnd, jason, richard, mw, thomas.petazzoni
  Cc: netdev, linux-kernel@vger.kernel.org

PXA168_ETH need HAS_IOMEM, so depend on it, the related error (with
allmodconfig under um):

    CC [M]  drivers/net/ethernet/marvell/pxa168_eth.o
  drivers/net/ethernet/marvell/pxa168_eth.c: In function ‘pxa168_eth_probe’:
  drivers/net/ethernet/marvell/pxa168_eth.c:1605:2: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
    iounmap(pep->base);
    ^

Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
---
 drivers/net/ethernet/marvell/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig
index bed8fbb..b3b72ad 100644
--- a/drivers/net/ethernet/marvell/Kconfig
+++ b/drivers/net/ethernet/marvell/Kconfig
@@ -64,7 +64,7 @@ config MVPP2
 
 config PXA168_ETH
 	tristate "Marvell pxa168 ethernet support"
-	depends on CPU_PXA168 || ARCH_BERLIN || COMPILE_TEST
+	depends on (CPU_PXA168 || ARCH_BERLIN || COMPILE_TEST) && HAS_IOMEM
 	select PHYLIB
 	---help---
 	  This driver supports the pxa168 Ethernet ports.
-- 
1.9.3

^ permalink raw reply related

* [PATCH] drivers/net/dsa/Kconfig: Let NET_DSA_BCM_SF2 depend on HAS_IOMEM
From: Chen Gang @ 2014-10-02 14:14 UTC (permalink / raw)
  To: davem, f.fainelli, leitec, andrew
  Cc: netdev@vger.kernel.org, Richard Weinberger,
	linux-kernel@vger.kernel.org

NET_DSA_BCM_SF2 need HAS_IOMEM, so depend on it, the related error (with
allmodconfig under um):

    CC [M]  drivers/net/dsa/bcm_sf2.o
  drivers/net/dsa/bcm_sf2.c: In function ‘bcm_sf2_sw_setup’:
  drivers/net/dsa/bcm_sf2.c:487:3: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
     iounmap(*base);
     ^

Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
---
 drivers/net/dsa/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig
index ea0697e..9234d80 100644
--- a/drivers/net/dsa/Kconfig
+++ b/drivers/net/dsa/Kconfig
@@ -47,6 +47,7 @@ config NET_DSA_MV88E6171
 
 config NET_DSA_BCM_SF2
 	tristate "Broadcom Starfighter 2 Ethernet switch support"
+	depends on HAS_IOMEM
 	select NET_DSA
 	select NET_DSA_TAG_BRCM
 	select FIXED_PHY if NET_DSA_BCM_SF2=y
-- 
1.9.3

^ permalink raw reply related

* [PATCH] drivers/net/can/Kconfig: Let CAN_AT91 depend on HAS_IOMEM
From: Chen Gang @ 2014-10-02 14:01 UTC (permalink / raw)
  To: wg, mkl, Richard Weinberger
  Cc: linux-can, netdev, linux-kernel@vger.kernel.org

CAN_AT91 needs HAS_IOMEM, so depends on it. The related error (with
allmodconfig under um):

    CC [M]  drivers/net/can/at91_can.o
  drivers/net/can/at91_can.c: In function ‘at91_can_probe’:
  drivers/net/can/at91_can.c:1329:2: error: implicit declaration of function ‘ioremap_nocache’ [-Werror=implicit-function-declaration]
  addr = ioremap_nocache(res->start, resource_size(res));
    ^
  drivers/net/can/at91_can.c:1329:7: warning: assignment makes pointer from integer without a cast [enabled by default]
    addr = ioremap_nocache(res->start, resource_size(res));
         ^
  drivers/net/can/at91_can.c:1384:2: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
    iounmap(addr);
    ^

Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
---
 drivers/net/can/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/can/Kconfig b/drivers/net/can/Kconfig
index e78d6b3..98d73aa 100644
--- a/drivers/net/can/Kconfig
+++ b/drivers/net/can/Kconfig
@@ -65,7 +65,7 @@ config CAN_LEDS
 
 config CAN_AT91
 	tristate "Atmel AT91 onchip CAN controller"
-	depends on ARCH_AT91 || COMPILE_TEST
+	depends on (ARCH_AT91 || COMPILE_TEST) && HAS_IOMEM
 	---help---
 	  This is a driver for the SoC CAN controller in Atmel's AT91SAM9263
 	  and AT91SAM9X5 processors.
-- 
1.9.3

^ permalink raw reply related

* Re: [Xen-devel] [PATCHv1] xen-netfront: always keep the Rx ring full of requests
From: David Vrabel @ 2014-10-02 13:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Boris Ostrovsky, netdev
In-Reply-To: <542D7361020000780003C014@mail.emea.novell.com>

On 02/10/14 14:46, Jan Beulich wrote:
>>>> On 02.10.14 at 15:33, <david.vrabel@citrix.com> wrote:
>> A full Rx ring only requires 1 MiB of memory.  This is not enough
>> memory that it is useful to dynamically scale the number of Rx
>> requests in the ring based on traffic rates.
> 
> The performance benefits are nice, but does the above statement
> scale to hundreds of guests with perhaps multiple NICs and/or
> queues?

Yes, because:

a) Even the full 1 MiB is a tiny fraction of a typically modern Linux VM
(for example, the AWS micro instance still has 1 GiB of memory).

b) Netfront would have used up to 1 MiB already even with moderate data
rates (there was no adjustment of target based on memory pressure).

c) Small VMs are going to typically have one VCPU and hence only one queue.

David

^ permalink raw reply

* [PATCH] staging: xlr_net: Replace obsolete nlm_cop2_{enable, restore} macros
From: Markos Chandras @ 2014-10-02 13:48 UTC (permalink / raw)
  To: linux-mips
  Cc: devel, Jayachandran C, Greg Kroah-Hartman, linux-kernel,
	Markos Chandras, netdev, David S. Miller

Commit 64f6ebe63914 ("MIPS: Netlogic: rename nlm_cop2_save/restore")
replaced nlm_cop2_enable with nlm_cop2_enable_irqsave and
nlm_cop2_restore with nlm_cop2_disable_irqrestore but it did not
update the xlr_net driver to use the new macros resulting into build
problems like this:

drivers/staging/netlogic/xlr_net.c: In function 'send_to_rfr_fifo':
drivers/staging/netlogic/xlr_net.c:128:3: error: implicit declaration of
function 'nlm_cop2_enable' [-Werror=implicit-function-declaration]
mflags = nlm_cop2_enable();
^
drivers/staging/netlogic/xlr_net.c:130:3: error: implicit declaration of
function 'nlm_cop2_restore' [-Werror=implicit-function-declaration]
nlm_cop2_restore(mflags);

Therefore rename these cases as well

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: devel@driverdev.osuosl.org
Cc: linux-kernel@vger.kernel.org
Cc: Jayachandran C <jchandra@broadcom.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Markos Chandras <markos.chandras@imgtec.com>
---
 drivers/staging/netlogic/xlr_net.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/netlogic/xlr_net.c b/drivers/staging/netlogic/xlr_net.c
index 9bf407d6241a..469f75f0f818 100644
--- a/drivers/staging/netlogic/xlr_net.c
+++ b/drivers/staging/netlogic/xlr_net.c
@@ -125,9 +125,9 @@ static int send_to_rfr_fifo(struct xlr_net_priv *priv, void *addr)
 	msg.msg3 = 0;
 	stnid = priv->nd->rfr_station;
 	do {
-		mflags = nlm_cop2_enable();
+		mflags = nlm_cop2_enable_irqsave();
 		ret = nlm_fmn_send(1, 0, stnid, &msg);
-		nlm_cop2_restore(mflags);
+		nlm_cop2_disable_irqrestore(mflags);
 		if (ret == 0)
 			return 0;
 	} while (++num_try < 10000);
@@ -298,9 +298,9 @@ static netdev_tx_t xlr_net_start_xmit(struct sk_buff *skb,
 	u32 flags;
 
 	xlr_make_tx_desc(&msg, virt_to_phys(skb->data), skb);
-	flags = nlm_cop2_enable();
+	flags = nlm_cop2_enable_irqsave();
 	ret = nlm_fmn_send(2, 0, priv->nd->tx_stnid, &msg);
-	nlm_cop2_restore(flags);
+	nlm_cop2_disable_irqrestore(flags);
 	if (ret)
 		dev_kfree_skb_any(skb);
 	return NETDEV_TX_OK;
-- 
2.1.1

^ permalink raw reply related

* [RFC PATCH net-next v3 4/4] rtnl: allow to create device with IFLA_LINK_NETNSID set
From: Nicolas Dichtel @ 2014-10-02 13:48 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: cwang-xCSkyg8dI+0RB7SZvlqPiA, Nicolas Dichtel,
	luto-kltTT9wpgjJwATOyAt5JVQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1412257690-31253-1-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

This patch adds the ability to create a netdevice in a specified netns and
then move it into the final netns. In fact, it allows to have a symetry between
get and set rtnl messages.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>
---
 net/core/rtnetlink.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 1b9329512496..57959a85ed2c 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1211,6 +1211,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_NUM_RX_QUEUES]	= { .type = NLA_U32 },
 	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_PORT_ID_LEN },
 	[IFLA_CARRIER_CHANGES]	= { .type = NLA_U32 },  /* ignored */
+	[IFLA_LINK_NETNSID]	= { .type = NLA_S32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -1983,7 +1984,7 @@ replay:
 		struct nlattr *slave_attr[m_ops ? m_ops->slave_maxtype + 1 : 0];
 		struct nlattr **data = NULL;
 		struct nlattr **slave_data = NULL;
-		struct net *dest_net;
+		struct net *dest_net, *link_net = NULL;
 
 		if (ops) {
 			if (ops->maxtype && linkinfo[IFLA_INFO_DATA]) {
@@ -2089,7 +2090,18 @@ replay:
 		if (IS_ERR(dest_net))
 			return PTR_ERR(dest_net);
 
-		dev = rtnl_create_link(dest_net, ifname, name_assign_type, ops, tb);
+		if (tb[IFLA_LINK_NETNSID]) {
+			int id = nla_get_s32(tb[IFLA_LINK_NETNSID]);
+
+			link_net = get_net_ns_by_id(dest_net, id);
+			if (link_net == NULL) {
+				err =  -EINVAL;
+				goto out;
+			}
+		}
+
+		dev = rtnl_create_link(link_net ? : dest_net, ifname,
+				       name_assign_type, ops, tb);
 		if (IS_ERR(dev)) {
 			err = PTR_ERR(dev);
 			goto out;
@@ -2117,9 +2129,16 @@ replay:
 			}
 		}
 		err = rtnl_configure_link(dev, ifm);
-		if (err < 0)
+		if (err < 0) {
 			unregister_netdevice(dev);
+			goto out;
+		}
+
+		if (link_net)
+			err = dev_change_net_namespace(dev, dest_net, ifname);
 out:
+		if (link_net)
+			put_net(link_net);
 		put_net(dest_net);
 		return err;
 	}
-- 
2.1.0

^ permalink raw reply related

* [RFC PATCH net-next v3 3/4] iptunnels: advertise link netns via netlink
From: Nicolas Dichtel @ 2014-10-02 13:48 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: cwang-xCSkyg8dI+0RB7SZvlqPiA, Nicolas Dichtel,
	luto-kltTT9wpgjJwATOyAt5JVQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1412257690-31253-1-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

Implement rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>
---
 include/net/ip6_tunnel.h | 1 +
 include/net/ip_tunnels.h | 1 +
 net/ipv4/ip_gre.c        | 2 ++
 net/ipv4/ip_tunnel.c     | 8 ++++++++
 net/ipv4/ip_vti.c        | 1 +
 net/ipv4/ipip.c          | 1 +
 net/ipv6/ip6_gre.c       | 1 +
 net/ipv6/ip6_tunnel.c    | 9 +++++++++
 net/ipv6/ip6_vti.c       | 1 +
 net/ipv6/sit.c           | 1 +
 10 files changed, 26 insertions(+)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index a5593dab6af7..8648519f4555 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -69,6 +69,7 @@ int ip6_tnl_xmit_ctl(struct ip6_tnl *t);
 __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw);
 __u32 ip6_tnl_get_cap(struct ip6_tnl *t, const struct in6_addr *laddr,
 			     const struct in6_addr *raddr);
+struct net *ip6_tnl_get_link_net(const struct net_device *dev);
 
 static inline void ip6tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
 {
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 7f538ba6e267..c92a99b5b77e 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -119,6 +119,7 @@ struct ip_tunnel_net {
 int ip_tunnel_init(struct net_device *dev);
 void ip_tunnel_uninit(struct net_device *dev);
 void  ip_tunnel_dellink(struct net_device *dev, struct list_head *head);
+struct net *ip_tunnel_get_link_net(const struct net_device *dev);
 int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
 		       struct rtnl_link_ops *ops, char *devname);
 
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 0485ef18d254..c75974986053 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -827,6 +827,7 @@ static struct rtnl_link_ops ipgre_link_ops __read_mostly = {
 	.dellink	= ip_tunnel_dellink,
 	.get_size	= ipgre_get_size,
 	.fill_info	= ipgre_fill_info,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static struct rtnl_link_ops ipgre_tap_ops __read_mostly = {
@@ -841,6 +842,7 @@ static struct rtnl_link_ops ipgre_tap_ops __read_mostly = {
 	.dellink	= ip_tunnel_dellink,
 	.get_size	= ipgre_get_size,
 	.fill_info	= ipgre_fill_info,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static int __net_init ipgre_tap_init_net(struct net *net)
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index b75b47b0a223..a8ab238d0df4 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -954,6 +954,14 @@ void ip_tunnel_dellink(struct net_device *dev, struct list_head *head)
 }
 EXPORT_SYMBOL_GPL(ip_tunnel_dellink);
 
+struct net *ip_tunnel_get_link_net(const struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+
+	return tunnel->net;
+}
+EXPORT_SYMBOL(ip_tunnel_get_link_net);
+
 int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
 				  struct rtnl_link_ops *ops, char *devname)
 {
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index e453cb724a95..93862411669c 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -530,6 +530,7 @@ static struct rtnl_link_ops vti_link_ops __read_mostly = {
 	.changelink	= vti_changelink,
 	.get_size	= vti_get_size,
 	.fill_info	= vti_fill_info,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static int __init vti_init(void)
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index ea88ab3102a8..406910d04b1b 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -498,6 +498,7 @@ static struct rtnl_link_ops ipip_link_ops __read_mostly = {
 	.dellink	= ip_tunnel_dellink,
 	.get_size	= ipip_get_size,
 	.fill_info	= ipip_fill_info,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static struct xfrm_tunnel ipip_handler __read_mostly = {
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 9a0a1aafe727..10981f568250 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1659,6 +1659,7 @@ static struct rtnl_link_ops ip6gre_link_ops __read_mostly = {
 	.dellink	= ip6gre_dellink,
 	.get_size	= ip6gre_get_size,
 	.fill_info	= ip6gre_fill_info,
+	.get_link_net	= ip6_tnl_get_link_net,
 };
 
 static struct rtnl_link_ops ip6gre_tap_ops __read_mostly = {
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index e01bd0399297..b86d9f4ea5ec 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1699,6 +1699,14 @@ nla_put_failure:
 	return -EMSGSIZE;
 }
 
+struct net *ip6_tnl_get_link_net(const struct net_device *dev)
+{
+	struct ip6_tnl *tunnel = netdev_priv(dev);
+
+	return tunnel->net;
+}
+EXPORT_SYMBOL(ip6_tnl_get_link_net);
+
 static const struct nla_policy ip6_tnl_policy[IFLA_IPTUN_MAX + 1] = {
 	[IFLA_IPTUN_LINK]		= { .type = NLA_U32 },
 	[IFLA_IPTUN_LOCAL]		= { .len = sizeof(struct in6_addr) },
@@ -1722,6 +1730,7 @@ static struct rtnl_link_ops ip6_link_ops __read_mostly = {
 	.dellink	= ip6_tnl_dellink,
 	.get_size	= ip6_tnl_get_size,
 	.fill_info	= ip6_tnl_fill_info,
+	.get_link_net	= ip6_tnl_get_link_net,
 };
 
 static struct xfrm6_tunnel ip4ip6_handler __read_mostly = {
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index 7f52fd9fa7b0..88e8aadcfac1 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -988,6 +988,7 @@ static struct rtnl_link_ops vti6_link_ops __read_mostly = {
 	.changelink	= vti6_changelink,
 	.get_size	= vti6_get_size,
 	.fill_info	= vti6_fill_info,
+	.get_link_net	= ip6_tnl_get_link_net,
 };
 
 static void __net_exit vti6_destroy_tunnels(struct vti6_net *ip6n)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 0d4e27466f82..02ef387811be 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1765,6 +1765,7 @@ static struct rtnl_link_ops sit_link_ops __read_mostly = {
 	.get_size	= ipip6_get_size,
 	.fill_info	= ipip6_fill_info,
 	.dellink	= ipip6_dellink,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static struct xfrm_tunnel sit_handler __read_mostly = {
-- 
2.1.0

^ permalink raw reply related

* [RFC PATCH net-next v3 2/4] rtnl: add link netns id to interface messages
From: Nicolas Dichtel @ 2014-10-02 13:48 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: cwang-xCSkyg8dI+0RB7SZvlqPiA, Nicolas Dichtel,
	luto-kltTT9wpgjJwATOyAt5JVQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1412257690-31253-1-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

This patch adds a new attribute (IFLA_LINK_NETNSID) which contains the 'link'
netns id when this netns is different from the netns where the interface
stands (for example for x-net interfaces like ip tunnels). When there is no id,
we put NETNSA_NSINDEX_UNKNOWN into this attribute to indicate to userland that
the link netns is different from the interface netns. Hence, userland knows that
some information like IFLA_LINK are not interpretable.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>
---
 include/net/rtnetlink.h      |  2 ++
 include/uapi/linux/if_link.h |  1 +
 net/core/rtnetlink.c         | 13 +++++++++++++
 3 files changed, 16 insertions(+)

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index e21b9f9653c0..6c6d5393fc34 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -46,6 +46,7 @@ static inline int rtnl_msg_family(const struct nlmsghdr *nlh)
  *			    to create when creating a new device.
  *	@get_num_rx_queues: Function to determine number of receive queues
  *			    to create when creating a new device.
+ *	@get_link_net: Function to get the i/o netns of the device
  */
 struct rtnl_link_ops {
 	struct list_head	list;
@@ -93,6 +94,7 @@ struct rtnl_link_ops {
 	int			(*fill_slave_info)(struct sk_buff *skb,
 						   const struct net_device *dev,
 						   const struct net_device *slave_dev);
+	struct net		*(*get_link_net)(const struct net_device *dev);
 };
 
 int __rtnl_link_register(struct rtnl_link_ops *ops);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 0bdb77e16875..938c0c02ed2e 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -145,6 +145,7 @@ enum {
 	IFLA_CARRIER,
 	IFLA_PHYS_PORT_ID,
 	IFLA_CARRIER_CHANGES,
+	IFLA_LINK_NETNSID,
 	__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a6882686ca3a..1b9329512496 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -862,6 +862,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(1) /* IFLA_OPERSTATE */
 	       + nla_total_size(1) /* IFLA_LINKMODE */
 	       + nla_total_size(4) /* IFLA_CARRIER_CHANGES */
+	       + nla_total_size(4) /* IFLA_LINK_NETNSID */
 	       + nla_total_size(ext_filter_mask
 			        & RTEXT_FILTER_VF ? 4 : 0) /* IFLA_NUM_VF */
 	       + rtnl_vfinfo_size(dev, ext_filter_mask) /* IFLA_VFINFO_LIST */
@@ -1134,6 +1135,18 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 			goto nla_put_failure;
 	}
 
+	if (dev->rtnl_link_ops &&
+	    dev->rtnl_link_ops->get_link_net) {
+		struct net *link_net = dev->rtnl_link_ops->get_link_net(dev);
+
+		if (!net_eq(dev_net(dev), link_net)) {
+			int id = peernet2id(dev_net(dev), link_net);
+
+			if (nla_put_s32(skb, IFLA_LINK_NETNSID, id))
+				goto nla_put_failure;
+		}
+	}
+
 	if (!(af_spec = nla_nest_start(skb, IFLA_AF_SPEC)))
 		goto nla_put_failure;
 
-- 
2.1.0

^ permalink raw reply related

* [RFC PATCH net-next v3 1/4] netns: add genl cmd to add and get peer netns ids
From: Nicolas Dichtel @ 2014-10-02 13:48 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	luto-kltTT9wpgjJwATOyAt5JVQ, cwang-xCSkyg8dI+0RB7SZvlqPiA,
	Nicolas Dichtel
In-Reply-To: <1412257690-31253-1-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

With this patch, a user can define an id for a peer netns by providing a FD or a
PID. These ids are local to netns (ie valid only into one netns).

This will be useful for netlink messages when a x-netns interface is dumped.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>
---
 MAINTAINERS                 |   1 +
 include/net/net_namespace.h |   5 ++
 include/uapi/linux/Kbuild   |   1 +
 include/uapi/linux/netns.h  |  31 +++++++
 net/core/net_namespace.c    | 195 ++++++++++++++++++++++++++++++++++++++++++++
 net/netlink/genetlink.c     |   4 +
 6 files changed, 237 insertions(+)
 create mode 100644 include/uapi/linux/netns.h

diff --git a/MAINTAINERS b/MAINTAINERS
index f8db3c3acc67..8e7f5d668e6a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6278,6 +6278,7 @@ F:	include/linux/netdevice.h
 F:	include/uapi/linux/in.h
 F:	include/uapi/linux/net.h
 F:	include/uapi/linux/netdevice.h
+F:	include/uapi/linux/netns.h
 F:	tools/net/
 F:	tools/testing/selftests/net/
 F:	lib/random32.c
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 361d26077196..d8847d978b59 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -59,6 +59,7 @@ struct net {
 	struct list_head	exit_list;	/* Use only net_mutex */
 
 	struct user_namespace   *user_ns;	/* Owning user namespace */
+	struct idr		netns_ids;
 
 	unsigned int		proc_inum;
 
@@ -289,6 +290,10 @@ static inline struct net *read_pnet(struct net * const *pnet)
 #define __net_initconst	__initconst
 #endif
 
+int peernet2id(struct net *net, struct net *peer);
+struct net *get_net_ns_by_id(struct net *net, int id);
+int netns_genl_register(void);
+
 struct pernet_operations {
 	struct list_head list;
 	int (*init)(struct net *net);
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 70e150ebc6c9..33a0bbfe4736 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -276,6 +276,7 @@ header-y += netfilter_decnet.h
 header-y += netfilter_ipv4.h
 header-y += netfilter_ipv6.h
 header-y += netlink.h
+header-y += netns.h
 header-y += netrom.h
 header-y += nfc.h
 header-y += nfs.h
diff --git a/include/uapi/linux/netns.h b/include/uapi/linux/netns.h
new file mode 100644
index 000000000000..8ebb08885795
--- /dev/null
+++ b/include/uapi/linux/netns.h
@@ -0,0 +1,31 @@
+#ifndef _UAPI_LINUX_NETNS_H_
+#define _UAPI_LINUX_NETNS_H_
+
+/* Generic netlink messages */
+
+#define NETNS_GENL_NAME			"netns"
+#define NETNS_GENL_VERSION		0x1
+
+/* Commands */
+enum {
+	NETNS_CMD_UNSPEC,
+	NETNS_CMD_NEWID,
+	NETNS_CMD_GETID,
+	__NETNS_CMD_MAX,
+};
+
+#define NETNS_CMD_MAX		(__NETNS_CMD_MAX - 1)
+
+/* Attributes */
+enum {
+	NETNSA_NONE,
+#define NETNSA_NSINDEX_UNKNOWN	-1
+	NETNSA_NSID,
+	NETNSA_PID,
+	NETNSA_FD,
+	__NETNSA_MAX,
+};
+
+#define NETNSA_MAX		(__NETNSA_MAX - 1)
+
+#endif /* _UAPI_LINUX_NETNS_H_ */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 7f155175bba8..4a5680ed42fb 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -15,6 +15,8 @@
 #include <linux/file.h>
 #include <linux/export.h>
 #include <linux/user_namespace.h>
+#include <linux/netns.h>
+#include <net/genetlink.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
@@ -144,6 +146,50 @@ static void ops_free_list(const struct pernet_operations *ops,
 	}
 }
 
+/* This function is used by idr_for_each(). If net is equal to peer, the
+ * function returns the id so that idr_for_each() stops. Because we cannot
+ * returns the id 0 (idr_for_each() will not stop), we return the magic value
+ * -1 for it.
+ */
+static int net_eq_idr(int id, void *net, void *peer)
+{
+	if (net_eq(net, peer))
+		return id ? : -1;
+	return 0;
+}
+
+/* returns NETNSA_NSINDEX_UNKNOWN if not found */
+int peernet2id(struct net *net, struct net *peer)
+{
+	int id = idr_for_each(&net->netns_ids, net_eq_idr, peer);
+
+	ASSERT_RTNL();
+
+	/* Magic value for id 0. */
+	if (id == -1)
+		return 0;
+	if (id == 0)
+		return NETNSA_NSINDEX_UNKNOWN;
+
+	return id;
+}
+
+struct net *get_net_ns_by_id(struct net *net, int id)
+{
+	struct net *peer;
+
+	if (id < 0)
+		return NULL;
+
+	rcu_read_lock();
+	peer = idr_find(&net->netns_ids, id);
+	if (peer)
+		get_net(peer);
+	rcu_read_unlock();
+
+	return peer;
+}
+
 /*
  * setup_net runs the initializers for the network namespace object.
  */
@@ -158,6 +204,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
 	atomic_set(&net->passive, 1);
 	net->dev_base_seq = 1;
 	net->user_ns = user_ns;
+	idr_init(&net->netns_ids);
 
 #ifdef NETNS_REFCNT_DEBUG
 	atomic_set(&net->use_count, 0);
@@ -288,6 +335,14 @@ static void cleanup_net(struct work_struct *work)
 	list_for_each_entry(net, &net_kill_list, cleanup_list) {
 		list_del_rcu(&net->list);
 		list_add_tail(&net->exit_list, &net_exit_list);
+		for_each_net(tmp) {
+			int id = peernet2id(tmp, net);
+
+			if (id >= 0)
+				idr_remove(&tmp->netns_ids, id);
+		}
+		idr_destroy(&net->netns_ids);
+
 	}
 	rtnl_unlock();
 
@@ -399,6 +454,146 @@ static struct pernet_operations __net_initdata net_ns_ops = {
 	.exit = net_ns_net_exit,
 };
 
+static struct genl_family netns_genl_family = {
+	.id		= GENL_ID_GENERATE,
+	.name		= NETNS_GENL_NAME,
+	.version	= NETNS_GENL_VERSION,
+	.hdrsize	= 0,
+	.maxattr	= NETNSA_MAX,
+	.netnsok	= true,
+};
+
+static struct nla_policy netns_nl_policy[NETNSA_MAX + 1] = {
+	[NETNSA_NONE]		= { .type = NLA_UNSPEC },
+	[NETNSA_NSID]		= { .type = NLA_S32 },
+	[NETNSA_PID]		= { .type = NLA_U32 },
+	[NETNSA_FD]		= { .type = NLA_U32 },
+};
+
+static int netns_nl_cmd_newid(struct sk_buff *skb, struct genl_info *info)
+{
+	struct net *net = genl_info_net(info);
+	struct net *peer;
+	int nsid, err;
+
+	if (!info->attrs[NETNSA_NSID])
+		return -EINVAL;
+	nsid = nla_get_s32(info->attrs[NETNSA_NSID]);
+	if (nsid < 0)
+		return -EINVAL;
+
+	if (info->attrs[NETNSA_PID])
+		peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
+	else if (info->attrs[NETNSA_FD])
+		peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
+	else
+		return -EINVAL;
+	if (IS_ERR(peer))
+		return PTR_ERR(peer);
+
+	rtnl_lock();
+	if (peernet2id(net, peer) >= 0) {
+		err = -EEXIST;
+		goto out;
+	}
+
+	err = idr_alloc(&net->netns_ids, peer, nsid, nsid + 1, GFP_KERNEL);
+	if (err >= 0)
+		err = 0;
+out:
+	rtnl_unlock();
+	put_net(peer);
+	return err;
+}
+
+static int netns_nl_get_size(void)
+{
+	return nla_total_size(sizeof(s32)) /* NETNSA_NSID */
+	       ;
+}
+
+static int netns_nl_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags,
+			 int cmd, struct net *net, struct net *peer)
+{
+	void *hdr;
+	int id;
+
+	hdr = genlmsg_put(skb, portid, seq, &netns_genl_family, flags, cmd);
+	if (!hdr)
+		return -EMSGSIZE;
+
+	rtnl_lock();
+	id = peernet2id(net, peer);
+	rtnl_unlock();
+	if (nla_put_s32(skb, NETNSA_NSID, id))
+		goto nla_put_failure;
+
+	return genlmsg_end(skb, hdr);
+
+nla_put_failure:
+	genlmsg_cancel(skb, hdr);
+	return -EMSGSIZE;
+}
+
+static int netns_nl_cmd_getid(struct sk_buff *skb, struct genl_info *info)
+{
+	struct net *net = genl_info_net(info);
+	struct sk_buff *msg;
+	int err = -ENOBUFS;
+	struct net *peer;
+
+	if (info->attrs[NETNSA_PID])
+		peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
+	else if (info->attrs[NETNSA_FD])
+		peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
+	else
+		return -EINVAL;
+
+	if (IS_ERR(peer))
+		return PTR_ERR(peer);
+
+	msg = genlmsg_new(netns_nl_get_size(), GFP_KERNEL);
+	if (!msg) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	err = netns_nl_fill(msg, info->snd_portid, info->snd_seq,
+			    NLM_F_ACK, NETNS_CMD_GETID, net, peer);
+	if (err < 0)
+		goto err_out;
+
+	err = genlmsg_unicast(net, msg, info->snd_portid);
+	goto out;
+
+err_out:
+	nlmsg_free(msg);
+out:
+	put_net(peer);
+	return err;
+}
+
+static struct genl_ops netns_genl_ops[] = {
+	{
+		.cmd = NETNS_CMD_NEWID,
+		.policy = netns_nl_policy,
+		.doit = netns_nl_cmd_newid,
+		.flags = GENL_ADMIN_PERM,
+	},
+	{
+		.cmd = NETNS_CMD_GETID,
+		.policy = netns_nl_policy,
+		.doit = netns_nl_cmd_getid,
+		.flags = GENL_ADMIN_PERM,
+	},
+};
+
+int netns_genl_register(void)
+{
+	return genl_register_family_with_ops(&netns_genl_family,
+					     netns_genl_ops);
+}
+
 static int __init net_ns_init(void)
 {
 	struct net_generic *ng;
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 76393f2f4b22..c6f39e40c9f3 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -1029,6 +1029,10 @@ static int __init genl_init(void)
 	if (err)
 		goto problem;
 
+	err = netns_genl_register();
+	if (err < 0)
+		goto problem;
+
 	return 0;
 
 problem:
-- 
2.1.0

^ permalink raw reply related

* [RFC PATCH net-next v3 0/4] netns: allow to identify peer netns
From: Nicolas Dichtel @ 2014-10-02 13:48 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: cwang-xCSkyg8dI+0RB7SZvlqPiA, luto-kltTT9wpgjJwATOyAt5JVQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <542D5726.8070308-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

The goal of this serie is to be able to multicast netlink messages with an
attribute that identify a peer netns.
This is needed by the userland to interpret some informations contained in
netlink messages (like IFLA_LINK value, but also some other attributes in case
of x-netns netdevice (see also
http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).

Ids of peer netns are set by userland via a new genl messages. These ids are
stored per netns and are local (ie only valid in the netns where they are set).
To avoid allocating an int for each peer netns, I use idr_for_each() to retrieve
the id of a peer netns.

Patch 1/4 introduces the netlink API mechanism to set and get these ids.
Patch 2/4 and 3/4 shows an example of how to use these ids in rtnetlink
messages. And patch 4/4 shows that the netlink messages can be symetric between
a GET and a SET.

iproute2 patches are available, I can send them on demand.

Here is a small screenshot to show how it can be used by userland:
$ ip netns add foo
$ ip netns del foo
$ ip netns
$ touch /var/run/netns/init_net
$ mount --bind /proc/1/ns/net /var/run/netns/init_net
$ ip netns add foo
$ ip netns exec foo ip netns set init_net 0
$ ip netns
foo
init_net
$ ip netns exec foo ip netns
foo
init_net (id: 0)
$ ip netns exec foo ip link add ipip1 link-netnsid 0 type ipip remote 10.16.0.121 local 10.16.0.249
$ ip netns exec foo ip l ls ipip1
6: ipip1@NONE: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default 
    link/ipip 10.16.0.249 peer 10.16.0.121 link-netnsid 0

The parameter link-netnsid shows us where the interface sends and receives
packets (and thus we know where encapsulated addresses are set).

RFCv2 -> RFCv3:
  ids are now defined by userland (via netlink). Ids are stored in each netns
  (and they are local to this netns).
  add get_link_net support for ip6 tunnels
  netnsid is now a s32 instead of a u32

RFCv1 -> RFCv2:
  remove useless ()
  ids are now stored in the user ns. It's possible to get an id for a peer netns
  only if the current netns and the peer netns have the same user ns parent.

 MAINTAINERS                  |   1 +
 include/net/ip6_tunnel.h     |   1 +
 include/net/ip_tunnels.h     |   1 +
 include/net/net_namespace.h  |   5 ++
 include/net/rtnetlink.h      |   2 +
 include/uapi/linux/Kbuild    |   1 +
 include/uapi/linux/if_link.h |   1 +
 include/uapi/linux/netns.h   |  31 +++++++
 net/core/net_namespace.c     | 195 +++++++++++++++++++++++++++++++++++++++++++
 net/core/rtnetlink.c         |  38 ++++++++-
 net/ipv4/ip_gre.c            |   2 +
 net/ipv4/ip_tunnel.c         |   8 ++
 net/ipv4/ip_vti.c            |   1 +
 net/ipv4/ipip.c              |   1 +
 net/ipv6/ip6_gre.c           |   1 +
 net/ipv6/ip6_tunnel.c        |   9 ++
 net/ipv6/ip6_vti.c           |   1 +
 net/ipv6/sit.c               |   1 +
 net/netlink/genetlink.c      |   4 +
 19 files changed, 301 insertions(+), 3 deletions(-)

Comments are welcome.

Regards,
Nicolas

^ permalink raw reply

* Re: [Xen-devel] [PATCHv1] xen-netfront: always keep the Rx ring full of requests
From: Jan Beulich @ 2014-10-02 13:46 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, Boris Ostrovsky, netdev
In-Reply-To: <1412256826-18874-1-git-send-email-david.vrabel@citrix.com>

>>> On 02.10.14 at 15:33, <david.vrabel@citrix.com> wrote:
> A full Rx ring only requires 1 MiB of memory.  This is not enough
> memory that it is useful to dynamically scale the number of Rx
> requests in the ring based on traffic rates.

The performance benefits are nice, but does the above statement
scale to hundreds of guests with perhaps multiple NICs and/or
queues?

Jan

> Keeping the ring full of Rx requests handles bursty traffic better
> than trying to converges on an optimal number of requests to keep
> filled.
> 
> On a 4 core host, an iperf -P 64 -t 60 run from dom0 to a 4 VCPU guest
> improved from 5.1 Gbit/s to 5.6 Gbit/s.  Gains with more bursty
> traffic are expected to be higher.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>

^ permalink raw reply

* Re: [RFC PATCH net-next v2 0/5] netns: allow to identify peer netns
From: Nicolas Dichtel @ 2014-10-02 13:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Network Development, Linux Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andy Lutomirski, Stephen Hemminger, Cong Wang, Linux API,
	Andrew Morton, David S. Miller
In-Reply-To: <87y4t2gtd0.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

Le 29/09/2014 20:43, Eric W. Biederman a écrit :
> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>
>> Le 26/09/2014 20:57, Eric W. Biederman a écrit :
>>> Andy Lutomirski <luto@amacapital.net> writes:
>>>
>>>> On Fri, Sep 26, 2014 at 11:10 AM, Eric W. Biederman
>>>> <ebiederm@xmission.com> wrote:
>>>>> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>>>>>
>>>>>> The goal of this serie is to be able to multicast netlink messages with an
>>>>>> attribute that identify a peer netns.
>>>>>> This is needed by the userland to interpret some informations contained in
>>>>>> netlink messages (like IFLA_LINK value, but also some other attributes in case
>>>>>> of x-netns netdevice (see also
>>>>>> http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
>>>>>> http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).
>>>>>
>>>>> I want say that the problem addressed by patch 3/5 of this series is a
>>>>> fundamentally valid problem.  We have network objects spanning network
>>>>> namespaces and it would be very nice to be able to talk about them in
>>>>> netlink, and file descriptors are too local and argubably too heavy
>>>>> weight for netlink quires and especially for netlink broadcast messages.
>>>>>
>>>>> Furthermore the concept of ineternal concept of peernet2id seems valid.
>>>>>
>>>>> However what you do not address is a way for CRIU (aka process
>>>>> migration) to be able to restore these ids after process migration.
>>>>> Going farther it looks like you are actively breaking process migration
>>>>> at this time, making this set of patches a no-go.
>> Ok, I will look more deeply into CRIU.
>>
>>>>>
>>>>> When adding a new form of namespace id CRIU patches are just about
>>>>> as necessary as iproute patches.
>> Noted.
>
>
>
>>>>> That does not describe what you have actually implemented in the
>>>>> patches.
>>>>>
>>>>> I see two ways to go with this.
>>>>>
>>>>> - A per network namespace table to that you can store ids for ``peer''
>>>>>     network namespaces.  The table would need to be populated manually by
>>>>>     the likes of ip netns add.
>>>>>
>>>>>     That flips the order of assignment and makes this idea solid.
>> I have a preference for this solution, because it allows to have a full
>> broadcast messages. When you have a lot of network interfaces (> 10k),
>> it saves a lot of time to avoid another request to get all informations.
>
> My practical question is how often does it happen that we care?
In fact, I don't think that scenarii with a lot of netns have a full mesh of
x-netns interfaces. It will be more one "link" netns with the physical
interface and all other with one interface with the link part in this "link"
netns. Hence, only one nsid is needing in each netns.

>
>>>>>     Unfortunately in the case of a fully referencing mesh of N network
>>>>>     namespaces such a mesh winds up taking O(N^2) space, which seems
>>>>>     undesirable.
>> Memory consumption vs performances ;-)
>> In fact, when you have a lot of netns, you already should have some memory
>> available (at least N lo interfaces + N interfaces (veth or a x-netns
>> interface)). I'm not convinced that this is really an obstacle.
>
> I would have to see how it all fits together. O(N^2) grows a lot faster
> that N.  So after a point it isn't in the same ballpark of memory
> consumption.
>
>>> broadcast message business, and only care about the remote namespace for
>>> unicast messages.  Putting the work in an infrequently used slow path
>>> instead of a comparitively common path gives us much more freedom in
>>> the implementation.
>> I think it's better to have a full netlink messages, instead a partial one.
>> There is already a lot of attributes added for each rtnl interface messages to
>> be sure to describe all parameters of these interfaces.
>> And if the user don't care about ids (user has not set any id with iproute2),
>> we can just add the same attribute with id 0 (let's say it's a reserved id) to
>> indicate that the link part of this interface is in another netns.
>
> I imagine an id like that is something we would want ip netns add to
> set, and probably set in all existing network namespaces as well.
>
>> The great benefit of your first proposal is that the ids are set by the
>> userspace and thus it allows a high flexibility.
>>
>> Would you accept a patch that implements this first solution?
>
> I would not fundamentally reject it.  I would really like to make
> certain we think through how it will be used and what the practical
> benefits are.  Depending on how it is used the data structure could
> be a killer or it could be a case where we see how to manage it and
> simply don't care.
I will send a v3, so we can talk about it.


Thank you,
Nicolas
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply

* [PATCHv1] xen-netfront: always keep the Rx ring full of requests
From: David Vrabel @ 2014-10-02 13:33 UTC (permalink / raw)
  To: netdev; +Cc: David Vrabel, xen-devel, Konrad Rzeszutek Wilk, Boris Ostrovsky

A full Rx ring only requires 1 MiB of memory.  This is not enough
memory that it is useful to dynamically scale the number of Rx
requests in the ring based on traffic rates.

Keeping the ring full of Rx requests handles bursty traffic better
than trying to converges on an optimal number of requests to keep
filled.

On a 4 core host, an iperf -P 64 -t 60 run from dom0 to a 4 VCPU guest
improved from 5.1 Gbit/s to 5.6 Gbit/s.  Gains with more bursty
traffic are expected to be higher.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 drivers/net/xen-netfront.c |  310 ++++++++------------------------------------
 1 file changed, 54 insertions(+), 256 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index ca82f54..d241aca 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -77,7 +77,9 @@ struct netfront_cb {
 
 #define NET_TX_RING_SIZE __CONST_RING_SIZE(xen_netif_tx, PAGE_SIZE)
 #define NET_RX_RING_SIZE __CONST_RING_SIZE(xen_netif_rx, PAGE_SIZE)
-#define TX_MAX_TARGET min_t(int, NET_TX_RING_SIZE, 256)
+
+/* Minimum number of Rx slots (includes slot for GSO metadata). */
+#define NET_RX_SLOTS_MIN (XEN_NETIF_NR_SLOTS_MIN + 1)
 
 /* Queue name is interface name with "-qNNN" appended */
 #define QUEUE_NAME_SIZE (IFNAMSIZ + 6)
@@ -138,10 +140,6 @@ struct netfront_queue {
 	int rx_ring_ref;
 
 	/* Receive-ring batched refills. */
-#define RX_MIN_TARGET 8
-#define RX_DFL_MIN_TARGET 64
-#define RX_MAX_TARGET min_t(int, NET_RX_RING_SIZE, 256)
-	unsigned rx_min_target, rx_max_target, rx_target;
 	struct sk_buff_head rx_batch;
 
 	struct timer_list rx_refill_timer;
@@ -228,14 +226,6 @@ static grant_ref_t xennet_get_rx_ref(struct netfront_queue *queue,
 	return ref;
 }
 
-#ifdef CONFIG_SYSFS
-static int xennet_sysfs_addif(struct net_device *netdev);
-static void xennet_sysfs_delif(struct net_device *netdev);
-#else /* !CONFIG_SYSFS */
-#define xennet_sysfs_addif(dev) (0)
-#define xennet_sysfs_delif(dev) do { } while (0)
-#endif
-
 static bool xennet_can_sg(struct net_device *dev)
 {
 	return dev->features & NETIF_F_SG;
@@ -251,7 +241,7 @@ static void rx_refill_timeout(unsigned long data)
 static int netfront_tx_slot_available(struct netfront_queue *queue)
 {
 	return (queue->tx.req_prod_pvt - queue->tx.rsp_cons) <
-		(TX_MAX_TARGET - MAX_SKB_FRAGS - 2);
+		(NET_TX_RING_SIZE - MAX_SKB_FRAGS - 2);
 }
 
 static void xennet_maybe_wake_tx(struct netfront_queue *queue)
@@ -265,77 +255,55 @@ static void xennet_maybe_wake_tx(struct netfront_queue *queue)
 		netif_tx_wake_queue(netdev_get_tx_queue(dev, queue->id));
 }
 
-static void xennet_alloc_rx_buffers(struct netfront_queue *queue)
+
+struct sk_buff *xennet_alloc_one_rx_buffer(struct netfront_queue *queue)
 {
-	unsigned short id;
 	struct sk_buff *skb;
 	struct page *page;
-	int i, batch_target, notify;
+
+	skb = __netdev_alloc_skb(queue->info->netdev,
+				 RX_COPY_THRESHOLD + NET_IP_ALIGN,
+				 GFP_ATOMIC | __GFP_NOWARN);
+	if (unlikely(!skb))
+		return NULL;
+
+	page = alloc_page(GFP_ATOMIC | __GFP_NOWARN);
+	if (!page) {
+		kfree_skb(skb);
+		return NULL;
+	}
+	skb_add_rx_frag(skb, 0, page, 0, 0, PAGE_SIZE);
+
+	/* Align ip header to a 16 bytes boundary */
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb->dev = queue->info->netdev;
+
+	return skb;
+}
+	
+
+static void xennet_alloc_rx_buffers(struct netfront_queue *queue)
+{
 	RING_IDX req_prod = queue->rx.req_prod_pvt;
-	grant_ref_t ref;
-	unsigned long pfn;
-	void *vaddr;
-	struct xen_netif_rx_request *req;
+	int notify;
 
 	if (unlikely(!netif_carrier_ok(queue->info->netdev)))
 		return;
 
-	/*
-	 * Allocate skbuffs greedily, even though we batch updates to the
-	 * receive ring. This creates a less bursty demand on the memory
-	 * allocator, so should reduce the chance of failed allocation requests
-	 * both for ourself and for other kernel subsystems.
-	 */
-	batch_target = queue->rx_target - (req_prod - queue->rx.rsp_cons);
-	for (i = skb_queue_len(&queue->rx_batch); i < batch_target; i++) {
-		skb = __netdev_alloc_skb(queue->info->netdev,
-					 RX_COPY_THRESHOLD + NET_IP_ALIGN,
-					 GFP_ATOMIC | __GFP_NOWARN);
-		if (unlikely(!skb))
-			goto no_skb;
-
-		/* Align ip header to a 16 bytes boundary */
-		skb_reserve(skb, NET_IP_ALIGN);
-
-		page = alloc_page(GFP_ATOMIC | __GFP_NOWARN);
-		if (!page) {
-			kfree_skb(skb);
-no_skb:
-			/* Could not allocate any skbuffs. Try again later. */
-			mod_timer(&queue->rx_refill_timer,
-				  jiffies + (HZ/10));
-
-			/* Any skbuffs queued for refill? Force them out. */
-			if (i != 0)
-				goto refill;
+	for (req_prod = queue->rx.req_prod_pvt;
+	     req_prod - queue->rx.rsp_cons < NET_RX_RING_SIZE;
+	     req_prod++) {
+		struct sk_buff *skb;
+		unsigned short id;
+		grant_ref_t ref;
+		unsigned long pfn;
+		struct xen_netif_rx_request *req;
+
+		skb = xennet_alloc_one_rx_buffer(queue);
+		if (!skb)
 			break;
-		}
-
-		skb_add_rx_frag(skb, 0, page, 0, 0, PAGE_SIZE);
-		__skb_queue_tail(&queue->rx_batch, skb);
-	}
-
-	/* Is the batch large enough to be worthwhile? */
-	if (i < (queue->rx_target/2)) {
-		if (req_prod > queue->rx.sring->req_prod)
-			goto push;
-		return;
-	}
-
-	/* Adjust our fill target if we risked running out of buffers. */
-	if (((req_prod - queue->rx.sring->rsp_prod) < (queue->rx_target / 4)) &&
-	    ((queue->rx_target *= 2) > queue->rx_max_target))
-		queue->rx_target = queue->rx_max_target;
-
- refill:
-	for (i = 0; ; i++) {
-		skb = __skb_dequeue(&queue->rx_batch);
-		if (skb == NULL)
-			break;
-
-		skb->dev = queue->info->netdev;
 
-		id = xennet_rxidx(req_prod + i);
+		id = xennet_rxidx(req_prod);
 
 		BUG_ON(queue->rx_skbs[id]);
 		queue->rx_skbs[id] = skb;
@@ -345,9 +313,8 @@ no_skb:
 		queue->grant_rx_ref[id] = ref;
 
 		pfn = page_to_pfn(skb_frag_page(&skb_shinfo(skb)->frags[0]));
-		vaddr = page_address(skb_frag_page(&skb_shinfo(skb)->frags[0]));
 
-		req = RING_GET_REQUEST(&queue->rx, req_prod + i);
+		req = RING_GET_REQUEST(&queue->rx, req_prod);
 		gnttab_grant_foreign_access_ref(ref,
 						queue->info->xbdev->otherend_id,
 						pfn_to_mfn(pfn),
@@ -357,11 +324,16 @@ no_skb:
 		req->gref = ref;
 	}
 
+	queue->rx.req_prod_pvt = req_prod;
+
+	/* Not enough requests? Try again later. */
+	if (req_prod - queue->rx.rsp_cons < NET_RX_SLOTS_MIN) {
+		mod_timer(&queue->rx_refill_timer, jiffies + (HZ/10));
+		return;
+	}
+
 	wmb();		/* barrier so backend seens requests */
 
-	/* Above is a suitable barrier to ensure backend will see requests. */
-	queue->rx.req_prod_pvt = req_prod + i;
- push:
 	RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&queue->rx, notify);
 	if (notify)
 		notify_remote_via_irq(queue->rx_irq);
@@ -1070,13 +1042,6 @@ err:
 
 	work_done -= handle_incoming_queue(queue, &rxq);
 
-	/* If we get a callback with very few responses, reduce fill target. */
-	/* NB. Note exponential increase, linear decrease. */
-	if (((queue->rx.req_prod_pvt - queue->rx.sring->rsp_prod) >
-	     ((3*queue->rx_target) / 4)) &&
-	    (--queue->rx_target < queue->rx_min_target))
-		queue->rx_target = queue->rx_min_target;
-
 	xennet_alloc_rx_buffers(queue);
 
 	if (work_done < budget) {
@@ -1396,13 +1361,6 @@ static int netfront_probe(struct xenbus_device *dev,
 		goto fail;
 	}
 
-	err = xennet_sysfs_addif(info->netdev);
-	if (err) {
-		unregister_netdev(info->netdev);
-		pr_warn("%s: add sysfs failed err=%d\n", __func__, err);
-		goto fail;
-	}
-
 	return 0;
 
  fail:
@@ -1644,9 +1602,6 @@ static int xennet_init_queue(struct netfront_queue *queue)
 	spin_lock_init(&queue->rx_lock);
 
 	skb_queue_head_init(&queue->rx_batch);
-	queue->rx_target     = RX_DFL_MIN_TARGET;
-	queue->rx_min_target = RX_DFL_MIN_TARGET;
-	queue->rx_max_target = RX_MAX_TARGET;
 
 	init_timer(&queue->rx_refill_timer);
 	queue->rx_refill_timer.data = (unsigned long)queue;
@@ -1670,7 +1625,7 @@ static int xennet_init_queue(struct netfront_queue *queue)
 	}
 
 	/* A grant for every tx ring slot */
-	if (gnttab_alloc_grant_references(TX_MAX_TARGET,
+	if (gnttab_alloc_grant_references(NET_TX_RING_SIZE,
 					  &queue->gref_tx_head) < 0) {
 		pr_alert("can't alloc tx grant refs\n");
 		err = -ENOMEM;
@@ -1678,7 +1633,7 @@ static int xennet_init_queue(struct netfront_queue *queue)
 	}
 
 	/* A grant for every rx ring slot */
-	if (gnttab_alloc_grant_references(RX_MAX_TARGET,
+	if (gnttab_alloc_grant_references(NET_RX_RING_SIZE,
 					  &queue->gref_rx_head) < 0) {
 		pr_alert("can't alloc rx grant refs\n");
 		err = -ENOMEM;
@@ -2145,161 +2100,6 @@ static const struct ethtool_ops xennet_ethtool_ops =
 	.get_strings = xennet_get_strings,
 };
 
-#ifdef CONFIG_SYSFS
-static ssize_t show_rxbuf_min(struct device *dev,
-			      struct device_attribute *attr, char *buf)
-{
-	struct net_device *netdev = to_net_dev(dev);
-	struct netfront_info *info = netdev_priv(netdev);
-	unsigned int num_queues = netdev->real_num_tx_queues;
-
-	if (num_queues)
-		return sprintf(buf, "%u\n", info->queues[0].rx_min_target);
-	else
-		return sprintf(buf, "%u\n", RX_MIN_TARGET);
-}
-
-static ssize_t store_rxbuf_min(struct device *dev,
-			       struct device_attribute *attr,
-			       const char *buf, size_t len)
-{
-	struct net_device *netdev = to_net_dev(dev);
-	struct netfront_info *np = netdev_priv(netdev);
-	unsigned int num_queues = netdev->real_num_tx_queues;
-	char *endp;
-	unsigned long target;
-	unsigned int i;
-	struct netfront_queue *queue;
-
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
-	target = simple_strtoul(buf, &endp, 0);
-	if (endp == buf)
-		return -EBADMSG;
-
-	if (target < RX_MIN_TARGET)
-		target = RX_MIN_TARGET;
-	if (target > RX_MAX_TARGET)
-		target = RX_MAX_TARGET;
-
-	for (i = 0; i < num_queues; ++i) {
-		queue = &np->queues[i];
-		spin_lock_bh(&queue->rx_lock);
-		if (target > queue->rx_max_target)
-			queue->rx_max_target = target;
-		queue->rx_min_target = target;
-		if (target > queue->rx_target)
-			queue->rx_target = target;
-
-		xennet_alloc_rx_buffers(queue);
-
-		spin_unlock_bh(&queue->rx_lock);
-	}
-	return len;
-}
-
-static ssize_t show_rxbuf_max(struct device *dev,
-			      struct device_attribute *attr, char *buf)
-{
-	struct net_device *netdev = to_net_dev(dev);
-	struct netfront_info *info = netdev_priv(netdev);
-	unsigned int num_queues = netdev->real_num_tx_queues;
-
-	if (num_queues)
-		return sprintf(buf, "%u\n", info->queues[0].rx_max_target);
-	else
-		return sprintf(buf, "%u\n", RX_MAX_TARGET);
-}
-
-static ssize_t store_rxbuf_max(struct device *dev,
-			       struct device_attribute *attr,
-			       const char *buf, size_t len)
-{
-	struct net_device *netdev = to_net_dev(dev);
-	struct netfront_info *np = netdev_priv(netdev);
-	unsigned int num_queues = netdev->real_num_tx_queues;
-	char *endp;
-	unsigned long target;
-	unsigned int i = 0;
-	struct netfront_queue *queue = NULL;
-
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
-	target = simple_strtoul(buf, &endp, 0);
-	if (endp == buf)
-		return -EBADMSG;
-
-	if (target < RX_MIN_TARGET)
-		target = RX_MIN_TARGET;
-	if (target > RX_MAX_TARGET)
-		target = RX_MAX_TARGET;
-
-	for (i = 0; i < num_queues; ++i) {
-		queue = &np->queues[i];
-		spin_lock_bh(&queue->rx_lock);
-		if (target < queue->rx_min_target)
-			queue->rx_min_target = target;
-		queue->rx_max_target = target;
-		if (target < queue->rx_target)
-			queue->rx_target = target;
-
-		xennet_alloc_rx_buffers(queue);
-
-		spin_unlock_bh(&queue->rx_lock);
-	}
-	return len;
-}
-
-static ssize_t show_rxbuf_cur(struct device *dev,
-			      struct device_attribute *attr, char *buf)
-{
-	struct net_device *netdev = to_net_dev(dev);
-	struct netfront_info *info = netdev_priv(netdev);
-	unsigned int num_queues = netdev->real_num_tx_queues;
-
-	if (num_queues)
-		return sprintf(buf, "%u\n", info->queues[0].rx_target);
-	else
-		return sprintf(buf, "0\n");
-}
-
-static struct device_attribute xennet_attrs[] = {
-	__ATTR(rxbuf_min, S_IRUGO|S_IWUSR, show_rxbuf_min, store_rxbuf_min),
-	__ATTR(rxbuf_max, S_IRUGO|S_IWUSR, show_rxbuf_max, store_rxbuf_max),
-	__ATTR(rxbuf_cur, S_IRUGO, show_rxbuf_cur, NULL),
-};
-
-static int xennet_sysfs_addif(struct net_device *netdev)
-{
-	int i;
-	int err;
-
-	for (i = 0; i < ARRAY_SIZE(xennet_attrs); i++) {
-		err = device_create_file(&netdev->dev,
-					   &xennet_attrs[i]);
-		if (err)
-			goto fail;
-	}
-	return 0;
-
- fail:
-	while (--i >= 0)
-		device_remove_file(&netdev->dev, &xennet_attrs[i]);
-	return err;
-}
-
-static void xennet_sysfs_delif(struct net_device *netdev)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(xennet_attrs); i++)
-		device_remove_file(&netdev->dev, &xennet_attrs[i]);
-}
-
-#endif /* CONFIG_SYSFS */
-
 static const struct xenbus_device_id netfront_ids[] = {
 	{ "vif" },
 	{ "" }
@@ -2317,8 +2117,6 @@ static int xennet_remove(struct xenbus_device *dev)
 
 	xennet_disconnect_backend(info);
 
-	xennet_sysfs_delif(info->netdev);
-
 	unregister_netdev(info->netdev);
 
 	for (i = 0; i < num_queues; ++i) {
-- 
1.7.10.4

^ permalink raw reply related

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
From: Jamal Hadi Salim @ 2014-10-02 13:02 UTC (permalink / raw)
  To: Dave Taht, Jesper Dangaard Brouer
  Cc: Tom Herbert, David Miller, Linux Netdev List, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Alexander Duyck, John Fastabend, Toke Høiland-Jørgensen
In-Reply-To: <CAA93jw5orK2FeD20zcFpXGkqGa=kF8E-vGtkdUs7vr54R5AE6g@mail.gmail.com>

On 10/02/14 01:18, Dave Taht wrote:

> I am not huge on averages. Network theorists tend to think about things in
> terms of fluid models. Van Jacobson's analogy of a water fountain's
> operation is very profound...
>
> While it is nearly impossible for a conventional Van Neuman time sliced
> CPU + network to actually act that way, things like BQL and dedicated
> pipelining systems like those in DPDK are getting closer to that ideal.
>
> An example of where averages let you down is on the classic 5 minute
> data reduction things like mrtg do, where you might see `60% of the
> bandwidth (capacity/5 minutes) in use, yet still see drops because
> over shorter intervals (capacity/10ms) you have bursts arriving.
>

I think in this case, averages makes sense because we are measuring
resource utilization - CPU (ab)use. Assuming it costs more when
we dont bulk and we make up for it when we bulk, then over the
measurement period we can find if it is an overall win.
Same thing with bandwidth utilization - instantenous bursts dont tell
you the overall utilization (so a double leaky bucket gives you
a better picture).

cheers,
jamal

^ permalink raw reply

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
From: Jamal Hadi Salim @ 2014-10-02 12:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, David Miller, Linux Netdev List, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen
In-Reply-To: <20141001223229.6cbaac07@redhat.com>

On 10/01/14 16:32, Jesper Dangaard Brouer wrote:
> On Wed, 01 Oct 2014 16:05:31 -0400

> I'll try to make it more explicit.
> Will resubmit patchset shortly...
>

Thanks for providing the clarity.

> Notice it is not difficult cause a queue to form, but it is tricky (not
> difficult) to correctly test this patchset.  Perhaps you misread my
> statement earlier as "it was difficult to test and cause a queue to form"?
>

The conflict maybe what "difficult" or "common" means.
I know from experience that it is difficult under *normal*
circumstances to create the overload. Example you had to turn
off GSO/TSO to see it for 10G ;-> iow you had to go out of your way to
turn off a useful feature. So you are no longer dealing with "common".
Which is fine by me - I just wanted you to specify that was the
case. I also wanted you to run the testcase which said "this is
how things were before my patch" for 10G (with GSO/TSO). But i
dont think you are in the mood for that and if your patch goes
in, then the reaction to any regression to that wouldnt take long.
If something breaks people would start whining in a short period.
So I am ok with it.


> I think you could read this blog in 30 sec:
>   http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html
>

Ok, yes that message was clear in 30 seconds or less ;-> Immediate
gratification.
It is what i would do (havent tried tweaking c states) - but off the
bat, I would do what you did when i want to run serious networking.

> My cover letter and testing section... will take you longer that 30
> sec, it have grown quite large (and Eric will not even read it :-P ;-))
>

But it is referenced forever - so i can go back and read it. A url
may return a 404 in 5 years.

> Believe or not, I've actually restricted and reduced the testing
> section.

I agree there is an upper bound to how much testing you can do.
Looking forward to seeing a paper with all the nice details.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH v2 net-next] mlx4: optimize xmit path
From: Amir Vadai @ 2014-10-02 12:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, Alexei Starovoitov, David S. Miller,
	Jesper Dangaard Brouer, Eric Dumazet, John Fastabend,
	Linux Netdev List, Or Gerlitz, amira, idos, Yevgeny Petrilin,
	eyalpe
In-Reply-To: <1412251620.16704.88.camel@edumazet-glaptop2.roam.corp.google.com>

On 10/2/2014 3:07 PM, Eric Dumazet wrote:
> On Thu, 2014-10-02 at 14:56 +0300, Amir Vadai wrote:
>> After making sure the sender thread and the TX completions are not on
>> the same CPU, I see the expected improvement. +0.5Mpps with tx
>> optimizations.
> 
> I got 40% here.
> 
> Hmm... both cpus are on the same socket, right ?
Yes. And on a NUMA node close to the NIC

> 
> TX coalescing is properly setup ?
I will try to play with it, if it is too aggressive, the queue is stopped.
I will play again with it and the ring size to find a better spot.

> 
> What interrupt rate do you get ?
~50K per second.

> 
> You can take a look at where cycles are spent
> 
> perf record -a -g sleep 5
> perf report
> 
> 
I will.

In case your curious, I have this on the CPU doing completions:
+  44.86%      swapper  [kernel.kallsyms]  [k] consume_skb
+  21.38%      swapper  [kernel.kallsyms]  [k] mlx4_en_poll_tx_cq
+   6.81%      swapper  [kernel.kallsyms]  [k] mlx4_en_free_tx_desc.isra.24
+   6.07%      swapper  [kernel.kallsyms]  [k] intel_idle
+   2.98%      swapper  [kernel.kallsyms]  [k] __dev_kfree_skb_any
+   2.66%         rngd  rngd               [.] 0x0000000000002749
+   1.79%         rngd  [kernel.kallsyms]  [k] consume_skb
+   1.28%      swapper  [kernel.kallsyms]  [k] irq_entries_start
+   1.16%         rngd  [kernel.kallsyms]  [k] mlx4_en_poll_tx_cq
+   0.86%      swapper  [kernel.kallsyms]  [k] swiotlb_unmap_page
+   0.79%      swapper  [kernel.kallsyms]  [k] unmap_single
+   0.63%      swapper  [kernel.kallsyms]  [k] irqtime_account_irq
+   0.55%      swapper  [kernel.kallsyms]  [k] mlx4_eq_int
+   0.51%      swapper  [kernel.kallsyms]  [k] eq_set_ci.isra.14

and this on the xmit thread:
+  72.00%  kpktgend_6  [kernel.kallsyms]  [k] mlx4_en_xmit
+  13.11%  kpktgend_6  [kernel.kallsyms]  [k] pktgen_thread_worker
+   5.34%  kpktgend_6  [kernel.kallsyms]  [k] _raw_spin_lock
+   3.80%  kpktgend_6  [kernel.kallsyms]  [k] swiotlb_map_page
+   2.08%  kpktgend_6  [kernel.kallsyms]  [k] __iowrite64_copy
+   1.21%  kpktgend_6  [kernel.kallsyms]  [k] skb_clone_tx_timestamp
+   0.87%  kpktgend_6  [kernel.kallsyms]  [k] kthread_should_stop
+   0.64%  kpktgend_6  [kernel.kallsyms]  [k] swiotlb_dma_mapping_error
+   0.51%  kpktgend_6  [kernel.kallsyms]  [k] __local_bh_enable_ip

Thanks,
Amir

^ permalink raw reply

* Re: netlink NETLINK_ROUTE  failure & Can the kernel really handle IPv6 properly
From: Ulf samuelsson @ 2014-10-02 12:38 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Linux Netdev List
In-Reply-To: <1412199032.2532289.174101717.2D6510D0@webmail.messagingengine.com>

This is the significant code, and it is wrong.

static struct notifier_block my_ipv6_address_notifier =
 {
   my_ipv6_address_notifier_cb,
   NULL,
   0
 };

register_inet6addr_notifier (&my_ipv6_address_notifier );

int
my_ipv6_address_notifier_cb (struct notifier_block *self,
                                unsigned long event, void *val)
{
 struct inet6_ifaddr *ifaddr = (struct inet6_ifaddr *)val;


 /* We are only interested in address add/delete events */
 /* IPv6 address add comes as NETDEV_UP and delete comes as
  * NETDEV_DOWN
  */
 if ((event != NETDEV_UP) && (event != NETDEV_DOWN))
   return ret;

 if (ifaddr == NULL)
   return ret;
 /* Now that we are sure that it is a IPv6 address being added deleted,
  * verify that it is a link local address.
  */
 if (!IPV6_IS_ADDR_LINKLOCAL (&ifaddr->addr))
   {
     return ret;
   }
 ...
 send_message_to_app(LINK_LOCAL_UP, ip);
 ...
 return ret;
}


Application tries to send message to "ip" and fails, because the link-local adress is still
in "tentative state"

Best Regards
Ulf Samuelsson
ulf@emagii.com
+46  (722) 427 437


> 1 okt 2014 kl. 23:30 skrev Hannes Frederic Sowa <hannes@stressinduktion.org>:
> 
> Hello,
> 
>> On Wed, Oct 1, 2014, at 22:28, Ulf Samuelsson wrote:
>> BTW, the problem I am trying to solve is how to connect to an I/F with 
>> an IPv6 link-local address.
>> 
>> An existing kernel module waits for a NETDEV_UP event, and then tries to 
>> communicate
>> with the link-local address.
>> 
>> This will fail, because (according to a colleague) the I/F enters a 
>> "tentative" state,
>> where it is trying to decide if it is unique or not.
>> It will remain in that state for 1-2 seconds, and only afterwards is the 
>> link-local address
>> available for normal use.
>> 
>> The guys writing the module, claim that the kernel is using NETDEV_UP.
>> There is very little code in the kernel using NETLINK_ROUTE, even in 
>> latest stable.
>> It is using NETDEV_UP.
>> 
>> If my colleague is right, the kernel really cannot handle IPv6 
>> link-local addresses properly.
> 
> Sorry, I cannot really follow you, can you send example code or be a bit
> more precise?
> 
> Thanks,
> Hannes
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] wil6210: remove obsolete msm platform code
From: Arnd Bergmann @ 2014-10-02 12:11 UTC (permalink / raw)
  To: Vladimir Kondratiev
  Cc: John W. Linville, wil6210, netdev, linux-wireless,
	linux-arm-kernel
In-Reply-To: <1572828.LhiMXjHRdV@lx-wigig-72>

On Thursday 02 October 2014 14:29:06 Vladimir Kondratiev wrote:
> On Tuesday, September 30, 2014 05:48:20 PM Arnd Bergmann wrote:
> > The wil6210 driver has a glue layer for the Qualcomm MSM platform
> > code, which apparently could never build and is unlikely to
> > ever be able to in the future, as the MSM platform is being phased
> > out in favor of the new QCOM platform, and the driver relies
> > on out-of-tree infrastructure. Trying to build it currently
> > results in this error:
> > 
> > drivers/net/wireless/ath/wil6210/wil_platform_msm.c:19:27: fatal error: linux/msm-bus.h: No such file or directory
> > 
> > Removing the driver fixes the build and makes it possible to
> > build an allmodconfig kernel for MSM.
> > 
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > 
> 
> We are investigating this, looking for the appropriate solution.
> Thanks for pointing to this.

FWIW, I took a closer look at the actual driver code now. It's
actually worse than I thought and I don't see any possible way
out other than removing this driver completely and starting a
new one, if you want to use it for mach-qcom. I hadn't realized
before that this is actually new code rather than ancient bitrot
that suddenly started causing problems.

Some of the worst problems are:

- the driver scans for DT properties that are completely
  undocumented and have not been reviewed

- if they had been reviewed, the first comment you would have
  received is that the binding makes no sense: you scan for
  another node rather than taking the node you already have for
  the PCI device, then pass data into a random soc-specific
  driver API that is not part of the upstream kernel.

- as mentioned in the my patch above, the code has never been
  compiled on a mainline kernel and now breaks builds.

I think the only way forward is to have the platform support for
whatever chip this is meant for reviewed and merged first and
then you can write a new glue driver for the interfaces we end
up putting into the kernel.
My guess is that with proper interfaces between the platform
bus driver and the wil driver glue, there would be very little
left in terms of interface and you can just call a generic
interface from the PCI driver, passing the 'struct device'
for the PCI dev in to an exported interface when provided.

	Arnd

^ permalink raw reply

* Re: [PATCH v2 net-next] mlx4: optimize xmit path
From: Eric Dumazet @ 2014-10-02 12:07 UTC (permalink / raw)
  To: Amir Vadai
  Cc: Or Gerlitz, Alexei Starovoitov, David S. Miller,
	Jesper Dangaard Brouer, Eric Dumazet, John Fastabend,
	Linux Netdev List, Or Gerlitz, amira, idos, Yevgeny Petrilin,
	eyalpe
In-Reply-To: <542D3D5F.5090900@mellanox.com>

On Thu, 2014-10-02 at 14:56 +0300, Amir Vadai wrote:
> After making sure the sender thread and the TX completions are not on
> the same CPU, I see the expected improvement. +0.5Mpps with tx
> optimizations.

I got 40% here.

Hmm... both cpus are on the same socket, right ?

TX coalescing is properly setup ?

What interrupt rate do you get ?

You can take a look at where cycles are spent

perf record -a -g sleep 5
perf report

^ permalink raw reply

* Re: [PATCH v2 net-next] mlx4: optimize xmit path
From: Amir Vadai @ 2014-10-02 11:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, Alexei Starovoitov, David S. Miller,
	Jesper Dangaard Brouer, Eric Dumazet, John Fastabend,
	Linux Netdev List, Or Gerlitz, amira, idos, Yevgeny Petrilin,
	eyalpe
In-Reply-To: <1412250327.16704.84.camel@edumazet-glaptop2.roam.corp.google.com>

On 10/2/2014 2:45 PM, Eric Dumazet wrote:
> On Thu, 2014-10-02 at 11:03 +0300, Amir Vadai wrote:
> 
>> Hi,
>>
>> Will take it into the split patchset - we just hit this bug when tried
>> to run benchmarks with blueflame disabled (easy to test by using ethtool
>> priv flag blueflame).
> 
> Hmm, I do not know this ethtool command, please share ;)

$ ethtool --set-priv-flags eth0 blueflame off

> 
>>
>> I'm still working on it, but I can't reproduce the numbers that you
>> show. On my development machine, I get ~5.5Mpps with burst=8 and ~2Mpps
>> with burst=1.
> 
> You have to be careful with the 'clone X' : If you choose a too big
> value, TX completion competes with the sender thread.
> 
>>
>> In addition, I see no improvements when adding the optimization to the
>> xmit path.
After making sure the sender thread and the TX completions are not on
the same CPU, I see the expected improvement. +0.5Mpps with tx
optimizations.

>> I use the net-next kernel + pktgen burst support patch, with and without
>> this xmit path optimization patch.
>>
>> Do you use other patches not upstream in your environment?
> 
> Nope, this is with David net-next tree.
> 
>> Can you share the .config/pktgen configuration?
> 
> Sure.
> 
>>
>> One other note: we're checking now that blueflame could be used with
>> xmit_more. It might result with packets reordering/drops. Still under
>> investigation.
> 
> I noticed no reorders. I tweaked the stack to force a gso segmentation
> (in software) instead of using NIC TSO for small packets (2 or 3 MSS)
> 
> 200 concurrent netperf -t TCP_RR -- -r 2000,2000    performance was
> increased by ~100%.
> 
> 
> #!/bin/bash
> #
> # on the destination, drop packets with
> #   iptables -A PREROUTING -t raw -p udp --dport 9 -j DROP
> #   Or run a recent enough kernel with global ICMP rate limiting to 1000 packets/sec
> #   ( http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=4cdf507d54525842dfd9f6313fdafba039084046 )
> #
> #### Configure
> 
> # Yeah, if you use PKTSIZE <= 104, performance is lower because of inline (copy whole frame content into tx desc)
> PKTSIZE=105
You can also set the module parameter to turn it off:
$ modprobe mlx4_en inline_thold=17

> 
[...]

Thanks,
Amir

^ permalink raw reply

* Re: [PATCH v2 net-next] mlx4: optimize xmit path
From: Eric Dumazet @ 2014-10-02 11:45 UTC (permalink / raw)
  To: Amir Vadai
  Cc: Or Gerlitz, Alexei Starovoitov, David S. Miller,
	Jesper Dangaard Brouer, Eric Dumazet, John Fastabend,
	Linux Netdev List, Or Gerlitz, amira, idos, Yevgeny Petrilin,
	eyalpe
In-Reply-To: <542D06C1.6090802@mellanox.com>

On Thu, 2014-10-02 at 11:03 +0300, Amir Vadai wrote:

> Hi,
> 
> Will take it into the split patchset - we just hit this bug when tried
> to run benchmarks with blueflame disabled (easy to test by using ethtool
> priv flag blueflame).

Hmm, I do not know this ethtool command, please share ;)

> 
> I'm still working on it, but I can't reproduce the numbers that you
> show. On my development machine, I get ~5.5Mpps with burst=8 and ~2Mpps
> with burst=1.

You have to be careful with the 'clone X' : If you choose a too big
value, TX completion competes with the sender thread.

> 
> In addition, I see no improvements when adding the optimization to the
> xmit path.
> I use the net-next kernel + pktgen burst support patch, with and without
> this xmit path optimization patch.
> 
> Do you use other patches not upstream in your environment?

Nope, this is with David net-next tree.

> Can you share the .config/pktgen configuration?

Sure.

> 
> One other note: we're checking now that blueflame could be used with
> xmit_more. It might result with packets reordering/drops. Still under
> investigation.

I noticed no reorders. I tweaked the stack to force a gso segmentation
(in software) instead of using NIC TSO for small packets (2 or 3 MSS)

200 concurrent netperf -t TCP_RR -- -r 2000,2000    performance was
increased by ~100%.


#!/bin/bash
#
# on the destination, drop packets with
#   iptables -A PREROUTING -t raw -p udp --dport 9 -j DROP
#   Or run a recent enough kernel with global ICMP rate limiting to 1000 packets/sec
#   ( http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=4cdf507d54525842dfd9f6313fdafba039084046 )
#
#### Configure

# Yeah, if you use PKTSIZE <= 104, performance is lower because of inline (copy whole frame content into tx desc)
PKTSIZE=105

echo "pktrate: $PKTRATE"

COUNT=20000000000
RUN_SECS=60
SRC_DEV=eth0
SRC_IP_MIN=7.0.0.1
SRC_IP_MAX=7.255.255.255
SRC_MAC=00:1a:11:c3:0d:7f 
DST_IP=10.246.7.152 
DST_MAC=00:1a:11:c3:0d:45 
DST_UDP=9

## END OF CONFIGURATION OPTIONS

#### Helper

## Configuration procfs inodes

DEV_INODE=/proc/net/pktgen/$SRC_DEV
MAIN_INODE=/proc/net/pktgen/pgctrl
THREAD_INODE=/proc/net/pktgen/kpktgend_2

# write to a procfs file
function pgset_ex()
{
  local result

	echo $2
  echo $2 > $1

  result=`cat $1 | fgrep "Result: OK:"`
  if [ "$result" = "" ]; then
       cat $1 | fgrep Result:
  fi
}


#### Pre: configure

# attach device exclusively
pgset_ex $THREAD_INODE "rem_device_all"
pgset_ex $THREAD_INODE "add_device $SRC_DEV"

# configure basics
pgset_ex $DEV_INODE "clone_skb 8"
pgset_ex $DEV_INODE "src_min $SRC_IP_MIN"
pgset_ex $DEV_INODE "src_max $SRC_IP_MAX"
pgset_ex $DEV_INODE "dst $DST_IP"
pgset_ex $DEV_INODE "dst_mac $DST_MAC"
pgset_ex $DEV_INODE "udp_dst_min $DST_UDP"
pgset_ex $DEV_INODE "udp_dst_max $DST_UDP"
pgset_ex $DEV_INODE "queue_map_min 0"
pgset_ex $DEV_INODE "queue_map_max 0"
pgset_ex $DEV_INODE "burst 8"

pgset_ex $DEV_INODE "pkt_size $PKTSIZE"

 pgset_ex $DEV_INODE "delay 0"

# reset to continuous transmission
pgset_ex $DEV_INODE "count $COUNT"


#### Run: transmit

echo -e "UDP packet generator (based on linux pktgen)\n"
echo -e "  src:  mac=$SRC_MAC ip=$SRC_IP dev=$SRC_DEV"
echo -e "  dest: mac=$DST_MAC ip=$DST_IP port=$DST_UDP\n"

modprobe pktgen

#ethtool -C eth0 tx-usecs 16 tx-frames 16
#ethtool -C eth1 tx-usecs 16 tx-frames 16

# start thread(s)
# the write will block until Ctrl^C is pressed or a timeout kills the write
echo "Running for $RUN_SECS seconds"
#pgset_ex $MAIN_INODE "start"
echo "start" > $MAIN_INODE 2>/dev/null &


  sleep $RUN_SECS
 echo $DEV_INODE 
 cat $DEV_INODE

# stop
kill $!
pgset_ex $MAIN_INODE "stop"

echo "OK. All done"

^ permalink raw reply

* Re: [PATCH net-next] net: Cleanup skb cloning  by adding SKB_FCLONE_FREE
From: Eric Dumazet @ 2014-10-02 11:35 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: netdev, davem, edumazet
In-Reply-To: <1412231620-3767-1-git-send-email-subramanian.vijay@gmail.com>

On Wed, 2014-10-01 at 23:33 -0700, Vijay Subramanian wrote:
> SKB_FCLONE_UNAVAILABLE has overloaded meaning depending on type of skb.
> 1: If skb is allocated from head_cache, it indicates fclone is not available.
> 2: If skb is a companion fclone skb (allocated from fclone_cache), it indicates
> it is available to be used.
> 
> To avoid confusion for case 2 above, this patch  replaces
> SKB_FCLONE_UNAVAILABLE with SKB_FCLONE_FREE where appropriate. For fclone
> companion skbs, this indicates it is free for use.
> 
> SKB_FCLONE_UNAVAILABLE will now simply indicate skb is from head_cache and
> cannot / will not have a companion fclone.
> 
> Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
> ---
>  include/linux/skbuff.h | 3 ++-
>  net/core/skbuff.c      | 8 ++++----
>  2 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 7c5036d..6c3fb9a 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -339,9 +339,10 @@ struct skb_shared_info {
>  
> 
>  enum {
> -	SKB_FCLONE_UNAVAILABLE,
> +	SKB_FCLONE_UNAVAILABLE,	/* skb has no fclone */
>  	SKB_FCLONE_ORIG,
>  	SKB_FCLONE_CLONE,
> +	SKB_FCLONE_FREE,	/* this fclone skb is available */
>  };
>  
Please comment all the states, now there is no ambiguity ?

^ permalink raw reply

* RE: [patch 1/2 -next] cxgb4: clean up a type issue
From: David Laight @ 2014-10-02 11:31 UTC (permalink / raw)
  To: 'Dan Carpenter', Hariprasad S
  Cc: netdev@vger.kernel.org, kernel-janitors@vger.kernel.org
In-Reply-To: <20141002112219.GA25606@mwanda>

From: Dan Carpenter
> The tx_desc struct hold 8 __be64 values.  The original code took a
> tx_desc pointer then casted it to an int pointer and then casted it to a
> u64 pointer.  It was confusing and triggered some static checker
> warnings.
> 
> I have changed the cxgb_pio_copy() to only take tx_desc pointers.  This
> isn't really a loss of flexibility because anything else was buggy to
> begin with.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> 
> diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c
> index bb7851e..599cdfd 100644
> --- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
> +++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
> @@ -854,9 +854,10 @@ static void write_sgl(const struct sk_buff *skb, struct sge_txq *q,
>   * memory mapped BAR2 space(user space writes).
>   * For coalesced WR SGE, fetches data from the FIFO instead of from Host.
>   */
> -static void cxgb_pio_copy(u64 __iomem *dst, u64 *src)
> +static void cxgb_pio_copy(u64 __iomem *dst, struct tx_desc *desc)
>  {
> -	int count = 8;
> +	int count = sizeof(*desc) / sizeof(u64);
> +	u64 *src = (u64 *)desc;
> 
>  	while (count) {
>  		writeq(*src, dst);
> @@ -914,12 +915,11 @@ static inline void ring_tx_db(struct adapter *adap, struct sge_txq *q, int n)
>  			int index = (q->pidx
>  				     ? (q->pidx - 1)
>  				     : (q->size - 1));
> -			unsigned int *wr = (unsigned int *)&q->desc[index];
> +			struct tx_desc *desc = &q->desc[index];
> 
>  			cxgb_pio_copy((u64 __iomem *)
>  				      (adap->bar2 + q->udb +
> -				       SGE_UDB_WCDOORBELL),
> -				      (u64 *)wr);
> +				       SGE_UDB_WCDOORBELL), desc);

Why not put &q->desc[index] in the call itself.
And I can't help think there are better places to break the long line.
Getting it to the code below would be nice:
			cxgp_pio_copy(adap->bar2 + q->udb + SGE_UDB_WCDOORBELL,
					  q->desc + index);

	David

>  		} else {
>  			writel(val,  adap->bar2 + q->udb + SGE_UDB_KDOORBELL);
>  		}
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox