Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 3/6] net: stmmac: stmmac_platform: fix parsing of DT binding
From: Niklas Cassel @ 2016-12-05 16:12 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue; +Cc: netdev, linux-kernel
In-Reply-To: <1480929039-20257-4-git-send-email-niklass@axis.com>

On 12/05/2016 10:10 AM, Niklas Cassel wrote:
> From: Niklas Cassel <niklas.cassel@axis.com>
>
> commit 64c3b252e9fc ("net: stmmac: fixed the pbl setting with DT")
> changed the parsing of the DT binding.
>
> Before 64c3b252e9fc, snps,fixed-burst and snps,mixed-burst were parsed
> regardless if the property snps,pbl existed or not.
> After the commit, fixed burst and mixed burst are only parsed if
> snps,pbl exists. Now when snps,aal has been added, it too is only
> parsed if snps,pbl exists.
>
> Since the DT binding does not specify that fixed burst, mixed burst
> or aal depend on snps,pbl being specified, undo changes introduced
> by 64c3b252e9fc.
>
> The issue commit 64c3b252e9fc ("net: stmmac: fixed the pbl setting with
> DT") tries to address is solved in another way:
> The databook specifies that all values other than
> 1, 2, 4, 8, 16, or 32 results in undefined behavior,
> so snps,pbl = <0> is invalid.
>
> If pbl is 0 after parsing, set pbl to DEFAULT_DMA_PBL.
> This handles the case where the property is omitted, and also handles
> the case where the property is specified without any data.
>
> Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
> ---
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  |  3 +++
>  .../net/ethernet/stmicro/stmmac/stmmac_platform.c  | 27 +++++++++++-----------
>  2 files changed, 16 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index b1e42ddf0370..2298c4d109fb 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1586,6 +1586,9 @@ static int stmmac_init_dma_engine(struct stmmac_priv *priv)
>  		return -EINVAL;
>  	}
>  
> +	if (!priv->plat->dma_cfg->pbl)
> +		priv->plat->dma_cfg->pbl = DEFAULT_DMA_PBL;
> +

Perhaps it is cleaner to have this "fallback to a default value" in
stmmac_probe_config_dt, right after
of_property_read_u32(np, "snps,pbl", &dma_cfg->pbl),
since the only other place where the pbl is set, is from PCI glue.

The PCI glue should set a correct pbl value, and shouldn't need to
rely on a default value.
The code in stmmac_init_dma_engine could then, instead of
"fallback to a default value", return an error if pbl == 0.

>  	if (priv->extend_desc && (priv->mode == STMMAC_RING_MODE))
>  		atds = 1;
>  
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
> index 98bf86d64d96..59e1740479fc 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
> @@ -301,21 +301,20 @@ stmmac_probe_config_dt(struct platform_device *pdev, const char **mac)
>  		plat->force_sf_dma_mode = 1;
>  	}
>  
> -	if (of_find_property(np, "snps,pbl", NULL)) {
> -		dma_cfg = devm_kzalloc(&pdev->dev, sizeof(*dma_cfg),
> -				       GFP_KERNEL);
> -		if (!dma_cfg) {
> -			stmmac_remove_config_dt(pdev, plat);
> -			return ERR_PTR(-ENOMEM);
> -		}
> -		plat->dma_cfg = dma_cfg;
> -		of_property_read_u32(np, "snps,pbl", &dma_cfg->pbl);
> -		dma_cfg->aal = of_property_read_bool(np, "snps,aal");
> -		dma_cfg->fixed_burst =
> -			of_property_read_bool(np, "snps,fixed-burst");
> -		dma_cfg->mixed_burst =
> -			of_property_read_bool(np, "snps,mixed-burst");
> +	dma_cfg = devm_kzalloc(&pdev->dev, sizeof(*dma_cfg),
> +			       GFP_KERNEL);
> +	if (!dma_cfg) {
> +		stmmac_remove_config_dt(pdev, plat);
> +		return ERR_PTR(-ENOMEM);
>  	}
> +	plat->dma_cfg = dma_cfg;
> +
> +	of_property_read_u32(np, "snps,pbl", &dma_cfg->pbl);
> +
> +	dma_cfg->aal = of_property_read_bool(np, "snps,aal");
> +	dma_cfg->fixed_burst = of_property_read_bool(np, "snps,fixed-burst");
> +	dma_cfg->mixed_burst = of_property_read_bool(np, "snps,mixed-burst");
> +
>  	plat->force_thresh_dma_mode = of_property_read_bool(np, "snps,force_thresh_dma_mode");
>  	if (plat->force_thresh_dma_mode) {
>  		plat->force_sf_dma_mode = 0;

^ permalink raw reply

* Re: [PATCH] Fixed to checkpatch.pl errors to vlan_dev.c
From: Stefan Schmidt @ 2016-12-05 15:48 UTC (permalink / raw)
  To: Ozgur Karatas, kaber, David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <52561480952041@web13j.yandex.ru>

Hello.

On 05/12/16 16:34, Ozgur Karatas wrote:
> Hello all,
>
> I will solve a checkpatch errors.
>
> Signed-off-by: Ozgur Karatas <ozgur@member.fsf.org>

The patch itself looks good, but please have a read about having a good 
commit message. I would suggest reading Documentation/SubmittingPatches 
section 14: The canonical patch format

Especially the parts about the subject line and summary.

>
> ---
>  net/8021q/vlan_dev.c                               |   2 +-
> diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
> index fbfacd5..2edb495 100644
> --- a/net/8021q/vlan_dev.c
> +++ b/net/8021q/vlan_dev.c
> @@ -738,7 +738,7 @@ static int vlan_dev_netpoll_setup(struct net_device *dev, struct netpoll_info *n
>
>  static void vlan_dev_netpoll_cleanup(struct net_device *dev)
>  {
> -       struct vlan_dev_priv *vlan= vlan_dev_priv(dev);
> +       struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
>         struct netpoll *netpoll = vlan->netpoll;
>
>         if (!netpoll)
>

regards
Stefan Schmidt

^ permalink raw reply

* Re: [PATCH net-next] liquidio: 'imply' ptp instead of 'select'
From: Nicolas Pitre @ 2016-12-05 15:44 UTC (permalink / raw)
  To: David Miller
  Cc: arnd, felix.manlunas, tglx, david.daney, satananda.burla,
	rvatsavayi, sgoutham, netdev, linux-kernel
In-Reply-To: <20161203.233945.1757084444220812603.davem@davemloft.net>

On Sat, 3 Dec 2016, David Miller wrote:

> From: Arnd Bergmann <arnd@arndb.de>
> Date: Sat,  3 Dec 2016 00:04:32 +0100
> 
> > ptp now depends on the optional POSIX_TIMERS setting and fails to build
> > if we select it without that:
> > 
> > warning: (LIQUIDIO_VF && TI_CPTS) selects PTP_1588_CLOCK which has unmet direct dependencies (NET && POSIX_TIMERS)
> > warning: (LIQUIDIO_VF && TI_CPTS) selects PTP_1588_CLOCK which has unmet direct dependencies (NET && POSIX_TIMERS)
> > ERROR: "posix_clock_unregister" [drivers/ptp/ptp.ko] undefined!
> > ERROR: "posix_clock_register" [drivers/ptp/ptp.ko] undefined!
> > ERROR: "pps_unregister_source" [drivers/ptp/ptp.ko] undefined!
> > ERROR: "pps_event" [drivers/ptp/ptp.ko] undefined!
> > ERROR: "pps_register_source" [drivers/ptp/ptp.ko] undefined!
> > 
> > It seems that two patches have collided here, the build failure
> > is a result of the combination. Changing the new option to 'imply'
> > as well fixes it.
> > 
> > Fixes: 111fc64a237f ("liquidio CN23XX: VF registration")
> > Fixes: d1cbfd771ce8 ("ptp_clock: Allow for it to be optional")
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> 
> Like the kbuild robot, when I apply this it complains about 'imply' being
> an unknown option.
> 
> I guess it worked for you because support for 'imply' exists in the -next
> tree and gets pulled in from somewhere else.

Exact.

> In any event, as-is I cannot apply this.

It should be carried in linux-next for the time being, and suggested as 
a probable "merge resolution" to Linus when submitting your tree for 
merging.  I think that's the best that can be done.


Nicolas

^ permalink raw reply

* RE: [PATCH V2 net-next] net: hns: Fix to conditionally convey RX checksum flag to stack
From: Salil Mehta @ 2016-12-05 15:43 UTC (permalink / raw)
  To: David Miller
  Cc: Zhuangyuzeng (Yisen), mehta.salil.lnk@gmail.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Linuxarm
In-Reply-To: <20161203.150905.739546865468666597.davem@davemloft.net>

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Saturday, December 03, 2016 8:09 PM
> To: Salil Mehta
> Cc: Zhuangyuzeng (Yisen); mehta.salil.lnk@gmail.com;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Linuxarm
> Subject: Re: [PATCH V2 net-next] net: hns: Fix to conditionally convey
> RX checksum flag to stack
> 
> From: Salil Mehta <salil.mehta@huawei.com>
> Date: Thu, 1 Dec 2016 12:09:22 +0000
> 
> > But maybe now since we don't have any method to de-multiplex the kind
> of
> > checksum error (cannot depend upon register) we can have below code
> > re-arrangement:
> >
> > hns_nic_rx_checksum() {
> >       /* check supported L3 protocol */
> > 	if (l3 != IPV4 && l3 != IPV6)
> > 		return;
> >       /* check if L3 protocols error */
> >       if (l3e)
> > 	 	return;
> >
> >       /* check if the packets are fragmented */
> > 	If (l3frags)
> > 		Return;
> >
> >       /* check supported L4 protocol */
> >  	if (l4 != UDP && l4 != TCP && l4 != SCTP)
> >  		return;
> >       /* check if any L4 protocol error */
> >       if (l3e)
> > 	 	return;
> >
> >       /* packet with valid checksum - covey to stack */
> >       skb->ip_summed = CHECKSUM_UNNECESSARY
> > }
> 
> This looks a lot cleaner and easier to understand.
Sure, floated Patch V3. Please have a look. Thanks!

Best regards
Salil

^ permalink raw reply

* RE: [PATCH V2 net-next] net: hns: Fix to conditionally convey RX checksum flag to stack
From: Salil Mehta @ 2016-12-05 15:42 UTC (permalink / raw)
  To: David Miller
  Cc: Zhuangyuzeng (Yisen), mehta.salil.lnk@gmail.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Linuxarm
In-Reply-To: <20161203.152515.1155017126921048869.davem@davemloft.net>

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Saturday, December 03, 2016 8:25 PM
> To: Salil Mehta
> Cc: Zhuangyuzeng (Yisen); mehta.salil.lnk@gmail.com;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Linuxarm
> Subject: Re: [PATCH V2 net-next] net: hns: Fix to conditionally convey
> RX checksum flag to stack
> 
> From: Salil Mehta <salil.mehta@huawei.com>
> Date: Thu, 1 Dec 2016 16:59:14 +0000
> 
> > It looks to me the cumbersome check in the PATCH V2 should
> > be retained.
> 
> I really want something simpler with small checks that are
> done in logical pieces in a straigtforward progression.
> 
> The code in V2 is completely unreadable.
Hi David,
I understand your concern. I have floated the Patch V3 with
re-structured code for your review. Thanks!

Best regards
Salil

^ permalink raw reply

* [PATCH V3 net-next] net: hns: Fix to conditionally convey RX checksum flag to stack
From: Salil Mehta @ 2016-12-05 15:37 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel,
	linuxarm

This patch introduces the RX checksum function to check the
status of the hardware calculated checksum and its error and
appropriately convey status to the upper stack in skb->ip_summed
field.

In hardware, we only support checksum for the following
protocols:
1) IPv4,
2) TCP(over IPv4 or IPv6),
3) UDP(over IPv4 or IPv6),
4) SCTP(over IPv4 or IPv6)
but we support many L3(IPv4, IPv6, MPLS, PPPoE etc) and
L4(TCP, UDP, GRE, SCTP, IGMP, ICMP etc.) protocols.

Hardware limitation:
Our present hardware RX Descriptor lacks L3/L4 checksum
"Status & Error" bit (which usually can be used to indicate whether
checksum was calculated by the hardware and if there was any error
encountered during checksum calculation).

Software workaround:
We do get info within the RX descriptor about the kind of
L3/L4 protocol coming in the packet and the error status. These
errors might not just be checksum errors but could be related to
version, length of IPv4, UDP, TCP etc.
Because there is no-way of knowing if it is a L3/L4 error due
to bad checksum or any other L3/L4 error, we will not (cannot)
convey hardware checksum status(CHECKSUM_UNNECESSARY) for such
cases to upper stack and will not maintain the RX L3/L4 checksum
counters as well.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
Change Log:
Patch V3: Re-structured the code.
Patch V2: Addressed the comment by David Miller
          Link: https://www.spinics.net/lists/netdev/msg406697.html
Patch V1: This patch is a result of the comments given by
          David Miller <davem@davemloft.net>
          Link: https://lkml.org/lkml/2016/6/15/27
---
 drivers/net/ethernet/hisilicon/hns/hnae.h     |    2 +
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |   73 ++++++++++++++++++++++---
 2 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hnae.h b/drivers/net/ethernet/hisilicon/hns/hnae.h
index 09602f1..8016854 100644
--- a/drivers/net/ethernet/hisilicon/hns/hnae.h
+++ b/drivers/net/ethernet/hisilicon/hns/hnae.h
@@ -99,6 +99,8 @@ enum hnae_led_state {
 #define HNS_RX_FLAG_L3ID_IPV6 0x1
 #define HNS_RX_FLAG_L4ID_UDP 0x0
 #define HNS_RX_FLAG_L4ID_TCP 0x1
+#define HNS_RX_FLAG_L4ID_SCTP 0x3
+
 
 #define HNS_TXD_ASID_S 0
 #define HNS_TXD_ASID_M (0xff << HNS_TXD_ASID_S)
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index 255fede..2296345 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -565,6 +565,68 @@ static void get_rx_desc_bnum(u32 bnum_flag, int *out_bnum)
 				   HNS_RXD_BUFNUM_M, HNS_RXD_BUFNUM_S);
 }
 
+static void hns_nic_rx_checksum(struct hns_nic_ring_data *ring_data,
+				struct sk_buff *skb, u32 flag)
+{
+	struct net_device *netdev = ring_data->napi.dev;
+	u32 l3id;
+	u32 l4id;
+
+	/* check if RX checksum offload is enabled */
+	if (unlikely(!(netdev->features & NETIF_F_RXCSUM)))
+		return;
+
+	/* In hardware, we only support checksum for the following protocols:
+	 * 1) IPv4,
+	 * 2) TCP(over IPv4 or IPv6),
+	 * 3) UDP(over IPv4 or IPv6),
+	 * 4) SCTP(over IPv4 or IPv6)
+	 * but we support many L3(IPv4, IPv6, MPLS, PPPoE etc) and L4(TCP,
+	 * UDP, GRE, SCTP, IGMP, ICMP etc.) protocols.
+	 *
+	 * Hardware limitation:
+	 * Our present hardware RX Descriptor lacks L3/L4 checksum "Status &
+	 * Error" bit (which usually can be used to indicate whether checksum
+	 * was calculated by the hardware and if there was any error encountered
+	 * during checksum calculation).
+	 *
+	 * Software workaround:
+	 * We do get info within the RX descriptor about the kind of L3/L4
+	 * protocol coming in the packet and the error status. These errors
+	 * might not just be checksum errors but could be related to version,
+	 * length of IPv4, UDP, TCP etc.
+	 * Because there is no-way of knowing if it is a L3/L4 error due to bad
+	 * checksum or any other L3/L4 error, we will not (cannot) convey
+	 * checksum status for such cases to upper stack and will not maintain
+	 * the RX L3/L4 checksum counters as well.
+	 */
+
+	/*  check L3 protocol for which checksum is supported */
+	if ((l3id != HNS_RX_FLAG_L3ID_IPV4) && (l3id != HNS_RX_FLAG_L3ID_IPV6))
+		return;
+
+	/* check for any(not just checksum)flagged L3 protocol errors */
+	if (unlikely(hnae_get_bit(flag, HNS_RXD_L3E_B)))
+		return;
+
+	/* we do not support checksum of fragmented packets */
+	if (unlikely(hnae_get_bit(flag, HNS_RXD_FRAG_B)))
+		return;
+
+	/*  check L4 protocol for which checksum is supported */
+	if ((l4id != HNS_RX_FLAG_L4ID_TCP) &&
+	    (l4id != HNS_RX_FLAG_L4ID_UDP) &&
+	    (l4id != HNS_RX_FLAG_L4ID_SCTP))
+		return;
+
+	/* check for any(not just checksum)flagged L4 protocol errors */
+	if (unlikely(hnae_get_bit(flag, HNS_RXD_L4E_B)))
+		return;
+
+	/* now, this has to be a packet with valid RX checksum */
+	skb->ip_summed = CHECKSUM_UNNECESSARY;
+}
+
 static int hns_nic_poll_rx_skb(struct hns_nic_ring_data *ring_data,
 			       struct sk_buff **out_skb, int *out_bnum)
 {
@@ -683,13 +745,10 @@ static int hns_nic_poll_rx_skb(struct hns_nic_ring_data *ring_data,
 	ring->stats.rx_pkts++;
 	ring->stats.rx_bytes += skb->len;
 
-	if (unlikely(hnae_get_bit(bnum_flag, HNS_RXD_L3E_B) ||
-		     hnae_get_bit(bnum_flag, HNS_RXD_L4E_B))) {
-		ring->stats.l3l4_csum_err++;
-		return 0;
-	}
-
-	skb->ip_summed = CHECKSUM_UNNECESSARY;
+	/* indicate to upper stack if our hardware has already calculated
+	 * the RX checksum
+	 */
+	hns_nic_rx_checksum(ring_data, skb, bnum_flag);
 
 	return 0;
 }
-- 
1.7.9.5

^ permalink raw reply related

* Re: [RFC] udp: some improvements on RX path.
From: Jesper Dangaard Brouer @ 2016-12-05 15:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: brouer, Paolo Abeni, netdev
In-Reply-To: <1480948133.18162.527.camel@edumazet-glaptop3.roam.corp.google.com>


On Mon, 05 Dec 2016 06:28:53 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Mon, 2016-12-05 at 14:22 +0100, Paolo Abeni wrote:
> > 
> > On Sun, 2016-12-04 at 18:43 -0800, Eric Dumazet wrote:  
[...]

> > > But I also want to work on the idea I gave few days back, having a
> > > separate queue and use splice to transfer the 'softirq queue' into
> > > a calm queue in a different cache line.
> > > 
> > > I expect a 50 % performance increase under load, maybe 1.5 Mpps.  

I also have high hopes for such a solution. I'm very excited that you
are working on this! :-)

 
> > It should work nicely under contention, but won't that increase the
> > overhead for the uncontended/single flow scenario ? the user space
> > reader needs to acquire 2 lock when splicing the 'softirq queue'.
> > On my system ksoftirqd and the u/s process work at similar speeds,
> > so splicing will happen quite often.   
> 
> Well, the splice would happen only if you have more than one message
> in the softirq queue. So no real overhead for uncontended flow
> scenario.
> 
> 
> This reminds me of the busylock I added in __dev_xmit_skb(), which
> basically is acquired only when we detect a possible contention on
> qdisc lock.

Do you think the splice technique would, have the same performance
benefit as having a MPMC queue with separate enqueue and dequeue locking?
(like we have with skb_array/ptr_ring that avoids cache bouncing)?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH] Fixed to checkpatch.pl errors to vlan_dev.c
From: Ozgur Karatas @ 2016-12-05 15:34 UTC (permalink / raw)
  To: kaber, David Miller; +Cc: netdev, linux-kernel

Hello all,

I will solve a checkpatch errors.

Signed-off-by: Ozgur Karatas <ozgur@member.fsf.org>

---
 net/8021q/vlan_dev.c                               |   2 +-
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index fbfacd5..2edb495 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -738,7 +738,7 @@ static int vlan_dev_netpoll_setup(struct net_device *dev, struct netpoll_info *n

 static void vlan_dev_netpoll_cleanup(struct net_device *dev)
 {
-       struct vlan_dev_priv *vlan= vlan_dev_priv(dev);
+       struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
        struct netpoll *netpoll = vlan->netpoll;

        if (!netpoll)

^ permalink raw reply related

* Re: BROKEN Re: [PATCH] netlink: 2-clause nla_ok()
From: Johannes Berg @ 2016-12-05 15:09 UTC (permalink / raw)
  To: Alexey Dobriyan, David Miller; +Cc: netdev
In-Reply-To: <CACVxJT9Qc=vNexx4ooQVFo=SrGs4-oHeZrnL18vbiS+o0F_N1A@mail.gmail.com>


> Maybe someone could vouch that other checks prevent
> this kind of situation from happening but not me.

No, now that you spell it out (and I see the patch) - this is
absolutely needed because nla_for_each_attr() [1] can be called on
arbitrary data coming from userspace in a message, e.g. by way
of nla_for_each_nested(). Even if it's not malformed, nla_ok() is the
only abort condition for that loop, so it would read at least one
nla_len after the real buffer without that condition.

johannes

[1] which seems to be the only significant user thereof

^ permalink raw reply

* Re: [net PATCH 2/2] ipv4: Drop suffix update from resize code
From: Robert Shearman @ 2016-12-05 15:05 UTC (permalink / raw)
  To: Alexander Duyck, netdev; +Cc: davem
In-Reply-To: <20161201122757.15499.51778.stgit@ahduyck-blue-test.jf.intel.com>

On 01/12/16 12:27, Alexander Duyck wrote:
> It has been reported that update_suffix can be expensive when it is called
> on a large node in which most of the suffix lengths are the same.  The time
> required to add 200K entries had increased from around 3 seconds to almost
> 49 seconds.
>
> In order to address this we need to move the code for updating the suffix
> out of resize and instead just have it handled in the cases where we are
> pushing a node that increases the suffix length, or will decrease the
> suffix length.
>
> Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
> Reported-by: Robert Shearman <rshearma@brocade.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

$ time sudo ip route restore < ~/allroutes
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists

real	0m4.026s
user	0m0.156s
sys	0m2.500s
$ time sudo ip route flush via 110.110.110.2

real	0m5.423s
user	0m0.064s
sys	0m1.096s

This reduces the time taken to both add and delete the 200k routes back 
down to levels comparable before commit 5405afd1a306. The changes look 
good to me too. Nicely done Alex!

Reviewed-by: Robert Shearman <rshearma@brocade.com>
Tested-by: Robert Shearman <rshearma@brocade.com>

Thanks,
Rob

^ permalink raw reply

* Re: [net PATCH 1/2] ipv4: Drop leaf from suffix pull/push functions
From: Robert Shearman @ 2016-12-05 15:05 UTC (permalink / raw)
  To: Alexander Duyck, netdev; +Cc: davem
In-Reply-To: <20161201122752.15499.8184.stgit@ahduyck-blue-test.jf.intel.com>

On 01/12/16 12:27, Alexander Duyck wrote:
> It wasn't necessary to pass a leaf in when doing the suffix updates so just
> drop it.  Instead just pass the suffix and work with that.
>
> Since we dropped the leaf there is no need to include that in the name so
> the names are updated to node_push_suffix and node_pull_suffix.
>
> Finally I noticed that the logic for pulling the suffix length back
> actually had some issues.  Specifically it would stop prematurely if there
> was a longer suffix, but it was not as long as the original suffix.  I
> updated the code to address that in node_pull_suffix.
>
> Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
> Suggested-by: Robert Shearman <rshearma@brocade.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

This addresses an issue in the current code whereby when a fib alias is 
removed from a leaf when there are other aliases remaining then the 
suffix length may not be updated in the conditions described above. This 
is because fib_remove_alias doesn't call trie_rebalance (or resize) in 
this case, as there's no need since no nodes have been added/removed.

This can be reproduced with this structure of the fib trie and by adding 
and then deleting 10.37.96.0/19:

         +-- 10.37.0.0/16 4 1 8 suffix/19
            |-- 10.37.0.0
               /24 universe UNICAST
            |-- 10.37.32.0
               /24 universe UNICAST
            |-- 10.37.64.0
               /20 universe UNICAST
            |-- 10.37.95.0
               /24 universe UNICAST
            +-- 10.37.96.0/20 2 1 2 suffix/21
               +-- 10.37.96.0/22 2 1 2 suffix/24
                  +-- 10.37.96.0/24 2 1 2 suffix/24
                     |-- 10.37.96.0
                        /32 link BROADCAST
                        /24 link UNICAST
                     +-- 10.37.96.192/26 2 0 2 suffix/28
                        |-- 10.37.96.204
                           /32 host LOCAL
                        |-- 10.37.96.255
                           /32 link BROADCAST
                  |-- 10.37.98.0
                     /25 universe UNICAST
               |-- 10.37.104.0
                  /21 universe UNICAST
            +-- 10.37.112.0/23 2 0 2 suffix/24
               |-- 10.37.112.0
                  /24 universe UNICAST
               |-- 10.37.113.0
                  /24 universe UNICAST
            |-- 10.37.223.0
               /24 universe UNICAST
            |-- 10.37.224.0
               /24 universe UNICAST

I've verified that with the fix included here that the suffix length on 
the 10.37.0.0/16 node now gets updated correctly to be /20.

Reviewed-by: Robert Shearman <rshearma@brocade.com>
Tested-by: Robert Shearman <rshearma@brocade.com>

Thanks,
Rob

^ permalink raw reply

* BROKEN Re: [PATCH] netlink: 2-clause nla_ok()
From: Alexey Dobriyan @ 2016-12-05 14:56 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

David, please do

    git revert 4f7df337fe79bba1e4c2d525525d63b5ba186bbd

I'm an idiot.

All rationale in the commit would be correct if reading "nla_len"
didn't require memory access. But it does.

    return rem >= (int)sizeof(*nla) &&
               nla->nla_len >= sizeof(*nla) &&
               nla->nla_len <= remaining;

Those logical ands ensure that memory access is not done
if "rem" is small enough to even fetch ->nla_len.

Maybe someone could vouch that other checks prevent
this kind of situation from happening but not me.
How very embarrassing.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

^ permalink raw reply

* Re: [PATCH 2/6] net: ethernet: ti: cpts: add support for ext rftclk selection
From: Rob Herring @ 2016-12-05 14:51 UTC (permalink / raw)
  To: Grygorii Strashko
  Cc: David S. Miller, netdev, Mugunthan V N, Richard Cochran,
	Sekhar Nori, linux-kernel, linux-omap, devicetree,
	Murali Karicheri, Wingman Kwok
In-Reply-To: <20161128230428.6872-3-grygorii.strashko@ti.com>

On Mon, Nov 28, 2016 at 05:04:24PM -0600, Grygorii Strashko wrote:
> Some CPTS instances, which can be found on KeyStone 2 1/10G Ethernet
> Switch Subsystems, can control an external multiplexer that selects
> one of up to 32 clocks for time sync reference (RFTCLK). This feature
> can be configured through CPTS_RFTCLK_SEL register (offset: x08).
> 
> Hence, introduce optional DT cpts_rftclk_sel poperty wich, if present,
> will specify CPTS reference clock. The cpts_rftclk_sel should be
> omitted in DT if HW doesn't support this feature. The external fixed
> rate clocks can be defined in board files as "fixed-clock".
> 
> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
> ---
>  Documentation/devicetree/bindings/net/keystone-netcp.txt |  2 ++

Please group binding changes into a single patch.

>  drivers/net/ethernet/ti/cpts.c                           | 12 ++++++++++++
>  drivers/net/ethernet/ti/cpts.h                           |  8 +++++++-
>  3 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/keystone-netcp.txt b/Documentation/devicetree/bindings/net/keystone-netcp.txt
> index c37b54e..ec4a241 100644
> --- a/Documentation/devicetree/bindings/net/keystone-netcp.txt
> +++ b/Documentation/devicetree/bindings/net/keystone-netcp.txt
> @@ -114,6 +114,8 @@ Optional properties:
>  				driver to them if needed.
>  
>   Properties related to cpts configurations.
> +	- cpts-rftclk-sel: selects one of up to 32 clocks for time sync
> +		reference.  Default = 0.

Vendor prefix.

>  	- cpts_clock_mult/cpts_clock_shift:
>  		used for converting time counter cycles to ns as in
>  

^ permalink raw reply

* Re: [PATCH 1/6] net: ethernet: ti: netcp: add support of cpts
From: Rob Herring @ 2016-12-05 14:49 UTC (permalink / raw)
  To: Grygorii Strashko
  Cc: David S. Miller, netdev, Mugunthan V N, Richard Cochran,
	Sekhar Nori, linux-kernel, linux-omap, devicetree,
	Murali Karicheri, Wingman Kwok
In-Reply-To: <20161128230428.6872-2-grygorii.strashko@ti.com>

On Mon, Nov 28, 2016 at 05:04:23PM -0600, Grygorii Strashko wrote:
> From: WingMan Kwok <w-kwok2@ti.com>
> 
> This patch adds support of the cpts device found in the
> gbe and 10gbe ethernet switches on the keystone 2 SoCs
> (66AK2E/L/Hx, 66AK2Gx).
> 
> Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
> ---
>  .../devicetree/bindings/net/keystone-netcp.txt     |   9 +
>  drivers/net/ethernet/ti/Kconfig                    |   7 +-
>  drivers/net/ethernet/ti/netcp.h                    |   2 +-
>  drivers/net/ethernet/ti/netcp_core.c               |  18 +-
>  drivers/net/ethernet/ti/netcp_ethss.c              | 437 ++++++++++++++++++++-
>  5 files changed, 459 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/keystone-netcp.txt b/Documentation/devicetree/bindings/net/keystone-netcp.txt
> index 04ba1dc..c37b54e 100644
> --- a/Documentation/devicetree/bindings/net/keystone-netcp.txt
> +++ b/Documentation/devicetree/bindings/net/keystone-netcp.txt
> @@ -113,6 +113,15 @@ Optional properties:
>  				will only initialize these ports and attach PHY
>  				driver to them if needed.
>  
> + Properties related to cpts configurations.
> +	- cpts_clock_mult/cpts_clock_shift:

Needs vendor prefix. Don't use '_'.

> +		used for converting time counter cycles to ns as in
> +
> +		ns = (cycles * clock_mult) >> _shift
> +
> +		Defaults: clock_mult, clock_shift = calculated from
> +		CPTS refclk

What does this mean?

> +
>  NetCP interface properties: Interface specification for NetCP sub-modules.
>  Required properties:
>  - rx-channel:	the navigator packet dma channel name for rx.

^ permalink raw reply

* Re: [PATCH 3/3] netns: fix net_generic() "id - 1" bloat
From: Alexey Dobriyan @ 2016-12-05 14:48 UTC (permalink / raw)
  To: David Laight
  Cc: davem@davemloft.net, netdev@vger.kernel.org, xemul@openvz.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0234CCB@AcuExch.aculab.com>

On Mon, Dec 5, 2016 at 3:49 PM, David Laight <David.Laight@aculab.com> wrote:
> From: Alexey Dobriyan
>> Sent: 02 December 2016 01:22
>> net_generic() function is both a) inline and b) used ~600 times.
>>
>> It has the following code inside
>>
>>               ...
>>       ptr = ng->ptr[id - 1];
>>               ...
>>
>> "id" is never compile time constant so compiler is forced to subtract 1.
>> And those decrements or LEA [r32 - 1] instructions add up.
>>
>> We also start id'ing from 1 to catch bugs where pernet sybsystem id
>> is not initialized and 0. This is quite pointless idea (nothing will
>> work or immediate interference with first registered subsystem) in
>> general but it hints what needs to be done for code size reduction.
>>
>> Namely, overlaying allocation of pointer array and fixed part of
>> structure in the beginning and using usual base-0 addressing.
>>
>> Ids are just cookies, their exact values do not matter, so lets start
>> with 3 on x86_64.
> ...
>>  struct net_generic {
>> -     struct {
>> -             unsigned int len;
>> -             struct rcu_head rcu;
>> -     } s;
>> -
>> -     void *ptr[0];
>> +     union {
>> +             struct {
>> +                     unsigned int len;
>> +                     struct rcu_head rcu;
>> +             } s;
>> +
>> +             void *ptr[0];
>> +     };
>>  };
>
> That union is an accident waiting to happen.

I kind of disagree. Module authors should not be given matches,
but it is hard to screw up if net_generic() is all you're given.

> What might work is to offset the Ids by
> (offsetof(struct net_generic, ptr)/sizeof (void *)) instead of by 1.
> The subtract from the offset will then counter the structure offset
> - which is what you are trying to achieve.

If you suggest this layout

struct net_generic {
        struct {
        } s;
        void *ptr[0];
};

then is it not optimal because offset of "ptr" needs to be somewhere in code
either in some LEA or imm8 of the final MOV which is 1 byte more bloaty.

Here is test program

struct ng1 {
        union {
                struct {
                        unsigned int len;
                } s;
                void *ptr[0];
        };
};
struct ng2 {
        struct {
                unsigned int len;
        } s;
        void *ptr[0];
};
struct net {
        int x;
        struct ng1 *gen1;
        struct ng2 *gen2;
};
void *ng1(const struct net *net, unsigned int id)
{
        return net->gen1->ptr[id];
}
void *ng2(const struct net *net, unsigned int id)
{
        return net->gen2->ptr[id];
}


0000000000000000 <ng1>:
   0:   48 8b 47 08             mov    rax,QWORD PTR [rdi+0x8]
   4:   89 f6                   mov    esi,esi
   6:   48 8b 04 f0             mov    rax,QWORD PTR [rax+rsi*8]
   a:   c3                      ret


0000000000000010 <ng2>:
  10:   48 8b 47 10             mov    rax,QWORD PTR [rdi+0x10]
  14:   89 f6                   mov    esi,esi
  16:   48 8b 44 f0 [[[08]]]          mov    rax,QWORD PTR [rax+rsi*8+0x8]
  1b:   c3                      ret

I can send more type checking if you can tolerate "_" identifier

    struct {
        unsigned int _;
    } pernet_ops_id_t;

^ permalink raw reply

* Re: [RFC] udp: some improvements on RX path.
From: Eric Dumazet @ 2016-12-05 14:28 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev
In-Reply-To: <1480944138.4694.37.camel@redhat.com>

On Mon, 2016-12-05 at 14:22 +0100, Paolo Abeni wrote:
> Hi Eric,
> 
> On Sun, 2016-12-04 at 18:43 -0800, Eric Dumazet wrote:
> > We currently access 3 cache lines from an skb in receive queue while
> > holding receive queue lock :
> > 
> > First cache line (contains ->next / prev pointers )
> > 2nd cache line (skb->peeked)
> > 3rd cache line (skb->truesize)
> > 
> > I believe we could get rid of skb->peeked completely.
> > 
> > I will cook a patch, but basically the idea is that the last owner of a
> > skb (right before skb->users becomes 0) can have the 'ownership' and
> > thus increase stats.
> 
> Agreed.
> 
> > The 3rd cache line miss is easily avoided by the following patch.
> 
> I run some performance tests on top of your patch "net: reorganize
> struct sock for better data locality", and I see an additional ~7%
> improvement on top of that, in the udp flood scenario. 
> 
> In my tests, the topmost perf offenders for the u/s process are now:
> 
>    9.98%  udp_sink  [kernel.kallsyms]  [k] udp_rmem_release
>    8.76%  udp_sink  [kernel.kallsyms]  [k] inet_recvmsg
>    6.71%  udp_sink  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>    5.40%  udp_sink  [kernel.kallsyms]  [k] __skb_try_recv_datagram
>    5.19%  udp_sink  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
> 
> udp_rmem_release() spends most of its time doing:
> 
>     atomic_sub(size, &sk->sk_rmem_alloc);
> 
> I see a cacheline miss while accessing sk_rmem_alloc; most probably due
> to sk_rmem_alloc being updated by the writer outside the rx queue lock.
> 
> Moving such update inside the lock show remove this cache miss but will
> increase the pressure on the rx lock. What do you think ?

I want to accumulate the sk_rmem_alloc  deficit in another variable in
the same cache line than the secondary list, updated from process
context at udp recvmsg() time.

And only transfer the accumulated deficit in one go when the second
queue is emptied, or if the accumulated deficite is > (rcvbuf/2)

If done right, we should interfere between the softirq flooder(s) and
the process thread only once per batch.

> 
> inet_recvmsg() is there because with "net: reorganize struct sock for
> better data locality" we get a cache miss while accessing skc_rxhash in
> sock_rps_record_flow(); touching sk_drops is dirtying that cacheline -
> sorry for not noticing this before. Do you have CONFIG_RPS disabled ?
> 
> > But I also want to work on the idea I gave few days back, having a
> > separate queue and use splice to transfer the 'softirq queue' into
> > a calm queue in a different cache line.
> > 
> > I expect a 50 % performance increase under load, maybe 1.5 Mpps.
> 
> It should work nicely under contention, but won't that increase the
> overhead for the uncontended/single flow scenario ? the user space
> reader needs to acquire 2 lock when splicing the 'softirq queue'. On my
> system ksoftirqd and the u/s process work at similar speeds, so splicing
> will happen quite often. 

Well, the splice would happen only if you have more than one message in
the softirq queue. So no real overhead for uncontended flow scenario.


This reminds me of the busylock I added in __dev_xmit_skb(), which
basically is acquired only when we detect a possible contention on qdisc
lock.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next v4 2/4] dt-bindings: net: add EEE capability constants
From: Rob Herring @ 2016-12-05 14:39 UTC (permalink / raw)
  To: Jerome Brunet
  Cc: netdev, devicetree, Florian Fainelli, Carlo Caione, Kevin Hilman,
	Giuseppe Cavallaro, Alexandre TORGUE, Martin Blumenstingl,
	Andre Roth, Andrew Lunn, Neil Armstrong, linux-amlogic,
	linux-arm-kernel, linux-kernel, Julia Lawall, Yegor Yefremov,
	Andreas Färber
In-Reply-To: <1480348229-25672-3-git-send-email-jbrunet@baylibre.com>

On Mon, Nov 28, 2016 at 04:50:26PM +0100, Jerome Brunet wrote:
> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> Tested-by: Yegor Yefremov <yegorslists@googlemail.com>
> Tested-by: Andreas Färber <afaerber@suse.de>
> Tested-by: Neil Armstrong <narmstrong@baylibre.com>
> ---
>  include/dt-bindings/net/mdio.h | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>  create mode 100644 include/dt-bindings/net/mdio.h

Seems changes are wanted on this, but patches 2 and 3 should be 
combined. The header is part of the binding doc.

Rob

^ permalink raw reply

* Re: [PATCH net-next v6 6/6] samples/bpf: add userspace example for prohibiting sockets
From: David Ahern @ 2016-12-05 14:31 UTC (permalink / raw)
  To: David Laight, Alexei Starovoitov
  Cc: netdev@vger.kernel.org, daniel@zonque.org, ast@fb.com,
	daniel@iogearbox.net, maheshb@google.com, tgraf@suug.ch
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0234BBD@AcuExch.aculab.com>

On 12/5/16 3:51 AM, David Laight wrote:
> You can at least improve the comment.

I fixed the header files for v7.

^ permalink raw reply

* Designing a safe RX-zero-copy Memory Model for Networking
From: Jesper Dangaard Brouer @ 2016-12-05 14:31 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: brouer, linux-mm, John Fastabend, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg

Hi all,

This is my design for how to safely handle RX zero-copy in the network
stack, by using page_pool[1] and modifying NIC drivers.  Safely means
not leaking kernel info in pages mapped to userspace and resilience
so a malicious userspace app cannot crash the kernel.

It is only a design, and thus the purpose is for you to find any holes
in this design ;-)  Below text is also available as html see[2].

[1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/design.html
[2] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html

===========================
Memory Model for Networking
===========================

This design describes how the page_pool change the memory model for
networking in the NIC (Network Interface Card) drivers.

.. Note:: The catch for driver developers is that, once an application
          request zero-copy RX, then the driver must use a specific
          SKB allocation mode and might have to reconfigure the
          RX-ring.

Design target
=============

Allow the NIC to function as a normal Linux NIC and be shared in a
safe manor, between the kernel network stack and an accelerated
userspace application using RX zero-copy delivery.

Target is to provide the basis for building RX zero-copy solutions in
a memory safe manor.  An efficient communication channel for userspace
delivery is out of scope for this document, but OOM considerations are
discussed below (`Userspace delivery and OOM`_).

Background
==========

The SKB or ``struct sk_buff`` is the fundamental meta-data structure
for network packets in the Linux Kernel network stack.  It is a fairly
complex object and can be constructed in several ways.

From a memory perspective there are two ways depending on
RX-buffer/page state:

1) Writable packet page
2) Read-only packet page

To take full potential of the page_pool, the drivers must actually
support handling both options depending on the configuration state of
the page_pool.

Writable packet page
--------------------

When the RX packet page is writable, the SKB setup is fairly straight
forward.  The SKB->data (and skb->head) can point directly to the page
data, adjusting the offset according to drivers headroom (for adding
headers) and setting the length according to the DMA descriptor info.

The page/data need to be writable, because the network stack need to
adjust headers (like TimeToLive and checksum) or even add or remove
headers for encapsulation purposes.

A subtle catch, which also requires a writable page, is that the SKB
also have an accompanying "shared info" data-structure ``struct
skb_shared_info``.  This "skb_shared_info" is written into the
skb->data memory area at the end (skb->end) of the (header) data.  The
skb_shared_info contains semi-sensitive information, like kernel
memory pointers to other pages (which might be pointers to more packet
data).  This would be bad from a zero-copy point of view to leak this
kind of information.

Read-only packet page
---------------------

When the RX packet page is read-only, the construction of the SKB is
significantly more complicated and even involves one more memory
allocation.

1) Allocate a new separate writable memory area, and point skb->data
   here.  This is needed due to (above described) skb_shared_info.

2) Memcpy packet headers into this (skb->data) area.

3) Clear part of skb_shared_info struct in writable-area.

4) Setup pointer to packet-data in the page (in skb_shared_info->frags)
   and adjust the page_offset to be past the headers just copied.

It is useful (later) that the network stack have this notion that part
of the packet and a page can be read-only.  This implies that the
kernel will not "pollute" this memory with any sensitive information.
This is good from a zero-copy point of view, but bad from a
performance perspective.


NIC RX Zero-Copy
================

Doing NIC RX zero-copy involves mapping RX pages into userspace.  This
involves costly mapping and unmapping operations in the address space
of the userspace process.  Plus for doing this safely, the page memory
need to be cleared before using it, to avoid leaking kernel
information to userspace, also a costly operation.  The page_pool base
"class" of optimization is moving these kind of operations out of the
fastpath, by recycling and lifetime control.

Once a NIC RX-queue's page_pool have been configured for zero-copy
into userspace, then can packets still be allowed to travel the normal
stack?

Yes, this should be possible, because the driver can use the
SKB-read-only mode, which avoids polluting the page data with
kernel-side sensitive data.  This implies, when a driver RX-queue
switch page_pool to RX-zero-copy mode it MUST also switch to
SKB-read-only mode (for normal stack delivery for this RXq).

XDP can be used for controlling which pages that gets RX zero-copied
to userspace.  The page is still writable for the XDP program, but
read-only for normal stack delivery.


Kernel safety
-------------

For the paranoid, how do we protect the kernel from a malicious
userspace program.  Sure there will be a communication interface
between kernel and userspace, that synchronize ownership of pages.
But a userspace program can violate this interface, given pages are
kept VMA mapped, the program can in principle access all the memory
pages in the given page_pool.  This opens up for a malicious (or
defect) program modifying memory pages concurrently with the kernel
and DMA engine using them.

An easy way to get around userspace modifying page data contents is
simply to map pages read-only into userspace.

.. Note:: The first implementation target is read-only zero-copy RX
          page to userspace and require driver to use SKB-read-only
          mode.

Advanced: Allowing userspace write access?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What if userspace need write access? Flipping the page permissions per
transfer will likely kill performance (as this likely affects the
TLB-cache).

I will argue that giving userspace write access is still possible,
without risking a kernel crash.  This is related to the SKB-read-only
mode that copies the packet headers (in to another memory area,
inaccessible to userspace).  The attack angle is to modify packet
headers after they passed some kernel network stack validation step
(as once headers are copied they are out of "reach").

Situation classes where memory page can be modified concurrently:

1) When DMA engine owns the page.  Not a problem, as DMA engine will
   simply overwrite data.

2) Just after DMA engine finish writing.  Not a problem, the packet
   will go through netstack validation and be rejected.

3) While XDP reads data. This can lead to XDP/eBPF program goes into a
   wrong code branch, but the eBPF virtual machine should not be able
   to crash the kernel. The worst outcome is a wrong or invalid XDP
   return code.

4) Before SKB with read-only page is constructed. Not a problem, the
   packet will go through netstack validation and be rejected.

5) After SKB with read-only page has been constructed.  Remember the
   packet headers were copied into a separate memory area, and the
   page data is pointed to with an offset passed the copied headers.
   Thus, userspace cannot modify the headers used for netstack
   validation.  It can only modify packet data contents, which is less
   critical as it cannot crash the kernel, and eventually this will be
   caught by packet checksum validation.

6) After netstack delivered packet to another userspace process. Not a
   problem, as it cannot crash the kernel.  It might corrupt
   packet-data being read by another userspace process, which one
   argument for requiring elevated privileges to get write access
   (like NET_CAP_ADMIN).


Userspace delivery and OOM
--------------------------

These RX pages are likely mapped to userspace via mmap(), so-far so
good.  It is key to performance to get an efficient way of signaling
between kernel and userspace, e.g what page are ready for consumption,
and when userspace are done with the page.

It is outside the scope of page_pool to provide such a queuing
structure, but the page_pool can offer some means of protecting the
system resource usage.  It is a classical problem that resources
(e.g. the page) must be returned in a timely manor, else the system,
in this case, will run out of memory.  Any system/design with
unbounded memory allocation can lead to Out-Of-Memory (OOM)
situations.

Communication between kernel and userspace is likely going to be some
kind of queue.  Given transferring packets individually will have too
much scheduling overhead.  A queue can implicitly function as a
bulking interface, and offers a natural way to split the workload
across CPU cores.

This essentially boils down-to a two queue system, with the RX-ring
queue and the userspace delivery queue.

Two bad situations exists for the userspace queue:

1) Userspace is not consuming objects fast-enough. This should simply
   result in packets getting dropped when enqueueing to a full
   userspace queue (as queue *must* implement some limit). Open
   question is; should this be reported or communicated to userspace.

2) Userspace is consuming objects fast, but not returning them in a
   timely manor.  This is a bad situation, because it threatens the
   system stability as it can lead to OOM.

The page_pool should somehow protect the system in case 2.  The
page_pool can detect the situation as it is able to track the number
of outstanding pages, due to the recycle feedback loop.  Thus, the
page_pool can have some configurable limit of allowed outstanding
pages, which can protect the system against OOM.

Note, the `Fbufs paper`_ propose to solve case 2 by allowing these
pages to be "pageable", i.e. swap-able, but that is not an option for
the page_pool as these pages are DMA mapped.

.. _`Fbufs paper`:
   http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.9688

Effect of blocking allocation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The effect of page_pool, in case 2, that denies more allocations
essentially result-in the RX-ring queue cannot be refilled and HW
starts dropping packets due to "out-of-buffers".  For NICs with
several HW RX-queues, this can be limited to a subset of queues (and
admin can control which RX queue with HW filters).

The question is if the page_pool can do something smarter in this
case, to signal the consumers of these pages, before the maximum limit
is hit (of allowed outstanding packets).  The MM-subsystem already
have a concept of emergency PFMEMALLOC reserves and associate
page-flags (e.g. page_is_pfmemalloc).  And the network stack already
handle and react to this.  Could the same PFMEMALLOC system be used
for marking pages when limit is close?

This requires further analysis. One can imagine; this could be used at
RX by XDP to mitigate the situation by dropping less-important frames.
Given XDP choose which pages are being send to userspace it might have
appropriate knowledge of what it relevant to drop(?).

.. Note:: An alternative idea is using a data-structure that blocks
          userspace from getting new pages before returning some.
          (out of scope for the page_pool)

--
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Above document is taken at GitHub commit 47fa7c844f48fab8b
 https://github.com/netoptimizer/prototype-kernel/commit/47fa7c844f48fab8b

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v2 net-next 7/8] net: reorganize struct sock for better data locality
From: Eric Dumazet @ 2016-12-05 14:30 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Eric Dumazet, David S . Miller, netdev, Yuchung Cheng
In-Reply-To: <1480941384.4694.13.camel@redhat.com>

On Mon, 2016-12-05 at 13:36 +0100, Paolo Abeni wrote:
> Hi Eric,

> I cherry-picked this patch only for some UDP benchmark. Under flood with
> many concurrent flows, I see this 20% improvement and a relevant
> decrease in system load.
> 
> Nice work, thanks Eric!
> 
> Tested-by: Paolo Abeni <pabeni@redhat.com>

Excellent, thanks a lot for testing !

^ permalink raw reply

* Re: [PATCH v2 net-next 1/2] flow dissector: ICMP support
From: Jiri Pirko @ 2016-12-05 14:27 UTC (permalink / raw)
  To: Simon Horman
  Cc: Tom Herbert, David Miller, Linux Kernel Network Developers,
	Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, Jamal Hadi Salim,
	Jiri Pirko
In-Reply-To: <20161205142208.GA16425@penelope.horms.nl>

Mon, Dec 05, 2016 at 03:23:16PM CET, simon.horman@netronome.com wrote:
>On Sat, Dec 03, 2016 at 10:52:32AM -0800, Tom Herbert wrote:
>> On Sat, Dec 3, 2016 at 2:49 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> > Fri, Dec 02, 2016 at 09:31:41PM CET, simon.horman@netronome.com wrote:
>> >>Allow dissection of ICMP(V6) type and code. This re-uses transport layer
>> >>port dissection code as although ICMP is not a transport protocol and their
>> >>type and code are not ports this allows sharing of both code and storage.
>> >>
>> >>Signed-off-by: Simon Horman <simon.horman@netronome.com>
>
>...
>
>> > Digging into this a bit more. I think it would be much nice not to mix
>> > up l4 ports and icmp stuff.
>> >
>> > How about to have FLOW_DISSECTOR_KEY_ICMP
>> > and
>> > struct flow_dissector_key_icmp {
>> >         u8 type;
>> >         u8 code;
>> > };
>> >
>> > The you can make this structure and struct flow_dissector_key_ports into
>> > an union in struct flow_keys.
>> >
>> > Looks much cleaner to me.
>
>Hi Jiri,
>
>I wondered about that too. The main reason that I didn't do this the first
>time around is that I wasn't sure what to do to differentiate between ports
>and icmp in fl_init_dissector() given that using a union would could to
>mask bits being set in both if they are set in either.
>
>I've taken a stab at addressing that below along with implementing your
>suggestion. If the treatment in fl_init_dissector() is acceptable
>perhaps something similar should be done for FLOW_DISSECTOR_KEY_IPV[46]_ADDRS

Looks fine.

>too?
>
>> I agree, this patch adds to many conditionals into the fast path for
>> ICMP handling. Neither is there much point in using type and code as
>> input to the packet hash.
>
>Hi Tom,
>
>I'm not sure that I follow this. A packet may be ICMP or not regardless of
>if FLOW_DISSECTOR_KEY_PORT (and now FLOW_DISSECTOR_KEY_ICMP) port is set or
>not. I'd appreciate some guidance here.
>
>
>First-pass at implementing Jiri's suggestion follows:
>
>diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
>index 5540dfa18872..475cd121496f 100644
>--- a/include/net/flow_dissector.h
>+++ b/include/net/flow_dissector.h
>@@ -91,14 +91,10 @@ struct flow_dissector_key_addrs {
> 
> /**
>  * flow_dissector_key_ports:
>- *	@ports: port numbers of Transport header or
>- *		type and code of ICMP header
>+ *	@ports: port numbers of Transport header
>  *		ports: source (high) and destination (low) port numbers
>  *		src: source port number
>  *		dst: destination port number
>- *		icmp: ICMP type (high) and code (low)
>- *		type: ICMP type
>- *		type: ICMP code
>  */
> struct flow_dissector_key_ports {
> 	union {
>@@ -107,6 +103,18 @@ struct flow_dissector_key_ports {
> 			__be16 src;
> 			__be16 dst;
> 		};
>+	};
>+};
>+
>+/**
>+ * flow_dissector_key_icmp:
>+ *	@ports: type and code of ICMP header
>+ *		icmp: ICMP type (high) and code (low)
>+ *		type: ICMP type
>+ *		code: ICMP code
>+ */
>+struct flow_dissector_key_icmp {
>+	union {
> 		__be16 icmp;
> 		struct {
> 			u8 type;
>@@ -133,6 +141,7 @@ enum flow_dissector_key_id {
> 	FLOW_DISSECTOR_KEY_IPV4_ADDRS, /* struct flow_dissector_key_ipv4_addrs */
> 	FLOW_DISSECTOR_KEY_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */
> 	FLOW_DISSECTOR_KEY_PORTS, /* struct flow_dissector_key_ports */
>+	FLOW_DISSECTOR_KEY_ICMP, /* struct flow_dissector_key_icmp */
> 	FLOW_DISSECTOR_KEY_ETH_ADDRS, /* struct flow_dissector_key_eth_addrs */
> 	FLOW_DISSECTOR_KEY_TIPC_ADDRS, /* struct flow_dissector_key_tipc_addrs */
> 	FLOW_DISSECTOR_KEY_VLAN, /* struct flow_dissector_key_flow_vlan */
>@@ -171,7 +180,10 @@ struct flow_keys {
> 	struct flow_dissector_key_tags tags;
> 	struct flow_dissector_key_vlan vlan;
> 	struct flow_dissector_key_keyid keyid;
>-	struct flow_dissector_key_ports ports;
>+	union {
>+		struct flow_dissector_key_ports ports;
>+		struct flow_dissector_key_icmp icmp;
>+	};
> 	struct flow_dissector_key_addrs addrs;
> };
> 
>diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>index 0584b4bb4390..33e5fa2b3e87 100644
>--- a/net/core/flow_dissector.c
>+++ b/net/core/flow_dissector.c
>@@ -139,6 +139,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> 	struct flow_dissector_key_basic *key_basic;
> 	struct flow_dissector_key_addrs *key_addrs;
> 	struct flow_dissector_key_ports *key_ports;
>+	struct flow_dissector_key_icmp *key_icmp;
> 	struct flow_dissector_key_tags *key_tags;
> 	struct flow_dissector_key_vlan *key_vlan;
> 	struct flow_dissector_key_keyid *key_keyid;
>@@ -559,18 +560,25 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> 		break;
> 	}
> 
>-	if (dissector_uses_key(flow_dissector,
>-			       FLOW_DISSECTOR_KEY_PORTS)) {
>-		key_ports = skb_flow_dissector_target(flow_dissector,
>-						      FLOW_DISSECTOR_KEY_PORTS,
>-						      target_container);
>-		if (flow_protos_are_icmp_any(proto, ip_proto))
>-			key_ports->icmp = skb_flow_get_be16(skb, nhoff, data,
>-							    hlen);
>-		else
>+	if (flow_protos_are_icmp_any(proto, ip_proto)) {
>+		if (dissector_uses_key(flow_dissector,
>+				       FLOW_DISSECTOR_KEY_ICMP)) {
>+			key_icmp = skb_flow_dissector_target(flow_dissector,
>+							     FLOW_DISSECTOR_KEY_ICMP,
>+							     target_container);
>+			key_icmp->icmp = skb_flow_get_be16(skb, nhoff, data,
>+							   hlen);
>+		}
>+	} else {
>+		if (dissector_uses_key(flow_dissector,
>+				       FLOW_DISSECTOR_KEY_PORTS)) {
>+			key_ports = skb_flow_dissector_target(flow_dissector,
>+							      FLOW_DISSECTOR_KEY_PORTS,
>+							      target_container);
> 			key_ports->ports = __skb_flow_get_ports(skb, nhoff,
> 								ip_proto, data,
> 								hlen);
>+		}
> 	}
> 
> out_good:
>@@ -975,6 +983,10 @@ static const struct flow_dissector_key flow_keys_dissector_keys[] = {
> 		.offset = offsetof(struct flow_keys, ports),
> 	},
> 	{
>+		.key_id = FLOW_DISSECTOR_KEY_ICMP,
>+		.offset = offsetof(struct flow_keys, icmp),
>+	},
>+	{
> 		.key_id = FLOW_DISSECTOR_KEY_VLAN,
> 		.offset = offsetof(struct flow_keys, vlan),
> 	},
>diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>index c96b2a349779..f4ba69bd362f 100644
>--- a/net/sched/cls_flower.c
>+++ b/net/sched/cls_flower.c
>@@ -38,7 +38,10 @@ struct fl_flow_key {
> 		struct flow_dissector_key_ipv4_addrs ipv4;
> 		struct flow_dissector_key_ipv6_addrs ipv6;
> 	};
>-	struct flow_dissector_key_ports tp;
>+	union {
>+		struct flow_dissector_key_ports tp;
>+		struct flow_dissector_key_icmp icmp;
>+	};
> 	struct flow_dissector_key_keyid enc_key_id;
> 	union {
> 		struct flow_dissector_key_ipv4_addrs enc_ipv4;
>@@ -511,19 +514,23 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
> 			       &mask->tp.dst, TCA_FLOWER_KEY_SCTP_DST_MASK,
> 			       sizeof(key->tp.dst));
> 	} else if (flow_basic_key_is_icmpv4(&key->basic)) {
>-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
>-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
>-			       sizeof(key->tp.type));
>-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>-			       sizeof(key->tp.code));
>+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
>+			       &mask->icmp.type,
>+			       TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
>+			       sizeof(key->icmp.type));
>+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>+			       &mask->icmp.code,
>+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>+			       sizeof(key->icmp.code));
> 	} else if (flow_basic_key_is_icmpv6(&key->basic)) {
>-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
>-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
>-			       sizeof(key->tp.type));
>-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>-			       sizeof(key->tp.code));
>+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
>+			       &mask->icmp.type,
>+			       TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
>+			       sizeof(key->icmp.type));
>+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>+			       &mask->icmp.code,
>+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>+			       sizeof(key->icmp.code));
> 	}
> 
> 	if (tb[TCA_FLOWER_KEY_ENC_IPV4_SRC] ||
>@@ -609,15 +616,16 @@ static int fl_init_hashtable(struct cls_fl_head *head,
> 		keys[cnt].key_id = id;						\
> 		keys[cnt].offset = FL_KEY_MEMBER_OFFSET(member);		\
> 		cnt++;								\
>-	} while(0);
>+	} while(0)
> 
> #define FL_KEY_SET_IF_MASKED(mask, keys, cnt, id, member)			\
> 	do {									\
> 		if (FL_KEY_IS_MASKED(mask, member))				\
> 			FL_KEY_SET(keys, cnt, id, member);			\
>-	} while(0);
>+	} while(0)
> 
> static void fl_init_dissector(struct cls_fl_head *head,
>+			      struct cls_fl_filter *f,
> 			      struct fl_flow_mask *mask)
> {
> 	struct flow_dissector_key keys[FLOW_DISSECTOR_KEY_MAX];
>@@ -631,8 +639,12 @@ static void fl_init_dissector(struct cls_fl_head *head,
> 			     FLOW_DISSECTOR_KEY_IPV4_ADDRS, ipv4);
> 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
> 			     FLOW_DISSECTOR_KEY_IPV6_ADDRS, ipv6);
>-	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
>-			     FLOW_DISSECTOR_KEY_PORTS, tp);
>+	if (flow_basic_key_is_icmpv4(&f->key.basic))
>+		FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
>+				     FLOW_DISSECTOR_KEY_ICMP, icmp);
>+	else
>+		FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
>+				     FLOW_DISSECTOR_KEY_PORTS, tp);
> 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
> 			     FLOW_DISSECTOR_KEY_VLAN, vlan);
> 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
>@@ -652,6 +664,7 @@ static void fl_init_dissector(struct cls_fl_head *head,
> }
> 
> static int fl_check_assign_mask(struct cls_fl_head *head,
>+				struct cls_fl_filter *f,
> 				struct fl_flow_mask *mask)
> {
> 	int err;
>@@ -672,7 +685,7 @@ static int fl_check_assign_mask(struct cls_fl_head *head,
> 	memcpy(&head->mask, mask, sizeof(head->mask));
> 	head->mask_assigned = true;
> 
>-	fl_init_dissector(head, mask);
>+	fl_init_dissector(head, f, mask);
> 
> 	return 0;
> }
>@@ -785,7 +798,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
> 	if (err)
> 		goto errout;
> 
>-	err = fl_check_assign_mask(head, &mask);
>+	err = fl_check_assign_mask(head, fnew, &mask);
> 	if (err)
> 		goto errout;
> 
>@@ -1000,24 +1013,24 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
> 				  sizeof(key->tp.dst))))
> 		goto nla_put_failure;
> 	else if (flow_basic_key_is_icmpv4(&key->basic) &&
>-		 (fl_dump_key_val(skb, &key->tp.type,
>-				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->tp.type,
>+		 (fl_dump_key_val(skb, &key->icmp.type,
>+				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->icmp.type,
> 				  TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
>-				  sizeof(key->tp.type)) ||
>-		  fl_dump_key_val(skb, &key->tp.code,
>-				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->tp.code,
>+				  sizeof(key->icmp.type)) ||
>+		  fl_dump_key_val(skb, &key->icmp.code,
>+				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->icmp.code,
> 				  TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>-				  sizeof(key->tp.code))))
>+				  sizeof(key->icmp.code))))
> 		goto nla_put_failure;
> 	else if (flow_basic_key_is_icmpv6(&key->basic) &&
>-		 (fl_dump_key_val(skb, &key->tp.type,
>-				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->tp.type,
>+		 (fl_dump_key_val(skb, &key->icmp.type,
>+				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->icmp.type,
> 				  TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
>-				  sizeof(key->tp.type)) ||
>-		  fl_dump_key_val(skb, &key->tp.code,
>-				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->tp.code,
>+				  sizeof(key->icmp.type)) ||
>+		  fl_dump_key_val(skb, &key->icmp.code,
>+				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->icmp.code,
> 				  TCA_FLOWER_KEY_ICMPV6_CODE_MASK,
>-				  sizeof(key->tp.code))))
>+				  sizeof(key->icmp.code))))
> 		goto nla_put_failure;
> 
> 	if (key->enc_control.addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS &&

^ permalink raw reply

* [PATCH net-next] net: dsa: mv88e6xxx: Use EDSA on mv88e6097
From: Stefan Eichenberger @ 2016-12-05 13:12 UTC (permalink / raw)
  To: andrew, vivien.didelot; +Cc: netdev, Stefan Eichenberger

Use DSA_TAG_PROTO_EDSA as tag_protocol for the mv88e6097. The 
initialisation was missing before.

Signed-off-by: Stefan Eichenberger <stefan.eichenberger@netmodule.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index ca453f3..7a6c587 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -3861,6 +3861,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
 		.global1_addr = 0x1b,
 		.age_time_coeff = 15000,
 		.g1_irqs = 8,
+		.tag_protocol = DSA_TAG_PROTO_EDSA,
 		.flags = MV88E6XXX_FLAGS_FAMILY_6097,
 		.ops = &mv88e6097_ops,
 	},
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH v2 net-next 1/2] flow dissector: ICMP support
From: Simon Horman @ 2016-12-05 14:23 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jiri Pirko, David Miller, Linux Kernel Network Developers,
	Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, Jamal Hadi Salim,
	Jiri Pirko
In-Reply-To: <CALx6S36EorReKwSs=qpS6uxCrL=_CQHf2BfV+Y=Kw-Dzfc241A@mail.gmail.com>

On Sat, Dec 03, 2016 at 10:52:32AM -0800, Tom Herbert wrote:
> On Sat, Dec 3, 2016 at 2:49 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> > Fri, Dec 02, 2016 at 09:31:41PM CET, simon.horman@netronome.com wrote:
> >>Allow dissection of ICMP(V6) type and code. This re-uses transport layer
> >>port dissection code as although ICMP is not a transport protocol and their
> >>type and code are not ports this allows sharing of both code and storage.
> >>
> >>Signed-off-by: Simon Horman <simon.horman@netronome.com>

...

> > Digging into this a bit more. I think it would be much nice not to mix
> > up l4 ports and icmp stuff.
> >
> > How about to have FLOW_DISSECTOR_KEY_ICMP
> > and
> > struct flow_dissector_key_icmp {
> >         u8 type;
> >         u8 code;
> > };
> >
> > The you can make this structure and struct flow_dissector_key_ports into
> > an union in struct flow_keys.
> >
> > Looks much cleaner to me.

Hi Jiri,

I wondered about that too. The main reason that I didn't do this the first
time around is that I wasn't sure what to do to differentiate between ports
and icmp in fl_init_dissector() given that using a union would could to
mask bits being set in both if they are set in either.

I've taken a stab at addressing that below along with implementing your
suggestion. If the treatment in fl_init_dissector() is acceptable
perhaps something similar should be done for FLOW_DISSECTOR_KEY_IPV[46]_ADDRS
too?

> I agree, this patch adds to many conditionals into the fast path for
> ICMP handling. Neither is there much point in using type and code as
> input to the packet hash.

Hi Tom,

I'm not sure that I follow this. A packet may be ICMP or not regardless of
if FLOW_DISSECTOR_KEY_PORT (and now FLOW_DISSECTOR_KEY_ICMP) port is set or
not. I'd appreciate some guidance here.


First-pass at implementing Jiri's suggestion follows:

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 5540dfa18872..475cd121496f 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -91,14 +91,10 @@ struct flow_dissector_key_addrs {
 
 /**
  * flow_dissector_key_ports:
- *	@ports: port numbers of Transport header or
- *		type and code of ICMP header
+ *	@ports: port numbers of Transport header
  *		ports: source (high) and destination (low) port numbers
  *		src: source port number
  *		dst: destination port number
- *		icmp: ICMP type (high) and code (low)
- *		type: ICMP type
- *		type: ICMP code
  */
 struct flow_dissector_key_ports {
 	union {
@@ -107,6 +103,18 @@ struct flow_dissector_key_ports {
 			__be16 src;
 			__be16 dst;
 		};
+	};
+};
+
+/**
+ * flow_dissector_key_icmp:
+ *	@ports: type and code of ICMP header
+ *		icmp: ICMP type (high) and code (low)
+ *		type: ICMP type
+ *		code: ICMP code
+ */
+struct flow_dissector_key_icmp {
+	union {
 		__be16 icmp;
 		struct {
 			u8 type;
@@ -133,6 +141,7 @@ enum flow_dissector_key_id {
 	FLOW_DISSECTOR_KEY_IPV4_ADDRS, /* struct flow_dissector_key_ipv4_addrs */
 	FLOW_DISSECTOR_KEY_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */
 	FLOW_DISSECTOR_KEY_PORTS, /* struct flow_dissector_key_ports */
+	FLOW_DISSECTOR_KEY_ICMP, /* struct flow_dissector_key_icmp */
 	FLOW_DISSECTOR_KEY_ETH_ADDRS, /* struct flow_dissector_key_eth_addrs */
 	FLOW_DISSECTOR_KEY_TIPC_ADDRS, /* struct flow_dissector_key_tipc_addrs */
 	FLOW_DISSECTOR_KEY_VLAN, /* struct flow_dissector_key_flow_vlan */
@@ -171,7 +180,10 @@ struct flow_keys {
 	struct flow_dissector_key_tags tags;
 	struct flow_dissector_key_vlan vlan;
 	struct flow_dissector_key_keyid keyid;
-	struct flow_dissector_key_ports ports;
+	union {
+		struct flow_dissector_key_ports ports;
+		struct flow_dissector_key_icmp icmp;
+	};
 	struct flow_dissector_key_addrs addrs;
 };
 
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 0584b4bb4390..33e5fa2b3e87 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -139,6 +139,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 	struct flow_dissector_key_basic *key_basic;
 	struct flow_dissector_key_addrs *key_addrs;
 	struct flow_dissector_key_ports *key_ports;
+	struct flow_dissector_key_icmp *key_icmp;
 	struct flow_dissector_key_tags *key_tags;
 	struct flow_dissector_key_vlan *key_vlan;
 	struct flow_dissector_key_keyid *key_keyid;
@@ -559,18 +560,25 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 		break;
 	}
 
-	if (dissector_uses_key(flow_dissector,
-			       FLOW_DISSECTOR_KEY_PORTS)) {
-		key_ports = skb_flow_dissector_target(flow_dissector,
-						      FLOW_DISSECTOR_KEY_PORTS,
-						      target_container);
-		if (flow_protos_are_icmp_any(proto, ip_proto))
-			key_ports->icmp = skb_flow_get_be16(skb, nhoff, data,
-							    hlen);
-		else
+	if (flow_protos_are_icmp_any(proto, ip_proto)) {
+		if (dissector_uses_key(flow_dissector,
+				       FLOW_DISSECTOR_KEY_ICMP)) {
+			key_icmp = skb_flow_dissector_target(flow_dissector,
+							     FLOW_DISSECTOR_KEY_ICMP,
+							     target_container);
+			key_icmp->icmp = skb_flow_get_be16(skb, nhoff, data,
+							   hlen);
+		}
+	} else {
+		if (dissector_uses_key(flow_dissector,
+				       FLOW_DISSECTOR_KEY_PORTS)) {
+			key_ports = skb_flow_dissector_target(flow_dissector,
+							      FLOW_DISSECTOR_KEY_PORTS,
+							      target_container);
 			key_ports->ports = __skb_flow_get_ports(skb, nhoff,
 								ip_proto, data,
 								hlen);
+		}
 	}
 
 out_good:
@@ -975,6 +983,10 @@ static const struct flow_dissector_key flow_keys_dissector_keys[] = {
 		.offset = offsetof(struct flow_keys, ports),
 	},
 	{
+		.key_id = FLOW_DISSECTOR_KEY_ICMP,
+		.offset = offsetof(struct flow_keys, icmp),
+	},
+	{
 		.key_id = FLOW_DISSECTOR_KEY_VLAN,
 		.offset = offsetof(struct flow_keys, vlan),
 	},
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index c96b2a349779..f4ba69bd362f 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -38,7 +38,10 @@ struct fl_flow_key {
 		struct flow_dissector_key_ipv4_addrs ipv4;
 		struct flow_dissector_key_ipv6_addrs ipv6;
 	};
-	struct flow_dissector_key_ports tp;
+	union {
+		struct flow_dissector_key_ports tp;
+		struct flow_dissector_key_icmp icmp;
+	};
 	struct flow_dissector_key_keyid enc_key_id;
 	union {
 		struct flow_dissector_key_ipv4_addrs enc_ipv4;
@@ -511,19 +514,23 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
 			       &mask->tp.dst, TCA_FLOWER_KEY_SCTP_DST_MASK,
 			       sizeof(key->tp.dst));
 	} else if (flow_basic_key_is_icmpv4(&key->basic)) {
-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
-			       sizeof(key->tp.type));
-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
-			       sizeof(key->tp.code));
+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
+			       &mask->icmp.type,
+			       TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
+			       sizeof(key->icmp.type));
+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
+			       &mask->icmp.code,
+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
+			       sizeof(key->icmp.code));
 	} else if (flow_basic_key_is_icmpv6(&key->basic)) {
-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
-			       sizeof(key->tp.type));
-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
-			       sizeof(key->tp.code));
+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
+			       &mask->icmp.type,
+			       TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
+			       sizeof(key->icmp.type));
+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
+			       &mask->icmp.code,
+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
+			       sizeof(key->icmp.code));
 	}
 
 	if (tb[TCA_FLOWER_KEY_ENC_IPV4_SRC] ||
@@ -609,15 +616,16 @@ static int fl_init_hashtable(struct cls_fl_head *head,
 		keys[cnt].key_id = id;						\
 		keys[cnt].offset = FL_KEY_MEMBER_OFFSET(member);		\
 		cnt++;								\
-	} while(0);
+	} while(0)
 
 #define FL_KEY_SET_IF_MASKED(mask, keys, cnt, id, member)			\
 	do {									\
 		if (FL_KEY_IS_MASKED(mask, member))				\
 			FL_KEY_SET(keys, cnt, id, member);			\
-	} while(0);
+	} while(0)
 
 static void fl_init_dissector(struct cls_fl_head *head,
+			      struct cls_fl_filter *f,
 			      struct fl_flow_mask *mask)
 {
 	struct flow_dissector_key keys[FLOW_DISSECTOR_KEY_MAX];
@@ -631,8 +639,12 @@ static void fl_init_dissector(struct cls_fl_head *head,
 			     FLOW_DISSECTOR_KEY_IPV4_ADDRS, ipv4);
 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 			     FLOW_DISSECTOR_KEY_IPV6_ADDRS, ipv6);
-	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
-			     FLOW_DISSECTOR_KEY_PORTS, tp);
+	if (flow_basic_key_is_icmpv4(&f->key.basic))
+		FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+				     FLOW_DISSECTOR_KEY_ICMP, icmp);
+	else
+		FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+				     FLOW_DISSECTOR_KEY_PORTS, tp);
 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 			     FLOW_DISSECTOR_KEY_VLAN, vlan);
 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
@@ -652,6 +664,7 @@ static void fl_init_dissector(struct cls_fl_head *head,
 }
 
 static int fl_check_assign_mask(struct cls_fl_head *head,
+				struct cls_fl_filter *f,
 				struct fl_flow_mask *mask)
 {
 	int err;
@@ -672,7 +685,7 @@ static int fl_check_assign_mask(struct cls_fl_head *head,
 	memcpy(&head->mask, mask, sizeof(head->mask));
 	head->mask_assigned = true;
 
-	fl_init_dissector(head, mask);
+	fl_init_dissector(head, f, mask);
 
 	return 0;
 }
@@ -785,7 +798,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 	if (err)
 		goto errout;
 
-	err = fl_check_assign_mask(head, &mask);
+	err = fl_check_assign_mask(head, fnew, &mask);
 	if (err)
 		goto errout;
 
@@ -1000,24 +1013,24 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 				  sizeof(key->tp.dst))))
 		goto nla_put_failure;
 	else if (flow_basic_key_is_icmpv4(&key->basic) &&
-		 (fl_dump_key_val(skb, &key->tp.type,
-				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->tp.type,
+		 (fl_dump_key_val(skb, &key->icmp.type,
+				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->icmp.type,
 				  TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
-				  sizeof(key->tp.type)) ||
-		  fl_dump_key_val(skb, &key->tp.code,
-				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->tp.code,
+				  sizeof(key->icmp.type)) ||
+		  fl_dump_key_val(skb, &key->icmp.code,
+				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->icmp.code,
 				  TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
-				  sizeof(key->tp.code))))
+				  sizeof(key->icmp.code))))
 		goto nla_put_failure;
 	else if (flow_basic_key_is_icmpv6(&key->basic) &&
-		 (fl_dump_key_val(skb, &key->tp.type,
-				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->tp.type,
+		 (fl_dump_key_val(skb, &key->icmp.type,
+				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->icmp.type,
 				  TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
-				  sizeof(key->tp.type)) ||
-		  fl_dump_key_val(skb, &key->tp.code,
-				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->tp.code,
+				  sizeof(key->icmp.type)) ||
+		  fl_dump_key_val(skb, &key->icmp.code,
+				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->icmp.code,
 				  TCA_FLOWER_KEY_ICMPV6_CODE_MASK,
-				  sizeof(key->tp.code))))
+				  sizeof(key->icmp.code))))
 		goto nla_put_failure;
 
 	if (key->enc_control.addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS &&

^ permalink raw reply related

* Re: [PATCH net-next] net: dsa: mv88e6xxx: Use EDSA on mv88e6097
From: Andrew Lunn @ 2016-12-05 14:12 UTC (permalink / raw)
  To: Stefan Eichenberger; +Cc: vivien.didelot, netdev, Stefan Eichenberger
In-Reply-To: <20161205131242.19370-1-stefan.eichenberger@netmodule.com>

On Mon, Dec 05, 2016 at 02:12:42PM +0100, Stefan Eichenberger wrote:
> Use DSA_TAG_PROTO_EDSA as tag_protocol for the mv88e6097. The 
> initialisation was missing before.
> 
> Signed-off-by: Stefan Eichenberger <stefan.eichenberger@netmodule.com>

Upps, sorry. My error.

This is net-next only.

Fixes: a1f482aa8c33 ("net: dsa: mv88e6xxx: Move the tagging protocol into info")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [patch net v2] net: fec: fix compile with CONFIG_M5272
From: Nikita Yushchenko @ 2016-12-05 14:03 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, David S. Miller, Fugang Duan, Troy Kisky, Andrew Lunn,
	Eric Nelson, Philippe Reynes, Johannes Berg, netdev, Chris Healy,
	Fabio Estevam, linux-kernel
In-Reply-To: <60acf0f4-2f1f-2b2c-06ed-b1d8cb03525d@cogentembedded.com>

diff --git a/drivers/net/ethernet/freescale/fec_main.c
b/drivers/net/ethernet/freescale/fec_main.c
index 741cf4a57bfc..12aef1b15356 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -3301,7 +3301,7 @@ fec_probe(struct platform_device *pdev)

 	/* Init network device */
 	ndev = alloc_etherdev_mqs(sizeof(struct fec_enet_private) +
-				  FEC_STAT_SIZE, num_tx_qs, num_rx_qs);
+				  FEC_STATS_SIZE, num_tx_qs, num_rx_qs);
 	if (!ndev)
 		return -ENOMEM;


> Aieee   I was typing too fast today, sorry...
> 
> send separate "fix for the fix", or re-send patch without that silly typo?
> 
> Nikita
> 
>> Hi Nikita,
>>
>> [auto build test ERROR on net/master]
>>
>> url:    https://github.com/0day-ci/linux/commits/Nikita-Yushchenko/net-fec-fix-compile-with-CONFIG_M5272/20161205-181735
>> config: arm-multi_v5_defconfig (attached as .config)
>> compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
>> reproduce:
>>         wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
>>         chmod +x ~/bin/make.cross
>>         # save the attached .config to linux build tree
>>         make.cross ARCH=arm 
>>
>> All errors (new ones prefixed by >>):
>>
>>    drivers/net/ethernet/freescale/fec_main.c: In function 'fec_probe':
>>>> drivers/net/ethernet/freescale/fec_main.c:3304:7: error: 'FEC_STAT_SIZE' undeclared (first use in this function)
>>           FEC_STAT_SIZE, num_tx_qs, num_rx_qs);
>>           ^~~~~~~~~~~~~
>>    drivers/net/ethernet/freescale/fec_main.c:3304:7: note: each undeclared identifier is reported only once for each function it appears in
>>
>> vim +/FEC_STAT_SIZE +3304 drivers/net/ethernet/freescale/fec_main.c
>>
>>   3298		int num_rx_qs;
>>   3299	
>>   3300		fec_enet_get_queue_num(pdev, &num_tx_qs, &num_rx_qs);
>>   3301	
>>   3302		/* Init network device */
>>   3303		ndev = alloc_etherdev_mqs(sizeof(struct fec_enet_private) +
>>> 3304					  FEC_STAT_SIZE, num_tx_qs, num_rx_qs);
>>   3305		if (!ndev)
>>   3306			return -ENOMEM;
>>   3307	
>>
>> ---
>> 0-DAY kernel test infrastructure                Open Source Technology Center
>> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
>>

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox