Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v3 2/2] tcp: allow to turn tcp timestamp randomization off
From: David Miller @ 2016-12-02 17:50 UTC (permalink / raw)
  To: fw; +Cc: netdev
In-Reply-To: <1480588327-2902-2-git-send-email-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Thu,  1 Dec 2016 11:32:07 +0100

> Eric says: "By looking at tcpdump, and TS val of xmit packets of multiple
> flows, we can deduct the relative qdisc delays (think of fq pacing).
> This should work even if we have one flow per remote peer."
> 
> Having random per flow (or host) offsets doesn't allow that anymore so add
> a way to turn this off.
> 
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> Acked-by: Yuchung Cheng <ycheng@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next v3 1/2] tcp: randomize tcp timestamp offsets for each connection
From: David Miller @ 2016-12-02 17:50 UTC (permalink / raw)
  To: fw; +Cc: netdev
In-Reply-To: <1480588327-2902-1-git-send-email-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Thu,  1 Dec 2016 11:32:06 +0100

> jiffies based timestamps allow for easy inference of number of devices
> behind NAT translators and also makes tracking of hosts simpler.
> 
> commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset")
> added the main infrastructure that is needed for per-connection ts
> randomization, in particular writing/reading the on-wire tcp header
> format takes the offset into account so rest of stack can use normal
> tcp_time_stamp (jiffies).
> 
> So only two items are left:
>  - add a tsoffset for request sockets
>  - extend the tcp isn generator to also return another 32bit number
>    in addition to the ISN.
> 
> Re-use of ISN generator also means timestamps are still monotonically
> increasing for same connection quadruple, i.e. PAWS will still work.
> 
> Includes fixes from Eric Dumazet.
> 
> Signed-off-by: Florian Westphal <fw@strlen.de>
> Acked-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Yuchung Cheng <ycheng@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH v2] cpsw: ethtool: add support for getting/setting EEE registers
From: Florian Fainelli @ 2016-12-02 17:48 UTC (permalink / raw)
  To: Giuseppe CAVALLARO, Andrew Lunn, Yegor Yefremov
  Cc: netdev, linux-omap@vger.kernel.org, Grygorii Strashko,
	N, Mugunthan V, Rami Rosen, Fabrice GASNIER
In-Reply-To: <2b4912e6-c380-b2bd-762a-d1da2b0a7d82@st.com>

On 12/02/2016 01:11 AM, Giuseppe CAVALLARO wrote:
> Hi Florian
> sorry for my delay.
> 
> On 11/24/2016 7:23 PM, Florian Fainelli wrote:
>> +Peppe,
>>
>> Le 24/11/2016 à 07:38, Andrew Lunn a écrit :
>>>> As for enabling advertising and correct working of cpsw do you mean it
>>>> would be better to disable EEE in any PHY on cpsw initialization as
>>>> long as cpsw doesn't provide support for EEE?
>>>>
>>>> We observe some strange behavior with our gigabit PHYs and a link
>>>> partner in a EEE-capable unmanaged NetGear switch. Disabling
>>>> advertising seems to help. Though we're still investigating the issue.
>>>
>>> Hi Florian
>>>
>>> Am i right in saying, a PHY should not advertise EEE until the MAC
>>> driver calls phy_init_eee(), indicating the MAC supports EEE?
>>
>> You would think so, but I don't see how this could possibly work if that
>> was not the case already, see below.
>>
>>>
>>> If so, it looks like we need to change a few of the PHY drivers, in
>>> particular, the bcm-*.c.
>>
>> The first part that bcm-phy-lib.c does is make sure that EEE is enabled
>> such that this gets reflected in MDIO_PCS_EEE_ABLE, without this, we
>> won't be able to pass the first test in phy_init_eee(). The second part
>> is to advertise EEE such that this gets reflected in MDIO_AN_EEE_ADV,
>> also to make sure that we can pass the second check in phy_init_eee().
>>
>> Now, looking at phy_init_eee(), and what stmmac does (and bcmgenet,
>> copied after stmmac), we need to somehow, have EEE advertised for
>> phy_init_eee() to succeed, prepare the MAC to support EEE, and finally
>> conclude with a call to phy_ethtool_set_eee(), which writes to the
>> MDIO_AN_EEE_ADV register, and concludes the EEE auto-negotiated process.
>> Since we already have EEE advertised, we are essentially just checking
>> that the EEE advertised settings and the LP advertised settings actually
>> do match, so it sounds like the final call to phy_ethtool_set_eee() is
>> potentially useless if the resolved advertised and link partner
>> advertised settings already match...
>>
>> So it sounds like at least, the first time you try to initialize EEE, we
>> should start with EEE not advertised, and then, if we have EEE enabled
>> at some point, and we re-negotiate the link parameters, somehow
>> phy_init_eee() does a right job for that.
>>
>> Peppe, any thoughts on this?
> 
> I share what you say.
> 
> In sum, the EEE management inside the stmmac is:
> 
> - the driver looks at own HW cap register if EEE is supported
> 
>     (indeed the user could keep disable EEE if bugged on some HW
>      + Alex, Fabrice: we had some patches for this to propose where we
>              called the phy_ethtool_set_eee to disable feature at phy
>              level
> 
> - then the stmmac asks PHY layer to understand if transceiver and
>   partners are EEE capable.
> 
> - If all matches the EEE is actually initialized.
> 
> the logic above should be respected when use ethtool, hmm, I will
> check the stmmac_ethtool_op_set_eee asap.
> 
> Hoping this is useful

This is definitively useful, the only part that I am struggling to
understand in phy_init_eee() is this:

                eee_adv = phy_read_mmd_indirect(phydev, MDIO_AN_EEE_ADV,
                                                MDIO_MMD_AN);
                if (eee_adv <= 0)
                        goto eee_exit_err;

if we are not already advertising EEE in the PHY's MMIO_MMD_AN page, by
the time we call phy_init_eee(), then we cannot complete the EEE
configuration at the PHY level, and presumably we should abort the EEE
configuration at the MAC level.

While this condition makes sense if e.g: you are re-negotiating the link
with your partner for instance and if EEE was already advertised, the
very first time this function is called, it seems to be like we should
skip the check, because phy_init_eee() should actually tell us if, as a
result of a successful check, we should be setting EEE as something we
advertise?

Do you remember what was the logic behind this check when you added it?

Thanks!
-- 
Florian

^ permalink raw reply

* Re: [PATCH v3 0/3] Add QLogic FastLinQ iSCSI (qedi) driver.
From: David Miller @ 2016-12-02 17:44 UTC (permalink / raw)
  To: Manish.Rangankar
  Cc: martin.petersen, linux-scsi, netdev, QLogic-Storage-Upstream,
	Yuval.Mintz, cleech, lduncan
In-Reply-To: <D4671854.35C28%manish.rangankar@cavium.com>

From: "Rangankar, Manish" <Manish.Rangankar@cavium.com>
Date: Fri, 2 Dec 2016 07:00:39 +0000

> Please consider applying the qed patches 1 & 2 to net-next.

Ok, done.

^ permalink raw reply

* Re: [PATCH net v2 3/3] Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"
From: David Miller @ 2016-12-02 17:35 UTC (permalink / raw)
  To: elicooper; +Cc: netdev, eric.dumazet
In-Reply-To: <20161201020512.21661-3-elicooper@gmx.com>

From: Eli Cooper <elicooper@gmx.com>
Date: Thu,  1 Dec 2016 10:05:12 +0800

> This reverts commit ae148b085876fa771d9ef2c05f85d4b4bf09ce0d
> ("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()").
> 
> skb->protocol is now set in __ip_local_out() and __ip6_local_out() before
> dst_output() is called. It is no longer necessary to do it for each tunnel.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Eli Cooper <elicooper@gmx.com>

Applied.

^ permalink raw reply

* Re: [PATCH net v2 2/3] ipv6: Set skb->protocol properly for local output
From: David Miller @ 2016-12-02 17:34 UTC (permalink / raw)
  To: elicooper; +Cc: netdev, eric.dumazet
In-Reply-To: <20161201020512.21661-2-elicooper@gmx.com>

From: Eli Cooper <elicooper@gmx.com>
Date: Thu,  1 Dec 2016 10:05:11 +0800

> When xfrm is applied to TSO/GSO packets, it follows this path:
> 
>     xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
> 
> where skb_gso_segment() relies on skb->protocol to function properly.
> 
> This patch sets skb->protocol to ETH_P_IPV6 before dst_output() is called,
> fixing a bug where GSO packets sent through an ipip6 tunnel are dropped
> when xfrm is involved.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Eli Cooper <elicooper@gmx.com>

Applied.

^ permalink raw reply

* Re: [PATCH net v2 1/3] ipv4: Set skb->protocol properly for local output
From: David Miller @ 2016-12-02 17:34 UTC (permalink / raw)
  To: elicooper; +Cc: netdev, eric.dumazet
In-Reply-To: <20161201020512.21661-1-elicooper@gmx.com>

From: Eli Cooper <elicooper@gmx.com>
Date: Thu,  1 Dec 2016 10:05:10 +0800

> When xfrm is applied to TSO/GSO packets, it follows this path:
> 
>     xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
> 
> where skb_gso_segment() relies on skb->protocol to function properly.
> 
> This patch sets skb->protocol to ETH_P_IP before dst_output() is called,
> fixing a bug where GSO packets sent through a sit tunnel are dropped
> when xfrm is involved.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Eli Cooper <elicooper@gmx.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op
From: Vivien Didelot @ 2016-12-02 17:30 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20161202154342.GL21887@lunn.ch>

Hi Andrew,

Andrew Lunn <andrew@lunn.ch> writes:

> +	/* Switch Software Reset */
> +	int (*g1_reset)(struct mv88e6xxx_chip *chip);
> +
>
> We have a collection of function pointers with port_ prefix, another
> collection with stats_, and a third with ppu_, etc. And then we have
> some which do not fit a specific category. Those i have prefixed with
> g1_ or g2_. I think we should have some prefix, and that is my
> suggestion.

I disagree. There's only one entry point to issue a switch software
reset, so .reset is enough.

I use this opportunity to give a bit of details about mv88e6xxx/ so that
things get written down at least once somewhere:

global1.c implements accessors to "Global 1 Registers" features and are
prefixed with mv88e6xxx_g1_; port.c implements accessors to "Port
Registers" features and are prefixed with mv88e6xxx_port_, and so
on. (where xxx can be a model if there's conflict due to a redefinition
of the same register)

If a feature is not present or if there's more than one way to access
it, these accessors are bundled in the per-chip mv88e6xxx_ops structure
for disambiguation.

chip.c implements support for a single chip by aggregating and nicely
wrapping these operations. It provides a generic API for Marvell
switches, used to implement net/dsa routines.

  Here's a couple of example. Setting a switch MAC can be done in Global
  1, or Global 2 depending on the model. Thus .set_switch_mac can be
  mv88e6xxx_g1_set_switch_mac or mv88e6xxx_g2_set_switch_mac.

  Setting the port's speed is always in the same Port register, but its
  layout varies with the model. Thus .port_set_speed can be
  mv88e6185_port_set_speed or mv88e6352_port_set_speed.

Thanks,

        Vivien

^ permalink raw reply

* Re: [PATCH net-next v3] ipv6 addrconf: Implemented enhanced DAD (RFC7527)
From: David Miller @ 2016-12-02 17:27 UTC (permalink / raw)
  To: nordmark; +Cc: netdev, hannes, gilligan
In-Reply-To: <1480549159-8142-1-git-send-email-nordmark@arista.com>

From: Erik Nordmark <nordmark@arista.com>
Date: Wed, 30 Nov 2016 15:39:19 -0800

> @@ -794,6 +808,17 @@ static void ndisc_recv_ns(struct sk_buff *skb)
>  have_ifp:
>  		if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) {
>  			if (dad) {
> +				if (nonce != 0 && ifp->dad_nonce == nonce) {
> +					u8 *np = (u8 *)&nonce;
> +					/* Matching nonce if looped back */
> +					ND_PRINTK(2, notice,
> +						  "%s: IPv6 DAD loopback for address %pI6c nonce %02x:%02x:%02x:%02x:%02x:%02x ignored\n",
> +						  ifp->idev->dev->name,
> +						  &ifp->addr,
> +						  np[0], np[1], np[2], np[3],
> +						  np[4], np[5]);

I know you said you'd leave this, but I'd actually like to ask that you
use %pM here to save some kernel size.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op
From: David Miller @ 2016-12-02 17:23 UTC (permalink / raw)
  To: vivien.didelot; +Cc: netdev, linux-kernel, kernel, f.fainelli, andrew
In-Reply-To: <20161130225930.25510-4-vivien.didelot@savoirfairelinux.com>

From: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Date: Wed, 30 Nov 2016 17:59:27 -0500

> @@ -765,6 +765,9 @@ struct mv88e6xxx_ops {
>  	int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg,
>  			 u16 val);
>  
> +	/* Switch Software Reset */
> +	int (*reset)(struct mv88e6xxx_chip *chip);
> +

I think Andrew's request to name this method "g1_reset" is reasonable, please
respin with that change.

Thanks.

^ permalink raw reply

* Re: [flamebait] xdp, well meaning but pointless
From: Jesper Dangaard Brouer @ 2016-12-02 17:22 UTC (permalink / raw)
  To: Florian Westphal; +Cc: brouer, netdev
In-Reply-To: <20161201091108.GF26507@breakpoint.cc>


On Thu, 1 Dec 2016 10:11:08 +0100 Florian Westphal <fw@strlen.de> wrote:

> In light of DPDKs existence it make a lot more sense to me to provide
> a). a faster mmap based interface (possibly AF_PACKET based) that allows
> to map nic directly into userspace, detaching tx/rx queue from kernel.
> 
> John Fastabend sent something like this last year as a proof of
> concept, iirc it was rejected because register space got exposed directly
> to userspace.  I think we should re-consider merging netmap
> (or something conceptually close to its design).

I'm actually working in this direction, of zero-copy RX mapping packets
into userspace.  This work is mostly related to page_pool, and I only
plan to use XDP as a filter for selecting packets going to userspace,
as this choice need to be taken very early.

My design is here:
 https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html

This is mostly about changing the memory model in the drivers, to allow
for safely mapping pages to userspace.  (An efficient queue mechanism is
not covered).  People often overlook that netmap's efficiency *also* comes
from introducing pre-mapping memory/pages to userspace.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH 5/7] Documentation: DT: net: cpsw: allow to specify descriptors pool size
From: Grygorii Strashko @ 2016-12-02 17:22 UTC (permalink / raw)
  To: Ivan Khoronzhuk
  Cc: David S. Miller, netdev, Mugunthan V N, Sekhar Nori, linux-kernel,
	linux-omap
In-Reply-To: <20161202112832.GB1213@khorivan>



On 12/02/2016 05:28 AM, Ivan Khoronzhuk wrote:
> On Thu, Dec 01, 2016 at 05:34:30PM -0600, Grygorii Strashko wrote:
>> Add optional property "descs_pool_size" to specify buffer descriptor's
>> pool size. The "descs_pool_size" should define total number of CPDMA
>> CPPI descriptors to be used for both ingress/egress packets
>> processing. If not specified - the default value 256 will be used
>> which will allow to place descriptor's pool into the internal CPPI
>> RAM on most of TI SoC.
>>
>> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
>> ---
>>  Documentation/devicetree/bindings/net/cpsw.txt | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt
>> index 5ad439f..b99d196 100644
>> --- a/Documentation/devicetree/bindings/net/cpsw.txt
>> +++ b/Documentation/devicetree/bindings/net/cpsw.txt
>> @@ -35,6 +35,11 @@ Optional properties:
>>  			  For example in dra72x-evm, pcf gpio has to be
>>  			  driven low so that cpsw slave 0 and phy data
>>  			  lines are connected via mux.
>> +- descs_pool_size	: total number of CPDMA CPPI descriptors to be used for
>> +			  both ingress/egress packets processing. if not
>> +			  specified the default value 256 will be used which
>> +			  will allow to place descriptors pool into the
>> +			  internal CPPI RAM.
> Does it describe h/w? Why now module parameter? or even smth like ethtool num
> ring entries?
> 

It can be module parameter too. in general this is expected to be 
 one-time boot setting only.  

----- OR
So, do you propose to use 
       ethtool -g ethX

       ethtool -G ethX [rx N] [tx N]
?

Now cpdma has one pool for all RX/TX channels, so changing this settings
by ethtool will require: pause interfaces, reallocate cpdma pool, 
re-arrange buffers between channels, resume interface. Correct?

How do you think - we can move forward with one pool or better to have two (Rx and Tx)?

Wouldn't it be reasonable to still have DT (or module) parameter to avoid 
cpdma reconfiguration on system startup (pause/resume interfaces) (faster boot)?

How about cpdma re-allocation policy (with expectation that is shouldn't happen too often)?
- increasing of Rx, Tx will grow total number of physically allocated buffers (total_desc_num)
- decreasing of Rx, Tx will just change number of available buffers (no memory re-allocation)

----- OR ----
Can we move forward with current patch (total number of CPDMA CPPI descriptors defined in DT) 
and add ethtool -G ethX [rx N] [tx N] which will allow to re-split descs between RX and TX?



-- 
regards,
-grygorii

^ permalink raw reply

* Re: [PATCH 5/7] Documentation: DT: net: cpsw: allow to specify descriptors pool size
From: Grygorii Strashko @ 2016-12-02 17:21 UTC (permalink / raw)
  To: Ivan Khoronzhuk
  Cc: David S. Miller, netdev, Mugunthan V N, Sekhar Nori, linux-kernel,
	linux-omap
In-Reply-To: <20161202112832.GB1213@khorivan>



On 12/02/2016 05:28 AM, Ivan Khoronzhuk wrote:
> On Thu, Dec 01, 2016 at 05:34:30PM -0600, Grygorii Strashko wrote:
>> Add optional property "descs_pool_size" to specify buffer descriptor's
>> pool size. The "descs_pool_size" should define total number of CPDMA
>> CPPI descriptors to be used for both ingress/egress packets
>> processing. If not specified - the default value 256 will be used
>> which will allow to place descriptor's pool into the internal CPPI
>> RAM on most of TI SoC.
>>
>> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
>> ---
>>  Documentation/devicetree/bindings/net/cpsw.txt | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt
>> index 5ad439f..b99d196 100644
>> --- a/Documentation/devicetree/bindings/net/cpsw.txt
>> +++ b/Documentation/devicetree/bindings/net/cpsw.txt
>> @@ -35,6 +35,11 @@ Optional properties:
>>  			  For example in dra72x-evm, pcf gpio has to be
>>  			  driven low so that cpsw slave 0 and phy data
>>  			  lines are connected via mux.
>> +- descs_pool_size	: total number of CPDMA CPPI descriptors to be used for
>> +			  both ingress/egress packets processing. if not
>> +			  specified the default value 256 will be used which
>> +			  will allow to place descriptors pool into the
>> +			  internal CPPI RAM.
> Does it describe h/w? Why now module parameter? or even smth like ethtool num
> ring entries?
> 

It can be module parameter too. for the use cases i'm aware of -
this is one-time boot setting only.  

----- OR
So, do you propose to use 
       ethtool -g ethX

       ethtool -G ethX [rx N] [tx N]
?

Now cpdma has one pool for all RX/TX channels, so changing this settings
by ethtool will require: pause interfaces, reallocate cpdma pool, 
re-arrange buffers between channels, resume interface. Correct?

How do you think - we can move forward with one pool or better to have two (Rx and Tx)?

Wouldn't it be reasonable to still have DT (or module) parameter to avoid 
cpdma reconfiguration on system startup (pause/resume interfaces) (faster boot)?

How about cpdma re-allocation policy (with expectation that is shouldn't happen too often)?
- increasing of Rx, Tx will grow total number of physically allocated buffers (total_desc_num)
- decreasing of Rx, Tx will just change number of available buffers (no memory re-allocation)

----- OR ----
Can we move forward with current patch (total number of CPDMA CPPI descriptors defined in DT) 
and add ethtool -G ethX [rx N] [tx N] which will allow to re-split descs between RX and TX?



-- 
regards,
-grygorii

^ permalink raw reply

* Re: [PATCH net] packet: fix race condition in packet_set_ring
From: David Miller @ 2016-12-02 17:17 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, philip.pettersson
In-Reply-To: <1480546536.18162.216.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 30 Nov 2016 14:55:36 -0800

> From: Philip Pettersson <philip.pettersson@gmail.com>
> 
> When packet_set_ring creates a ring buffer it will initialize a
> struct timer_list if the packet version is TPACKET_V3. This value
> can then be raced by a different thread calling setsockopt to
> set the version to TPACKET_V1 before packet_set_ring has finished.
> 
> This leads to a use-after-free on a function pointer in the
> struct timer_list when the socket is closed as the previously
> initialized timer will not be deleted.
> 
> The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
> changing the packet version while also taking the lock at the start
> of packet_set_ring.
> 
> Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
> Signed-off-by: Philip Pettersson <philip.pettersson@gmail.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-02 17:12 UTC (permalink / raw)
  To: Edward Cree
  Cc: Hannes Frederic Sowa, Florian Westphal,
	Linux Kernel Network Developers, Jesper Dangaard Brouer
In-Reply-To: <cb2e6263-d981-eccf-cea7-39392ceb67b5@solarflare.com>

On Fri, Dec 2, 2016 at 6:36 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 01/12/16 23:46, Tom Herbert wrote:
>> The only time we
>> _really_ to allocate an skbuf is when we need to put the packet onto a
>> queue. All the other use cases are really just to pass a structure
>> containing a packet from function to function. For that purpose we
>> should be able to just pass a much smaller structure in a stack
>> argument and only allocate an skbuff when we need to enqueue. In cases
>> where we don't ever queue a packet we might never need to allocate any
>> skbuff
> Now this intrigues me, because one of the objections to bundling (vs GRO)
> was the memory usage of all those SKBs.  IIRC we already do a 'GRO-like'
> coalescing when packets reach a TCP socket anyway (or at least in some
> cases, not sure if all the different ways we can enqueue a TCP packet for
> RX do it), but if we could carry the packets from NIC to socket without
> SKBs, doing so in lists rather than one-at-a-time wouldn't cost any extra
> memory (the packet-pages are all already allocated on the NIC RX ring).
> Possibly combine the two, so that rather than having potentially four
> versions of each function (skb, skbundle, void*, void* bundle) you just
> have the two 'ends'.
>
Yep, seems like a good idea to incorporate bundling into TXDP from the get-go.

Tom

> -Ed

^ permalink raw reply

* Re: [PATCH 2/2] net: ethernet: altera: TSE: do not use tx queue lock in tx completion handler
From: David Miller @ 2016-12-02 17:11 UTC (permalink / raw)
  To: LinoSanfilippo; +Cc: vbridger, nios2-dev, linux-kernel, netdev
In-Reply-To: <1480546112-3099-2-git-send-email-LinoSanfilippo@gmx.de>

From: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Date: Wed, 30 Nov 2016 23:48:32 +0100

> The driver already uses its private lock for synchronization between xmit
> and xmit completion handler making the additional use of the xmit_lock
> unnecessary.
> Furthermore the driver does not set NETIF_F_LLTX resulting in xmit to be
> called with the xmit_lock held and then taking the private lock while xmit
> completion handler does the reverse, first take the private lock, then the
> xmit_lock.
> Fix these issues by not taking the xmit_lock in the tx completion handler.
> 
> Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>

Yeah that could be a nasty deadlock, in fact.

Applied, thanks.

^ permalink raw reply

* Re: [PATCH 1/2] net: ethernet: altera: TSE: Remove unneeded dma sync for tx buffers
From: David Miller @ 2016-12-02 17:11 UTC (permalink / raw)
  To: LinoSanfilippo; +Cc: vbridger, nios2-dev, linux-kernel, netdev
In-Reply-To: <1480546112-3099-1-git-send-email-LinoSanfilippo@gmx.de>

From: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Date: Wed, 30 Nov 2016 23:48:31 +0100

> An explicit dma sync for device directly after mapping as well as an
> explicit dma sync for cpu directly before unmapping is unnecessary and
> costly on the hotpath. So remove these calls.
> 
> Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>

Applied.

^ permalink raw reply

* Re: [PATCH/RFC net-next 0/2] net/sched: cls_flower: Support matching on ICMP
From: Jiri Pirko @ 2016-12-02 17:10 UTC (permalink / raw)
  To: Simon Horman
  Cc: netdev, David S.  Miller, Jay Vosburgh, Veaceslav Falico,
	Andy Gospodarek, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <1480672352-13291-1-git-send-email-simon.horman@netronome.com>

Fri, Dec 02, 2016 at 10:52:30AM CET, simon.horman@netronome.com wrote:
>Hi,
>
>this series add supports for matching on ICMP type and code to cls_flower.
>This is modeled on existing support for matching on L4 ports. The updates
>to the dissector are intended to allow for code and storage re-use.

Looks fine to me. Thanks!


>
>Simon Horman (2):
>  flow dissector: ICMP support
>  net/sched: cls_flower: Support matching on ICMP type and code
>
> drivers/net/bonding/bond_main.c |  6 +++--
> include/linux/skbuff.h          |  5 +++++
> include/net/flow_dissector.h    | 50 ++++++++++++++++++++++++++++++++++++++---
> include/uapi/linux/pkt_cls.h    | 10 +++++++++
> net/core/flow_dissector.c       | 34 +++++++++++++++++++++++++---
> net/sched/cls_flow.c            |  4 ++--
> net/sched/cls_flower.c          | 42 ++++++++++++++++++++++++++++++++++
> 7 files changed, 141 insertions(+), 10 deletions(-)
>
>-- 
>2.7.0.rc3.207.g0ac5344
>

^ permalink raw reply

* Re: [PATCH/RFC iproute2/net-next 0/3] tc: flower: Support matching on ICMP
From: Jiri Pirko @ 2016-12-02 17:10 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev, Stephen Hemminger, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <1480672785-14570-1-git-send-email-simon.horman@netronome.com>

Fri, Dec 02, 2016 at 10:59:42AM CET, simon.horman@netronome.com wrote:
>Add support for matching on ICMP type and code to flower. This is modeled
>on existing support for matching on L4 ports.
>
>The second patch provided a minor cleanup which is in keeping with
>they style used in the last patch.
>
>This is marked as an RFC to match the same designation given to the
>corresponding kernel patches.

Looks nice, I only have those 2 enum nitpicks.

Thanks.

>
>Based on iproute2/net-next with the following applied:
>* [[PATCH iproute2/net-next v2] 0/4] tc: flower: SCTP and other port fixes
>
>Simon Horman (3):
>  tc: flower: update headers for TCA_FLOWER_KEY_ICMP*
>  tc: flower: introduce enum flower_endpoint
>  tc: flower: support matching on ICMP type and code
>
> include/linux/pkt_cls.h |  10 ++++
> man/man8/tc-flower.8    |  20 ++++++--
> tc/f_flower.c           | 118 ++++++++++++++++++++++++++++++++++++++++++------
> 3 files changed, 129 insertions(+), 19 deletions(-)
>
>-- 
>2.7.0.rc3.207.g0ac5344
>

^ permalink raw reply

* Re: [PATCH/RFC iproute2/net-next 3/3] tc: flower: support matching on ICMP type and code
From: Jiri Pirko @ 2016-12-02 17:09 UTC (permalink / raw)
  To: Simon Horman, g; +Cc: netdev, Stephen Hemminger, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <1480672785-14570-4-git-send-email-simon.horman@netronome.com>

Fri, Dec 02, 2016 at 10:59:45AM CET, simon.horman@netronome.com wrote:
>Support matching on ICMP type and code.
>
>Example usage:
>
>tc qdisc add dev eth0 ingress
>
>tc filter add dev eth0 protocol ip parent ffff: flower \
>	indev eth0 ip_proto icmp type 8 code 0 action drop
>
>tc filter add dev eth0 protocol ipv6 parent ffff: flower \
>	indev eth0 ip_proto icmpv6 type 128 code 0 action drop
>
>Signed-off-by: Simon Horman <simon.horman@netronome.com>
>---
> man/man8/tc-flower.8 | 20 ++++++++---
> tc/f_flower.c        | 96 ++++++++++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 105 insertions(+), 11 deletions(-)
>
>diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
>index a401293fed50..c01ace6249dd 100644
>--- a/man/man8/tc-flower.8
>+++ b/man/man8/tc-flower.8
>@@ -29,7 +29,7 @@ flower \- flow based traffic control filter
> .IR PRIORITY " | "
> .BR vlan_eth_type " { " ipv4 " | " ipv6 " | "
> .IR ETH_TYPE " } | "
>-.BR ip_proto " { " tcp " | " udp " | " sctp " | "
>+.BR ip_proto " { " tcp " | " udp " | " sctp " | " icmp " | " icmpv6 " | "
> .IR IP_PROTO " } | { "
> .BR dst_ip " | " src_ip " } { "
> .IR ipv4_address " | " ipv6_address " } | { "
>@@ -94,7 +94,7 @@ or an unsigned 16bit value in hexadecimal format.
> Match on layer four protocol.
> .I IP_PROTO
> may be
>-.BR tcp ", " udp ", " sctp
>+.BR tcp ", " udp ", " sctp ", " icmp ", " icmpv6
> or an unsigned 8bit value in hexadecimal format.
> .TP
> .BI dst_ip " ADDRESS"
>@@ -112,6 +112,13 @@ option of tc filter.
> Match on layer 4 protocol source or destination port number. Only available for
> .BR ip_proto " values " udp ", " tcp  " and " sctp
> which have to be specified in beforehand.
>+.TP
>+.BI type " NUMBER"
>+.TQ
>+.BI code " NUMBER"
>+Match on ICMP type or code. Only available for
>+.BR ip_proto " values " icmp  " and " icmpv6
>+which have to be specified in beforehand.
> .SH NOTES
> As stated above where applicable, matches of a certain layer implicitly depend
> on the matches of the next lower layer. Precisely, layer one and two matches
>@@ -120,13 +127,16 @@ have no dependency, layer three matches
> (\fBip_proto\fR, \fBdst_ip\fR and \fBsrc_ip\fR)
> depend on the
> .B protocol
>-option of tc filter
>-and finally layer four matches
>+option of tc filter, layer four port matches
> (\fBdst_port\fR and \fBsrc_port\fR)
> depend on
> .B ip_proto
> being set to
>-.BR tcp ", " udp " or " sctp.
>+.BR tcp ", " udp " or " sctp,
>+and finally ICMP matches (\fBcode\fR and \fBtype\fR) depend on
>+.B ip_proto
>+being set to
>+.BR icmp " or " icmpv6.
> .P
> There can be only used one mask per one prio. If user needs to specify different
> mask, he has to use different prio.
>diff --git a/tc/f_flower.c b/tc/f_flower.c
>index 42253067b43d..59f6f1ea26e6 100644
>--- a/tc/f_flower.c
>+++ b/tc/f_flower.c
>@@ -28,6 +28,11 @@ enum flower_endpoint {
> 	flower_dst
> };
> 
>+enum flower_icmp_field {
>+	flower_icmp_type,
>+	flower_icmp_code

	FLOWER_ICMP_FIELD_TYPE,
	FLOWER_ICMP_FIELD_CODE,

>+};
>+
> static void explain(void)
> {
> 	fprintf(stderr,
>@@ -42,11 +47,13 @@ static void explain(void)
> 		"                       vlan_ethtype [ ipv4 | ipv6 | ETH-TYPE ] |\n"
> 		"                       dst_mac MAC-ADDR |\n"
> 		"                       src_mac MAC-ADDR |\n"
>-		"                       ip_proto [tcp | udp | sctp | IP-PROTO ] |\n"
>+		"                       ip_proto [tcp | udp | sctp | icmp | icmpv6 | IP-PROTO ] |\n"
> 		"                       dst_ip [ IPV4-ADDR | IPV6-ADDR ] |\n"
> 		"                       src_ip [ IPV4-ADDR | IPV6-ADDR ] |\n"
> 		"                       dst_port PORT-NUMBER |\n"
>-		"                       src_port PORT-NUMBER }\n"
>+		"                       src_port PORT-NUMBER |\n"
>+		"                       type ICMP-TYPE |\n"
>+		"                       code ICMP-CODE }\n"
> 		"       FILTERID := X:Y:Z\n"
> 		"       ACTION-SPEC := ... look at individual actions\n"
> 		"\n"
>@@ -95,16 +102,23 @@ static int flower_parse_ip_proto(char *str, __be16 eth_type, int type,
> 	int ret;
> 	__u8 ip_proto;
> 
>-	if (eth_type != htons(ETH_P_IP) && eth_type != htons(ETH_P_IPV6)) {
>-		fprintf(stderr, "Illegal \"eth_type\" for ip proto\n");
>-		return -1;
>-	}
>+	if (eth_type != htons(ETH_P_IP) && eth_type != htons(ETH_P_IPV6))
>+		goto err;
>+
> 	if (matches(str, "tcp") == 0) {
> 		ip_proto = IPPROTO_TCP;
> 	} else if (matches(str, "udp") == 0) {
> 		ip_proto = IPPROTO_UDP;
> 	} else if (matches(str, "sctp") == 0) {
> 		ip_proto = IPPROTO_SCTP;
>+	} else if (matches(str, "icmp") == 0) {
>+		if (eth_type != htons(ETH_P_IP))
>+			goto err;
>+		ip_proto = IPPROTO_ICMP;
>+	} else if (matches(str, "icmpv6") == 0) {
>+		if (eth_type != htons(ETH_P_IPV6))
>+			goto err;
>+		ip_proto = IPPROTO_ICMPV6;
> 	} else {
> 		ret = get_u8(&ip_proto, str, 16);
> 		if (ret)
>@@ -113,6 +127,10 @@ static int flower_parse_ip_proto(char *str, __be16 eth_type, int type,
> 	addattr8(n, MAX_MSG, type, ip_proto);
> 	*p_ip_proto = ip_proto;
> 	return 0;
>+
>+err:
>+	fprintf(stderr, "Illegal \"eth_type\" for ip proto\n");
>+	return -1;
> }
> 
> static int flower_parse_ip_addr(char *str, __be16 eth_type,
>@@ -165,6 +183,39 @@ static int flower_parse_ip_addr(char *str, __be16 eth_type,
> 	return 0;
> }
> 
>+static int flower_icmp_attr_type(__be16 eth_type, __u8 ip_proto,
>+				 enum flower_icmp_field field)
>+{
>+	if (eth_type == htons(ETH_P_IP) && ip_proto == IPPROTO_ICMP)
>+		return field == flower_icmp_code ? TCA_FLOWER_KEY_ICMPV4_CODE :
>+			TCA_FLOWER_KEY_ICMPV4_TYPE;
>+	else if (eth_type == htons(ETH_P_IPV6) &&ip_proto == IPPROTO_ICMPV6)
>+		return field == flower_icmp_code ? TCA_FLOWER_KEY_ICMPV6_CODE :
>+			TCA_FLOWER_KEY_ICMPV6_TYPE;
>+
>+	return -1;
>+}
>+
>+static int flower_parse_icmp(char *str, __u16 eth_type, __u8 ip_proto,
>+			     bool is_code, struct nlmsghdr *n)
>+{
>+	int ret;
>+	int type;
>+	uint8_t value;
>+
>+	type = flower_icmp_attr_type(eth_type, ip_proto, is_code);
>+	if (type < 0)
>+		return -1;
>+
>+	ret = get_u8(&value, str, 10);
>+	if (ret)
>+		return -1;
>+
>+	addattr8(n, MAX_MSG, type, value);
>+
>+	return 0;
>+}
>+
> static int flower_port_attr_type(__u8 ip_proto, enum flower_endpoint endpoint)
> {
> 	if (ip_proto == IPPROTO_TCP)
>@@ -358,6 +409,22 @@ static int flower_parse_opt(struct filter_util *qu, char *handle,
> 				fprintf(stderr, "Illegal \"src_port\"\n");
> 				return -1;
> 			}
>+		} else if (matches(*argv, "type") == 0) {
>+			NEXT_ARG();
>+			ret = flower_parse_icmp(*argv, eth_type, ip_proto,
>+						false, n);
>+			if (ret < 0) {
>+				fprintf(stderr, "Illegal \"icmp type\"\n");
>+				return -1;
>+			}
>+		} else if (matches(*argv, "code") == 0) {
>+			NEXT_ARG();
>+			ret = flower_parse_icmp(*argv, eth_type, ip_proto,
>+						true, n);
>+			if (ret < 0) {
>+				fprintf(stderr, "Illegal \"icmp code\"\n");
>+				return -1;
>+			}
> 		} else if (matches(*argv, "action") == 0) {
> 			NEXT_ARG();
> 			ret = parse_action(&argc, &argv, TCA_FLOWER_ACT, n);
>@@ -471,6 +538,10 @@ static void flower_print_ip_proto(FILE *f, __u8 *p_ip_proto,
> 		fprintf(f, "udp");
> 	else if (ip_proto == IPPROTO_SCTP)
> 		fprintf(f, "sctp");
>+	else if (ip_proto == IPPROTO_ICMP)
>+		fprintf(f, "icmp");
>+	else if (ip_proto == IPPROTO_ICMPV6)
>+		fprintf(f, "icmpv6");
> 	else
> 		fprintf(f, "%02x", ip_proto);
> 	*p_ip_proto = ip_proto;
>@@ -519,6 +590,12 @@ static void flower_print_port(FILE *f, char *name, struct rtattr *attr)
> 		fprintf(f, "\n  %s %d", name, ntohs(rta_getattr_u16(attr)));
> }
> 
>+static void flower_print_icmp(FILE *f, char *name, struct rtattr *attr)
>+{
>+	if (attr)
>+		fprintf(f, "\n  %s %d", name, ntohs(rta_getattr_u8(attr)));
>+}
>+
> static int flower_print_opt(struct filter_util *qu, FILE *f,
> 			    struct rtattr *opt, __u32 handle)
> {
>@@ -587,6 +664,13 @@ static int flower_print_opt(struct filter_util *qu, FILE *f,
> 	if (nl_type >= 0)
> 		flower_print_port(f, "src_port", tb[nl_type]);
> 
>+	nl_type = flower_icmp_attr_type(eth_type, ip_proto, false);
>+	if (nl_type >= 0)
>+		flower_print_icmp(f, "icmp_type", tb[nl_type]);
>+	nl_type = flower_icmp_attr_type(eth_type, ip_proto, true);
>+	if (nl_type >= 0)
>+		flower_print_icmp(f, "icmp_code", tb[nl_type]);
>+
> 	if (tb[TCA_FLOWER_FLAGS]) {
> 		__u32 flags = rta_getattr_u32(tb[TCA_FLOWER_FLAGS]);
> 
>-- 
>2.7.0.rc3.207.g0ac5344
>

^ permalink raw reply

* Re: [PATCH/RFC iproute2/net-next 2/3] tc: flower: introduce enum flower_endpoint
From: Jiri Pirko @ 2016-12-02 17:08 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev, Stephen Hemminger, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <1480672785-14570-3-git-send-email-simon.horman@netronome.com>

Fri, Dec 02, 2016 at 10:59:44AM CET, simon.horman@netronome.com wrote:
>Introduce enum flower_endpoint and use it instead of a bool
>as the type for paramatising source and destination.
>
>This is intended to improve read-ability and provide some type
>checking of endpoint parameters.
>
>Signed-off-by: Simon Horman <simon.horman@netronome.com>
>---
> tc/f_flower.c | 22 ++++++++++++++--------
> 1 file changed, 14 insertions(+), 8 deletions(-)
>
>diff --git a/tc/f_flower.c b/tc/f_flower.c
>index 615e8f27bed2..42253067b43d 100644
>--- a/tc/f_flower.c
>+++ b/tc/f_flower.c
>@@ -23,6 +23,11 @@
> #include "tc_util.h"
> #include "rt_names.h"
> 
>+enum flower_endpoint {
>+	flower_src,
>+	flower_dst

	FLOWER_ENDPOINT_SRC,
	FLOWER_ENDPOINT_DST,

>+};
>+
> static void explain(void)
> {
> 	fprintf(stderr,
>@@ -160,29 +165,30 @@ static int flower_parse_ip_addr(char *str, __be16 eth_type,
> 	return 0;
> }
> 
>-static int flower_port_attr_type(__u8 ip_proto, bool is_src)
>+static int flower_port_attr_type(__u8 ip_proto, enum flower_endpoint endpoint)
> {
> 	if (ip_proto == IPPROTO_TCP)
>-		return is_src ? TCA_FLOWER_KEY_TCP_SRC :
>+		return endpoint == flower_src ? TCA_FLOWER_KEY_TCP_SRC :
> 			TCA_FLOWER_KEY_TCP_DST;
> 	else if (ip_proto == IPPROTO_UDP)
>-		return is_src ? TCA_FLOWER_KEY_UDP_SRC :
>+		return endpoint == flower_src ? TCA_FLOWER_KEY_UDP_SRC :
> 			TCA_FLOWER_KEY_UDP_DST;
> 	else if (ip_proto == IPPROTO_SCTP)
>-		return is_src ? TCA_FLOWER_KEY_SCTP_SRC :
>+		return endpoint == flower_src ? TCA_FLOWER_KEY_SCTP_SRC :
> 			TCA_FLOWER_KEY_SCTP_DST;
> 	else
> 		return -1;
> }
> 
>-static int flower_parse_port(char *str, __u8 ip_proto, bool is_src,
>+static int flower_parse_port(char *str, __u8 ip_proto,
>+			     enum flower_endpoint endpoint,
> 			     struct nlmsghdr *n)
> {
> 	int ret;
> 	int type;
> 	__be16 port;
> 
>-	type = flower_port_attr_type(ip_proto, is_src);
>+	type = flower_port_attr_type(ip_proto, endpoint);
> 	if (type < 0)
> 		return -1;
> 
>@@ -340,14 +346,14 @@ static int flower_parse_opt(struct filter_util *qu, char *handle,
> 			}
> 		} else if (matches(*argv, "dst_port") == 0) {
> 			NEXT_ARG();
>-			ret = flower_parse_port(*argv, ip_proto, false, n);
>+			ret = flower_parse_port(*argv, ip_proto, flower_dst, n);
> 			if (ret < 0) {
> 				fprintf(stderr, "Illegal \"dst_port\"\n");
> 				return -1;
> 			}
> 		} else if (matches(*argv, "src_port") == 0) {
> 			NEXT_ARG();
>-			ret = flower_parse_port(*argv, ip_proto, true, n);
>+			ret = flower_parse_port(*argv, ip_proto, flower_src, n);
> 			if (ret < 0) {
> 				fprintf(stderr, "Illegal \"src_port\"\n");
> 				return -1;
>-- 
>2.7.0.rc3.207.g0ac5344
>

^ permalink raw reply

* Re: net/can: warning in raw_setsockopt/__alloc_pages_slowpath
From: Oliver Hartkopp @ 2016-12-02 17:05 UTC (permalink / raw)
  To: Marc Kleine-Budde, Andrey Konovalov, David S. Miller, linux-can,
	netdev, LKML
  Cc: Dmitry Vyukov, Kostya Serebryany, syzkaller
In-Reply-To: <74e2aed8-ba38-ee91-59a3-49131ea18d60@pengutronix.de>



On 12/02/2016 04:42 PM, Marc Kleine-Budde wrote:
> On 12/02/2016 04:11 PM, Oliver Hartkopp wrote:
>>
>>
>> On 12/02/2016 02:24 PM, Marc Kleine-Budde wrote:
>>> On 12/02/2016 01:43 PM, Andrey Konovalov wrote:
>>
>>
>>>>  [<ffffffff8369e0de>] raw_setsockopt+0x1be/0x9f0 net/can/raw.c:506
>>>
>>> We should add a check for a sensible optlen....
>>>
>>>> static int raw_setsockopt(struct socket *sock, int level, int optname,
>>>> 			  char __user *optval, unsigned int optlen)
>>>> {
>>>> 	struct sock *sk = sock->sk;
>>>> 	struct raw_sock *ro = raw_sk(sk);
>>>> 	struct can_filter *filter = NULL;  /* dyn. alloc'ed filters */
>>>> 	struct can_filter sfilter;         /* single filter */
>>>> 	struct net_device *dev = NULL;
>>>> 	can_err_mask_t err_mask = 0;
>>>> 	int count = 0;
>>>> 	int err = 0;
>>>>
>>>> 	if (level != SOL_CAN_RAW)
>>>> 		return -EINVAL;
>>>>
>>>> 	switch (optname) {
>>>>
>>>> 	case CAN_RAW_FILTER:
>>>> 		if (optlen % sizeof(struct can_filter) != 0)
>>>> 			return -EINVAL;
>>>
>>> here...
>>>
>>> 		if (optlen > 64 * sizeof(struct can_filter))
>>> 			return -EINVAL;
>>>
>>
>> Agreed.
>>
>> But what is sensible here?
>> 64 filters is way to small IMO.
>>
>> When thinking about picking a bunch of single CAN IDs I would tend to
>> something like 512 filters.
>
> Ok - 64 was just an arbitrary chosen value for demonstration purposes :)
>

:-)

Would you like to send a patch?

Regards,
Oliver

^ permalink raw reply

* Re: [PATCH] net: atarilance: use %8ph for printing hex string
From: David Miller @ 2016-12-02 17:03 UTC (permalink / raw)
  To: linux; +Cc: felipe.balbi, netdev, linux-kernel
In-Reply-To: <1480543375-28904-1-git-send-email-linux@rasmusvillemoes.dk>

From: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Date: Wed, 30 Nov 2016 23:02:54 +0100

> This is already using the %pM printf extension; might as well also use
> %ph to make the code smaller.
> 
> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>

Applied, thank you.

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of __E1000_DOWN
From: Baicar, Tyler @ 2016-12-02 17:02 UTC (permalink / raw)
  To: Neftin, Sasha, jeffrey.t.kirsher, intel-wired-lan, netdev,
	linux-kernel, okaya, timur
In-Reply-To: <20c191fe-e8fe-b2ca-0628-7d0188d1f28e@codeaurora.org>

Hello Sasha,

Were you able to reproduce this issue?

Do you have a patch fixing the close function inconsistencies that you 
mentioned which I could try out?

Thanks,
Tyler

On 11/21/2016 1:40 PM, Baicar, Tyler wrote:
> On 11/17/2016 6:31 AM, Neftin, Sasha wrote:
>> On 11/13/2016 10:34 AM, Neftin, Sasha wrote:
>>> On 11/11/2016 12:35 AM, Baicar, Tyler wrote:
>>>> Hello Sasha,
>>>>
>>>> On 11/9/2016 11:19 PM, Neftin, Sasha wrote:
>>>>> On 11/9/2016 11:41 PM, Tyler Baicar wrote:
>>>>>> Move IRQ free code so that it will happen regardless of the
>>>>>> __E1000_DOWN bit. Currently the e1000e driver only releases its IRQ
>>>>>> if the __E1000_DOWN bit is cleared. This is not sufficient because
>>>>>> it is possible for __E1000_DOWN to be set without releasing the IRQ.
>>>>>> In such a situation, we will hit a kernel bug later in e1000_remove
>>>>>> because the IRQ still has action since it was never freed. A
>>>>>> secondary bus reset can cause this case to happen.
>>>>>>
>>>>>> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
>>>>>> ---
>>>>>>    drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++-
>>>>>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
>>>>>> b/drivers/net/ethernet/intel/e1000e/netdev.c
>>>>>> index 7017281..36cfcb0 100644
>>>>>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>>>>>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>>>>>> @@ -4679,12 +4679,13 @@ int e1000e_close(struct net_device *netdev)
>>>>>>          if (!test_bit(__E1000_DOWN, &adapter->state)) {
>>>>>>            e1000e_down(adapter, true);
>>>>>> -        e1000_free_irq(adapter);
>>>>>>              /* Link status message must follow this format */
>>>>>>            pr_info("%s NIC Link is Down\n", adapter->netdev->name);
>>>>>>        }
>>>>>>    +    e1000_free_irq(adapter);
>>>>>> +
>>>>>>        napi_disable(&adapter->napi);
>>>>>>          e1000e_free_tx_resources(adapter->tx_ring);
>>>>>>
>>>>> I would like not recommend insert this change. This change related
>>>>> driver state machine, we afraid from lot of synchronization 
>>>>> problem and
>>>>> issues.
>>>>> We need keep e1000_free_irq in loop and check for 'test_bit' ready.
>>>> What do you mean here? There is no loop. If __E1000_DOWN is set 
>>>> then we
>>>> will never free the IRQ.
>>>>
>>>>> Another point, does before execute secondary bus reset your SW 
>>>>> back up
>>>>> pcie configuration space as properly?
>>>> After a secondary bus reset, the link needs to recover and go back 
>>>> to a
>>>> working state after 1 second.
>>>>
>>>>  From the callstack, the issue is happening while removing the 
>>>> endpoint
>>>> from the system, before applying the secondary bus reset.
>>>>
>>>> The order of events is
>>>> 1. remove the drivers
>>>> 2. cause a secondary bus reset
>>>> 3. wait 1 second
>>> Actually, this is too much, usually link up in less than 100ms.You can
>>> check Data Link Layer indication.
>>>> 4. recover the link
>>>>
>>>> callstack:
>>>> free_msi_irqs+0x6c/0x1a8
>>>> pci_disable_msi+0xb0/0x148
>>>> e1000e_reset_interrupt_capability+0x60/0x78
>>>> e1000_remove+0xc8/0x180
>>>> pci_device_remove+0x48/0x118
>>>> __device_release_driver+0x80/0x108
>>>> device_release_driver+0x2c/0x40
>>>> pci_stop_bus_device+0xa0/0xb0
>>>> pci_stop_bus_device+0x3c/0xb0
>>>> pci_stop_root_bus+0x54/0x80
>>>> acpi_pci_root_remove+0x28/0x64
>>>> acpi_bus_trim+0x6c/0xa4
>>>> acpi_device_hotplug+0x19c/0x3f4
>>>> acpi_hotplug_work_fn+0x28/0x3c
>>>> process_one_work+0x150/0x460
>>>> worker_thread+0x50/0x4b8
>>>> kthread+0xd4/0xe8
>>>> ret_from_fork+0x10/0x50
>>>>
>>>> Thanks,
>>>> Tyler
>>>>
>>> Hello Tyler,
>>> Okay, we need consult more about this suggestion.
>>> May I ask what is setup you run? Is there NIC or on board LAN? I would
>>> like try reproduce this issue in our lab's too.
>>> Also, is same issue observed with same scenario and others NIC's too?
>>> Sasha
>>> _______________________________________________
>>> Intel-wired-lan mailing list
>>> Intel-wired-lan@lists.osuosl.org
>>> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan
>>>
>> Hello Tyler,
>> I see some in consistent implementation of __*_close methods in our
>> drivers. Do you have any igb NIC to check if same problem persist there?
>> Thanks,
>> Sasha
> Hello Sasha,
>
> I couldn't find an igb NIC to test with, but I did find another e1000e 
> card that does not cause the same issue. That card is:
>
> 0004:01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit 
> Network Connection
>         Subsystem: Intel Corporation Gigabit CT Desktop Adapter
>         Physical Slot: 5-1
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
> ParErr- Stepping- SERR+ FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0, Cache Line Size: 64 bytes
>         Interrupt: pin A routed to IRQ 299
>         Region 0: Memory at c01001c0000 (32-bit, non-prefetchable) 
> [size=128K]
>         Region 1: Memory at c0100100000 (32-bit, non-prefetchable) 
> [size=512K]
>         Region 2: I/O ports at 1000 [size=32]
>         Region 3: Memory at c01001e0000 (32-bit, non-prefetchable) 
> [size=16K]
>         Expansion ROM at c0100180000 [disabled] [size=256K]
>         Capabilities: [c8] Power Management version 2
>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>                 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>         Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>                 Address: 00000000397f0040  Data: 0000
>         Capabilities: [e0] Express (v1) Endpoint, MSI 00
>                 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 
> <512ns, L1 <64us
>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>                 DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
> Unsupported+
>                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>                         MaxPayload 256 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- 
> AuxPwr+ TransPend-
>                 LnkCap: Port #8, Speed 2.5GT/s, Width x1, ASPM L0s L1, 
> Exit Latency L0s <128ns, L1 <64us
>                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
>                 LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- 
> SlotClk+ DLActive- BWMgmt- ABWMgmt-
>         Capabilities: [a0] MSI-X: Enable- Count=5 Masked-
>                 Vector table: BAR=3 offset=00000000
>                 PBA: BAR=3 offset=00002000
>         Capabilities: [100 v1] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
> UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
> UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- 
> UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
> NonFatalErr-
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
> NonFatalErr+
>                 AERCap: First Error Pointer: 00, GenCap- CGenEn- 
> ChkCap- ChkEn-
>         Capabilities: [140 v1] Device Serial Number 
> 68-05-ca-ff-ff-29-47-34
>         Kernel driver in use: e1000e
>
> Here are the kernel logs from first running the test on this card and 
> then running the test on the Intel 82571EB card that I originally 
> found the issue with (you can see the issue doesn't happen on this 
> card but does on the other):
>
> [   44.146749] ACPI: \_SB_.PCI0: Device has suffered a power fault
> [   44.155238] pcieport 0000:00:00.0: PCIe Bus Error: 
> severity=Uncorrected (Non-Fatal), type=Transaction Layer, 
> id=0000(Requester ID)
> [   44.166111] pcieport 0000:00:00.0:   device [17cb:0400] error 
> status/mask=00004000/00400000
> [   44.174420] pcieport 0000:00:00.0:    [14] Completion Timeout (First)
> [   44.401943] e1000e 0000:01:00.0 eth0: Timesync Tx Control register 
> not set as expected
> [   82.445586] pcieport 0002:00:00.0: PCIe Bus Error: 
> severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
> [   82.454851] pcieport 0002:00:00.0:   device [17cb:0400] error 
> status/mask=00000001/00006000
> [   82.463209] pcieport 0002:00:00.0:    [ 0] Receiver Error
> [   82.469355] pcieport 0002:00:00.0: PCIe Bus Error: 
> severity=Uncorrected (Non-Fatal), type=Transaction Layer, 
> id=0000(Requester ID)
> [   82.481026] pcieport 0002:00:00.0:   device [17cb:0400] error 
> status/mask=00004000/00400000
> [   82.489343] pcieport 0002:00:00.0:    [14] Completion Timeout (First)
> [   82.504573] ACPI: \_SB_.PCI2: Device has suffered a power fault
> [   84.528036] kernel BUG at drivers/pci/msi.c:369!
>
> I'm not sure why it reproduces on the 82571EB card and not the 82574L 
> card. The only obvious difference is there is no Reciever Error on the 
> 82574L card.
>
> If you have a patch fixing the inconsistencies you mentioned with the 
> __*_close methods I would certainly be willing to test it out!
>
> Thanks,
> Tyler
>

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply

* Re: [flamebait] xdp, well meaning but pointless
From: Tom Herbert @ 2016-12-02 16:59 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, Thomas Graf, Florian Westphal,
	Linux Kernel Network Developers
In-Reply-To: <163220cc-a91b-5627-ea93-cd43dc0079c6@stressinduktion.org>

On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 02.12.2016 11:24, Jesper Dangaard Brouer wrote:
>> On Thu, 1 Dec 2016 13:51:32 -0800
>> Tom Herbert <tom@herbertland.com> wrote:
>>
>>>>> The technical plenary at last IETF on Seoul a couple of weeks ago was
>>>>> exclusively focussed on DDOS in light of the recent attack against
>>>>> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
>>>>> presentation by Nick Sullivan
>>>>> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
>>>>> alluded to some implementation of DDOS mitigation. In particular, on
>>>>> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
>>
>> slide 14
>>
>>>>> numbers he gave we're based in iptables+BPF and that was a whole
>>>>> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
>>>>> and that's also when I introduced XDP to whole IETF :-) ). If that's
>>>>> the best we can do the Internet is in a world hurt. DDOS mitigation
>>>>> alone is probably a sufficient motivation to look at XDP. We need
>>>>> something that drops bad packets as quickly as possible when under
>>>>> attack, we need this to be integrated into the stack, we need it to be
>>>>> programmable to deal with the increasing savvy of attackers, and we
>>>>> don't want to be forced to be dependent on HW solutions. This is why
>>>>> we created XDP!
>>
>> The 1.2Mpps number is a bit low, but we are unfortunately in that
>> ballpark.
>>
>>>> I totally understand that. But in my reply to David in this thread I
>>>> mentioned DNS apex processing as being problematic which is actually
>>>> being referred in your linked slide deck on page 9 ("What do floods look
>>>> like") and the problematic of parsing DNS packets in XDP due to string
>>>> processing and looping inside eBPF.
>>
>> That is a weak argument. You do realize CloudFlare actually use eBPF to
>> do this exact filtering, and (so-far) eBPF for parsing DNS have been
>> sufficient for them.
>
> You are talking about this code on the following slides (I actually
> transcribed it for you here and disassembled):
>
> l0:     ld #0x14
> l1:     ldxb 4*([0]&0xf)
> l2:     add x
> l3:     tax
> l4:     ld [x+0]
> l5:     jeq #0x7657861, l6, l13
> l6:     ld [x+4]
> l7:     jeq #0x6d706c65, l8, l13
> l8:     ld [x+8]
> l9:     jeq #0x3636f6d, l10, l13
> l10:    ldb [x+12]
> l11:    jeq #0, l12, l13
> l12:    ret #0x1
> l13:    ret #0
>
> You can offload this to u32 in hardware if that is what you want.
>
> The reason this works is because of netfilter, which allows them to
> dynamically generate BPF programs and insert and delete them from
> chains, do intersection or unions of them.
>
> If you have a freestanding program like in XDP the complexity space is a
> different one and not comparable to this at all.
>
I don't understand this comment about complexity especially in regards
to the idea of offloading u32 to hardware. Relying on hardware to do
anything always leads to more complexity than an equivalent SW
implementation for the same functionality. The only reason we ever use
a hardware mechanisms is if it gives *significantly* better
performance. If the performance difference isn't there then doing
things in SW is going to be the better path (as we see in XDP).

Tom

> Bye,
> Hannes
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox