Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Question on rhashtable in worst-case scenario.
From: Ben Greear @ 2016-04-01 18:17 UTC (permalink / raw)
  To: Herbert Xu, Johannes Berg
  Cc: David Miller, linux-kernel, linux-wireless, netdev, tgraf
In-Reply-To: <20160401004627.GA9367@gondor.apana.org.au>

On 03/31/2016 05:46 PM, Herbert Xu wrote:
> On Thu, Mar 31, 2016 at 05:29:59PM +0200, Johannes Berg wrote:
>>
>> Does removing this completely disable the "-EEXIST" error? I can't say
>> I fully understand the elasticity stuff in __rhashtable_insert_fast().
>
> What EEXIST error are you talking about? The only one that can be
> returned on insertion is if you're explicitly checking for dups
> which clearly can't be the case for you.
>
> If you're talking about the EEXIST error due to a rehash then it is
> completely hidden from you by rhashtable_insert_rehash.
>
> If you actually meant EBUSY then yes this should prevent it from
> occurring, unless your chain-length exceeds 2^32.

EEXIST was on removal, and was a symptom of the failure to insert, not
really a problem itself.

I reverted my revert (ie, back to rhashtable), added Johanne's patch
to check insertion (and added my on pr_err there).

I also added this:

diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
index 38ef0be..c25b945 100644
--- a/net/mac80211/sta_info.c
+++ b/net/mac80211/sta_info.c
@@ -66,6 +66,7 @@

  static const struct rhashtable_params sta_rht_params = {
         .nelem_hint = 3, /* start small */
+       .insecure_elasticity = true, /* Disable chain-length checks. */
         .automatic_shrinking = true,
         .head_offset = offsetof(struct sta_info, hash_node),
         .key_offset = offsetof(struct sta_info, addr),

Now, my test case seems to pass, though I did have one strange issue
before I put in the pr_err.  I'm not sure if it was a hashtable issue
or something else..but I have lots of something-else going on in this system,
so I'd say that likely the patch above fixes rhashtable for my use case.

I will of course let you know if I run into more issues that appear
to be hashtable related!

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply related

* [PATCH] Marvell phy: add fiber status check for some components
From: Charles-Antoine Couret @ 2016-04-01 16:01 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 0 bytes --]



[-- Attachment #2: marvell.patch --]
[-- Type: text/x-patch, Size: 2543 bytes --]

>From a5a7a9828511ff6a522cf742109768207ff89929 Mon Sep 17 00:00:00 2001
From: Charles-Antoine Couret <charles-antoine.couret@nexvision.fr>
Date: Fri, 1 Apr 2016 16:16:35 +0200
Subject: [PATCH] Marvell phy: add fiber status check for some components

This patch is not tested with all Marvell's phy. The new function was actived
only for tested phys.

Signed-off-by: Charles-Antoine Couret <charles-antoine.couret@nexvision.fr>
---
 drivers/net/phy/marvell.c | 37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index ab1d0fc..5ac186e 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -890,6 +890,39 @@ static int marvell_read_status(struct phy_device *phydev)
 	return 0;
 }
 
+/* marvell_read_fiber_status
+ *
+ * Some Marvell's phys have two modes: fiber and copper.
+ * Both need status checked.
+ * Description:
+ *   First, check the fiber link and status.
+ *   If the fiber link is down, check the copper link and status which
+ *   will be the default value if both link are down.
+ */
+static int marvell_read_fiber_status(struct phy_device *phydev)
+{
+	int err;
+
+	/* Check the fiber mode first */
+	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M1111_FIBER);
+	if (err < 0)
+		return err;
+
+	err = marvell_read_status(phydev);
+	if (err < 0)
+		return err;
+
+	if (phydev->link)
+		return 0;
+
+	/* If fiber link is down, check and save copper mode state */
+	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M1111_COPPER);
+	if (err < 0)
+		return err;
+
+	return marvell_read_status(phydev);
+}
+
 static int marvell_aneg_done(struct phy_device *phydev)
 {
 	int retval = phy_read(phydev, MII_M1011_PHY_STATUS);
@@ -1122,7 +1155,7 @@ static struct phy_driver marvell_drivers[] = {
 		.probe = marvell_probe,
 		.config_init = &m88e1111_config_init,
 		.config_aneg = &marvell_config_aneg,
-		.read_status = &marvell_read_status,
+		.read_status = &marvell_read_fiber_status,
 		.ack_interrupt = &marvell_ack_interrupt,
 		.config_intr = &marvell_config_intr,
 		.resume = &genphy_resume,
@@ -1270,7 +1303,7 @@ static struct phy_driver marvell_drivers[] = {
 		.probe = marvell_probe,
 		.config_init = &marvell_config_init,
 		.config_aneg = &m88e1510_config_aneg,
-		.read_status = &marvell_read_status,
+		.read_status = &marvell_read_fiber_status,
 		.ack_interrupt = &marvell_ack_interrupt,
 		.config_intr = &marvell_config_intr,
 		.did_interrupt = &m88e1121_did_interrupt,
-- 
2.5.5

^ permalink raw reply related

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: David Miller @ 2016-04-01 18:33 UTC (permalink / raw)
  To: daniel
  Cc: hannes, alexei.starovoitov, eric.dumazet, netdev, sasha.levin,
	mkubecek
In-Reply-To: <56FE2CE3.9080805@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Fri, 01 Apr 2016 10:10:11 +0200

> Dave, do you need me to resubmit this one w/o changes:
> http://patchwork.ozlabs.org/patch/603903/ ?

I'll apply this and queue it up for -stable, thanks.

^ permalink raw reply

* Re: [RFC PATCH 6/6] ppc: ebpf/jit: Implement JIT compiler for extended BPF
From: Daniel Borkmann @ 2016-04-01 18:34 UTC (permalink / raw)
  To: Alexei Starovoitov, Naveen N. Rao, linux-kernel, linuxppc-dev
  Cc: oss, Matt Evans, Michael Ellerman, Paul Mackerras,
	David S. Miller, Ananth N Mavinakayanahalli, netdev
In-Reply-To: <56FEB9AD.2080401@fb.com>

On 04/01/2016 08:10 PM, Alexei Starovoitov wrote:
> On 4/1/16 2:58 AM, Naveen N. Rao wrote:
>> PPC64 eBPF JIT compiler. Works for both ABIv1 and ABIv2.
>>
>> Enable with:
>> echo 1 > /proc/sys/net/core/bpf_jit_enable
>> or
>> echo 2 > /proc/sys/net/core/bpf_jit_enable
>>
>> ... to see the generated JIT code. This can further be processed with
>> tools/net/bpf_jit_disasm.
>>
>> With CONFIG_TEST_BPF=m and 'modprobe test_bpf':
>> test_bpf: Summary: 291 PASSED, 0 FAILED, [234/283 JIT'ed]
>>
>> ... on both ppc64 BE and LE.
>>
>> The details of the approach are documented through various comments in
>> the code, as are the TODOs. Some of the prominent TODOs include
>> implementing BPF tail calls and skb loads.
>>
>> Cc: Matt Evans <matt@ozlabs.org>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Paul Mackerras <paulus@samba.org>
>> Cc: Alexei Starovoitov <ast@fb.com>
>> Cc: "David S. Miller" <davem@davemloft.net>
>> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
>> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
>> ---
>>   arch/powerpc/include/asm/ppc-opcode.h |  19 +-
>>   arch/powerpc/net/Makefile             |   4 +
>>   arch/powerpc/net/bpf_jit.h            |  66 ++-
>>   arch/powerpc/net/bpf_jit64.h          |  58 +++
>>   arch/powerpc/net/bpf_jit_comp64.c     | 828 ++++++++++++++++++++++++++++++++++
>>   5 files changed, 973 insertions(+), 2 deletions(-)
>>   create mode 100644 arch/powerpc/net/bpf_jit64.h
>>   create mode 100644 arch/powerpc/net/bpf_jit_comp64.c
> ...
>> -#ifdef CONFIG_PPC64
>> +#if defined(CONFIG_PPC64) && (!defined(_CALL_ELF) || _CALL_ELF != 2)
>
> impressive stuff!

+1, awesome to see another one!

> Everything nicely documented. Could you add few words for the above
> condition as well ?
> Or may be a new macro, since it occurs many times?
> What are these _CALL_ELF == 2 and != 2 conditions mean? ppc ABIs ?
> Will there ever be v3 ?

Minor TODO would also be to convert to use bpf_jit_binary_alloc() and
bpf_jit_binary_free() API for the image, which is done by other eBPF
jits, too.

> So far most of the bpf jits were going via net-next tree, but if
> in this case no changes to the core is necessary then I guess it's fine
> to do it via powerpc tree. What's your plan?
>

^ permalink raw reply

* Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Daniel Borkmann @ 2016-04-01 18:36 UTC (permalink / raw)
  To: David Miller
  Cc: hannes, alexei.starovoitov, eric.dumazet, netdev, sasha.levin,
	mkubecek
In-Reply-To: <20160401.143358.1963393066215816255.davem@davemloft.net>

On 04/01/2016 08:33 PM, David Miller wrote:
> From: Daniel Borkmann <daniel@iogearbox.net>
> Date: Fri, 01 Apr 2016 10:10:11 +0200
>
>> Dave, do you need me to resubmit this one w/o changes:
>> http://patchwork.ozlabs.org/patch/603903/ ?
>
> I'll apply this and queue it up for -stable, thanks.

Ok, thanks!

^ permalink raw reply

* Re: [PATCH] net: mvpp2: fix maybe-uninitialized warning
From: David Miller @ 2016-04-01 18:37 UTC (permalink / raw)
  To: jszhang; +Cc: mw, netdev, linux-kernel, linux-arm-kernel
In-Reply-To: <1459414883-6295-1-git-send-email-jszhang@marvell.com>

From: Jisheng Zhang <jszhang@marvell.com>
Date: Thu, 31 Mar 2016 17:01:23 +0800

> This is to fix the following maybe-uninitialized warning:
> 
> drivers/net/ethernet/marvell/mvpp2.c:6007:18: warning: 'err' may be
> used uninitialized in this function [-Wmaybe-uninitialized]
> 
> Signed-off-by: Jisheng Zhang <jszhang@marvell.com>

Applied.

^ permalink raw reply

* Re: [PATCH (net.git) 0/3] stmmac MDIO and normal descr fixes
From: David Miller @ 2016-04-01 18:39 UTC (permalink / raw)
  To: peppe.cavallaro
  Cc: netdev, gabriel.fernandez, afaerber, fschaefer.oss, dinh.linux,
	preid, rhgadsdon, linux-kernel
In-Reply-To: <1459494436-27386-1-git-send-email-peppe.cavallaro@st.com>

From: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Date: Fri, 1 Apr 2016 09:07:13 +0200

> This patch series is to fix the problems below and recently debugged
> in this mailing list:
> 
> o to fix a problem for the HW where the normal descriptor
> o to fix the mdio registration according to the different
>   platform configurations
> 
> I am resending all the patches again: built on top of net.git repo.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH] bridge: remove br_dev_set_multicast_list
From: David Miller @ 2016-04-01 18:43 UTC (permalink / raw)
  To: roy.qing.li; +Cc: netdev
In-Reply-To: <1459498570-30664-1-git-send-email-roy.qing.li@gmail.com>

From: roy.qing.li@gmail.com
Date: Fri,  1 Apr 2016 16:16:10 +0800

> From: Li RongQing <roy.qing.li@gmail.com>
> 
> remove br_dev_set_multicast_list which does nothing
> 
> Signed-off-by: Li RongQing <roy.qing.li@gmail.com>

This will break SIOCADDMULTI et al. on the bridge, see net/core/dev.c
which checks whether this ndo OP is NULL or not.

Please sufficiently grep the source tree on how an ndo operation is
used before making changes like this.

Thanks.

^ permalink raw reply

* Re: [PATCH] net: mvpp2: use cache_line_size() to get cacheline size
From: David Miller @ 2016-04-01 18:43 UTC (permalink / raw)
  To: jszhang; +Cc: mw, netdev, linux-kernel, linux-arm-kernel, thomas.petazzoni
In-Reply-To: <1459501865-7033-1-git-send-email-jszhang@marvell.com>

From: Jisheng Zhang <jszhang@marvell.com>
Date: Fri, 1 Apr 2016 17:11:05 +0800

> L1_CACHE_BYTES may not be the real cacheline size, use cache_line_size
> to determine the cacheline size in runtime.
> 
> Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
> Suggested-by: Marcin Wojtas <mw@semihalf.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: mvneta: use cache_line_size() to get cacheline size
From: David Miller @ 2016-04-01 18:44 UTC (permalink / raw)
  To: jszhang; +Cc: mw, thomas.petazzoni, netdev, linux-kernel, linux-arm-kernel
In-Reply-To: <1459501969-7083-1-git-send-email-jszhang@marvell.com>

From: Jisheng Zhang <jszhang@marvell.com>
Date: Fri, 1 Apr 2016 17:12:49 +0800

> L1_CACHE_BYTES may not be the real cacheline size, use cache_line_size
> to determine the cacheline size in runtime.
> 
> Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
> Suggested-by: Marcin Wojtas <mw@semihalf.com>

Applied.

^ permalink raw reply

* Re: [PATCH 1/2] ipv6: rework the lock in addrconf_permanent_addr
From: David Miller @ 2016-04-01 18:44 UTC (permalink / raw)
  To: roy.qing.li; +Cc: netdev
In-Reply-To: <1459502818-24939-1-git-send-email-roy.qing.li@gmail.com>

From: roy.qing.li@gmail.com
Date: Fri,  1 Apr 2016 17:26:58 +0800

> From: Li RongQing <roy.qing.li@gmail.com>
> 
> 1. nothing of idev is changed, so read lock is enough
> 2. ifp is changed, so used ifp->lock or cmpxchg to protect it
> 
> Signed-off-by: Li RongQing <roy.qing.li@gmail.com>

You posted this patch twice and didn't post patch 2/2.

I'm tossing this from patchwork, please resubmit this
properly.

^ permalink raw reply

* Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Eric Dumazet @ 2016-04-01 18:49 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem
In-Reply-To: <20160401180531.13882.44793.stgit@localhost.localdomain>

On Fri, 2016-04-01 at 11:05 -0700, Alexander Duyck wrote:
> RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other
> than fragmentation and reassembly.  Currently we are looking at this field
> as a way of identifying what frames can be aggregated and  which cannot for
> GRO.  While this is valid for frames that do not have DF set, it is invalid
> to do so if the bit is set.
> 
> In addition we were generating IPv4 ID collisions when 2 or more flows were
> interleaved over the same tunnel.  To prevent that we store the result of
> all IP ID checks via a "|=" instead of overwriting previous values.

Note that for atomic datagrams (DF = 1), since fragmentation and
reassembly can not occur, maybe some people use ID field for other
purposes.

For example, TCP stack tracks per socket ID generation, even if it sends
DF=1 frames. Damn useful for tcpdump analysis and drop inference.

With your change, the resulting GRO packet would propagate the ID of
first frag. Most GSO/GSO engines will then provide a ID sequence, which
might not match original packets.

I do not particularly care, but it is worth mentioning that GRO+TSO
would not be idempotent anymore.

^ permalink raw reply

* Re: [PATCH net] vlan: pull on __vlan_insert_tag error path and fix csum correction
From: David Miller @ 2016-04-01 19:00 UTC (permalink / raw)
  To: daniel; +Cc: jiri, alexei.starovoitov, jesse, tom, netdev
In-Reply-To: <244f8d5684800cc98545932aa6851bf73f7326e2.1459503053.git.daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Fri,  1 Apr 2016 11:41:03 +0200

> Moreover, I noticed that when in the non-error path the __skb_pull()
> is done and the original offset to mac header was non-zero, we fixup
> from a wrong skb->data offset in the checksum complete processing.
> 
> So the skb_postpush_rcsum() really needs to be done before __skb_pull()
> where skb->data still points to the mac header start.

Ugh, what a mess, are you sure any of this is right even after your
change?  What happens (outside of the csum part) is this:

	__skb_push(offset);
	__vlan_insert_tag(); {
		skb_push(VLAN_HLEN);
	...
		memmove(skb->data, skb->data + VLAN_HLEN, 2 * ETH_ALEN);
	}
	__skb_pull(offset);

If I understand this correctly, the last pull will therefore put
skb->data pointing at vlan_ethhdr->h_vlan_TCI of the new VLAN header
pushed by __vlan_insert_tag().

That is assuming skb->data began right after the original ethernet
header.

To me, that postpull csum currently is absolutely in the correct spot,
because it's acting upon the pull done by __vlan_insert_tag(), not the
one done here by skb_vlan_push().

Right?

Can you tell me how you tested this?  Just curious...

^ permalink raw reply

* Re: [v7, 4/5] powerpc/fsl: move mpc85xx.h to include/linux/fsl
From: Stephen Boyd @ 2016-04-01 19:03 UTC (permalink / raw)
  To: Yangbo Lu, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-clk-u79uwXL29TY76Z2rM5mHXA,
	linux-i2c-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mmc-u79uwXL29TY76Z2rM5mHXA
  Cc: ulf.hansson-QSEj5FYQhm4dnm+yROfE0A, Zhao Qiang, Russell King,
	Bhupesh Sharma, Santosh Shilimkar, Jochen Friedrich,
	scott.wood-3arQi8VN3Tc, Rob Herring, Claudiu Manoil, Kumar Gala,
	leoyang.li-3arQi8VN3Tc, xiaobo.xie-3arQi8VN3Tc
In-Reply-To: <1459480051-3701-5-git-send-email-yangbo.lu-3arQi8VN3Tc@public.gmane.org>

On 03/31/2016 08:07 PM, Yangbo Lu wrote:
>  drivers/clk/clk-qoriq.c                                       | 3 +--
>

For clk part:

Acked-by: Stephen Boyd <sboyd-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply

* Re: [PATCH] net: mvneta: fix changing MTU when using per-cpu processing
From: David Miller @ 2016-04-01 19:18 UTC (permalink / raw)
  To: mw
  Cc: linux-kernel, linux-arm-kernel, netdev, linux,
	sebastian.hesselbarth, andrew, jason, thomas.petazzoni,
	gregory.clement, nadavh, alior, nitroshift, jaz
In-Reply-To: <1459516878-2802-1-git-send-email-mw@semihalf.com>

From: Marcin Wojtas <mw@semihalf.com>
Date: Fri,  1 Apr 2016 15:21:18 +0200

> After enabling per-cpu processing it appeared that under heavy load
> changing MTU can result in blocking all port's interrupts and transmitting
> data is not possible after the change.
> 
> This commit fixes above issue by disabling percpu interrupts for the
> time, when TXQs and RXQs are reconfigured.
> 
> Signed-off-by: Marcin Wojtas <mw@semihalf.com>

Applied, thanks.

When I reviewed this I was worried that this was yet another case where
the ndo op could be invoked in a potentially atomic or similar context,
whereby on_each_cpu() would be illegal to use.

But that appears to not be the case, and thus this change is just fine.

Thanks.

^ permalink raw reply

* Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: David Miller @ 2016-04-01 19:24 UTC (permalink / raw)
  To: eric.dumazet
  Cc: aduyck, herbert, tom, jesse, alexander.duyck, edumazet, netdev
In-Reply-To: <1459536543.6473.289.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 01 Apr 2016 11:49:03 -0700

> For example, TCP stack tracks per socket ID generation, even if it
> sends DF=1 frames. Damn useful for tcpdump analysis and drop
> inference.

Thanks for mentioning this, I never considered this use case.

> With your change, the resulting GRO packet would propagate the ID of
> first frag. Most GSO/GSO engines will then provide a ID sequence,
> which might not match original packets.
> 
> I do not particularly care, but it is worth mentioning that GRO+TSO
> would not be idempotent anymore.

Our eventual plan was to start emitting zero in the ID field for
outgoing TCP datagrams with DF set, since the issue that caused us to
generate incrementing IDs in the first place (buggy Microsoft SLHC
compression) we decided is not relevant and important enough to
accommodate any more.

So outside of your TCP behavior analysis case, there isn't a
compelling argument to keeping that code around any more, rather than
just put zero in the ID field.

I suppose we could keep the counter code around and allow it to be
enabled using a sysctl or socket option, but how strongly do you
really feel about this?

^ permalink raw reply

* [RFC v3 -next 0/2] virtio-net: Advised MTU feature
From: Aaron Conole @ 2016-04-01 19:32 UTC (permalink / raw)
  To: netdev, Michael S. Tsirkin, virtualization, linux-kernel,
	Paolo Abeni, Sergei Shtylyov, Pankaj Gupta

The following series adds the ability for a hypervisor to set an MTU on the
guest during feature negotiation phase. This is useful for VM orchestration
when, for instance, tunneling is involved and the MTU of the various systems
should be homogenous.

The first patch adds the feature bit as described in the proposed VIRTIO spec
addition found at
https://lists.oasis-open.org/archives/virtio-dev/201603/msg00001.html
The second patch adds a user of the bit, and a warning when the guest changes
the MTU from the hypervisor advised MTU. Future patches may add more thorough
error handling.

v2:
* Whitespace and code style cleanups from Sergei Shtylyov and Paolo Abeni
* Additional test before printing a warning

v3:
* Removed the warning when changing MTU (which simplified the code)

Aaron Conole (2):
  virtio: Start feature MTU support
  virtio_net: Read the advised MTU

 drivers/net/virtio_net.c        | 8 ++++++++
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 11 insertions(+)

-- 
2.5.5

^ permalink raw reply

* [RFC v3 -net 1/2] virtio: Start feature MTU support
From: Aaron Conole @ 2016-04-01 19:32 UTC (permalink / raw)
  To: netdev, Michael S. Tsirkin, virtualization, linux-kernel,
	Paolo Abeni, Sergei Shtylyov, Pankaj Gupta
In-Reply-To: <1459539136-13948-1-git-send-email-aconole@redhat.com>

This commit adds the feature bit and associated mtu device entry for the
virtio network device. Future commits will make use of these bits to support
negotiated MTU.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
---
v2,v3:
* No change

 include/uapi/linux/virtio_net.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index ec32293..41a6a01 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -55,6 +55,7 @@
 #define VIRTIO_NET_F_MQ	22	/* Device supports Receive Flow
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
+#define VIRTIO_NET_F_MTU 25	/* Device supports Default MTU Negotiation */
 
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO	6	/* Host handles pkts w/ any GSO type */
@@ -73,6 +74,8 @@ struct virtio_net_config {
 	 * Legal values are between 1 and 0x8000
 	 */
 	__u16 max_virtqueue_pairs;
+	/* Default maximum transmit unit advice */
+	__u16 mtu;
 } __attribute__((packed));
 
 /*
-- 
2.5.5

^ permalink raw reply related

* [RFC v3 -next 2/2] virtio_net: Read the advised MTU
From: Aaron Conole @ 2016-04-01 19:32 UTC (permalink / raw)
  To: netdev, Michael S. Tsirkin, virtualization, linux-kernel,
	Paolo Abeni, Sergei Shtylyov, Pankaj Gupta
In-Reply-To: <1459539136-13948-1-git-send-email-aconole@redhat.com>

This patch checks the feature bit for the VIRTIO_NET_F_MTU feature. If it
exists, read the advised MTU and use it.

No proper error handling is provided for the case where a user changes the
negotiated MTU. A future commit will add proper error handling. Instead, a
warning is emitted if the guest changes the device MTU after previously
being given advice.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
---
v2:
* Whitespace cleanup in the last hunk
* Code style change around the pr_warn
* Additional test for mtu change before printing warning
v3:
* removed the mtu change warning

 drivers/net/virtio_net.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 49d84e5..2308083 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1450,6 +1450,7 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
 
 static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
 {
+	struct virtnet_info *vi = netdev_priv(dev);
 	if (new_mtu < MIN_MTU || new_mtu > MAX_MTU)
 		return -EINVAL;
 	dev->mtu = new_mtu;
@@ -1896,6 +1897,12 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
 		vi->has_cvq = true;
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_MTU)) {
+		dev->mtu = virtio_cread16(vdev,
+					  offsetof(struct virtio_net_config,
+						   mtu));
+	}
+
 	if (vi->any_header_sg)
 		dev->needed_headroom = vi->hdr_len;
 
@@ -2081,6 +2088,7 @@ static unsigned int features[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ,
 	VIRTIO_NET_F_CTRL_MAC_ADDR,
 	VIRTIO_F_ANY_LAYOUT,
+	VIRTIO_NET_F_MTU,
 };
 
 static struct virtio_driver virtio_net_driver = {
-- 
2.5.5

^ permalink raw reply related

* Re: [PATCH] RDS: sync congestion map updating
From: santosh shilimkar @ 2016-04-01 19:47 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Wengang Wang, leon-2ukJVAZIZ/Y, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <56FC927E.9090404-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

(cc-ing netdev)
On 3/30/2016 7:59 PM, Wengang Wang wrote:
>
>
> 在 2016年03月31日 09:51, Wengang Wang 写道:
>>
>>
>> 在 2016年03月31日 01:16, santosh shilimkar 写道:
>>> Hi Wengang,
>>>
>>> On 3/30/2016 9:19 AM, Leon Romanovsky wrote:
>>>> On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote:
>>>>> Problem is found that some among a lot of parallel RDS
>>>>> communications hang.
>>>>> In my test ten or so among 33 communications hang. The send
>>>>> requests got
>>>>> -ENOBUF error meaning the peer socket (port) is congested. But
>>>>> meanwhile,
>>>>> peer socket (port) is not congested.
>>>>>
>>>>> The congestion map updating can happen in two paths: one is in
>>>>> rds_recvmsg path
>>>>> and the other is when it receives packets from the hardware. There
>>>>> is no
>>>>> synchronization when updating the congestion map. So a bit
>>>>> operation (clearing)
>>>>> in the rds_recvmsg path can be skipped by another bit operation
>>>>> (setting) in
>>>>> hardware packet receving path.
>>>>>
>
> To be more detailed.  Here, the two paths (user calls recvmsg and
> hardware receives data) are for different rds socks. thus the
> rds_sock->rs_recv_lock is not helpful to sync the updating on congestion
> map.
>
For archive purpose, let me try to conclude the thread. I synced
with Wengang offlist and came up with below fix. I was under
impression that __set_bit_le() was atmoic version. After fixing
it like patch(end of the email), the bug gets addressed.

I will probably send this as fix for stable as well.


 From 5614b61f6fdcd6ae0c04e50b97efd13201762294 Mon Sep 17 00:00:00 2001
From: Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Date: Wed, 30 Mar 2016 23:26:47 -0700
Subject: [PATCH] RDS: Fix the atomicity for congestion map update

Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.

Full credit to Wengang <wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.

Signed-off-by: Wengang Wang <wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
  net/rds/cong.c |    4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/cong.c b/net/rds/cong.c
index e6144b8..6641bcf 100644
--- a/net/rds/cong.c
+++ b/net/rds/cong.c
@@ -299,7 +299,7 @@ void rds_cong_set_bit(struct rds_cong_map *map, 
__be16 port)
  	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
  	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;

-	__set_bit_le(off, (void *)map->m_page_addrs[i]);
+	set_bit_le(off, (void *)map->m_page_addrs[i]);
  }

  void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
@@ -313,7 +313,7 @@ void rds_cong_clear_bit(struct rds_cong_map *map, 
__be16 port)
  	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
  	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;

-	__clear_bit_le(off, (void *)map->m_page_addrs[i]);
+	clear_bit_le(off, (void *)map->m_page_addrs[i]);
  }

  static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port)
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH v3 net-next] net: ipv4: Consider failed nexthops in multipath routes
From: Julian Anastasov @ 2016-04-01 19:51 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev
In-Reply-To: <1459523824-29828-1-git-send-email-dsa@cumulusnetworks.com>


	Hello,

On Fri, 1 Apr 2016, David Ahern wrote:

> v3
> - Julian comments: changed use of dead in documentation to failed,
>   init state to NUD_REACHABLE which simplifies fib_good_nh, use of
>   nh_dev for neighbor lookup, fallback to first entry which is what
>   current logic does
> 
> v2
> - use rcu locking to avoid refcnts per Eric's suggestion
> - only consider neighbor info for nh_scope == RT_SCOPE_LINK per Julian's
>   comment
> - drop the 'state == NUD_REACHABLE' from the state check since it is
>   part of NUD_VALID (comment from Julian)
> - wrapped the use of the neigh in a sysctl
> 
>  Documentation/networking/ip-sysctl.txt | 10 ++++++++++
>  include/net/netns/ipv4.h               |  3 +++
>  net/ipv4/fib_semantics.c               | 32 ++++++++++++++++++++++++++++----
>  net/ipv4/sysctl_net_ipv4.c             | 11 +++++++++++
>  4 files changed, 52 insertions(+), 4 deletions(-)
> 

> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index d97268e8ff10..e08abf96824a 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -1559,21 +1559,45 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
>  }
>  
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> +static bool fib_good_nh(const struct fib_nh *nh)
> +{
> +	int state = NUD_REACHABLE;
> +
> +	if (nh->nh_scope == RT_SCOPE_LINK) {
> +		struct neighbour *n = NULL;

	NULL is not needed anymore.

> +
> +		rcu_read_lock_bh();
> +
> +		n = __neigh_lookup_noref(&arp_tbl, &nh->nh_gw, nh->nh_dev);
> +		if (n)
> +			state = n->nud_state;
> +
> +		rcu_read_unlock_bh();
> +	}
> +
> +	return !!(state & NUD_VALID);
> +}
>  
>  void fib_select_multipath(struct fib_result *res, int hash)
>  {
>  	struct fib_info *fi = res->fi;
> +	struct net *net = fi->fib_net;
> +	unsigned char first_nhsel = 0;

	Looking at fib_table_lookup() res->nh_sel is not 0
in all cases. I personally don't like that we do not
fallback properly but to make this logic more correct we
can use something like this:

	bool first = false;

>  
>  	for_nexthops(fi) {
>  		if (hash > atomic_read(&nh->nh_upper_bound))
>  			continue;
>  
> -		res->nh_sel = nhsel;
> -		return;
> +		if (!net->ipv4.sysctl_fib_multipath_use_neigh ||
> +		    fib_good_nh(nh)) {
> +			res->nh_sel = nhsel;
> +			return;
> +		}
> +		if (!first_nhsel)
> +			first_nhsel = nhsel;

		if (!first) {
			res->nh_sel = nhsel;
			first = true;
		}

>  	} endfor_nexthops(fi);
>  
> -	/* Race condition: route has just become dead. */
> -	res->nh_sel = 0;
> +	res->nh_sel = first_nhsel;

	And then this is not needed anymore. Even setting
to 0 was not needed because 0 is not better than current
nh_sel when both are DEAD/LINKDOWN.

Regards

^ permalink raw reply

* [PATCH v2 -next] net/core/dev: Warn on a too-short GRO frame
From: Aaron Conole @ 2016-04-01 19:58 UTC (permalink / raw)
  To: netdev, Joe Perches

From: Aaron Conole <aconole@bytheb.org>

When signaling that a GRO frame is ready to be processed, the network stack
correctly checks length and aborts processing when a frame is less than 14
bytes. However, such a condition is really indicative of a broken driver,
and should be loudly signaled, rather than silently dropped as the case is
today.

Convert the condition to use net_warn_ratelimited() to ensure the stack
loudly complains about such broken drivers.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
---
v2:
* Convert from WARN_ON to net_warn_ratelimited

 net/core/dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe7..1be269e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4663,6 +4663,8 @@ static struct sk_buff *napi_frags_skb(struct napi_struct *napi)
 	if (unlikely(skb_gro_header_hard(skb, hlen))) {
 		eth = skb_gro_header_slow(skb, hlen, 0);
 		if (unlikely(!eth)) {
+			net_warn_ratelimited("%s: dropping impossible skb\n",
+					     __func__);
 			napi_reuse_skb(napi, skb);
 			return NULL;
 		}
-- 
2.5.5

^ permalink raw reply related

* Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Alexander Duyck @ 2016-04-01 19:58 UTC (permalink / raw)
  To: David Miller
  Cc: Eric Dumazet, Alex Duyck, Herbert Xu, Tom Herbert, Jesse Gross,
	Eric Dumazet, Netdev
In-Reply-To: <20160401.152405.915323132719949585.davem@davemloft.net>

On Fri, Apr 1, 2016 at 12:24 PM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 01 Apr 2016 11:49:03 -0700
>
>> For example, TCP stack tracks per socket ID generation, even if it
>> sends DF=1 frames. Damn useful for tcpdump analysis and drop
>> inference.
>
> Thanks for mentioning this, I never considered this use case.

RFC 6864 is pretty explicit about this, IPv4 ID used only for
fragmentation.  https://tools.ietf.org/html/rfc6864#section-4.1

The goal with this change is to try and keep most of the existing
behavior in tact without violating this rule?  I would think the
sequence number should give you the ability to infer a drop in the
case of TCP.  In the case of UDP tunnels we are now getting a bit more
data since we were ignoring the outer IP header ID before.

>> With your change, the resulting GRO packet would propagate the ID of
>> first frag. Most GSO/GSO engines will then provide a ID sequence,
>> which might not match original packets.

Right.  But that is only in the case where the IP IDs did not already
increment or were left uninitialized meaning the transmitter was
probably already following RFC 6864 and chose a fixed value.  Odds are
in such a case we end up improving the performance if anything as
there are plenty of legacy systems out there that still require the
IPv4 ID increment in order to get LRO/GRO.

>> I do not particularly care, but it is worth mentioning that GRO+TSO
>> would not be idempotent anymore.

In the patch I mentioned we had already broken that.  I'm basically
just going through and fixing the cases for tunnels where we were
doing the outer header wrong while at the same time relaxing the
requirements for the inner header if DF is set.  I'll probably add
some documentation do the Documentation folder about it as well.  I'm
currently in the process of writing up documentation for GSO and GSO
partial for the upcoming patchset.  I can pretty easily throw in a few
comments about GRO as well.

> Our eventual plan was to start emitting zero in the ID field for
> outgoing TCP datagrams with DF set, since the issue that caused us to
> generate incrementing IDs in the first place (buggy Microsoft SLHC
> compression) we decided is not relevant and important enough to
> accommodate any more.

For the GSO partial stuff I was probably just going to have the IP ID
on the inner headers lurch forward in chunks equal to gso_segs when we
are doing the segmentation.  I didn't want to use a fixed value just
because that would likely make it easy to identify Linux devices being
a bump in the wire.  I figure if there are already sources that
weren't updating IP ID for their segmentation offloads then if we just
take that approach odds are we will blend in with the other devices
and be more difficult to single out.

Another reason for doing it this way is that different devices are
going to have different behaviors with GSO partial.  In the case of
the i40e driver it recognizes both inner and outer network headers so
it can increment both correctly.  In the case of igb and ixgbe they
only can support the outer header so the inner IP ID value would be
lurching by gso_size every time we move from one GSO frame to the
next.

> So outside of your TCP behavior analysis case, there isn't a
> compelling argument to keeping that code around any more, rather than
> just put zero in the ID field.
>
> I suppose we could keep the counter code around and allow it to be
> enabled using a sysctl or socket option, but how strongly do you
> really feel about this?

I'm not suggesting we drop the counter code for transmit.  What RFC
6864 says is "Originating sources MAY set the IPv4 ID field of atomic
datagrams to any value."

For transmit we can leave the IP ID code as is.  For receive we should
not be snooping into the IP ID for any frames that have the DF bit set
as devices that have adopted RFC 6864 on their transmit path will end
up causing issues.

- Alex

^ permalink raw reply

* Re: [Odd commit author id merge via netdev]
From: Johannes Berg @ 2016-04-01 20:01 UTC (permalink / raw)
  To: santosh shilimkar, netdev, David S. Miller
In-Reply-To: <56FEB50E.4060004@oracle.com>

On Fri, 2016-04-01 at 10:51 -0700, santosh shilimkar wrote:
> Hi Dave,
> 
> I noticed something odd while checking the recent
> commits of mine in kernel.org tree made it via netdev.
> 
> Don't know if its patchwork tool doing this.
> Usual author line in my git objects	:
> 	Author: Santosh Shilimkar <emaid-id>
> 
> But the commits going via your tree seems to be like below..
> 	Author: email-id <email-id>
> 
> Few more examples of the commits end of the email. Can this
> be fixed for future commits ? The git objects you pulled from
> my tree directly have right author format where as ones which
> are picked from patchworks seems to be odd.
> 

Patchwork does store this info somehow and re-use it, quite possibly
from the very first patch you ever sent. I think this bug was *just*
fixed in patchwork, but it'll probably be a while until that fix lands.

However, you can go and create a patchwork account with the real name,
associate it with all the email addresses you use and then I think
it'll pick it up. Not entirely sure though, you'll have to test it.

johannes

^ permalink raw reply

* [PATCH] ip6_tunnel: set rtnl_link_ops before calling register_netdevice
From: Thadeu Lima de Souza Cascardo @ 2016-04-01 20:17 UTC (permalink / raw)
  To: netdev

When creating an ip6tnl tunnel with ip tunnel, rtnl_link_ops is not set
before ip6_tnl_create2 is called. When register_netdevice is called, there
is no linkinfo attribute in the NEWLINK message because of that.

Setting rtnl_link_ops before calling register_netdevice fixes that.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
---
 net/ipv6/ip6_tunnel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index eb2ac4b..1f20345 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -252,12 +252,12 @@ static int ip6_tnl_create2(struct net_device *dev)
 
 	t = netdev_priv(dev);
 
+	dev->rtnl_link_ops = &ip6_link_ops;
 	err = register_netdevice(dev);
 	if (err < 0)
 		goto out;
 
 	strcpy(t->parms.name, dev->name);
-	dev->rtnl_link_ops = &ip6_link_ops;
 
 	dev_hold(dev);
 	ip6_tnl_link(ip6n, t);
-- 
2.5.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox