Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v2 2/4] Documentation: net: phy: Add a paragraph about pause frames/flow control
From: Florian Fainelli @ 2016-11-28 17:33 UTC (permalink / raw)
  To: Sebastian Frias, netdev
  Cc: davem, andrew, martin.blumenstingl, mans, alexandre.torgue,
	peppe.cavallaro, timur, jbrunet
In-Reply-To: <1b7425f7-9183-69d4-76e8-42eefffeb1c6@laposte.net>

On 11/28/2016 02:38 AM, Sebastian Frias wrote:
> On 27/11/16 19:44, Florian Fainelli wrote:
>> Describe that the Ethernet MAC controller is ultimately responsible for
>> dealing with proper pause frames/flow control advertisement and
>> enabling, and that it is therefore allowed to have it change
>> phydev->supported/advertising with SUPPORTED_Pause and
>> SUPPORTED_AsymPause.
>>
>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>> ---
>>  Documentation/networking/phy.txt | 18 ++++++++++++++++--
>>  1 file changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
>> index 4b25c0f24201..9a42a9414cea 100644
>> --- a/Documentation/networking/phy.txt
>> +++ b/Documentation/networking/phy.txt
>> @@ -127,8 +127,9 @@ Letting the PHY Abstraction Layer do Everything
>>   values pruned from them which don't make sense for your controller (a 10/100
>>   controller may be connected to a gigabit capable PHY, so you would need to
>>   mask off SUPPORTED_1000baseT*).  See include/linux/ethtool.h for definitions
>> - for these bitfields. Note that you should not SET any bits, or the PHY may
>> - get put into an unsupported state.
>> + for these bitfields. Note that you should not SET any bits, except the
>> + SUPPORTED_Pause and SUPPORTED_AsymPause bits (see below), or the PHY may get
>> + put into an unsupported state.
>>  
>>   Lastly, once the controller is ready to handle network traffic, you call
>>   phy_start(phydev).  This tells the PAL that you are ready, and configures the
>> @@ -139,6 +140,19 @@ Letting the PHY Abstraction Layer do Everything
>>   When you want to disconnect from the network (even if just briefly), you call
>>   phy_stop(phydev).
>>  
>> +Pause frames / flow control
>> +
>> + The PHY does not participate directly in flow control/pause frames except by
>> + making sure that the SUPPORTED_Pause and SUPPORTED_AsymPause bits are set in
>> + MII_ADVERTISE to indicate towards the link partner that the Ethernet MAC
>> + controller supports such a thing. Since flow control/pause frames generation
>> + involves the Ethernet MAC driver, it is recommended that this driver takes care
>> + of properly indicating advertisement and support for such features by setting
>> + the SUPPORTED_Pause and SUPPORTED_AsymPause bits accordingly. This can be done
>> + either before or after phy_connect() 
> 
> If the bits are set after phy_connect(), how does the PHY framework knows there's
> an update to the bits? Should some call be made?

You would most likely either call phy_start() to start the PHY state
machine (again) or have to re-negotiate the link with e.g:
genphy_restart_aneg().
-- 
Florian

^ permalink raw reply

* RE: [PATCH net-next] hv_netvsc: remove excessive logging on MTU change
From: Haiyang Zhang @ 2016-11-28 17:33 UTC (permalink / raw)
  To: Vitaly Kuznetsov, netdev@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, KY Srinivasan
In-Reply-To: <20161128172544.2491-1-vkuznets@redhat.com>



> -----Original Message-----
> From: Vitaly Kuznetsov [mailto:vkuznets@redhat.com]
> Sent: Monday, November 28, 2016 12:26 PM
> To: netdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org; KY Srinivasan <kys@microsoft.com>;
> Haiyang Zhang <haiyangz@microsoft.com>
> Subject: [PATCH net-next] hv_netvsc: remove excessive logging on MTU
> change
> 
> When we change MTU or the number of channels on a netvsc device we get
> the
> following logged:
> 
>  hv_netvsc bf5edba8...: net device safe to remove
>  hv_netvsc: hv_netvsc channel opened successfully
>  hv_netvsc bf5edba8...: Send section size: 6144, Section count:2560
>  hv_netvsc bf5edba8...: Device MAC 00:15:5d:1e:91:12 link state up
> 
> This information is useful as debug at most.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>

^ permalink raw reply

* Re: [PATCH net-next v2 6/6] tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING
From: Neal Cardwell @ 2016-11-28 17:26 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: David Miller, Soheil Hassas Yeganeh, francisyyan, Netdev,
	Eric Dumazet
In-Reply-To: <1480316838-154141-7-git-send-email-ycheng@google.com>

On Mon, Nov 28, 2016 at 2:07 AM, Yuchung Cheng <ycheng@google.com> wrote:
> From: Francis Yan <francisyyan@gmail.com>
>
> This patch exports the sender chronograph stats via the socket
> SO_TIMESTAMPING channel. Currently we can instrument how long a
> particular application unit of data was queued in TCP by tracking
> SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having
> these sender chronograph stats exported simultaneously along with
> these timestamps allow further breaking down the various sender
> limitation.  For example, a video server can tell if a particular
> chunk of video on a connection takes a long time to deliver because
> TCP was experiencing small receive window. It is not possible to
> tell before this patch without packet traces.
>
> To prepare these stats, the user needs to set
> SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags
> while requesting other SOF_TIMESTAMPING TX timestamps. When the
> timestamps are available in the error queue, the stats are returned
> in a separate control message of type SCM_TIMESTAMPING_OPT_STATS,
> in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME,
> TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond.
>
> Signed-off-by: Francis Yan <francisyyan@gmail.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH net-next v2 5/6] tcp: export sender limits chronographs to TCP_INFO
From: Neal Cardwell @ 2016-11-28 17:26 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: David Miller, Soheil Hassas Yeganeh, francisyyan, Netdev,
	Eric Dumazet
In-Reply-To: <1480316838-154141-6-git-send-email-ycheng@google.com>

On Mon, Nov 28, 2016 at 2:07 AM, Yuchung Cheng <ycheng@google.com> wrote:
> From: Francis Yan <francisyyan@gmail.com>
>
> This patch exports all the sender chronograph measurements collected
> in the previous patches to TCP_INFO interface. Note that busy time
> exported includes all the other sending limits (rwnd-limited,
> sndbuf-limited). Internally the time unit is jiffy but externally
> the measurements are in microseconds for future extensions.
>
> Signed-off-by: Francis Yan <francisyyan@gmail.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH net-next v2 4/6] tcp: instrument how long TCP is limited by insufficient send buffer
From: Neal Cardwell @ 2016-11-28 17:25 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: David Miller, Soheil Hassas Yeganeh, francisyyan, Netdev,
	Eric Dumazet
In-Reply-To: <1480316838-154141-5-git-send-email-ycheng@google.com>

On Mon, Nov 28, 2016 at 2:07 AM, Yuchung Cheng <ycheng@google.com> wrote:
> From: Francis Yan <francisyyan@gmail.com>
>
> This patch measures the amount of time when TCP runs out of new data
> to send to the network due to insufficient send buffer, while TCP
> is still busy delivering (i.e. write queue is not empty). The goal
> is to indicate either the send buffer autotuning or user SO_SNDBUF
> setting has resulted network under-utilization.
>
> The measurement starts conservatively by checking various conditions
> to minimize false claims (i.e. under-estimation is more likely).
> The measurement stops when the SOCK_NOSPACE flag is cleared. But it
> does not account the time elapsed till the next application write.
> Also the measurement only starts if the sender is still busy sending
> data, s.t. the limit accounted is part of the total busy time.
>
> Signed-off-by: Francis Yan <francisyyan@gmail.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* [PATCH net-next] hv_netvsc: remove excessive logging on MTU change
From: Vitaly Kuznetsov @ 2016-11-28 17:25 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, K. Y. Srinivasan, Haiyang Zhang

When we change MTU or the number of channels on a netvsc device we get the
following logged:

 hv_netvsc bf5edba8...: net device safe to remove
 hv_netvsc: hv_netvsc channel opened successfully
 hv_netvsc bf5edba8...: Send section size: 6144, Section count:2560
 hv_netvsc bf5edba8...: Device MAC 00:15:5d:1e:91:12 link state up

This information is useful as debug at most.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 drivers/net/hyperv/netvsc.c       | 8 ++++----
 drivers/net/hyperv/rndis_filter.c | 6 +++---
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 720b5fa..d85da0d 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -410,8 +410,8 @@ static int netvsc_init_buf(struct hv_device *device)
 	net_device->send_section_cnt =
 		net_device->send_buf_size / net_device->send_section_size;
 
-	dev_info(&device->device, "Send section size: %d, Section count:%d\n",
-		 net_device->send_section_size, net_device->send_section_cnt);
+	netdev_dbg(ndev, "Send section size: %d, Section count:%d\n",
+		   net_device->send_section_size, net_device->send_section_cnt);
 
 	/* Setup state for managing the send buffer. */
 	net_device->map_words = DIV_ROUND_UP(net_device->send_section_cnt,
@@ -578,7 +578,7 @@ void netvsc_device_remove(struct hv_device *device)
 	 * At this point, no one should be accessing net_device
 	 * except in here
 	 */
-	dev_notice(&device->device, "net device safe to remove\n");
+	netdev_dbg(ndev, "net device safe to remove\n");
 
 	/* Now, we can close the channel safely */
 	vmbus_close(device->channel);
@@ -1380,7 +1380,7 @@ int netvsc_device_add(struct hv_device *device, void *additional_info)
 	}
 
 	/* Channel is opened */
-	pr_info("hv_netvsc channel opened successfully\n");
+	netdev_dbg(ndev, "hv_netvsc channel opened successfully\n");
 
 	/* If we're reopening the device we may have multiple queues, fill the
 	 * chn_table with the default channel to use it before subchannels are
diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
index 9195d5d..8d90904 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -1059,9 +1059,9 @@ int rndis_filter_device_add(struct hv_device *dev,
 
 	device_info->link_state = rndis_device->link_state;
 
-	dev_info(&dev->device, "Device MAC %pM link state %s\n",
-		 rndis_device->hw_mac_adr,
-		 device_info->link_state ? "down" : "up");
+	netdev_dbg(net, "Device MAC %pM link state %s\n",
+		   rndis_device->hw_mac_adr,
+		   device_info->link_state ? "down" : "up");
 
 	if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_5)
 		return 0;
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH net-next v2 3/6] tcp: instrument how long TCP is limited by receive window
From: Neal Cardwell @ 2016-11-28 17:25 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: David Miller, Soheil Hassas Yeganeh, francisyyan, Netdev,
	Eric Dumazet
In-Reply-To: <1480316838-154141-4-git-send-email-ycheng@google.com>

On Mon, Nov 28, 2016 at 2:07 AM, Yuchung Cheng <ycheng@google.com> wrote:
> From: Francis Yan <francisyyan@gmail.com>
>
> This patch measures the total time when the TCP stops sending because
> the receiver's advertised window is not large enough. Note that
> once the limit is lifted we are likely in the busy status if we
> have data pending.
>
> Signed-off-by: Francis Yan <francisyyan@gmail.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH net-next v2 2/6] tcp: instrument how long TCP is busy sending
From: Neal Cardwell @ 2016-11-28 17:24 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: David Miller, Soheil Hassas Yeganeh, francisyyan, Netdev,
	Eric Dumazet
In-Reply-To: <1480316838-154141-3-git-send-email-ycheng@google.com>

On Mon, Nov 28, 2016 at 2:07 AM, Yuchung Cheng <ycheng@google.com> wrote:
> From: Francis Yan <francisyyan@gmail.com>
>
> This patch measures TCP busy time, which is defined as the period
> of time when sender has data (or FIN) to send. The time starts when
> data is buffered and stops when the write queue is flushed by ACKs
> or error events.
>
> Note the busy time does not include SYN time, unless data is
> included in SYN (i.e. Fast Open). It does include FIN time even
> if the FIN carries no payload. Excluding pure FIN is possible but
> would incur one additional test in the fast path, which may not
> be worth it.
>
> Signed-off-by: Francis Yan <francisyyan@gmail.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH net-next v2 1/6] tcp: instrument tcp sender limits chronographs
From: Neal Cardwell @ 2016-11-28 17:23 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: David Miller, Soheil Hassas Yeganeh, francisyyan, Netdev,
	Eric Dumazet
In-Reply-To: <1480316838-154141-2-git-send-email-ycheng@google.com>

On Mon, Nov 28, 2016 at 2:07 AM, Yuchung Cheng <ycheng@google.com> wrote:
> From: Francis Yan <francisyyan@gmail.com>
>
> This patch implements the skeleton of the TCP chronograph
> instrumentation on sender side limits:
>
>         1) idle (unspec)
>         2) busy sending data other than 3-4 below
>         3) rwnd-limited
>         4) sndbuf-limited
>
> The limits are enumerated 'tcp_chrono'. Since a connection in
> theory can idle forever, we do not track the actual length of this
> uninteresting idle period. For the rest we track how long the sender
> spends in each limit. At any point during the life time of a
> connection, the sender must be in one of the four states.
>
> If there are multiple conditions worthy of tracking in a chronograph
> then the highest priority enum takes precedence over
> the other conditions. So that if something "more interesting"
> starts happening, stop the previous chrono and start a new one.
>
> The time unit is jiffy(u32) in order to save space in tcp_sock.
> This implies application must sample the stats no longer than every
> 49 days of 1ms jiffy.
>
> Signed-off-by: Francis Yan <francisyyan@gmail.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH] amd-xgbe: Fix unused suspend handlers build warning
From: David Miller @ 2016-11-28 17:19 UTC (permalink / raw)
  To: bp; +Cc: linux-kernel, thomas.lendacky, netdev
In-Reply-To: <20161126205352.19577-1-bp@alien8.de>

From: Borislav Petkov <bp@alien8.de>
Date: Sat, 26 Nov 2016 21:53:52 +0100

> From: Borislav Petkov <bp@suse.de>
> 
> Fix:
> 
>   drivers/net/ethernet/amd/xgbe/xgbe-main.c:835:12: warning: ‘xgbe_suspend’ defined
>     but not used [-Wunused-function]
>   drivers/net/ethernet/amd/xgbe/xgbe-main.c:855:12: warning: ‘xgbe_resume’ defined
>     but not used [-Wunused-function]
> 
> I see it during randconfig builds here.
> 
> Signed-off-by: Borislav Petkov <bp@suse.de>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] Set DEFAULT_TCP_CONG to bbr if DEFAULT_BBR is set
From: David Miller @ 2016-11-28 17:15 UTC (permalink / raw)
  To: jwollrath; +Cc: netdev
In-Reply-To: <20161125140526.2486-1-jwollrath@web.de>

From: Julian Wollrath <jwollrath@web.de>
Date: Fri, 25 Nov 2016 15:05:26 +0100

> Signed-off-by: Julian Wollrath <jwollrath@web.de>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: hns: Fix to conditionally convey RX checksum flag to stack
From: David Miller @ 2016-11-28 17:12 UTC (permalink / raw)
  To: salil.mehta; +Cc: yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel, linuxarm
In-Reply-To: <20161125133240.1264224-1-salil.mehta@huawei.com>

From: Salil Mehta <salil.mehta@huawei.com>
Date: Fri, 25 Nov 2016 13:32:40 +0000

> @@ -778,6 +778,35 @@ int hns_ae_get_regs_len(struct hnae_handle *handle)
>  	return total_num;
>  }
>  
> +static bool hns_ae_is_l3l4_csum_err(struct hnae_handle *handle)
> +{
> +	struct hns_ppe_cb *ppe_cb = hns_get_ppe_cb(handle);
> +	u32 regval;
> +	bool retval = false;
> +
> +	/* read PPE_HIS_PRO_ERR register and check for the checksum errors */
> +	regval = dsaf_read_dev(ppe_cb, PPE_HIS_PRO_ERR_REG);
> +

I don't see how a single register can properly provide error status for a ring
of pending received packets.

No matter how this register is implemented, it is either going to result in
packets erroneously being marked as having errors, or error status being
lost when multiple packets in a row have such errors.

For example, if you receive several packets in a row that have errors,
you'll read this register for the first one.  If this read clears the error
status, which I am guessing it does, then you won't see the error status
for the next packet that had one of these errors as well.

If you don't have something which is provided on a per-packet basis
then you can't determine the error properly.  Therefore you will just
have to always ignore the checksum if there is any error indicated in
the ring descriptor.

^ permalink raw reply

* Re: [PATCH 0/2] net: phy: realtek: fix RTL8211F TX-delay handling
From: David Miller @ 2016-11-28 17:07 UTC (permalink / raw)
  To: martin.blumenstingl-gM/Ye1E23mwN+BqQ9rBEUg
  Cc: f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, robh+dt-DgEjT+Ai2ygdnm+yROfE0A,
	mark.rutland-5wv7dgnIgG8, sean.wang-NuS5LvNUpcJWk0Htik3J/w,
	netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-amlogic-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	jbrunet-rdvid1DuHRBWk0Htik3J/w
In-Reply-To: <20161125131201.19994-1-martin.blumenstingl-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org>

From: Martin Blumenstingl <martin.blumenstingl-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org>
Date: Fri, 25 Nov 2016 14:11:59 +0100

> The RTL8211F PHY driver currently enables the TX-delay only when the
> phy-mode is PHY_INTERFACE_MODE_RGMII. This is incorrect, because there
> are three RGMII variations of the phy-mode which explicitly request the
> PHY to enable the RX and/or TX delay, while PHY_INTERFACE_MODE_RGMII
> specifies that the PHY should disable the RX and/or TX delays.
> 
> Additionally to the RTL8211F PHY driver change this contains a small
> update to the phy-mode documentation to clarify the purpose of the
> RGMII phy-modes.
> While this may not be perfect yet it's at least a start. Please feel
> free to drop this patch from this series and send an improved version
> yourself.
> 
> These patches are the results of recent discussions, see [0]
> 
> [0] http://lists.infradead.org/pipermail/linux-amlogic/2016-November/001688.html

Series applied, thanks Martin.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [patch net-next 0/4] mlxsw: couple of enhancements and fixes
From: Jiri Pirko @ 2016-11-28 17:01 UTC (permalink / raw)
  To: netdev; +Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz

From: Jiri Pirko <jiri@mellanox.com>

Couple of enhancements and fixes from Ido.

Ido Schimmel (4):
  mlxsw: resources: Add maximum buffer size
  mlxsw: spectrum_buffers: Limit size of pools
  mlxsw: core: Add missing rollback in error path
  mlxsw: core: Change order of operations in removal path

 drivers/net/ethernet/mellanox/mlxsw/core.c             | 3 ++-
 drivers/net/ethernet/mellanox/mlxsw/resources.h        | 2 ++
 drivers/net/ethernet/mellanox/mlxsw/spectrum_buffers.c | 3 +++
 3 files changed, 7 insertions(+), 1 deletion(-)

-- 
2.7.4

^ permalink raw reply

* [patch net-next 4/4] mlxsw: core: Change order of operations in removal path
From: Jiri Pirko @ 2016-11-28 17:01 UTC (permalink / raw)
  To: netdev; +Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz
In-Reply-To: <1480352486-18518-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

We call bus->init() before allocating 'lag.mapping'. Change the order of
operations in removal path to reflect that.

This makes the error path of mlxsw_core_bus_device_register() symmetric
with mlxsw_core_bus_device_unregister().

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 7a0ad39..4dc028b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -1188,8 +1188,8 @@ void mlxsw_core_bus_device_unregister(struct mlxsw_core *mlxsw_core)
 	mlxsw_thermal_fini(mlxsw_core->thermal);
 	devlink_unregister(devlink);
 	mlxsw_emad_fini(mlxsw_core);
-	mlxsw_core->bus->fini(mlxsw_core->bus_priv);
 	kfree(mlxsw_core->lag.mapping);
+	mlxsw_core->bus->fini(mlxsw_core->bus_priv);
 	free_percpu(mlxsw_core->pcpu_stats);
 	devlink_free(devlink);
 	mlxsw_core_driver_put(device_kind);
-- 
2.7.4

^ permalink raw reply related

* [patch net-next 3/4] mlxsw: core: Add missing rollback in error path
From: Jiri Pirko @ 2016-11-28 17:01 UTC (permalink / raw)
  To: netdev; +Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz
In-Reply-To: <1480352486-18518-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

Without this rollback, the thermal zone is still registered during the
error path, whereas its private data is freed upon the destruction of
the underlying bus device due to the use of devm_kzalloc(). This results
in use after free.

Fix this by calling mlxsw_thermal_fini() from the appropriate place in
the error path.

Fixes: a50c1e35650b ("mlxsw: core: Implement thermal zone")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index b21f88c..7a0ad39 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -1157,6 +1157,7 @@ int mlxsw_core_bus_device_register(const struct mlxsw_bus_info *mlxsw_bus_info,
 	if (mlxsw_core->driver->fini)
 		mlxsw_core->driver->fini(mlxsw_core);
 err_driver_init:
+	mlxsw_thermal_fini(mlxsw_core->thermal);
 err_thermal_init:
 err_hwmon_init:
 	devlink_unregister(devlink);
-- 
2.7.4

^ permalink raw reply related

* [patch net-next 2/4] mlxsw: spectrum_buffers: Limit size of pools
From: Jiri Pirko @ 2016-11-28 17:01 UTC (permalink / raw)
  To: netdev; +Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz
In-Reply-To: <1480352486-18518-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

The shared buffer pools are containers whose size is used to calculate
the maximum usage for packets from / to a specific port / {port, PG/TC},
when dynamic threshold is employed.

While it's perfectly fine for the sum of the pools to exceed the maximum
size of the shared buffer, a single pool cannot.

Add a check when the pool size is set and forbid sizes larger than the
maximum size of the shared buffer.

Without the patch:
$ devlink sb pool set pci/0000:03:00.0 pool 0 size 999999999 thtype
dynamic
// No error is returned

With the patch:
$ devlink sb pool set pci/0000:03:00.0 pool 0 size 999999999 thtype
dynamic
devlink answers: Invalid argument

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_buffers.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_buffers.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_buffers.c
index bcaed8a..a746826 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_buffers.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_buffers.c
@@ -611,6 +611,9 @@ int mlxsw_sp_sb_pool_set(struct mlxsw_core *mlxsw_core,
 	u32 pool_size = MLXSW_SP_BYTES_TO_CELLS(size);
 	enum mlxsw_reg_sbpr_mode mode;
 
+	if (size > MLXSW_CORE_RES_GET(mlxsw_sp->core, MAX_BUFFER_SIZE))
+		return -EINVAL;
+
 	mode = (enum mlxsw_reg_sbpr_mode) threshold_type;
 	return mlxsw_sp_sb_pr_write(mlxsw_sp, pool, dir, mode, pool_size);
 }
-- 
2.7.4

^ permalink raw reply related

* [patch net-next 1/4] mlxsw: resources: Add maximum buffer size
From: Jiri Pirko @ 2016-11-28 17:01 UTC (permalink / raw)
  To: netdev; +Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz
In-Reply-To: <1480352486-18518-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

We need to be able to limit the size of shared buffer pools, so query
the maximum size from the device during init.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/resources.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/resources.h b/drivers/net/ethernet/mellanox/mlxsw/resources.h
index 1c2119b..3c2171d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/resources.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/resources.h
@@ -47,6 +47,7 @@ enum mlxsw_res_id {
 	MLXSW_RES_ID_MAX_SYSTEM_PORT,
 	MLXSW_RES_ID_MAX_LAG,
 	MLXSW_RES_ID_MAX_LAG_MEMBERS,
+	MLXSW_RES_ID_MAX_BUFFER_SIZE,
 	MLXSW_RES_ID_MAX_CPU_POLICERS,
 	MLXSW_RES_ID_MAX_VRS,
 	MLXSW_RES_ID_MAX_RIFS,
@@ -70,6 +71,7 @@ static u16 mlxsw_res_ids[] = {
 	[MLXSW_RES_ID_MAX_SYSTEM_PORT] = 0x2502,
 	[MLXSW_RES_ID_MAX_LAG] = 0x2520,
 	[MLXSW_RES_ID_MAX_LAG_MEMBERS] = 0x2521,
+	[MLXSW_RES_ID_MAX_BUFFER_SIZE] = 0x2802,	/* Bytes */
 	[MLXSW_RES_ID_MAX_CPU_POLICERS] = 0x2A13,
 	[MLXSW_RES_ID_MAX_VRS] = 0x2C01,
 	[MLXSW_RES_ID_MAX_RIFS] = 0x2C02,
-- 
2.7.4

^ permalink raw reply related

* Re: stmmac ethernet in kernel 4.4: coalescing related pauses?
From: Lino Sanfilippo @ 2016-11-28 17:01 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, pavel, peppe.cavallaro, netdev, linux-kernel
In-Reply-To: <20161128.113031.964579744326063048.davem@davemloft.net>

On 28.11.2016 17:30, David Miller wrote:
> From: Lino Sanfilippo <lsanfil@marvell.com>
> Date: Mon, 28 Nov 2016 16:57:35 +0100
>
>> I wonder if the best fix would be indeed to deactivate irq coalescing
>> completely.
>> Does it make any sense at all to use it if a driver uses NAPI already?
>
> It absolutely does make sense, when it is implemented and functions
> properly.
>

Interesting. I always thought both (NAPI and irq coalescing) are essentially doing the same thing only
one time in software and one time with hw support. Did I misunderstand NAPI?

Regards,
Lino

^ permalink raw reply

* Re: [PATCH net-next 1/5] net: mvneta: Use cacheable memory to store the rx buffer virtual address
From: Gregory CLEMENT @ 2016-11-28 17:00 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: David S. Miller, linux-kernel, netdev, Arnd Bergmann,
	Jason Cooper, Andrew Lunn, Sebastian Hesselbarth,
	Thomas Petazzoni, linux-arm-kernel, Nadav Haklai, Marcin Wojtas,
	Dmitri Epshtein, Yelena Krivosheev
In-Reply-To: <20161128163548.70181560@xhacker>

Hi Jisheng,
 
 On lun., nov. 28 2016, Jisheng Zhang <jszhang@marvell.com> wrote:

> Hi Gregory,
>
> On Fri, 25 Nov 2016 16:30:14 +0100 Gregory CLEMENT wrote:
>
>> Until now the virtual address of the received buffer were stored in the
>> cookie field of the rx descriptor. However, this field is 32-bits only
>> which prevents to use the driver on a 64-bits architecture.
>> 
>> With this patch the virtual address is stored in an array not shared with
>> the hardware (no more need to use the DMA API). Thanks to this, it is
>> possible to use cache contrary to the access of the rx descriptor member.
>> 
>> The change is done in the swbm path only because the hwbm uses the cookie
>> field, this also means that currently the hwbm is not usable in 64-bits.
>> 
>> Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
>> ---
>>  drivers/net/ethernet/marvell/mvneta.c | 96 ++++++++++++++++++++++++----
>>  1 file changed, 84 insertions(+), 12 deletions(-)
>> 
>> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
>> index 87274d4ab102..b6849f88cab7 100644
>> --- a/drivers/net/ethernet/marvell/mvneta.c
>> +++ b/drivers/net/ethernet/marvell/mvneta.c
>> @@ -561,6 +561,9 @@ struct mvneta_rx_queue {
>>  	u32 pkts_coal;
>>  	u32 time_coal;
>>  
>> +	/* Virtual address of the RX buffer */
>> +	void  **buf_virt_addr;
>
> can we store buf_phys_addr in cacheable memory as well?

Even if we store in in cacheable memory we will still need to store it
in the buffer descriptor as it is used by the hardware.

>
>> +
>>  	/* Virtual address of the RX DMA descriptors array */
>>  	struct mvneta_rx_desc *descs;
>>  
>> @@ -1573,10 +1576,14 @@ static void mvneta_tx_done_pkts_coal_set(struct mvneta_port *pp,
>>  
>>  /* Handle rx descriptor fill by setting buf_cookie and buf_phys_addr */
>>  static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
>> -				u32 phys_addr, u32 cookie)
>> +				u32 phys_addr, void *virt_addr,
>> +				struct mvneta_rx_queue *rxq)
>>  {
>> -	rx_desc->buf_cookie = cookie;
>> +	int i;
>> +
>>  	rx_desc->buf_phys_addr = phys_addr;
>> +	i = rx_desc - rxq->descs;
>> +	rxq->buf_virt_addr[i] = virt_addr;
>>  }
>>  
>>  /* Decrement sent descriptors counter */
>> @@ -1781,7 +1788,8 @@ EXPORT_SYMBOL_GPL(mvneta_frag_free);
>>  
>>  /* Refill processing for SW buffer management */
>>  static int mvneta_rx_refill(struct mvneta_port *pp,
>> -			    struct mvneta_rx_desc *rx_desc)
>> +			    struct mvneta_rx_desc *rx_desc,
>> +			    struct mvneta_rx_queue *rxq)
>>  
>>  {
>>  	dma_addr_t phys_addr;
>> @@ -1799,7 +1807,7 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
>>  		return -ENOMEM;
>>  	}
>>  
>> -	mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)data);
>> +	mvneta_rx_desc_fill(rx_desc, phys_addr, data, rxq);
>>  	return 0;
>>  }
>>  
>> @@ -1861,7 +1869,12 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
>>  
>>  	for (i = 0; i < rxq->size; i++) {
>>  		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
>> -		void *data = (void *)rx_desc->buf_cookie;
>> +		void *data;
>> +
>> +		if (!pp->bm_priv)
>> +			data = rxq->buf_virt_addr[i];
>> +		else
>> +			data = (void *)(uintptr_t)rx_desc->buf_cookie;
>>  
>>  		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
>>  				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
>> @@ -1894,12 +1907,13 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
>>  		unsigned char *data;
>>  		dma_addr_t phys_addr;
>>  		u32 rx_status, frag_size;
>> -		int rx_bytes, err;
>> +		int rx_bytes, err, index;
>>  
>>  		rx_done++;
>>  		rx_status = rx_desc->status;
>>  		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
>> -		data = (unsigned char *)rx_desc->buf_cookie;
>> +		index = rx_desc - rxq->descs;
>> +		data = (unsigned char *)rxq->buf_virt_addr[index];
>>  		phys_addr = rx_desc->buf_phys_addr;
>>  
>>  		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
>> @@ -1938,7 +1952,7 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
>>  		}
>>  
>>  		/* Refill processing */
>> -		err = mvneta_rx_refill(pp, rx_desc);
>> +		err = mvneta_rx_refill(pp, rx_desc, rxq);
>>  		if (err) {
>>  			netdev_err(dev, "Linux processing - Can't refill\n");
>>  			rxq->missed++;
>> @@ -2020,7 +2034,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>>  		rx_done++;
>>  		rx_status = rx_desc->status;
>>  		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
>> -		data = (unsigned char *)rx_desc->buf_cookie;
>> +		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
>>  		phys_addr = rx_desc->buf_phys_addr;
>>  		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
>>  		bm_pool = &pp->bm_priv->bm_pools[pool_id];
>> @@ -2708,6 +2722,57 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
>>  	return rx_done;
>>  }
>>  
>> +/* Refill processing for HW buffer management */
>> +static int mvneta_rx_hwbm_refill(struct mvneta_port *pp,
>> +				 struct mvneta_rx_desc *rx_desc)
>> +
>> +{
>> +	dma_addr_t phys_addr;
>> +	void *data;
>> +
>> +	data = mvneta_frag_alloc(pp->frag_size);
>> +	if (!data)
>> +		return -ENOMEM;
>> +
>> +	phys_addr = dma_map_single(pp->dev->dev.parent, data,
>> +				   MVNETA_RX_BUF_SIZE(pp->pkt_size),
>> +				   DMA_FROM_DEVICE);
>> +	if (unlikely(dma_mapping_error(pp->dev->dev.parent, phys_addr))) {
>> +		mvneta_frag_free(pp->frag_size, data);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	phys_addr += pp->rx_offset_correction;
>> +	rx_desc->buf_phys_addr = phys_addr;
>> +	rx_desc->buf_cookie = (uintptr_t)data;
>> +
>> +	return 0;
>> +}
>> +
>> +/* Handle rxq fill: allocates rxq skbs; called when initializing a port */
>> +static int mvneta_rxq_bm_fill(struct mvneta_port *pp,
>> +			      struct mvneta_rx_queue *rxq,
>> +			      int num)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < num; i++) {
>> +		memset(rxq->descs + i, 0, sizeof(struct mvneta_rx_desc));
>> +		if (mvneta_rx_hwbm_refill(pp, rxq->descs + i) != 0) {
>> +			netdev_err(pp->dev, "%s:rxq %d, %d of %d buffs  filled\n",
>> +				   __func__, rxq->id, i, num);
>> +			break;
>> +		}
>> +	}
>> +
>> +	/* Add this number of RX descriptors as non occupied (ready to
>> +	 * get packets)
>> +	 */
>> +	mvneta_rxq_non_occup_desc_add(pp, rxq, i);
>> +
>> +	return i;
>> +}
>> +
>>  /* Handle rxq fill: allocates rxq skbs; called when initializing a port */
>>  static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
>>  			   int num)
>> @@ -2716,7 +2781,7 @@ static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
>>  
>>  	for (i = 0; i < num; i++) {
>>  		memset(rxq->descs + i, 0, sizeof(struct mvneta_rx_desc));
>> -		if (mvneta_rx_refill(pp, rxq->descs + i) != 0) {
>> +		if (mvneta_rx_refill(pp, rxq->descs + i, rxq) != 0) {
>>  			netdev_err(pp->dev, "%s:rxq %d, %d of %d buffs  filled\n",
>>  				__func__, rxq->id, i, num);
>>  			break;
>> @@ -2784,14 +2849,21 @@ static int mvneta_rxq_init(struct mvneta_port *pp,
>>  		mvneta_rxq_buf_size_set(pp, rxq,
>>  					MVNETA_RX_BUF_SIZE(pp->pkt_size));
>>  		mvneta_rxq_bm_disable(pp, rxq);
>> +
>> +		rxq->buf_virt_addr = devm_kmalloc(pp->dev->dev.parent,
>> +						  rxq->size * sizeof(void *),
>> +						  GFP_KERNEL);
>
> I would suggest allocate this buffer during probe. Otherwise, there's
> memory leak if we either change the mtu or close then open the eth in
> a loop, e.g
>
> while true
> do
> 	ifconfig eth0 up
> 	ifconfig eth0 down
> done

Indeed, I will move it.

Thanks,

Gregory

>
> Thanks,
> Jisheng
>
>> +		if (!rxq->buf_virt_addr)
>> +			return -ENOMEM;
>> +
>> +		mvneta_rxq_fill(pp, rxq, rxq->size);
>>  	} else {
>>  		mvneta_rxq_bm_enable(pp, rxq);
>>  		mvneta_rxq_long_pool_set(pp, rxq);
>>  		mvneta_rxq_short_pool_set(pp, rxq);
>> +		mvneta_rxq_bm_fill(pp, rxq, rxq->size);
>>  	}
>>  
>> -	mvneta_rxq_fill(pp, rxq, rxq->size);
>> -
>>  	return 0;
>>  }
>>  
>

-- 
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH net-next 0/2] Fix support for the MV88E6097
From: David Miller @ 2016-11-28 16:59 UTC (permalink / raw)
  To: eichest; +Cc: andrew, vivien.didelot, netdev, stefan.eichenberger
In-Reply-To: <20161125084130.3210-1-stefan.eichenberger@netmodule.com>

From: Stefan Eichenberger <eichest@gmail.com>
Date: Fri, 25 Nov 2016 09:41:28 +0100

> This patchset fixes the following two issues for the MV88E6097:
> - Add missing definition of g1_irqs
> - Add missing comment

Series applied, thanks Stefan.

^ permalink raw reply

* Re: AF_VSOCK network namespace support
From: Jorgen S. Hansen @ 2016-11-28 15:24 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: netdev@vger.kernel.org, imbrenda@linux.vnet.ibm.com
In-Reply-To: <20161123145535.GA16465@stefanha-x1.localdomain>

Hi Stefan,

> On Nov 23, 2016, at 3:55 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> Hi Jorgen,
> There are two use cases where network namespace support in AF_VSOCK
> could be useful:
> 
> 1. Claudio Imbrenda pointed out that a machine cannot act as both host
>   and guest at the same time.  This is necessary for nested
>   virtualization.  Currently only one transport (the host side or the
>   guest side) can be registered at a time.

VMCI based AF_VSOCK relies on the VMCI driver for nested virtualization support. The VMCI driver is a combined host/guest driver with a routing component, that will either direct traffic to VMs managed by the host “personality” of the driver, or to the outer host. So any VMCI driver driver is able to function simultaneously as both a guest and a host driver - exactly to be able to support nested virtualization.

Since, for VMCI based vSocket, the host has a fixed CID (2), any traffic generated by an application inside a VM destined for CID 2 will be routed out of the VM (to the host - either a virtual or physical one). Any traffic for a CID > 2 will be directed towards VMs managed by the host personality of the VMCI driver.

Since VMCI predates nested virtualization, the solution above was partly a result of having to support existing configurations in a transparent way.

> 2. Users may wish to isolate the AF_VSOCK address namespace so that two
>   VMs have completely independent CID and ports (they could even use
>   the same CID and ports because they're in separate namespaces).  This
>   ensures that a host service visible to VM1 is not automatically
>   visible to VM2.

If the goal is to provide fine grained service access control, won’t this end up requiring a namespace per VM? For ESX, we have a mechanism to tag VMs that allows them to be granted access to a service offered through AF_VSOCK, but this is not part of the Linux hypervisor.

If the intent is to be able to support multi tenancy, then this sounds like a better fit. Also, in the multi tenancy case, isolating the other AFs is probably what you want as well.

> Network namespaces could solve both problems.
> 
> A drawback of namespaces is that existing configurations using network
> namespaces for IPv4/6 or other purposes break if AF_VSOCK gains network
> namespace support.  This is not a big problem for virtio-vsock if we
> implement namespace support soon since there are no existing users.
> 
> I wonder how other address families have solved this transition to
> network namespaces.  It's almost like we need fine-grained namespaces
> instead of a blanket network namespace that applies across all address
> families...
> 
> I'm playing around with the code now but wanted to get your thoughts in
> case you've already considered these problems.
> 
> Stefan

Thanks,
Jørgen

^ permalink raw reply

* Re: [PATCH net] sit: Set skb->protocol properly in ipip6_tunnel_xmit()
From: Eric Dumazet @ 2016-11-28 16:53 UTC (permalink / raw)
  To: David Miller, Alexander Duyck; +Cc: sfr, elicooper, netdev
In-Reply-To: <20161128.114703.154301432767947066.davem@davemloft.net>

On Mon, 2016-11-28 at 11:47 -0500, David Miller wrote:
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Sun, 27 Nov 2016 13:04:00 +1100
> 
> > [Just for Dave's information]
> > 
> > On Fri, 25 Nov 2016 13:50:17 +0800 Eli Cooper <elicooper@gmx.com> wrote:
> >>
> >> Similar to commit ae148b085876
> >> ("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"),
> >> sit tunnels also need to update skb->protocol; otherwise, TSO/GSO packets
> >> might not be properly segmented, which causes the packets being dropped.
> >> 
> >> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
> >> Tested-by: Eli Cooper <elicooper@gmx.com>
> >> Cc: stable@vger.kernel.org
> >> Signed-off-by: Eli Cooper <elicooper@gmx.com>
> > 
> > I tested this patch and it does *not* solve my problem.
> 
> I'm torn on this patch, because it looked exactly like it would solve the
> kind of problem Stephen is running into.
> 
> Even though it doesn't fix his case, it seems correct to me.
> 
> I was wondering if it was also important to set the skb->protocol
> before the call to ip_tunnel_encap() but I couldn't find a dependency.
> 
> In any event I'd like to see some other people review this change
> before I apply it.
> 
> My only other guess for Stephen's problem is somehow the SKB headers
> aren't set up properly for what the GSO engine expects.

Well, mlx4 just works, and uses GSO engine just fine.

So my guess is this is a bug in Intel IGB driver.


Alexander, can you take a look ?

Features for eth0:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: on
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
busy-poll: off [fixed]
hw-tc-offload: off [fixed]

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: enable multiqueue by default
From: Michael S. Tsirkin @ 2016-11-28 16:52 UTC (permalink / raw)
  To: Jason Wang
  Cc: Neil Horman, Jeremy Eder, Hannes Frederic Sowa, netdev,
	linux-kernel, virtualization, Marko Myllynen, Maxime Coquelin
In-Reply-To: <1480048646-17536-1-git-send-email-jasowang@redhat.com>

On Fri, Nov 25, 2016 at 12:37:26PM +0800, Jason Wang wrote:
> We use single queue even if multiqueue is enabled and let admin to
> enable it through ethtool later. This is used to avoid possible
> regression (small packet TCP stream transmission). But looks like an
> overkill since:
> 
> - single queue user can disable multiqueue when launching qemu
> - brings extra troubles for the management since it needs extra admin
>   tool in guest to enable multiqueue
> - multiqueue performs much better than single queue in most of the
>   cases
> 
> So this patch enables multiqueue by default: if #queues is less than or
> equal to #vcpu, enable as much as queue pairs; if #queues is greater
> than #vcpu, enable #vcpu queue pairs.
> 
> Cc: Hannes Frederic Sowa <hannes@redhat.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Neil Horman <nhorman@redhat.com>
> Cc: Jeremy Eder <jeder@redhat.com>
> Cc: Marko Myllynen <myllynen@redhat.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

OK I stil htink we should handle cpu hotplug better
but this can be done separately.

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  drivers/net/virtio_net.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index d4ac7a6..a21d93a 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1886,8 +1886,11 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	if (vi->any_header_sg)
>  		dev->needed_headroom = vi->hdr_len;
>  
> -	/* Use single tx/rx queue pair as default */
> -	vi->curr_queue_pairs = 1;
> +	/* Enable multiqueue by default */
> +	if (num_online_cpus() >= max_queue_pairs)
> +		vi->curr_queue_pairs = max_queue_pairs;
> +	else
> +		vi->curr_queue_pairs = num_online_cpus();
>  	vi->max_queue_pairs = max_queue_pairs;
>  
>  	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
> @@ -1918,6 +1921,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>  		goto free_unregister_netdev;
>  	}
>  
> +	virtnet_set_affinity(vi);
> +
>  	/* Assume link up if device can't report link status,
>  	   otherwise get link status from config. */
>  	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH net] sit: Set skb->protocol properly in ipip6_tunnel_xmit()
From: David Miller @ 2016-11-28 16:47 UTC (permalink / raw)
  To: sfr; +Cc: elicooper, netdev
In-Reply-To: <20161127130400.4a69ff1b@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Sun, 27 Nov 2016 13:04:00 +1100

> [Just for Dave's information]
> 
> On Fri, 25 Nov 2016 13:50:17 +0800 Eli Cooper <elicooper@gmx.com> wrote:
>>
>> Similar to commit ae148b085876
>> ("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"),
>> sit tunnels also need to update skb->protocol; otherwise, TSO/GSO packets
>> might not be properly segmented, which causes the packets being dropped.
>> 
>> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
>> Tested-by: Eli Cooper <elicooper@gmx.com>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Eli Cooper <elicooper@gmx.com>
> 
> I tested this patch and it does *not* solve my problem.

I'm torn on this patch, because it looked exactly like it would solve the
kind of problem Stephen is running into.

Even though it doesn't fix his case, it seems correct to me.

I was wondering if it was also important to set the skb->protocol
before the call to ip_tunnel_encap() but I couldn't find a dependency.

In any event I'd like to see some other people review this change
before I apply it.

My only other guess for Stephen's problem is somehow the SKB headers
aren't set up properly for what the GSO engine expects.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox