Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] b43legacy: Remove unused function
From: Larry Finger @ 2015-01-06 18:08 UTC (permalink / raw)
  To: Rickard Strandqvist
  Cc: Kalle Valo, linux-wireless, b43-dev, netdev, linux-kernel
In-Reply-To: <1420567175-17323-1-git-send-email-rickard_strandqvist@spectrumdigital.se>

On 01/06/2015 11:59 AM, Rickard Strandqvist wrote:
> Remove the function b43legacy_radio_set_tx_iq() that is not used anywhere.
>
> This was partially found by using a static code analysis program called cppcheck.
>
> Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
> ---
>   drivers/net/wireless/b43legacy/radio.c |   19 -------------------
>   drivers/net/wireless/b43legacy/radio.h |    1 -
>   2 files changed, 20 deletions(-)

Acked-by: Larry Finger <Larry.Finger@lwfinger.net>

Larry

>
> diff --git a/drivers/net/wireless/b43legacy/radio.c b/drivers/net/wireless/b43legacy/radio.c
> index 8961776..9501420 100644
> --- a/drivers/net/wireless/b43legacy/radio.c
> +++ b/drivers/net/wireless/b43legacy/radio.c
> @@ -1743,25 +1743,6 @@ u16 freq_r3A_value(u16 frequency)
>   	return value;
>   }
>
> -void b43legacy_radio_set_tx_iq(struct b43legacy_wldev *dev)
> -{
> -	static const u8 data_high[5] = { 0x00, 0x40, 0x80, 0x90, 0xD0 };
> -	static const u8 data_low[5]  = { 0x00, 0x01, 0x05, 0x06, 0x0A };
> -	u16 tmp = b43legacy_radio_read16(dev, 0x001E);
> -	int i;
> -	int j;
> -
> -	for (i = 0; i < 5; i++) {
> -		for (j = 0; j < 5; j++) {
> -			if (tmp == (data_high[i] | data_low[j])) {
> -				b43legacy_phy_write(dev, 0x0069, (i - j) << 8 |
> -						    0x00C0);
> -				return;
> -			}
> -		}
> -	}
> -}
> -
>   int b43legacy_radio_selectchannel(struct b43legacy_wldev *dev,
>   				  u8 channel,
>   				  int synthetic_pu_workaround)
> diff --git a/drivers/net/wireless/b43legacy/radio.h b/drivers/net/wireless/b43legacy/radio.h
> index bccb3d7..dd2976d 100644
> --- a/drivers/net/wireless/b43legacy/radio.h
> +++ b/drivers/net/wireless/b43legacy/radio.h
> @@ -92,7 +92,6 @@ void b43legacy_nrssi_hw_write(struct b43legacy_wldev *dev, u16 offset, s16 val);
>   void b43legacy_nrssi_hw_update(struct b43legacy_wldev *dev, u16 val);
>   void b43legacy_nrssi_mem_update(struct b43legacy_wldev *dev);
>
> -void b43legacy_radio_set_tx_iq(struct b43legacy_wldev *dev);
>   u16 b43legacy_radio_calibrationvalue(struct b43legacy_wldev *dev);
>
>   #endif /* B43legacy_RADIO_H_ */
>

^ permalink raw reply

* Re: [PATCH] net: wireless: ipw2x00: ipw2200.c:  Remove unused function
From: Kalle Valo @ 2015-01-06 18:04 UTC (permalink / raw)
  To: Rickard Strandqvist
  Cc: Stanislav Yakovlev, John W. Linville, linux-wireless, netdev,
	linux-kernel
In-Reply-To: <1419078599-31118-1-git-send-email-rickard_strandqvist@spectrumdigital.se>

Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> writes:

> Remove the function ipw_alive() that is not used anywhere.
>
> This was partially found by using a static code analysis program called cppcheck.
>
> Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>

Please improve the patch title and make it more unique like:

ipw2200: remove unused ipw_alive()

-- 
Kalle Valo

^ permalink raw reply

* Re: [PATCH] net: wireless: b43legacy: radio.c: Remove unused function
From: Rickard Strandqvist @ 2015-01-06 18:01 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Sedat Dilek, Larry Finger, Stefano Brivio, Network Development,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux Kernel Mailing List, b43-dev
In-Reply-To: <CACna6rxF1OKRZSpaL=+XzSV2kCpMH5LuB-B+eWV7zF8MJ-9tog-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

2015-01-05 7:34 GMT+01:00 Rafał Miłecki <zajec5-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> On 3 January 2015 at 13:28, Rickard Strandqvist
> <rickard_strandqvist-IW2WV5XWFqGZkjO+N0TKoMugMpMbD5Xr@public.gmane.org> wrote:
>> 2015-01-02 22:34 GMT+01:00 Rafał Miłecki <zajec5-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>>>
>>> 1) I gave you Ack for the changes
>>> 2) You could drop "net: wireless: " or better use something Sedat proposed
>>
>> Nice, yes I miss the Ack :)
>>
>> I just got one more complaining about my subject-line in "net: wireless: "
>> I use some sed call for this, so it's easy to fix. I will now remove that
>> part hereinafter.
>
> Will you re-send this patch with a proper subject then, please?
>
> --
> Rafał


Hi Rafał

Ok, resent the patch.

Just a thought about this naming.
Now I have changed in radio.c and radio.h, what happens when I send
another patch for other files in the same directory, the caption will
then be identical.


Kind regards
Rickard Strandqvist
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] b43legacy: Remove unused function
From: Rickard Strandqvist @ 2015-01-06 17:59 UTC (permalink / raw)
  To: Larry Finger, Stefano Brivio
  Cc: Rickard Strandqvist, Kalle Valo,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	b43-dev-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Remove the function b43legacy_radio_set_tx_iq() that is not used anywhere.

This was partially found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist-IW2WV5XWFqGZkjO+N0TKoMugMpMbD5Xr@public.gmane.org>
---
 drivers/net/wireless/b43legacy/radio.c |   19 -------------------
 drivers/net/wireless/b43legacy/radio.h |    1 -
 2 files changed, 20 deletions(-)

diff --git a/drivers/net/wireless/b43legacy/radio.c b/drivers/net/wireless/b43legacy/radio.c
index 8961776..9501420 100644
--- a/drivers/net/wireless/b43legacy/radio.c
+++ b/drivers/net/wireless/b43legacy/radio.c
@@ -1743,25 +1743,6 @@ u16 freq_r3A_value(u16 frequency)
 	return value;
 }
 
-void b43legacy_radio_set_tx_iq(struct b43legacy_wldev *dev)
-{
-	static const u8 data_high[5] = { 0x00, 0x40, 0x80, 0x90, 0xD0 };
-	static const u8 data_low[5]  = { 0x00, 0x01, 0x05, 0x06, 0x0A };
-	u16 tmp = b43legacy_radio_read16(dev, 0x001E);
-	int i;
-	int j;
-
-	for (i = 0; i < 5; i++) {
-		for (j = 0; j < 5; j++) {
-			if (tmp == (data_high[i] | data_low[j])) {
-				b43legacy_phy_write(dev, 0x0069, (i - j) << 8 |
-						    0x00C0);
-				return;
-			}
-		}
-	}
-}

^ permalink raw reply related

* Re: [PATCH net-next 1/3] net: add IPv4 routing FIB support for swdev
From: Scott Feldman @ 2015-01-06 17:51 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Netdev, Jiří Pírko, john fastabend, Thomas Graf,
	Jamal Hadi Salim, Andy Gospodarek, Roopa Prabhu
In-Reply-To: <1420552709.32369.50.camel@stressinduktion.org>

On Tue, Jan 6, 2015 at 5:58 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hi Scott,
>
> On Do, 2015-01-01 at 19:29 -0800, sfeldma@gmail.com wrote:
>> From: Scott Feldman <sfeldma@gmail.com>
>>
>> To offload IPv4 L3 routing functions to swdev device, the swdev device driver
>> implements two new ndo ops (ndo_switch_fib_ipv4_add/del).  The ops are called
>> by the core IPv4 FIB code when installing/removing FIB entries to/from the
>> kernel FIB.  On install, the driver should return 0 if FIB entry (route) can be
>> installed to device for offloading, -EOPNOTSUPP if route cannot be installed
>> due to device limitations, and other negative error code on failure to install
>> route to device.  On failure error code, the route is not installed to device,
>> and not installed in kernel FIB, and the return code is propagated back to the
>> user-space caller (via netlink).  An -EOPNOTSUPP error code is skipped for the
>> device but installed in the kernel FIB.
>>
>> The FIB entry (route) nexthop list is used to find the swdev device port to
>> anchor the ndo op call.  The route's fib_dev (the first nexthop's dev) is used
>> find the swdev port by recursively traversing the fib_dev's lower_dev list
>> until a swdev port is found.  The ndo op is called on this swdev port.
>>
>> Since the FIB entry is "naked" when push from the kernel, the driver/device
>> is responsible for resolving the route's nexthops to neighbor MAC addresses.
>> This can be done by the driver by monitoring NETEVENT_NEIGH_UPDATE
>> netevent notifier to watch for ARP activity.  Once a nexthop is resolved to
>> neighbor MAC address, it can be installed to the device and the device will
>> do the L3 routing offload in HW, for that nexthop.
>>
>> Signed-off-by: Scott Feldman <sfeldma@gmail.com>
>> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
>> ---
>>  include/linux/netdevice.h |   22 +++++++++++
>>  include/net/switchdev.h   |   18 +++++++++
>>  net/ipv4/fib_trie.c       |   17 ++++++++-
>>  net/switchdev/switchdev.c |   89 +++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 145 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 679e6e9..b66d22b 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -767,6 +767,8 @@ struct netdev_phys_item_id {
>>  typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
>>                                      struct sk_buff *skb);
>>
>> +struct fib_info;
>> +
>>  /*
>>   * This structure defines the management hooks for network devices.
>>   * The following hooks can be defined; unless noted otherwise, they are
>> @@ -1030,6 +1032,14 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
>>   * int (*ndo_switch_port_stp_update)(struct net_device *dev, u8 state);
>>   *   Called to notify switch device port of bridge port STP
>>   *   state change.
>> + * int (*ndo_sw_parent_fib_ipv4_add)(struct net_device *dev, __be32 dst,
>> + *                                int dst_len, struct fib_info *fi,
>> + *                                u8 tos, u8 type, u32 tb_id);
>> + *   Called to add IPv4 route to switch device.
>> + * int (*ndo_sw_parent_fib_ipv4_del)(struct net_device *dev, __be32 dst,
>> + *                                int dst_len, struct fib_info *fi,
>> + *                                u8 tos, u8 type, u32 tb_id);
>> + *   Called to delete IPv4 route from switch device.
>>   */
>>  struct net_device_ops {
>>       int                     (*ndo_init)(struct net_device *dev);
>> @@ -1189,6 +1199,18 @@ struct net_device_ops {
>>                                                           struct netdev_phys_item_id *psid);
>>       int                     (*ndo_switch_port_stp_update)(struct net_device *dev,
>>                                                             u8 state);
>> +     int                     (*ndo_switch_fib_ipv4_add)(struct net_device *dev,
>> +                                                        __be32 dst,
>> +                                                        int dst_len,
>> +                                                        struct fib_info *fi,
>> +                                                        u8 tos, u8 type,
>> +                                                        u32 tb_id);
>> +     int                     (*ndo_switch_fib_ipv4_del)(struct net_device *dev,
>> +                                                        __be32 dst,
>> +                                                        int dst_len,
>> +                                                        struct fib_info *fi,
>> +                                                        u8 tos, u8 type,
>> +                                                        u32 tb_id);
>>  #endif
>>  };
>
> At this point I would like to start the discussion about handling of the
> table ids/vrfs (again :) ): as I can see it, this version just passes
> table ids down to the driver layer and the rocker driver filters them by
> local/main table? This seems to be mostly fine for a first version but
> does not feel like it will integrate well with the rest of the linux
> networking ecosystem.
>
> Will hardware have the capabilities to do programmable matches like "ip
> rule" is currently capable to do? Should we plan for that? Do we want to
> support hardware which does support multiple tables/VRFs?

Good questions, thanks for bringing these up.

>
> I would like to present a first suggestion:
> My take on this would be strive towards an integration with ip-rule, so
> we add tables which will be offloaded to hardware. This happens only in
> situations where those tables will be the first match for incoming
> packets specified with an in-interface filter which has the capability
> to do the offloading (for example). The determination if the table is
> capable for hardware offloading should be done automatically, so if
> later hardware will be capable of doing ip rule like matches, we can
> just expand the check which flags the tables accordingly.

Sounds like a good suggestion to me.  We need to think about what the
swdev API looks like to the switch device driver.  Could you take a
stab at defining what integration with ip-rule looks like, code-wise,
at the swdev API layer?

With the rocker device we're prototyping with, the standard LPM on IP
dst is the normal L3 routing table structure.  Within that, table
priorities could be handled, so routes in one table take precedence
over routes in another table.  If we want to do policy routing, then
we'd need to use the ACL table in rocker to match on other fields
besides just IP dst.

> Anyway, if hardware supports multiple tables or VRFs, it is better to
> manage pass in a pointer where drivers can embed private data for
> management, I think.
>
> Thanks,
> Hannes
>
>
>

^ permalink raw reply

* Re: [net-next PATCH v1 08/11] net: rocker: add get flow API operation
From: John Fastabend @ 2015-01-06 17:50 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Thomas Graf, Jiří Pírko, Jamal Hadi Salim,
	simon.horman@netronome.com, Netdev, David S. Miller,
	Andy Gospodarek
In-Reply-To: <CAE4R7bDUiLGQnEHs1jemqjrQkzObN5jCkAa3LdWr90KGysrm5w@mail.gmail.com>

On 01/06/2015 08:57 AM, Scott Feldman wrote:
> On Tue, Jan 6, 2015 at 6:59 AM, John Fastabend <john.fastabend@gmail.com> wrote:
>> On 01/05/2015 11:40 PM, Scott Feldman wrote:
>>>
>>> On Wed, Dec 31, 2014 at 11:48 AM, John Fastabend
>>> <john.fastabend@gmail.com> wrote:
>>>>
>>>> Add operations to get flows. I wouldn't mind cleaning this code
>>>> up a bit but my first attempt to do this used macros which shortered
>>>> the code up but when I was done I decided it just made the code
>>>> unreadable and unmaintainable.
>>>>
>>>> I might think about it a bit more but this implementation albeit
>>>> a bit long and repeatative is easier to understand IMO.
>>>
>>>
>>> Dang, you put a lot of work into this one.
>>>
>>> Something doesn't feel right though.  In this case, rocker driver just
>>> happened to have cached all the flow/group stuff in hash tables in
>>> software, so you don't need to query thru to the device to extract the
>>> if_flow info.  What doesn't feel right is all the work need in the
>>> driver.  For each and every driver.  get_flows needs to go above
>>> driver, somehow.
>>
>>
>> Another option is to have a software cache in the flow_table.c I
>> was trying to avoid caching as I really don't expect 'get' operations
>> to be fast path and going to hardware seems good enough for me.
>> Other than its a bit annoying to write the mapping code.
>
> Caching in flow_table.c seems best to me as drivers/devices don't need
> to be involved and the cache can server multiple users of the API.
> Are there cases where the device could get flow table entries
> installed/deleted outside the API?  For example, if the device was
> learning MAC addresses, and did automatic table insertions.  We worked
> around that case with the recent L2 swdev support by pushing learned
> MAC addrs up to bridge's FDB so software and hardware tables stay
> synced.
>

OK I guess I'm convinced. I'll go ahead and cache the flow entries in
software. I'll work this into v2.

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [net-next PATCH v1 04/11] rocker: add pipeline model for rocker switch
From: John Fastabend @ 2015-01-06 17:49 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Thomas Graf, Jiří Pírko, Jamal Hadi Salim,
	simon.horman@netronome.com, Netdev, David S. Miller,
	Andy Gospodarek
In-Reply-To: <CAE4R7bAHY8=TJ_hZ5X72bB5fca22zg2L8FgKJdRdRZGDWpSvcA@mail.gmail.com>

On 01/06/2015 09:16 AM, Scott Feldman wrote:
> On Tue, Jan 6, 2015 at 9:00 AM, John Fastabend <john.fastabend@gmail.com> wrote:
>> On 01/05/2015 11:01 PM, Scott Feldman wrote:
>>>> +
>>>> +struct net_flow_jump_table parse_ethernet[3] = {
>>>> +       {
>>>> +               .field = {
>>>> +                  .header = HEADER_ETHERNET,
>>>> +                  .field = HEADER_ETHERNET_ETHERTYPE,
>>>> +                  .type = NET_FLOW_FIELD_REF_ATTR_TYPE_U16,
>>>> +                  .value_u16 = 0x0800,
>
> ETH_P_IP, etc
>
>>>
>>>
>>> How is htons/ntohs conversions happening here?
>>
>>
>> my current stance is to leave everything in host order in the model
>> and let the drivers do conversions as needed. For example some drivers
>> want the vlan vid in host order others network order. I think its
>> more readable above then with hton*() throughout.
>
> Hmmm...I would argue adding htons/htonl makes it more readable in the
> sense that it's a reminder that this is a field in a network header,
> to be used for matching against packet headers, which use
> network-ordering.  Store the field in the order best for comparison
> with the raw pkt data.  Drivers may still need to do some conversion
> if the field is programmed in hardware in a diff order.
>

Easy enough here, but then when we set_flows what do we use
network-ordering or host? If it can be network-order in some
cases and host-order in others its hard to resolve pragmatically.
Humans at a CLI can most likely get it right for well known fields
such as VLAN IDs but for less common fields (maybe proprietary)
or management software it gets tricky.

I guess we could add a flag to indicate byte ordering.

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* [PATCH] net: ethernet: cpsw: ignore VLAN ID 1
From: Felipe Balbi @ 2015-01-06 17:43 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Linux OMAP Mailing List, Felipe Balbi, stable,
	Mugunthan V N

CPSW completely hangs if we add, and later remove,
VLAN ID #1. What happens is that after removing
VLAN ID #1, no packets will be received by CPSW
rendering network unusable.

In order to "fix" the issue, we're returning -EINVAL
if anybody tries to add VLAN ID #1. While at that,
also filter out any ID > 4095 because we only have
12 bits for VLAN IDs.

Fixes: 3b72c2f (drivers: net:ethernet: cpsw: add support for VLAN)
Cc: <stable@vger.kernel.org> # v3.9+
Cc: Mugunthan V N <mugunthanvnm@ti.com>
Tested-by: Schuyler Patton <spatton@ti.com>
Signed-off-by: Felipe Balbi <balbi@ti.com>
---
 drivers/net/ethernet/ti/cpsw.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index e61ee8351272..028bb7f3de65 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1669,6 +1669,13 @@ static int cpsw_ndo_vlan_rx_add_vid(struct net_device *ndev,
 	if (vid == priv->data.default_vlan)
 		return 0;
 
+	/* NOTICE: CPSW does not support VID 1. We should
+	 * also filter out VID > 4095 as we only have 12
+	 * bits for VID entries
+	 */
+	if (vid == 1 || vid >= VLAN_N_VID)
+		return -EINVAL;
+
 	dev_info(priv->dev, "Adding vlanid %d to vlan filter\n", vid);
 	return cpsw_add_vlan_ale_entry(priv, vid);
 }
@@ -1682,6 +1689,13 @@ static int cpsw_ndo_vlan_rx_kill_vid(struct net_device *ndev,
 	if (vid == priv->data.default_vlan)
 		return 0;
 
+	/* NOTICE: CPSW does not support VID 1. We should
+	 * also filter out VID > 4095 as we only have 12
+	 * bits for VID entries
+	 */
+	if (vid == 1 || vid >= VLAN_N_VID)
+		return -EINVAL;
+
 	dev_info(priv->dev, "removing vlanid %d from vlan filter\n", vid);
 	ret = cpsw_ale_del_vlan(priv->ale, vid, 0);
 	if (ret != 0)
-- 
2.2.0

^ permalink raw reply related

* Re: [PATCH net-next v2 0/8] net: extend ethtool link mode bitmaps to 48 bits
From: David Decotigny @ 2015-01-06 17:36 UTC (permalink / raw)
  To: Amir Vadai
  Cc: Florian Fainelli, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-api@vger.kernel.org,
	Saeed Mahameed, David S. Miller, Jason Wang, Michael S. Tsirkin,
	Herbert Xu, Al Viro, Ben Hutchings, Masatake YAMATO, Xi Wang,
	Neil Horman, WANG Cong, Flavio Leitner, Tom Gundersen, Jiri Pirko,
	Vlad Yasevich, Eric W. Biederman, Venkata Duvvuru,
	Govindarajulu Varadarajan
In-Reply-To: <54ABE991.3040107@mellanox.com>

Interesting. It seems that the band-aid I was proposing is already
obsolete. We could still use the remaining reserved 16 bits to encode
5 more bits per mask (that is: 53 bits / mask total). But if I
understand you, it would allow us to survive only a few months longer,
as opposed to a few weeks.

One short-term alternative solution I can imagine is the following:
/* For example bitmap-based for variable length: */
struct ethtool_link_mode {
  __u32 cmd;
  __u8 autoneg :1;
  __u8 duplex :2;
 __u16 supported_nbits;
  __u16 advertising_nbits;
  __u16 lp_advertising_nbits;
  __u32 reserved[4];
  __u8 masks[0];
};
/* Or simpler, statically limited to 64b / mask, but easier to migrate
to for driver authors: */
struct ethtool_link_mode {
  __u32 cmd;
  __u8 autoneg :1;
  __u8 duplex :2;
   __u64 supported;
  __u64 advertising;
  __u64 lp_advertising;
  __u32 reserved[4];
};
#define ETHTOOL_GLINK_MODE 0x0000004a
#define ETHTOOL_SLINK_MODE 0x0000004b
struct ethtool_ops {
...
   int (*get_link_mode)(struct net_device *, struct ethtool_link_mode *);
   int (*set_link_mode)(struct net_device *, struct ethtool_link_mode *);
};

The same thing required for EEE.

I am not sure about moving the autoneg and duplex fields into the new
struct. Especially the "duplex" field.

Then the idea would be to update the ethtool user-space tool to try
get/set_link mode when reporting/changing the autoneg/advertising
settings.

Both will require significant effort from the driver authors.
Especially if the variable-length bitmap approach is preferred:
 - most drivers currently use simple bitwise arithmetic in their code,
and that goes far beyond get/set_settings, it is sometimes part of the
core driver logic. They will have to migrate to the bitmap API if they
want to use the larger bitmaps (note: no change needed if they are
happy with <= 32b / mask)
 - we would have to progressively deprecate the use of #define
ADVERTISED_1000baseT_Full in favor of an enum of the bit indices.

Any feedback welcome. In the meantime, I am going to propose a v3 of
current option with 53 bits / mask. I can also propose a prototype of
the scheme described above, please let me know.

On Tue, Jan 6, 2015 at 5:56 AM, Amir Vadai <amirv@mellanox.com> wrote:
> On 1/6/2015 4:54 AM, David Decotigny wrote:
>> This patch series increases the width of the supported/advertising
>> ethtool masks from 32 bits to 48. This should allow to breathe for a
>> couple more years (or... months?).
> Hi,
>
> Nice work.
> What worries me is that it is going to be for weeks more than years or
> months...
>
> Mellanox is about to release next month a driver for a new NIC, with 3
> new speeds * few link modes for each + new link modes for 10G.
> It seems that we will need to consume almost all the new bits.
>
> Amir

^ permalink raw reply

* Re: [PATCH net-next v3 0/5]: ixgbevf: Allow querying VFs RSS indirection table and key
From: Vlad Zolotarov @ 2015-01-06 17:30 UTC (permalink / raw)
  To: Greg Rose; +Cc: Gleb Natapov, netdev, Avi Kivity, jeffrey.t.kirsher
In-Reply-To: <CALgkqUqh4uMxhnTA9yOm2m0Z3tOZ1E2+Z8Y68Kv5HJ6vP=Fk2A@mail.gmail.com>


On 01/06/15 18:59, Greg Rose wrote:
> On Tue, Jan 6, 2015 at 2:58 AM, Vlad Zolotarov
> <vladz@cloudius-systems.com> wrote:
>> On 01/06/15 08:55, Gleb Natapov wrote:
>>> On Mon, Jan 05, 2015 at 03:54:52PM -0800, Greg Rose wrote:
>>>> On Mon, Jan 5, 2015 at 6:15 AM, Vlad Zolotarov
>>>> <vladz@cloudius-systems.com> wrote:
>>>>> Add the ethtool ops to VF driver to allow querying the RSS indirection
>>>>> table
>>>>> and RSS Random Key.
>>>>>
>>>>>    - PF driver: Add new VF-PF channel commands.
>>>>>    - VF driver: Utilize these new commands and add the corresponding
>>>>>                 ethtool callbacks.
>>>>>
>>>>> New in v3:
>>>>>      - Added a missing support for x550 devices.
>>>>>      - Mask the indirection table values according to PSRTYPE[n].RQPL.
>>>>>      - Minimized the number of added VF-PF commands.
>>>>>
>>>>> New in v2:
>>>>>      - Added a detailed description to patches 4 and 5.
>>>>>
>>>>> New in v1 (compared to RFC):
>>>>>      - Use "if-else" statement instead of a "switch-case" for a single
>>>>> option case.
>>>>>        More specifically: in cases where the newly added API version is
>>>>> the only one
>>>>>        allowed. We may consider using a "switch-case" back again when the
>>>>> list of
>>>>>        allowed API versions in these specific places grows up.
>>>>>
>>>>> Vlad Zolotarov (5):
>>>>>     ixgbe: Add a RETA query command to VF-PF channel API
>>>>>     ixgbevf: Add a RETA query code
>>>>>     ixgbe: Add GET_RSS_KEY command to VF-PF channel commands set
>>>>>     ixgbevf: Add RSS Key query code
>>>>>     ixgbevf: Add the appropriate ethtool ops to query RSS indirection
>>>>>       table and key
>>>>>
>>>>>    drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h      |  10 ++
>>>>>    drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c    |  91
>>>>> +++++++++++++++
>>>>>    drivers/net/ethernet/intel/ixgbevf/ethtool.c      |  43 +++++++
>>>>>    drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   4 +-
>>>>>    drivers/net/ethernet/intel/ixgbevf/mbx.h          |  10 ++
>>>>>    drivers/net/ethernet/intel/ixgbevf/vf.c           | 132
>>>>> ++++++++++++++++++++++
>>>>>    drivers/net/ethernet/intel/ixgbevf/vf.h           |   2 +
>>>>>    7 files changed, 291 insertions(+), 1 deletion(-)
>>>> I've given this code a review and I don't see a way to
>>>> set a policy in the PF driver as to whether this request should be
>>>> allowed or not.  We cannot enable this query by default - it is a
>>>> security risk. To make this acceptable you need to do a
>>>> couple of things.
>>>>
>>> Can you please elaborate on the security risk this information poses?
>>> Is toeplitz hash function cryptographically strong enough so that VF
>>> cannot reconstruct the hash key from hash result provided in packet
>>> descriptor? The abstract of this paper [1] claims it is not, but I do
>>> not have access to the full article unfortunately hence the question.
>>>
>>> [1]
>>> http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5503170&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5503170
>>
>> I agree with Gleb here: when we started with just thinking about the idea of
>> this patch the possible security issue was the first thing that came into
>> our minds.
>> But eventually we couldn't come up with any security risk or attack example
>> that is exclusively caused by the fact that VF knows the indirection table
>> and/or RSS hash key of the PF.
>>
>> So, Greg, if we have missed anything and your have such an example could you
>> share it here, please?
> I don't have any examples and that is not my area of expertise.  But
> just because we can't think of a security risk or attack example
> doesn't mean there isn't one.
>
> Just add a policy hook so that the system admin can decide whether
> this information should be shared with the VFs and then we're covered
> for cases of both known and unknown exploits, risks, etc.

I absolutely disagree with u in regard of defining an RSS redirection 
table and RSS hash key as a security sensitive data. I don't know how u 
got to this conclusion.
However I don't want to argue about any longer. Let's move on.

Let's clarify one thing about this "hook". Do u agree that it should 
cover only the cases when VF shares the mentioned above data with PF - 
namely for all devices but x550?

thanks,
vlad

>
> Thanks,
>
> - Greg
>
>> Thanks,
>> vlad
>>
>>> --
>>>                          Gleb.
>>

^ permalink raw reply

* Re: TCP connection issues against Amazon S3
From: Eric Dumazet @ 2015-01-06 17:20 UTC (permalink / raw)
  To: Erik Grinaker; +Cc: linux-kernel, Yuchung Cheng, netdev
In-Reply-To: <D5C36777-4461-4B8D-B6C4-38B4276F42DC@bengler.no>

On Tue, 2015-01-06 at 16:11 +0000, Erik Grinaker wrote:
> > On 06 Jan 2015, at 16:04, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Tue, 2015-01-06 at 15:14 +0000, Erik Grinaker wrote:
> >> (CCing Yuchung, as his name comes up in the relevant commits)
> >> 
> >> After upgrading from Ubuntu 12.04.5 to 14.04.1 we have begun seeing
> >> intermittent TCP connection hangs for HTTP image requests against
> >> Amazon S3. 3-5% of requests will suddenly stall in the middle of the
> >> transfer before timing out. We see this problem across a range of
> >> servers, in several data centres and networks, all located in Norway.
> >> 
> >> A packet dump [1] shows repeated ACK retransmits for some of the
> >> requests. Using Ubuntu mainline kernels, we found the problem to have
> >> been introduced between 3.11.10 and 3.12.0, possibly in
> >> 0f7cc9a3c2bd89b15720dbf358e9b9e62af27126. The problem is also present
> >> in 3.18.1. Disabling tcp_window_scaling seems to solve it, but has
> >> obvious drawbacks for transfer speeds. Other sysctls do not seem to
> >> affect it.
> >> 
> >> I am not sure if this is fundamentally a kernel bug or a network
> >> issue, but we did not see this problem with older kernels.
> >> 
> >> [1] http://abstrakt.bengler.no/tcp-issues-s3.pcap.bz2
> > 
> > 
> > CC netdev
> > 
> > This looks like the bug we fixed here :
> > 
> > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=39bb5e62867de82b269b07df900165029b928359
> 
> Has that patch gone into a release? Because the problem persists with 3.18.1.

Patch is in 3.18.1 yes.

So thats a separate issue. 

Can you confirm pcap was taken at receiver (195.159.221.106), not sender
(54.231.136.74) , and on which host is running the 'buggy kernel' ?

If the sender is broken, changing the kernel on receiver wont help.

BTW not using sack (on 54.231.132.98) is terrible for performance in
lossy environments.

^ permalink raw reply

* [PATCH net-next] Driver: Vmxnet3: Make Rx ring 2 size configurable
From: Shrikrishna Khare @ 2015-01-06 17:20 UTC (permalink / raw)
  To: sbhatewara, pv-drivers, netdev, linux-kernel
  Cc: Shrikrishna Khare, Ramya Bolla

Rx ring 2 size can be configured by adjusting rx-jumbo parameter
of ethtool -G.

Signed-off-by: Ramya Bolla <bollar@vmware.com>
Signed-off-by: Shreyas Bhatewara <sbhatewara@vmware.com>
Signed-off-by: Shrikrishna Khare <skhare@vmware.com>
---
 drivers/net/vmxnet3/vmxnet3_defs.h    |    1 +
 drivers/net/vmxnet3/vmxnet3_drv.c     |    6 +++++-
 drivers/net/vmxnet3/vmxnet3_ethtool.c |   27 ++++++++++++++++++++-------
 drivers/net/vmxnet3/vmxnet3_int.h     |    6 ++++--
 4 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/drivers/net/vmxnet3/vmxnet3_defs.h b/drivers/net/vmxnet3/vmxnet3_defs.h
index 4d84912..25b6fa4 100644
--- a/drivers/net/vmxnet3/vmxnet3_defs.h
+++ b/drivers/net/vmxnet3/vmxnet3_defs.h
@@ -342,6 +342,7 @@ union Vmxnet3_GenericDesc {
 #define VMXNET3_TX_RING_MAX_SIZE   4096
 #define VMXNET3_TC_RING_MAX_SIZE   4096
 #define VMXNET3_RX_RING_MAX_SIZE   4096
+#define VMXNET3_RX_RING2_MAX_SIZE  2048
 #define VMXNET3_RC_RING_MAX_SIZE   8192
 
 /* a list of reasons for queue stop */
diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index afd2953..7af1f5c 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -2505,6 +2505,9 @@ vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
 	ring0_size = min_t(u32, ring0_size, VMXNET3_RX_RING_MAX_SIZE /
 			   sz * sz);
 	ring1_size = adapter->rx_queue[0].rx_ring[1].size;
+	ring1_size = (ring1_size + sz - 1) / sz * sz;
+	ring1_size = min_t(u32, ring1_size, VMXNET3_RX_RING2_MAX_SIZE /
+			   sz * sz);
 	comp_size = ring0_size + ring1_size;
 
 	for (i = 0; i < adapter->num_rx_queues; i++) {
@@ -2585,7 +2588,7 @@ vmxnet3_open(struct net_device *netdev)
 
 	err = vmxnet3_create_queues(adapter, adapter->tx_ring_size,
 				    adapter->rx_ring_size,
-				    VMXNET3_DEF_RX_RING_SIZE);
+				    adapter->rx_ring2_size);
 	if (err)
 		goto queue_err;
 
@@ -2964,6 +2967,7 @@ vmxnet3_probe_device(struct pci_dev *pdev,
 
 	adapter->tx_ring_size = VMXNET3_DEF_TX_RING_SIZE;
 	adapter->rx_ring_size = VMXNET3_DEF_RX_RING_SIZE;
+	adapter->rx_ring2_size = VMXNET3_DEF_RX_RING2_SIZE;
 
 	spin_lock_init(&adapter->cmd_lock);
 	adapter->adapter_pa = dma_map_single(&adapter->pdev->dev, adapter,
diff --git a/drivers/net/vmxnet3/vmxnet3_ethtool.c b/drivers/net/vmxnet3/vmxnet3_ethtool.c
index b7b5332..8a5a90e 100644
--- a/drivers/net/vmxnet3/vmxnet3_ethtool.c
+++ b/drivers/net/vmxnet3/vmxnet3_ethtool.c
@@ -447,12 +447,12 @@ vmxnet3_get_ringparam(struct net_device *netdev,
 	param->rx_max_pending = VMXNET3_RX_RING_MAX_SIZE;
 	param->tx_max_pending = VMXNET3_TX_RING_MAX_SIZE;
 	param->rx_mini_max_pending = 0;
-	param->rx_jumbo_max_pending = 0;
+	param->rx_jumbo_max_pending = VMXNET3_RX_RING2_MAX_SIZE;
 
 	param->rx_pending = adapter->rx_ring_size;
 	param->tx_pending = adapter->tx_ring_size;
 	param->rx_mini_pending = 0;
-	param->rx_jumbo_pending = 0;
+	param->rx_jumbo_pending = adapter->rx_ring2_size;
 }
 
 
@@ -461,7 +461,7 @@ vmxnet3_set_ringparam(struct net_device *netdev,
 		      struct ethtool_ringparam *param)
 {
 	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
-	u32 new_tx_ring_size, new_rx_ring_size;
+	u32 new_tx_ring_size, new_rx_ring_size, new_rx_ring2_size;
 	u32 sz;
 	int err = 0;
 
@@ -473,6 +473,10 @@ vmxnet3_set_ringparam(struct net_device *netdev,
 						VMXNET3_RX_RING_MAX_SIZE)
 		return -EINVAL;
 
+	if (param->rx_jumbo_pending == 0 ||
+	    param->rx_jumbo_pending > VMXNET3_RX_RING2_MAX_SIZE)
+		return -EINVAL;
+
 	/* if adapter not yet initialized, do nothing */
 	if (adapter->rx_buf_per_pkt == 0) {
 		netdev_err(netdev, "adapter not completely initialized, "
@@ -500,8 +504,15 @@ vmxnet3_set_ringparam(struct net_device *netdev,
 							   sz) != 0)
 		return -EINVAL;
 
-	if (new_tx_ring_size == adapter->tx_queue[0].tx_ring.size &&
-	    new_rx_ring_size == adapter->rx_queue[0].rx_ring[0].size) {
+	/* ring2 has to be a multiple of VMXNET3_RING_SIZE_ALIGN */
+	new_rx_ring2_size = (param->rx_jumbo_pending + VMXNET3_RING_SIZE_MASK) &
+				~VMXNET3_RING_SIZE_MASK;
+	new_rx_ring2_size = min_t(u32, new_rx_ring2_size,
+				  VMXNET3_RX_RING2_MAX_SIZE);
+
+	if (new_tx_ring_size == adapter->tx_ring_size &&
+	    new_rx_ring_size == adapter->rx_ring_size &&
+	    new_rx_ring2_size == adapter->rx_ring2_size) {
 		return 0;
 	}
 
@@ -522,7 +533,7 @@ vmxnet3_set_ringparam(struct net_device *netdev,
 		vmxnet3_rq_destroy_all(adapter);
 
 		err = vmxnet3_create_queues(adapter, new_tx_ring_size,
-			new_rx_ring_size, VMXNET3_DEF_RX_RING_SIZE);
+			new_rx_ring_size, new_rx_ring2_size);
 
 		if (err) {
 			/* failed, most likely because of OOM, try default
@@ -530,11 +541,12 @@ vmxnet3_set_ringparam(struct net_device *netdev,
 			netdev_err(netdev, "failed to apply new sizes, "
 				   "try the default ones\n");
 			new_rx_ring_size = VMXNET3_DEF_RX_RING_SIZE;
+			new_rx_ring2_size = VMXNET3_DEF_RX_RING2_SIZE;
 			new_tx_ring_size = VMXNET3_DEF_TX_RING_SIZE;
 			err = vmxnet3_create_queues(adapter,
 						    new_tx_ring_size,
 						    new_rx_ring_size,
-						    VMXNET3_DEF_RX_RING_SIZE);
+						    new_rx_ring2_size);
 			if (err) {
 				netdev_err(netdev, "failed to create queues "
 					   "with default sizes. Closing it\n");
@@ -549,6 +561,7 @@ vmxnet3_set_ringparam(struct net_device *netdev,
 	}
 	adapter->tx_ring_size = new_tx_ring_size;
 	adapter->rx_ring_size = new_rx_ring_size;
+	adapter->rx_ring2_size = new_rx_ring2_size;
 
 out:
 	clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
diff --git a/drivers/net/vmxnet3/vmxnet3_int.h b/drivers/net/vmxnet3/vmxnet3_int.h
index 5f0199f..048f020 100644
--- a/drivers/net/vmxnet3/vmxnet3_int.h
+++ b/drivers/net/vmxnet3/vmxnet3_int.h
@@ -69,10 +69,10 @@
 /*
  * Version numbers
  */
-#define VMXNET3_DRIVER_VERSION_STRING   "1.2.1.0-k"
+#define VMXNET3_DRIVER_VERSION_STRING   "1.3.1.0-k"
 
 /* a 32-bit int, each byte encode a verion number in VMXNET3_DRIVER_VERSION */
-#define VMXNET3_DRIVER_VERSION_NUM      0x01020100
+#define VMXNET3_DRIVER_VERSION_NUM      0x01030100
 
 #if defined(CONFIG_PCI_MSI)
 	/* RSS only makes sense if MSI-X is supported. */
@@ -352,6 +352,7 @@ struct vmxnet3_adapter {
 	/* Ring sizes */
 	u32 tx_ring_size;
 	u32 rx_ring_size;
+	u32 rx_ring2_size;
 
 	struct work_struct work;
 
@@ -384,6 +385,7 @@ struct vmxnet3_adapter {
 /* must be a multiple of VMXNET3_RING_SIZE_ALIGN */
 #define VMXNET3_DEF_TX_RING_SIZE    512
 #define VMXNET3_DEF_RX_RING_SIZE    256
+#define VMXNET3_DEF_RX_RING2_SIZE   128
 
 #define VMXNET3_MAX_ETH_HDR_SIZE    22
 #define VMXNET3_MAX_SKB_BUF_SIZE    (3*1024)
-- 
1.7.4.1

^ permalink raw reply related

* Re: [net-next PATCH v1 04/11] rocker: add pipeline model for rocker switch
From: Scott Feldman @ 2015-01-06 17:16 UTC (permalink / raw)
  To: John Fastabend
  Cc: Thomas Graf, Jiří Pírko, Jamal Hadi Salim,
	simon.horman@netronome.com, Netdev, David S. Miller,
	Andy Gospodarek
In-Reply-To: <54AC149A.6040401@gmail.com>

On Tue, Jan 6, 2015 at 9:00 AM, John Fastabend <john.fastabend@gmail.com> wrote:
> On 01/05/2015 11:01 PM, Scott Feldman wrote:
>>> +
>>> +struct net_flow_jump_table parse_ethernet[3] = {
>>> +       {
>>> +               .field = {
>>> +                  .header = HEADER_ETHERNET,
>>> +                  .field = HEADER_ETHERNET_ETHERTYPE,
>>> +                  .type = NET_FLOW_FIELD_REF_ATTR_TYPE_U16,
>>> +                  .value_u16 = 0x0800,

ETH_P_IP, etc

>>
>>
>> How is htons/ntohs conversions happening here?
>
>
> my current stance is to leave everything in host order in the model
> and let the drivers do conversions as needed. For example some drivers
> want the vlan vid in host order others network order. I think its
> more readable above then with hton*() throughout.

Hmmm...I would argue adding htons/htonl makes it more readable in the
sense that it's a reminder that this is a field in a network header,
to be used for matching against packet headers, which use
network-ordering.  Store the field in the order best for comparison
with the raw pkt data.  Drivers may still need to do some conversion
if the field is programmed in hardware in a diff order.

^ permalink raw reply

* Re: route/max_size sysctl in ipv4
From: Ani Sinha @ 2015-01-06 17:11 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: David Miller, netdev@vger.kernel.org
In-Reply-To: <54ABD80F.7070602@draigBrady.com>

On Tue, Jan 6, 2015 at 4:41 AM, Pádraig Brady <P@draigbrady.com> wrote:
> On 06/01/15 00:56, Ani Sinha wrote:
>> On Mon, Jan 5, 2015 at 4:51 PM, David Miller <davem@davemloft.net> wrote:
>>> From: Ani Sinha <ani@arista.com>
>>> Date: Mon, 5 Jan 2015 16:43:30 -0800
>>>
>>>> On Mon, Jan 5, 2015 at 4:36 PM, David Miller <davem@davemloft.net> wrote:
>>>>> From: Ani Sinha <ani@arista.com>
>>>>> Date: Mon, 5 Jan 2015 15:48:11 -0800
>>>>>
>>>>>> I am looking at the code and it looks like since the route cache for
>>>>>> ipv4 was removed from the kernel, this sysctl parameter no longer
>>>>>> serves the same purpose. It does not look like it is even used in the
>>>>>> ipv4/route.c module. Is there an equivalent sysctl parameter limiting
>>>>>> the number of route entries in the kernel? Or is there now no
>>>>>> mechanism to limit the number of route entries?
>>>>>
>>>>> There is nothing to limit, since the cache was removed.
>>>>
>>>> Shouldn't the documentation be updated to reflect that? Also what's
>>>> the point of having a dummy variable that does nothing? Should we not
>>>> simply remove it?
>>>
>>> There is nothing to update, the behavior is completely transparent.
>>> Absolutely no cache entries exist, therefore the limit cannot be
>>> reached.
>>
>> I disagree. You are advertising a feature in an official documentation
>> that simply does not exist for ipv4. This is very confusing. If I did
>> not dig into the code, I wouldn't know that this particular knob is a
>> noop since the time the route cache was removed.
>
> You can't change APIs with impunity.
>
>>> The sysctl is kept so that scripts reading it don't suddenly stop
>>> working.  We can't just remove sysctl values.
>
> Perhaps /proc/sys/net/ipv4/route/max_size should always return 0 when read,
> and discard written values?

Why can't se simply change the documentation to reflect the fact that
this sysctl is no longer in operation?

^ permalink raw reply

* Re: [patch iproute2 2/2] iplink: print out addrgenmode attribute
From: Jiri Pirko @ 2015-01-06 17:08 UTC (permalink / raw)
  To: Thomas Haller; +Cc: netdev, stephen
In-Reply-To: <1420563153.2527.14.camel@hal>

Tue, Jan 06, 2015 at 05:52:33PM CET, thaller@redhat.com wrote:
>On Tue, 2015-01-06 at 17:23 +0100, Jiri Pirko wrote:
>> addrgenmode is currently write only by ip. So display this information
>> if provided by kernel as well.
>
>>  
>> +static void print_af_spec(FILE *fp, struct rtattr *af_spec_attr)
>> +{
>> +	struct rtattr *inet6_attr;
>> +	struct rtattr *tb[IFLA_INET6_MAX + 1];
>> +
>> +	inet6_attr = parse_rtattr_one_nested(AF_INET6, af_spec_attr);
>> +	if (!inet6_attr)
>> +		return;
>> +
>> +	parse_rtattr_nested(tb, IFLA_INET6_MAX, inet6_attr);
>> +
>> +	if (tb[IFLA_INET6_ADDR_GEN_MODE]) {
>> +		switch (rta_getattr_u8(tb[IFLA_INET6_ADDR_GEN_MODE])) {
>> +		case IN6_ADDR_GEN_MODE_EUI64:
>> +			fprintf(fp, "addrgenmode eui64 ");
>
>eui64 is the default and the behavior of older kernels.
>
>I dunno, would it be better not to print the default case?


This prints only when show_details is on. So I believe it is ok to print
the default value of addrgenmode (same is done as for other things)

^ permalink raw reply

* Re: [net 1/3] i40e: fix un-necessary Tx hangs
From: Jesse Brandeburg @ 2015-01-06 17:04 UTC (permalink / raw)
  To: Jeff Kirsher, davem; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <1420533864-13125-2-git-send-email-jeffrey.t.kirsher@intel.com>

On Tue, 6 Jan 2015 00:44:22 -0800
Jeff Kirsher <jeffrey.t.kirsher@intel.com> wrote:

> From: Jesse Brandeburg <jesse.brandeburg@intel.com>
> 
> When the driver was polling with interrupts disabled the hardware
> will occasionally not write back descriptors.  This patch causes
> the driver to detect this situation and force an interrupt to
> fire which will flush the stuck descriptor.  Does not conflict
> with NAPI because if we are already polling the napi_schedule is
> ignored.  Additionally the extra interrupts are rate limited, so
> don't cause a burden to the CPU.

Dave, sorry to do this but we discovered a bug in this patch (where
this patch impacts performance) and didn't interrupt Jeff before he sent
it (my mistake) I will prep a v2 for Jeff.

Please do not apply.

^ permalink raw reply

* Re: IPsec workshop at netdev01?
From: Florian Westphal @ 2015-01-06 17:00 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: netdev, Jamal Hadi Salim, Herbert Xu, David Miller
In-Reply-To: <20150106101936.GC31458@secunet.com>

Steffen Klassert <steffen.klassert@secunet.com> wrote:
> - We still lack a 32/64 bit compatibiltiy layer for IPsec, this issue
>   comes up from time to time. Some solutions were proposed in the past
>   but all had problems. The current behaviour is broken if someone tries
>   to configure IPsec with 32 bit tools on a 64 bit machine. Can we get
>   this right somehow or is it better to just return an error in this case?

FWIW I think
http://patchwork.ozlabs.org/patch/49465/

came closest to achieving full CONFIG_COMPAT support; since netlink is
no longer async now I'm not sure we'd still need additonal 32-compat syscalls
to make compat work for all cases.

So "its ugly as hell" is probably the only problem that is hard to avoid ;-)

^ permalink raw reply

* Re: [net-next PATCH v1 04/11] rocker: add pipeline model for rocker switch
From: John Fastabend @ 2015-01-06 17:00 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Thomas Graf, Jiří Pírko, Jamal Hadi Salim,
	simon.horman@netronome.com, Netdev, David S. Miller,
	Andy Gospodarek
In-Reply-To: <CAE4R7bB6-bKh6EaZXpsXQeLx5A=MCJTCadYQoP2K5cUfOEUiEw@mail.gmail.com>

On 01/05/2015 11:01 PM, Scott Feldman wrote:
> On Wed, Dec 31, 2014 at 11:47 AM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> This adds rocker support for the net_flow_get_* operations. With this
>> we can interrogate rocker.
>>
>> Here we see that for static configurations enabling the get operations
>> is simply a matter of defining a pipeline model and returning the
>> structures for the core infrastructure to encapsulate into netlink
>> messages.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
>>   drivers/net/ethernet/rocker/rocker.c          |   35 +
>>   drivers/net/ethernet/rocker/rocker_pipeline.h |  673 +++++++++++++++++++++++++
>>   2 files changed, 708 insertions(+)
>>   create mode 100644 drivers/net/ethernet/rocker/rocker_pipeline.h
>>
>> diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
>> index fded127..4c6787a 100644
>> --- a/drivers/net/ethernet/rocker/rocker.c
>> +++ b/drivers/net/ethernet/rocker/rocker.c
>> @@ -36,6 +36,7 @@
>>   #include <generated/utsrelease.h>
>>
>>   #include "rocker.h"
>> +#include "rocker_pipeline.h"
>>
>>   static const char rocker_driver_name[] = "rocker";
>>
>> @@ -3780,6 +3781,33 @@ static int rocker_port_switch_port_stp_update(struct net_device *dev, u8 state)
>>          return rocker_port_stp_update(rocker_port, state);
>>   }
>>
>> +#ifdef CONFIG_NET_FLOW_TABLES
>
> Can this #ifdef test be moved out of driver?  The if_flow core code
> can stub out operations if CONFIG_NET_FLOW_TABLES isn't defined.

here sure this is easy enough.

>
>> +static struct net_flow_table **rocker_get_tables(struct net_device *d)
>> +{
>> +       return rocker_table_list;
>> +}
>> +
>> +static struct net_flow_header **rocker_get_headers(struct net_device *d)
>> +{
>> +       return rocker_header_list;
>> +}
>> +
>> +static struct net_flow_action **rocker_get_actions(struct net_device *d)
>> +{
>> +       return rocker_action_list;
>> +}
>> +
>> +static struct net_flow_tbl_node **rocker_get_tgraph(struct net_device *d)
>> +{
>> +       return rocker_table_nodes;
>> +}
>> +
>> +static struct net_flow_hdr_node **rocker_get_hgraph(struct net_device *d)
>> +{
>> +       return rocker_header_nodes;
>> +}
>> +#endif
>> +
>>   static const struct net_device_ops rocker_port_netdev_ops = {
>>          .ndo_open                       = rocker_port_open,
>>          .ndo_stop                       = rocker_port_stop,
>> @@ -3794,6 +3822,13 @@ static const struct net_device_ops rocker_port_netdev_ops = {
>>          .ndo_bridge_getlink             = rocker_port_bridge_getlink,
>>          .ndo_switch_parent_id_get       = rocker_port_switch_parent_id_get,
>>          .ndo_switch_port_stp_update     = rocker_port_switch_port_stp_update,
>> +#ifdef CONFIG_NET_FLOW_TABLES
>
> same comment here

We could although then we need some 'depends on' logic in Kconfig
to be sure CONFIG_NET_FLOW_TABLES is enabled. I think we want to
be able to strip this code out of the main core code paths when
its not needed which means wrapping it in the ifdef in netdevice.h

>
>> +       .ndo_flow_get_tables            = rocker_get_tables,
>> +       .ndo_flow_get_headers           = rocker_get_headers,
>> +       .ndo_flow_get_actions           = rocker_get_actions,
>> +       .ndo_flow_get_tbl_graph         = rocker_get_tgraph,
>> +       .ndo_flow_get_hdr_graph         = rocker_get_hgraph,
>> +#endif
>>   };
>>
>>   /********************
>> diff --git a/drivers/net/ethernet/rocker/rocker_pipeline.h b/drivers/net/ethernet/rocker/rocker_pipeline.h
>> new file mode 100644
>> index 0000000..9544339
>> --- /dev/null
>> +++ b/drivers/net/ethernet/rocker/rocker_pipeline.h
>
> Add standard header info...copyright/license.
>
>> @@ -0,0 +1,673 @@
>> +#ifndef _MY_PIPELINE_H_
>> +#define _MY_PIPELINE_H_
>
> _ROCKER_PIPELINE_H_
>
>> +
>> +#include <linux/if_flow.h>
>> +
>> +/* header definition */
>> +#define HEADER_ETHERNET_SRC_MAC 1
>> +#define HEADER_ETHERNET_DST_MAC 2
>> +#define HEADER_ETHERNET_ETHERTYPE 3
>
> Use enum?
>

yep changed this all to enums, array_size macros and use a
simpler null terminator throughout. Thanks.

[...]

>> +/* headers graph */
>> +#define HEADER_INSTANCE_ETHERNET 1
>> +#define HEADER_INSTANCE_VLAN_OUTER 2
>> +#define HEADER_INSTANCE_IPV4 3
>> +#define HEADER_INSTANCE_IN_LPORT 4
>> +#define HEADER_INSTANCE_GOTO_TABLE 5
>> +#define HEADER_INSTANCE_GROUP_ID 6
>> +
>> +struct net_flow_jump_table parse_ethernet[3] = {
>> +       {
>> +               .field = {
>> +                  .header = HEADER_ETHERNET,
>> +                  .field = HEADER_ETHERNET_ETHERTYPE,
>> +                  .type = NET_FLOW_FIELD_REF_ATTR_TYPE_U16,
>> +                  .value_u16 = 0x0800,
>
> How is htons/ntohs conversions happening here?

my current stance is to leave everything in host order in the model
and let the drivers do conversions as needed. For example some drivers
want the vlan vid in host order others network order. I think its
more readable above then with hton*() throughout.

>
> Since these are network header fields, seems you want htons(0x0800).
>
>> +               },
>> +               .node = HEADER_INSTANCE_IPV4,
>> +       },
>> +       {
>> +               .field = {
>> +                  .header = HEADER_ETHERNET,
>> +                  .field = HEADER_ETHERNET_ETHERTYPE,
>> +                  .type = NET_FLOW_FIELD_REF_ATTR_TYPE_U16,
>> +                  .value_u16 = 0x8100,
>> +               },
>> +               .node = HEADER_INSTANCE_VLAN_OUTER,
>> +       },
>> +       {
>> +               .field = {0},
>> +               .node = 0,
>> +       },
>
> just use NULL,

Yep done throughout.

[...]

>> +/* table definition */
>> +struct net_flow_field_ref matches_ig_port[2] = {
>> +       { .instance = HEADER_INSTANCE_IN_LPORT,
>> +         .header = HEADER_METADATA,
>> +         .field = HEADER_METADATA_IN_LPORT,
>> +         .mask_type = NET_FLOW_MASK_TYPE_LPM},
>
> Need other mask type, not LPM.

v2 will have the additional mask types.

>
>
>> +struct net_flow_table *rocker_table_list[7] = {
>> +       &ingress_port_table,
>> +       &vlan_table,
>> +       &term_mac_table,
>> +       &ucast_routing_table,
>> +       &bridge_table,
>> +       &acl_table,
>> +       &null_table,
>> +};
>
> cool stuff

bit of work to get here but sort of fun to start defining
pipelines like this.

[...]

>> +struct net_flow_tbl_node *rocker_table_nodes[7] = {
>> +       &table_node_ingress_port,
>> +       &table_node_vlan,
>> +       &table_node_term_mac,
>> +       &table_node_ucast_routing,
>> +       &table_node_bridge,
>> +       &table_node_acl,
>> +       &table_node_nil,
>> +};
>
> Cool...getting tired but will review this again in v2

Great thanks for the detailed feedback.

>
>> +#endif /*_MY_PIPELINE_H*/
>
> ROCKER
>
>>


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH net-next v3 0/5]: ixgbevf: Allow querying VFs RSS indirection table and key
From: Greg Rose @ 2015-01-06 16:59 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: Gleb Natapov, netdev, Avi Kivity, jeffrey.t.kirsher
In-Reply-To: <54ABBFEB.9010105@cloudius-systems.com>

On Tue, Jan 6, 2015 at 2:58 AM, Vlad Zolotarov
<vladz@cloudius-systems.com> wrote:
>
> On 01/06/15 08:55, Gleb Natapov wrote:
>>
>> On Mon, Jan 05, 2015 at 03:54:52PM -0800, Greg Rose wrote:
>>>
>>> On Mon, Jan 5, 2015 at 6:15 AM, Vlad Zolotarov
>>> <vladz@cloudius-systems.com> wrote:
>>>>
>>>> Add the ethtool ops to VF driver to allow querying the RSS indirection
>>>> table
>>>> and RSS Random Key.
>>>>
>>>>   - PF driver: Add new VF-PF channel commands.
>>>>   - VF driver: Utilize these new commands and add the corresponding
>>>>                ethtool callbacks.
>>>>
>>>> New in v3:
>>>>     - Added a missing support for x550 devices.
>>>>     - Mask the indirection table values according to PSRTYPE[n].RQPL.
>>>>     - Minimized the number of added VF-PF commands.
>>>>
>>>> New in v2:
>>>>     - Added a detailed description to patches 4 and 5.
>>>>
>>>> New in v1 (compared to RFC):
>>>>     - Use "if-else" statement instead of a "switch-case" for a single
>>>> option case.
>>>>       More specifically: in cases where the newly added API version is
>>>> the only one
>>>>       allowed. We may consider using a "switch-case" back again when the
>>>> list of
>>>>       allowed API versions in these specific places grows up.
>>>>
>>>> Vlad Zolotarov (5):
>>>>    ixgbe: Add a RETA query command to VF-PF channel API
>>>>    ixgbevf: Add a RETA query code
>>>>    ixgbe: Add GET_RSS_KEY command to VF-PF channel commands set
>>>>    ixgbevf: Add RSS Key query code
>>>>    ixgbevf: Add the appropriate ethtool ops to query RSS indirection
>>>>      table and key
>>>>
>>>>   drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h      |  10 ++
>>>>   drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c    |  91
>>>> +++++++++++++++
>>>>   drivers/net/ethernet/intel/ixgbevf/ethtool.c      |  43 +++++++
>>>>   drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   4 +-
>>>>   drivers/net/ethernet/intel/ixgbevf/mbx.h          |  10 ++
>>>>   drivers/net/ethernet/intel/ixgbevf/vf.c           | 132
>>>> ++++++++++++++++++++++
>>>>   drivers/net/ethernet/intel/ixgbevf/vf.h           |   2 +
>>>>   7 files changed, 291 insertions(+), 1 deletion(-)
>>>
>>> I've given this code a review and I don't see a way to
>>> set a policy in the PF driver as to whether this request should be
>>> allowed or not.  We cannot enable this query by default - it is a
>>> security risk. To make this acceptable you need to do a
>>> couple of things.
>>>
>> Can you please elaborate on the security risk this information poses?
>> Is toeplitz hash function cryptographically strong enough so that VF
>> cannot reconstruct the hash key from hash result provided in packet
>> descriptor? The abstract of this paper [1] claims it is not, but I do
>> not have access to the full article unfortunately hence the question.
>>
>> [1]
>> http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5503170&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5503170
>
>
> I agree with Gleb here: when we started with just thinking about the idea of
> this patch the possible security issue was the first thing that came into
> our minds.
> But eventually we couldn't come up with any security risk or attack example
> that is exclusively caused by the fact that VF knows the indirection table
> and/or RSS hash key of the PF.
>
> So, Greg, if we have missed anything and your have such an example could you
> share it here, please?

I don't have any examples and that is not my area of expertise.  But
just because we can't think of a security risk or attack example
doesn't mean there isn't one.

Just add a policy hook so that the system admin can decide whether
this information should be shared with the VFs and then we're covered
for cases of both known and unknown exploits, risks, etc.

Thanks,

- Greg

>
> Thanks,
> vlad
>
>>
>> --
>>                         Gleb.
>
>

^ permalink raw reply

* Re: [net-next PATCH v1 08/11] net: rocker: add get flow API operation
From: Scott Feldman @ 2015-01-06 16:57 UTC (permalink / raw)
  To: John Fastabend
  Cc: Thomas Graf, Jiří Pírko, Jamal Hadi Salim,
	simon.horman@netronome.com, Netdev, David S. Miller,
	Andy Gospodarek
In-Reply-To: <54ABF85F.2080105@gmail.com>

On Tue, Jan 6, 2015 at 6:59 AM, John Fastabend <john.fastabend@gmail.com> wrote:
> On 01/05/2015 11:40 PM, Scott Feldman wrote:
>>
>> On Wed, Dec 31, 2014 at 11:48 AM, John Fastabend
>> <john.fastabend@gmail.com> wrote:
>>>
>>> Add operations to get flows. I wouldn't mind cleaning this code
>>> up a bit but my first attempt to do this used macros which shortered
>>> the code up but when I was done I decided it just made the code
>>> unreadable and unmaintainable.
>>>
>>> I might think about it a bit more but this implementation albeit
>>> a bit long and repeatative is easier to understand IMO.
>>
>>
>> Dang, you put a lot of work into this one.
>>
>> Something doesn't feel right though.  In this case, rocker driver just
>> happened to have cached all the flow/group stuff in hash tables in
>> software, so you don't need to query thru to the device to extract the
>> if_flow info.  What doesn't feel right is all the work need in the
>> driver.  For each and every driver.  get_flows needs to go above
>> driver, somehow.
>
>
> Another option is to have a software cache in the flow_table.c I
> was trying to avoid caching as I really don't expect 'get' operations
> to be fast path and going to hardware seems good enough for me.
> Other than its a bit annoying to write the mapping code.

Caching in flow_table.c seems best to me as drivers/devices don't need
to be involved and the cache can server multiple users of the API.
Are there cases where the device could get flow table entries
installed/deleted outside the API?  For example, if the device was
learning MAC addresses, and did automatic table insertions.  We worked
around that case with the recent L2 swdev support by pushing learned
MAC addrs up to bridge's FDB so software and hardware tables stay
synced.

^ permalink raw reply

* Re: [patch iproute2 2/2] iplink: print out addrgenmode attribute
From: Thomas Haller @ 2015-01-06 16:52 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, stephen
In-Reply-To: <1420561426-3118-2-git-send-email-jiri@resnulli.us>

[-- Attachment #1: Type: text/plain, Size: 810 bytes --]

On Tue, 2015-01-06 at 17:23 +0100, Jiri Pirko wrote:
> addrgenmode is currently write only by ip. So display this information
> if provided by kernel as well.

>  
> +static void print_af_spec(FILE *fp, struct rtattr *af_spec_attr)
> +{
> +	struct rtattr *inet6_attr;
> +	struct rtattr *tb[IFLA_INET6_MAX + 1];
> +
> +	inet6_attr = parse_rtattr_one_nested(AF_INET6, af_spec_attr);
> +	if (!inet6_attr)
> +		return;
> +
> +	parse_rtattr_nested(tb, IFLA_INET6_MAX, inet6_attr);
> +
> +	if (tb[IFLA_INET6_ADDR_GEN_MODE]) {
> +		switch (rta_getattr_u8(tb[IFLA_INET6_ADDR_GEN_MODE])) {
> +		case IN6_ADDR_GEN_MODE_EUI64:
> +			fprintf(fp, "addrgenmode eui64 ");

eui64 is the default and the behavior of older kernels.

I dunno, would it be better not to print the default case?


Thomas

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* [patch iproute2 1/2] libnetlink: add parse_rtattr_one_nested helper
From: Jiri Pirko @ 2015-01-06 16:23 UTC (permalink / raw)
  To: netdev; +Cc: stephen, thaller

Sometimes, it is more convenient to get only one specific nested attribute by
type. For example for IFLA_AF_SPEC where type is address family (AF_INET6).
So add this helper for this purpose.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/libnetlink.h |  4 ++++
 lib/libnetlink.c     | 12 ++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index fe7d5d3..f0faf2d 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -80,11 +80,15 @@ extern int parse_rtattr(struct rtattr *tb[], int max, struct rtattr *rta, int le
 extern int parse_rtattr_flags(struct rtattr *tb[], int max, struct rtattr *rta,
 			      int len, unsigned short flags);
 extern int parse_rtattr_byindex(struct rtattr *tb[], int max, struct rtattr *rta, int len);
+extern struct rtattr *parse_rtattr_one(int type, struct rtattr *rta, int len);
 extern int __parse_rtattr_nested_compat(struct rtattr *tb[], int max, struct rtattr *rta, int len);
 
 #define parse_rtattr_nested(tb, max, rta) \
 	(parse_rtattr((tb), (max), RTA_DATA(rta), RTA_PAYLOAD(rta)))
 
+#define parse_rtattr_one_nested(type, rta) \
+	(parse_rtattr_one(type, RTA_DATA(rta), RTA_PAYLOAD(rta)))
+
 #define parse_rtattr_nested_compat(tb, max, rta, data, len) \
 	({ data = RTA_PAYLOAD(rta) >= len ? RTA_DATA(rta) : NULL;	\
 		__parse_rtattr_nested_compat(tb, max, rta, len); })
diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 8d504a9..8e70263 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -701,6 +701,18 @@ int parse_rtattr_byindex(struct rtattr *tb[], int max, struct rtattr *rta, int l
 	return i;
 }
 
+struct rtattr *parse_rtattr_one(int type, struct rtattr *rta, int len)
+{
+	while (RTA_OK(rta, len)) {
+		if (rta->rta_type == type)
+			return rta;
+		rta = RTA_NEXT(rta, len);
+	}
+	if (len)
+		fprintf(stderr, "!!!Deficit %d, rta_len=%d\n", len, rta->rta_len);
+	return NULL;
+}
+
 int __parse_rtattr_nested_compat(struct rtattr *tb[], int max, struct rtattr *rta,
 			         int len)
 {
-- 
1.9.3

^ permalink raw reply related

* [patch iproute2 2/2] iplink: print out addrgenmode attribute
From: Jiri Pirko @ 2015-01-06 16:23 UTC (permalink / raw)
  To: netdev; +Cc: stephen, thaller
In-Reply-To: <1420561426-3118-1-git-send-email-jiri@resnulli.us>

addrgenmode is currently write only by ip. So display this information
if provided by kernel as well.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 ip/ipaddress.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 4d99324..0a04d90 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -242,6 +242,29 @@ static void print_linktype(FILE *fp, struct rtattr *tb)
 	}
 }
 
+static void print_af_spec(FILE *fp, struct rtattr *af_spec_attr)
+{
+	struct rtattr *inet6_attr;
+	struct rtattr *tb[IFLA_INET6_MAX + 1];
+
+	inet6_attr = parse_rtattr_one_nested(AF_INET6, af_spec_attr);
+	if (!inet6_attr)
+		return;
+
+	parse_rtattr_nested(tb, IFLA_INET6_MAX, inet6_attr);
+
+	if (tb[IFLA_INET6_ADDR_GEN_MODE]) {
+		switch (rta_getattr_u8(tb[IFLA_INET6_ADDR_GEN_MODE])) {
+		case IN6_ADDR_GEN_MODE_EUI64:
+			fprintf(fp, "addrgenmode eui64 ");
+			break;
+		case IN6_ADDR_GEN_MODE_NONE:
+			fprintf(fp, "addrgenmode none ");
+			break;
+		}
+	}
+}
+
 static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 {
 	struct ifla_vf_mac *vf_mac;
@@ -634,6 +657,9 @@ int print_linkinfo(const struct sockaddr_nl *who,
 	if (do_link && tb[IFLA_LINKINFO] && show_details)
 		print_linktype(fp, tb[IFLA_LINKINFO]);
 
+	if (do_link && tb[IFLA_AF_SPEC] && show_details)
+		print_af_spec(fp, tb[IFLA_AF_SPEC]);
+
 	if (do_link && tb[IFLA_IFALIAS]) {
 		fprintf(fp, "%s    alias %s", _SL_,
 			rta_getattr_str(tb[IFLA_IFALIAS]));
-- 
1.9.3

^ permalink raw reply related

* Re: TCP connection issues against Amazon S3
From: Erik Grinaker @ 2015-01-06 16:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, Yuchung Cheng, netdev
In-Reply-To: <1420560253.32621.20.camel@edumazet-glaptop2.roam.corp.google.com>


> On 06 Jan 2015, at 16:04, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2015-01-06 at 15:14 +0000, Erik Grinaker wrote:
>> (CCing Yuchung, as his name comes up in the relevant commits)
>> 
>> After upgrading from Ubuntu 12.04.5 to 14.04.1 we have begun seeing
>> intermittent TCP connection hangs for HTTP image requests against
>> Amazon S3. 3-5% of requests will suddenly stall in the middle of the
>> transfer before timing out. We see this problem across a range of
>> servers, in several data centres and networks, all located in Norway.
>> 
>> A packet dump [1] shows repeated ACK retransmits for some of the
>> requests. Using Ubuntu mainline kernels, we found the problem to have
>> been introduced between 3.11.10 and 3.12.0, possibly in
>> 0f7cc9a3c2bd89b15720dbf358e9b9e62af27126. The problem is also present
>> in 3.18.1. Disabling tcp_window_scaling seems to solve it, but has
>> obvious drawbacks for transfer speeds. Other sysctls do not seem to
>> affect it.
>> 
>> I am not sure if this is fundamentally a kernel bug or a network
>> issue, but we did not see this problem with older kernels.
>> 
>> [1] http://abstrakt.bengler.no/tcp-issues-s3.pcap.bz2
> 
> 
> CC netdev
> 
> This looks like the bug we fixed here :
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=39bb5e62867de82b269b07df900165029b928359

Has that patch gone into a release? Because the problem persists with 3.18.1.

> Could you post output of 'nstat' command ?

Sure:

#kernel
IpInReceives                    1676194            0.0
IpInDelivers                    1676115            0.0
IpOutRequests                   1163105            0.0
IcmpInErrors                    826                0.0
IcmpInEchoReps                  826                0.0
IcmpOutErrors                   824                0.0
IcmpOutTimeExcds                10                 0.0
IcmpOutTimestamps               814                0.0
IcmpMsgInType8                  826                0.0
IcmpMsgOutType0                 814                0.0
IcmpMsgOutType3                 10                 0.0
TcpActiveOpens                  2004               0.0
TcpPassiveOpens                 1102               0.0
TcpAttemptFails                 5                  0.0
TcpEstabResets                  34                 0.0
TcpInSegs                       1667699            0.0
TcpOutSegs                      1159990            0.0
TcpRetransSegs                  26                 0.0
TcpOutRsts                      504                0.0
UdpInDatagrams                  2002               0.0
UdpOutDatagrams                 2087               0.0
Ip6InReceives                   1                  0.0
Ip6InNoRoutes                   1                  0.0
Ip6OutRequests                  10                 0.0
Ip6OutDiscards                  3                  0.0
Ip6OutNoRoutes                  1                  0.0
Ip6OutMcastPkts                 7                  0.0
Ip6InOctets                     76                 0.0
Ip6OutOctets                    728                0.0
Ip6OutMcastOctets               520                0.0
Ip6InNoECTPkts                  1                  0.0
Icmp6OutMsgs                    4                  0.0
Icmp6OutNeighborSolicits        1                  0.0
Icmp6OutMLDv2Reports            3                  0.0
Icmp6OutType135                 1                  0.0
Icmp6OutType143                 3                  0.0
TcpExtEmbryonicRsts             5                  0.0
TcpExtPruneCalled               1                  0.0
TcpExtTW                        1897               0.0
TcpExtDelayedACKs               3058               0.0
TcpExtDelayedACKLocked          13                 0.0
TcpExtDelayedACKLost            2330               0.0
TcpExtTCPPrequeued              3084               0.0
TcpExtTCPDirectCopyFromPrequeue 10944              0.0
TcpExtTCPHPHits                 1246417            0.0
TcpExtTCPPureAcks               7512               0.0
TcpExtTCPHPAcks                 3219               0.0
TcpExtTCPSackRecovery           2                  0.0
TcpExtTCPLossUndo               4                  0.0
TcpExtTCPFastRetrans            2                  0.0
TcpExtTCPTimeouts               18                 0.0
TcpExtTCPLossProbes             145                0.0
TcpExtTCPLossProbeRecovery      125                0.0
TcpExtTCPSackRecoveryFail       1                  0.0
TcpExtTCPRcvCollapsed           22                 0.0
TcpExtTCPDSACKOldSent           43                 0.0
TcpExtTCPDSACKRecv              3                  0.0
TcpExtTCPAbortOnData            113                0.0
TcpExtTCPDSACKIgnoredNoUndo     3                  0.0
TcpExtTCPSpuriousRTOs           2                  0.0
TcpExtTCPSackShiftFallback      3                  0.0
TcpExtTCPRcvCoalesce            927994             0.0
TcpExtTCPOFOQueue               300911             0.0
TcpExtTCPOFOMerge               76                 0.0
TcpExtTCPSpuriousRtxHostQueues  24                 0.0
IpExtInBcastPkts                5588               0.0
IpExtInOctets                   2454079082         0.0
IpExtOutOctets                  56232776           0.0
IpExtInBcastOctets              3218688            0.0
IpExtInNoECTPkts                1676194            0.0

^ permalink raw reply

* Re: TCP connection issues against Amazon S3
From: Eric Dumazet @ 2015-01-06 16:04 UTC (permalink / raw)
  To: Erik Grinaker; +Cc: linux-kernel, Yuchung Cheng, netdev
In-Reply-To: <5DCDADEF-FF9C-4844-8A2C-62E2D3B3B8CE@bengler.no>

On Tue, 2015-01-06 at 15:14 +0000, Erik Grinaker wrote:
> (CCing Yuchung, as his name comes up in the relevant commits)
> 
> After upgrading from Ubuntu 12.04.5 to 14.04.1 we have begun seeing
> intermittent TCP connection hangs for HTTP image requests against
> Amazon S3. 3-5% of requests will suddenly stall in the middle of the
> transfer before timing out. We see this problem across a range of
> servers, in several data centres and networks, all located in Norway.
> 
> A packet dump [1] shows repeated ACK retransmits for some of the
> requests. Using Ubuntu mainline kernels, we found the problem to have
> been introduced between 3.11.10 and 3.12.0, possibly in
> 0f7cc9a3c2bd89b15720dbf358e9b9e62af27126. The problem is also present
> in 3.18.1. Disabling tcp_window_scaling seems to solve it, but has
> obvious drawbacks for transfer speeds. Other sysctls do not seem to
> affect it.
> 
> I am not sure if this is fundamentally a kernel bug or a network
> issue, but we did not see this problem with older kernels.
> 
> [1] http://abstrakt.bengler.no/tcp-issues-s3.pcap.bz2--


CC netdev

This looks like the bug we fixed here :

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=39bb5e62867de82b269b07df900165029b928359

Could you post output of 'nstat' command ?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox