Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v8 3/3] net: hisilicon: new hip04 ethernet driver
From: Alexander Graf @ 2014-12-07  9:49 UTC (permalink / raw)
  To: Ding Tianhong, Zhangfei Gao
  Cc: davem, linux, arnd, f.fainelli, sergei.shtylyov, mark.rutland,
	David.Laight, eric.dumazet, xuwei5, linux-arm-kernel, netdev,
	devicetree
In-Reply-To: <5483C977.2060308@huawei.com>



On 07.12.14 04:28, Ding Tianhong wrote:
> On 2014/12/7 8:42, Alexander Graf wrote:
>> On 19.04.14 03:13, Zhangfei Gao wrote:
>>> Support Hisilicon hip04 ethernet driver, including 100M / 1000M controller.
>>> The controller has no tx done interrupt, reclaim xmitted buffer in the poll.
>>>
>>> Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org>
>>
>> Is this driver still supposed to go upstream? I presume this was the
>> last submission and it's been quite some time ago :)
>>
> 
> yes, it is really a long time, but The hip04 did not support tx irq, 
> we couldn't get any better idea to fix this defect, do you have any suggestion?

Well, if hardware doesn't have a TX irq I don't see there's anything we
can do to fix that ;).

Dave, what's your take here? Should we keep a driver from going upstream
just because the hardware is partly broken? I'd really prefer to have an
upstream driver on that SoC rather than some random (eventually even
more broken) downstream code.


Alex

^ permalink raw reply

* Re: Where exactly will arch_fast_hash be used
From: Herbert Xu @ 2014-12-07  9:28 UTC (permalink / raw)
  To: George Spelvin; +Cc: dborkman, hannes, linux-kernel, netdev, tgraf
In-Reply-To: <20141207052041.20498.qmail@ns.horizon.com>

On Sun, Dec 07, 2014 at 12:20:41AM -0500, George Spelvin wrote:
>
> Can you point me to a source for that statement?

For a start why don't you print out the hashes of 1-255 and then
find out how easy it is to deduce the last bit of the hash result.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] mac80211: Supporting of IFLA_INFO_KIND rtnl attribute
From: Marcel Holtmann @ 2014-12-07  9:06 UTC (permalink / raw)
  To: Vadim Kochan
  Cc: Johannes Berg, John W. Linville, David S. Miller, linux-wireless,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1417910690-2021-1-git-send-email-vadim4j-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Hi Vadim,

> It allows to identify the wireless kind of device for
> the user application, e.g.:
> 
>    # ip -d link
> 
>    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
>        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0
>    2: enp0s25: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000
>        link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff promiscuity 0
>    3: wlp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
>        link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff promiscuity 0
>        wireless
>    4: wlp0s26u1u2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>        link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff promiscuity 0 
>        wireless 
> 
> Signed-off-by: Vadim Kochan <vadim4j-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
> net/mac80211/iface.c | 7 +++++++
> 1 file changed, 7 insertions(+)
> 
> diff --git a/net/mac80211/iface.c b/net/mac80211/iface.c
> index 653f5eb..b993c7d 100644
> --- a/net/mac80211/iface.c
> +++ b/net/mac80211/iface.c
> @@ -18,6 +18,7 @@
> #include <linux/rtnetlink.h>
> #include <net/mac80211.h>
> #include <net/ieee80211_radiotap.h>
> +#include <net/rtnetlink.h>
> #include "ieee80211_i.h"
> #include "sta_info.h"
> #include "debugfs_netdev.h"
> @@ -1624,6 +1625,10 @@ static void ieee80211_assign_perm_addr(struct ieee80211_local *local,
> 	mutex_unlock(&local->iflist_mtx);
> }
> 
> +static struct rtnl_link_ops wireless_link_ops __read_mostly = {
> +	.kind = "wireless",
> +};
> +

I would prefer if we use "wlan" here. Same as what we use for DEVTYPE and RFKILL switches. Being consistent across the kernel helps a lot. Since other subsystems including Bluetooth might want to identify their netdev as well and "wireless" would be just too generic.

Regards

Marcel

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next v3 2/2] rocker: remove swdev mode
From: Thomas Graf @ 2014-12-07  8:19 UTC (permalink / raw)
  To: roopa
  Cc: jiri, sfeldma, jhs, bcrl, john.fastabend, stephen, linville,
	vyasevic, netdev, davem, shm, gospo
In-Reply-To: <1417935267-6000-3-git-send-email-roopa@cumulusnetworks.com>

On 12/06/14 at 10:54pm, roopa@cumulusnetworks.com wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> 
> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
> ---
>  drivers/net/ethernet/rocker/rocker.c |   18 +-----------------
>  include/linux/rtnetlink.h            |    2 +-
>  net/core/rtnetlink.c                 |   12 +++++++++---
>  3 files changed, 11 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
> index fded127..9f1d256 100644
> --- a/drivers/net/ethernet/rocker/rocker.c
> +++ b/drivers/net/ethernet/rocker/rocker.c
> @@ -3755,7 +3739,7 @@ static int rocker_port_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
>  				      u32 filter_mask)
>  {
>  	struct rocker_port *rocker_port = netdev_priv(dev);
> -	u16 mode = BRIDGE_MODE_SWDEV;
> +	u16 mode = -1;
        ^^^

I assume you meant s16

^ permalink raw reply

* [PATCH net-next] net/mlx4_en: ethtool force speed when asking for autoneg=off
From: Amir Vadai @ 2014-12-07  8:07 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Yevgeny Petrilin, Saeed Mahameed, Amir Vadai

From: Saeed Mahameed <saeedm@mellanox.com>

Use cmd->autoneg == AUTONEG_DISABLE as a user hint to force specific speed.
We don't want to rely on ethtool to calculate advertised link modes when
forcing specific speed, a user can request a specific speed and specify
"autoneg off" in ethtool command to give a hint for forcing this speed.

Move en_warn("port reset..") inside the "port reset" block.

Fixes: d48b3ab ("net/mlx4_en: Use PTYS register to set ethtool settings (Speed)")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index c45e06a..3045582 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -771,13 +771,13 @@ static int mlx4_en_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 	}
 
 	proto_admin = cpu_to_be32(ptys_adv);
-	if (speed >= 0 && speed != priv->port_state.link_speed)
+	if (speed >= 0 && ((speed != priv->port_state.link_speed) ||
+			   (cmd->autoneg == AUTONEG_DISABLE)))
 		/* If speed was set then speed decides :-) */
 		proto_admin = speed_set_ptys_admin(priv, speed,
 						   ptys_reg.eth_proto_cap);
 
 	proto_admin &= ptys_reg.eth_proto_cap;
-
 	if (proto_admin == ptys_reg.eth_proto_admin)
 		return 0; /* Nothing to change */
 
@@ -798,9 +798,9 @@ static int mlx4_en_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 		return ret;
 	}
 
-	en_warn(priv, "Port link mode changed, restarting port...\n");
 	mutex_lock(&priv->mdev->state_lock);
 	if (priv->port_up) {
+		en_warn(priv, "Port link mode changed, restarting port...\n");
 		mlx4_en_stop_port(dev, 1);
 		if (mlx4_en_start_port(dev))
 			en_err(priv, "Failed restarting port %d\n", priv->port);
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next v3 2/2] rocker: remove swdev mode
From: roopa @ 2014-12-07  6:54 UTC (permalink / raw)
  To: jiri, sfeldma, jhs, bcrl, tgraf, john.fastabend, stephen,
	linville, vyasevic
  Cc: netdev, davem, shm, gospo, Roopa Prabhu

From: Roopa Prabhu <roopa@cumulusnetworks.com>

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 drivers/net/ethernet/rocker/rocker.c |   18 +-----------------
 include/linux/rtnetlink.h            |    2 +-
 net/core/rtnetlink.c                 |   12 +++++++++---
 3 files changed, 11 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
index fded127..9f1d256 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -3700,27 +3700,11 @@ static int rocker_port_bridge_setlink(struct net_device *dev,
 {
 	struct rocker_port *rocker_port = netdev_priv(dev);
 	struct nlattr *protinfo;
-	struct nlattr *afspec;
 	struct nlattr *attr;
-	u16 mode;
 	int err;
 
 	protinfo = nlmsg_find_attr(nlh, sizeof(struct ifinfomsg),
 				   IFLA_PROTINFO);
-	afspec = nlmsg_find_attr(nlh, sizeof(struct ifinfomsg), IFLA_AF_SPEC);
-
-	if (afspec) {
-		attr = nla_find_nested(afspec, IFLA_BRIDGE_MODE);
-		if (attr) {
-			if (nla_len(attr) < sizeof(mode))
-				return -EINVAL;
-
-			mode = nla_get_u16(attr);
-			if (mode != BRIDGE_MODE_SWDEV)
-				return -EINVAL;
-		}
-	}
-
 	if (protinfo) {
 		attr = nla_find_nested(protinfo, IFLA_BRPORT_LEARNING);
 		if (attr) {
@@ -3755,7 +3739,7 @@ static int rocker_port_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
 				      u32 filter_mask)
 {
 	struct rocker_port *rocker_port = netdev_priv(dev);
-	u16 mode = BRIDGE_MODE_SWDEV;
+	u16 mode = -1;
 	u32 mask = BR_LEARNING | BR_LEARNING_SYNC;
 
 	return ndo_dflt_bridge_getlink(skb, pid, seq, dev, mode,
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 3b04190..dcfa06b 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -103,6 +103,6 @@ extern int ndo_dflt_fdb_del(struct ndmsg *ndm,
 			    u16 vid);
 
 extern int ndo_dflt_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
-				   struct net_device *dev, u16 mode,
+				   struct net_device *dev, s16 mode,
 				   u32 flags, u32 mask);
 #endif	/* __LINUX_RTNETLINK_H */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 61cb7e7..b4e04b9 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2696,7 +2696,7 @@ static int brport_nla_put_flag(struct sk_buff *skb, u32 flags, u32 mask,
 }
 
 int ndo_dflt_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
-			    struct net_device *dev, u16 mode,
+			    struct net_device *dev, s16 mode,
 			    u32 flags, u32 mask)
 {
 	struct nlmsghdr *nlh;
@@ -2734,11 +2734,17 @@ int ndo_dflt_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
 	if (!br_afspec)
 		goto nla_put_failure;
 
-	if (nla_put_u16(skb, IFLA_BRIDGE_FLAGS, BRIDGE_FLAGS_SELF) ||
-	    nla_put_u16(skb, IFLA_BRIDGE_MODE, mode)) {
+	if (nla_put_u16(skb, IFLA_BRIDGE_FLAGS, BRIDGE_FLAGS_SELF)) {
 		nla_nest_cancel(skb, br_afspec);
 		goto nla_put_failure;
 	}
+
+	if (mode >= 0) {
+		if (nla_put_u16(skb, IFLA_BRIDGE_MODE, mode)) {
+			nla_nest_cancel(skb, br_afspec);
+			goto nla_put_failure;
+		}
+	}
 	nla_nest_end(skb, br_afspec);
 
 	protinfo = nla_nest_start(skb, IFLA_PROTINFO | NLA_F_NESTED);
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net-next v3 1/2] bridge: remove mode 'swdev'
From: roopa @ 2014-12-07  6:54 UTC (permalink / raw)
  To: jiri, sfeldma, jhs, bcrl, tgraf, john.fastabend, stephen,
	linville, vyasevic
  Cc: netdev, davem, shm, gospo, Roopa Prabhu

From: Roopa Prabhu <roopa@cumulusnetworks.com>

swdev mode was introduced to indicate switchdev offloads
for bridging from user space. But user can
use BRIDGE_FLAGS_SELF to directly call into the
hw switch port driver today. swdev mode is not required anymore.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 include/uapi/linux/if_bridge.h |    1 -
 1 file changed, 1 deletion(-)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index 296a556..da17e45 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -105,7 +105,6 @@ struct __fdb_entry {
 
 #define BRIDGE_MODE_VEB		0	/* Default loopback mode */
 #define BRIDGE_MODE_VEPA	1	/* 802.1Qbg defined VEPA mode */
-#define BRIDGE_MODE_SWDEV	2	/* Full switch device offload */
 
 /* Bridge management nested attributes
  * [IFLA_AF_SPEC] = {
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net-next v3 0/2] remove bridge BRIDGE_MODE_SWDEV
From: roopa @ 2014-12-07  6:54 UTC (permalink / raw)
  To: jiri, sfeldma, jhs, bcrl, tgraf, john.fastabend, stephen,
	linville, vyasevic
  Cc: netdev, davem, shm, gospo, Roopa Prabhu

From: Roopa Prabhu <roopa@cumulusnetworks.com>

Roopa Prabhu (2):
  bridge: remove mode 'swdev'
  rocker: remove swdev mode

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>

 drivers/net/ethernet/rocker/rocker.c |   18 +-----------------
 include/linux/rtnetlink.h            |    2 +-
 include/uapi/linux/if_bridge.h       |    1 -
 net/core/rtnetlink.c                 |   12 +++++++++---
 4 files changed, 11 insertions(+), 22 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* Re: Where exactly will arch_fast_hash be used
From: George Spelvin @ 2014-12-07  5:20 UTC (permalink / raw)
  To: herbert; +Cc: dborkman, hannes, linux, linux-kernel, netdev, tgraf

If you want DoS-resistant hash tables, I'm working on adding SipHash
to the kernel.

This is a keyed pseudo-random function designed specifically for that
application.  I am starting with ext4 directory hashes, and then intended
to expand to secure sequence numbers (since it's far faster than MD5).

(I'm trying to figure out a good interface, since the crypto API
is a bit heavy for something to heavily optimized.)

But one comment caught my eye:
> Even if security wasn't an issue, straight CRC32 has really poor
> lower-order bit distribution, which makes it a terrible choice for
> a hash table that simply uses the lower-order bits.

Er... huh?  That's the first time I've heard that claim, and while I'm not
Philip Koopman or Guy Castagnoli, I thought I understood CRCs pretty well.

CRCs generally mix bits pretty well.  The sparse 16-bit CRCs chosen
for implementation simplicity had some limitations, but the Castagnoli
polynomial is quite dense.

And their mathematical symmetry means that the low bits really shouldn't
be any different from any other bits.  But if it is an issue, it's just
as easy work to shift down the correct number of high bits rather than
using the low.

Can you point me to a source for that statement?

^ permalink raw reply

* Re: [PATCH v8 3/3] net: hisilicon: new hip04 ethernet driver
From: Ding Tianhong @ 2014-12-07  3:28 UTC (permalink / raw)
  To: Alexander Graf, Zhangfei Gao
  Cc: davem, linux, arnd, f.fainelli, sergei.shtylyov, mark.rutland,
	David.Laight, eric.dumazet, xuwei5, linux-arm-kernel, netdev,
	devicetree
In-Reply-To: <5483A279.5000503@suse.de>

On 2014/12/7 8:42, Alexander Graf wrote:
> On 19.04.14 03:13, Zhangfei Gao wrote:
>> Support Hisilicon hip04 ethernet driver, including 100M / 1000M controller.
>> The controller has no tx done interrupt, reclaim xmitted buffer in the poll.
>>
>> Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> 
> Is this driver still supposed to go upstream? I presume this was the
> last submission and it's been quite some time ago :)
> 

yes, it is really a long time, but The hip04 did not support tx irq, 
we couldn't get any better idea to fix this defect, do you have any suggestion?

Regards
Ding

>> ---
>>  drivers/net/ethernet/hisilicon/Makefile    |    2 +-
>>  drivers/net/ethernet/hisilicon/hip04_eth.c |  846 ++++++++++++++++++++++++++++
>>  2 files changed, 847 insertions(+), 1 deletion(-)
>>  create mode 100644 drivers/net/ethernet/hisilicon/hip04_eth.c
>>
>> diff --git a/drivers/net/ethernet/hisilicon/Makefile b/drivers/net/ethernet/hisilicon/Makefile
>> index 1d6eb6e..17dec03 100644
> 
> [...]
> 
>> +static irqreturn_t hip04_mac_interrupt(int irq, void *dev_id)
>> +{
>> +	struct net_device *ndev = (struct net_device *) dev_id;
>> +	struct hip04_priv *priv = netdev_priv(ndev);
>> +	struct net_device_stats *stats = &ndev->stats;
>> +	u32 ists = readl_relaxed(priv->base + PPE_INTSTS);
>> +
>> +	writel_relaxed(DEF_INT_MASK, priv->base + PPE_RINT);
>> +
>> +	if (unlikely(ists & DEF_INT_ERR)) {
>> +		if (ists & (RCV_NOBUF | RCV_DROP))
>> +			stats->rx_errors++;
>> +			stats->rx_dropped++;
>> +			netdev_err(ndev, "rx drop\n");
> 
> Why would a user want to see this as a dmesg error? We already have
> stats for packet drops, no?
> 
>> +		if (ists & TX_DROP) {
>> +			stats->tx_dropped++;
>> +			netdev_err(ndev, "tx drop\n");
> 
> Same here.
> 
>> +		}
>> +	}
>> +
>> +	if (ists & RCV_INT) {
>> +		/* disable rx interrupt */
>> +		priv->reg_inten &= ~(RCV_INT);
>> +		writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN);
>> +		napi_schedule(&priv->napi);
>> +	}
>> +
>> +	return IRQ_HANDLED;
>> +}
> 
> [...]
> 
>> +
>> +static const struct of_device_id hip04_mac_match[] = {
>> +	{ .compatible = "hisilicon,hip04-mac" },
>> +	{ }
>> +};
> 
> This is missing the magic macro that actually would make module aliases
> work.
> 
> 
> Alex
> 
>> +
>> +static struct platform_driver hip04_mac_driver = {
>> +	.probe	= hip04_mac_probe,
>> +	.remove	= hip04_remove,
>> +	.driver	= {
>> +		.name		= DRV_NAME,
>> +		.owner		= THIS_MODULE,
>> +		.of_match_table	= hip04_mac_match,
>> +	},
>> +};
>> +module_platform_driver(hip04_mac_driver);
>> +
>> +MODULE_DESCRIPTION("HISILICON P04 Ethernet driver");
>> +MODULE_LICENSE("GPL v2");
>> +MODULE_ALIAS("platform:hip04-ether");
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> .
> 

^ permalink raw reply

* From:  Brian Neu
From: Brian Neu @ 2014-12-07  1:42 UTC (permalink / raw)
  To: David Smith, Taylor, ehart test, Theresa, Bind Users,
	Juliette Vara, Jennifer Waller, Polly Waters, oreilly,
	e1000 devel, Carolyn, netdev, Majordomo, Majordomo, Chris Mason,
	linux btrfs, pgsql novice, Tom Lane, majordomo, backuppc users


[-- Attachment #1.1: Type: text/plain, Size: 81 bytes --]

Hi!
How are you? http://library.skc.edu/noise.php people say it works!

Brian Neu

[-- Attachment #2: Type: text/plain, Size: 440 bytes --]

------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net

[-- Attachment #3: Type: text/plain, Size: 257 bytes --]

_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: [PATCH v8 3/3] net: hisilicon: new hip04 ethernet driver
From: Alexander Graf @ 2014-12-07  0:42 UTC (permalink / raw)
  To: Zhangfei Gao
  Cc: davem, linux, arnd, f.fainelli, sergei.shtylyov, mark.rutland,
	David.Laight, eric.dumazet, xuwei5, linux-arm-kernel, netdev,
	devicetree
In-Reply-To: <1397869980-21187-4-git-send-email-zhangfei.gao@linaro.org>

On 19.04.14 03:13, Zhangfei Gao wrote:
> Support Hisilicon hip04 ethernet driver, including 100M / 1000M controller.
> The controller has no tx done interrupt, reclaim xmitted buffer in the poll.
> 
> Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org>

Is this driver still supposed to go upstream? I presume this was the
last submission and it's been quite some time ago :)

> ---
>  drivers/net/ethernet/hisilicon/Makefile    |    2 +-
>  drivers/net/ethernet/hisilicon/hip04_eth.c |  846 ++++++++++++++++++++++++++++
>  2 files changed, 847 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/net/ethernet/hisilicon/hip04_eth.c
> 
> diff --git a/drivers/net/ethernet/hisilicon/Makefile b/drivers/net/ethernet/hisilicon/Makefile
> index 1d6eb6e..17dec03 100644

[...]

> +static irqreturn_t hip04_mac_interrupt(int irq, void *dev_id)
> +{
> +	struct net_device *ndev = (struct net_device *) dev_id;
> +	struct hip04_priv *priv = netdev_priv(ndev);
> +	struct net_device_stats *stats = &ndev->stats;
> +	u32 ists = readl_relaxed(priv->base + PPE_INTSTS);
> +
> +	writel_relaxed(DEF_INT_MASK, priv->base + PPE_RINT);
> +
> +	if (unlikely(ists & DEF_INT_ERR)) {
> +		if (ists & (RCV_NOBUF | RCV_DROP))
> +			stats->rx_errors++;
> +			stats->rx_dropped++;
> +			netdev_err(ndev, "rx drop\n");

Why would a user want to see this as a dmesg error? We already have
stats for packet drops, no?

> +		if (ists & TX_DROP) {
> +			stats->tx_dropped++;
> +			netdev_err(ndev, "tx drop\n");

Same here.

> +		}
> +	}
> +
> +	if (ists & RCV_INT) {
> +		/* disable rx interrupt */
> +		priv->reg_inten &= ~(RCV_INT);
> +		writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN);
> +		napi_schedule(&priv->napi);
> +	}
> +
> +	return IRQ_HANDLED;
> +}

[...]

> +
> +static const struct of_device_id hip04_mac_match[] = {
> +	{ .compatible = "hisilicon,hip04-mac" },
> +	{ }
> +};

This is missing the magic macro that actually would make module aliases
work.


Alex

> +
> +static struct platform_driver hip04_mac_driver = {
> +	.probe	= hip04_mac_probe,
> +	.remove	= hip04_remove,
> +	.driver	= {
> +		.name		= DRV_NAME,
> +		.owner		= THIS_MODULE,
> +		.of_match_table	= hip04_mac_match,
> +	},
> +};
> +module_platform_driver(hip04_mac_driver);
> +
> +MODULE_DESCRIPTION("HISILICON P04 Ethernet driver");
> +MODULE_LICENSE("GPL v2");
> +MODULE_ALIAS("platform:hip04-ether");
> 

^ permalink raw reply

* [PATCH] mac80211: Supporting of IFLA_INFO_KIND rtnl attribute
From: Vadim Kochan @ 2014-12-07  0:04 UTC (permalink / raw)
  To: johannes-cdvu00un1VgdHxzADdlk8Q, linville-2XuSBdqkA4R54TAoqtyWWQ,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: Vadim Kochan

It allows to identify the wireless kind of device for
the user application, e.g.:

    # ip -d link

    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0
    2: enp0s25: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000
        link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff promiscuity 0
    3: wlp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
        link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff promiscuity 0
        wireless
    4: wlp0s26u1u2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
        link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff promiscuity 0 
        wireless 

Signed-off-by: Vadim Kochan <vadim4j-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 net/mac80211/iface.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/mac80211/iface.c b/net/mac80211/iface.c
index 653f5eb..b993c7d 100644
--- a/net/mac80211/iface.c
+++ b/net/mac80211/iface.c
@@ -18,6 +18,7 @@
 #include <linux/rtnetlink.h>
 #include <net/mac80211.h>
 #include <net/ieee80211_radiotap.h>
+#include <net/rtnetlink.h>
 #include "ieee80211_i.h"
 #include "sta_info.h"
 #include "debugfs_netdev.h"
@@ -1624,6 +1625,10 @@ static void ieee80211_assign_perm_addr(struct ieee80211_local *local,
 	mutex_unlock(&local->iflist_mtx);
 }
 
+static struct rtnl_link_ops wireless_link_ops __read_mostly = {
+	.kind = "wireless",
+};
+
 int ieee80211_if_add(struct ieee80211_local *local, const char *name,
 		     struct wireless_dev **new_wdev, enum nl80211_iftype type,
 		     struct vif_params *params)
@@ -1684,6 +1689,8 @@ int ieee80211_if_add(struct ieee80211_local *local, const char *name,
 		memcpy(sdata->vif.addr, ndev->dev_addr, ETH_ALEN);
 		memcpy(sdata->name, ndev->name, IFNAMSIZ);
 
+		ndev->rtnl_link_ops = &wireless_link_ops;
+
 		sdata->dev = ndev;
 	}
 
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH net-next 4/4] macvlan: play well with ipvlan device
From: Mahesh Bandewar @ 2014-12-06 23:53 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Mahesh Bandewar

If device is already used as an ipvlan port then refuse to
use it as a macvlan port at early stage of port creation.

	thost1:~# ip link add link eth0 ipvl0 type ipvlan
	thost1:~# echo $?
	0
	thost1:~# ip link add link eth0 mvl0 type macvlan
	RTNETLINK answers: Device or resource busy
	thost1:~# echo $?
	2
	thost1:~#

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 drivers/net/macvlan.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 9538674587aa..84b4008210c7 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -1056,6 +1056,9 @@ static int macvlan_port_create(struct net_device *dev)
 	if (dev->type != ARPHRD_ETHER || dev->flags & IFF_LOOPBACK)
 		return -EINVAL;
 
+	if (netif_is_ipvlan_port(dev))
+		return -EBUSY;
+
 	port = kzalloc(sizeof(*port), GFP_KERNEL);
 	if (port == NULL)
 		return -ENOMEM;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next 3/4] ipvlan: move the device check function into netdevice.h
From: Mahesh Bandewar @ 2014-12-06 23:53 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Mahesh Bandewar

Move the port check [ipvlan_dev_master()] and device check
[ipvlan_dev_slave()] functions to netdevice.h and rename them
netif_is_ipvlan_port() and netif_is_ipvlan() resp. to be
consistent with macvlan api naming.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 drivers/net/ipvlan/ipvlan.h      | 10 ----------
 drivers/net/ipvlan/ipvlan_main.c | 10 +++++-----
 include/linux/netdevice.h        | 10 ++++++++++
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h
index c44d29eca6c0..2729f64b3e7e 100644
--- a/drivers/net/ipvlan/ipvlan.h
+++ b/drivers/net/ipvlan/ipvlan.h
@@ -107,16 +107,6 @@ static inline struct ipvl_port *ipvlan_port_get_rtnl(const struct net_device *d)
 	return rtnl_dereference(d->rx_handler_data);
 }
 
-static inline bool ipvlan_dev_master(struct net_device *d)
-{
-	return d->priv_flags & IFF_IPVLAN_MASTER;
-}
-
-static inline bool ipvlan_dev_slave(struct net_device *d)
-{
-	return d->priv_flags & IFF_IPVLAN_SLAVE;
-}
-
 void ipvlan_adjust_mtu(struct ipvl_dev *ipvlan, struct net_device *dev);
 void ipvlan_set_port_mode(struct ipvl_port *port, u32 nval);
 void ipvlan_init_secret(void);
diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index a706a547d811..83cf9fb0804d 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -447,11 +447,11 @@ static int ipvlan_link_new(struct net *src_net, struct net_device *dev,
 	if (!phy_dev)
 		return -ENODEV;
 
-	if (ipvlan_dev_slave(phy_dev)) {
+	if (netif_is_ipvlan(phy_dev)) {
 		struct ipvl_dev *tmp = netdev_priv(phy_dev);
 
 		phy_dev = tmp->phy_dev;
-	} else if (!ipvlan_dev_master(phy_dev)) {
+	} else if (!netif_is_ipvlan_port(phy_dev)) {
 		err = ipvlan_port_create(phy_dev);
 		if (err < 0)
 			return err;
@@ -561,7 +561,7 @@ static int ipvlan_device_event(struct notifier_block *unused,
 	struct ipvl_port *port;
 	LIST_HEAD(lst_kill);
 
-	if (!ipvlan_dev_master(dev))
+	if (!netif_is_ipvlan_port(dev))
 		return NOTIFY_DONE;
 
 	port = ipvlan_port_get_rtnl(dev);
@@ -652,7 +652,7 @@ static int ipvlan_addr6_event(struct notifier_block *unused,
 	struct net_device *dev = (struct net_device *)if6->idev->dev;
 	struct ipvl_dev *ipvlan = netdev_priv(dev);
 
-	if (!ipvlan_dev_slave(dev))
+	if (!netif_is_ipvlan(dev))
 		return NOTIFY_DONE;
 
 	if (!ipvlan || !ipvlan->port)
@@ -724,7 +724,7 @@ static int ipvlan_addr4_event(struct notifier_block *unused,
 	struct ipvl_dev *ipvlan = netdev_priv(dev);
 	struct in_addr ip4_addr;
 
-	if (!ipvlan_dev_slave(dev))
+	if (!netif_is_ipvlan(dev))
 		return NOTIFY_DONE;
 
 	if (!ipvlan || !ipvlan->port)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1f49aac258f9..c31f74d76ebd 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3646,6 +3646,16 @@ static inline bool netif_is_macvlan_port(struct net_device *dev)
 	return dev->priv_flags & IFF_MACVLAN_PORT;
 }
 
+static inline bool netif_is_ipvlan(struct net_device *dev)
+{
+	return dev->priv_flags & IFF_IPVLAN_SLAVE;
+}
+
+static inline bool netif_is_ipvlan_port(struct net_device *dev)
+{
+	return dev->priv_flags & IFF_IPVLAN_MASTER;
+}
+
 static inline bool netif_is_bond_master(struct net_device *dev)
 {
 	return dev->flags & IFF_MASTER && dev->priv_flags & IFF_BONDING;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next 2/4] ipvlan: play well with macvlan device
From: Mahesh Bandewar @ 2014-12-06 23:53 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Mahesh Bandewar

If a device is already a macvlan port then refuse to use it as
an ipvlan port in the early stage of port creation.

	thost1:~# ip link add link eth0 mvl0 type macvlan
	thost1:~# echo $?
	0
	thost1:~# ip link add link eth0 ipvl0 type ipvlan
	RTNETLINK answers: Device or resource busy
	thost1:~# echo $?
	2
	thost1:~#

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 drivers/net/ipvlan/ipvlan_main.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 96b71b0d78f6..a706a547d811 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -38,6 +38,12 @@ static int ipvlan_port_create(struct net_device *dev)
 		netdev_err(dev, "Master is either lo or non-ether device\n");
 		return -EINVAL;
 	}
+
+	if (netif_is_macvlan_port(dev)) {
+		netdev_err(dev, "Master is a macvlan port.\n");
+		return -EBUSY;
+	}
+
 	port = kzalloc(sizeof(struct ipvl_port), GFP_KERNEL);
 	if (!port)
 		return -ENOMEM;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next 1/4] netdevice: Add a function to check macvlan port
From: Mahesh Bandewar @ 2014-12-06 23:53 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Mahesh Bandewar

Similar to a check for macvlan device, netif_is_macvlan(), add
another function to check if a device is used as macvlan port.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 include/linux/netdevice.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 29c92ee9ed56..1f49aac258f9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3641,6 +3641,11 @@ static inline bool netif_is_macvlan(struct net_device *dev)
 	return dev->priv_flags & IFF_MACVLAN;
 }
 
+static inline bool netif_is_macvlan_port(struct net_device *dev)
+{
+	return dev->priv_flags & IFF_MACVLAN_PORT;
+}
+
 static inline bool netif_is_bond_master(struct net_device *dev)
 {
 	return dev->flags & IFF_MASTER && dev->priv_flags & IFF_BONDING;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v2 2/4] net: tcp: add key management to congestion control
From: Daniel Borkmann @ 2014-12-06 23:39 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev
In-Reply-To: <1417909163-19135-1-git-send-email-dborkman@redhat.com>

For a per-route congestion control possibility, our aim is to store
a unique u32 key identifier into dst metrics, which can then be
mapped into a tcp_congestion_ops struct. We argue that having a
RTAX key entry is the most simple, generic and easy way to manage,
and also keeps the memory footprint of dst entries lower on 64 bit
than with storing a pointer directly, for example. Having a unique
key id also allows for decoupling actual TCP congestion control
module management from the FIB layer, i.e. we don't have to care
about expensive module refcounting inside the FIB at this point.

We first thought of using an IDR store for the realization, which
takes over dynamic assignment of unused key space and also performs
the key to pointer mapping in RCU. While doing so, we stumbled upon
the issue that due to the nature of dynamic key distribution, it
just so happens, arguably in very rare occasions, that excessive
module loads and unloads can lead to a possible reuse of previously
used key space. Thus, previously stale keys in the dst metric are
now being reassigned to a different congestion control algorithm,
which might lead to unexpected behaviour. One way to resolve this
would have been to walk FIBs on the actually rare occasion of a
module unload and reset the metric keys for each FIB in each netns,
but that's just very costly.

Therefore, we argue a better solution is to reuse the unique
congestion control algorithm name member and map that into u32 key
space through jhash. For that, we split the flags attribute (as it
currently uses 2 bits only anyway) into two u32 attributes, flags
and key, so that we can keep the cacheline boundary of 2 cachelines
on x86_64 and cache the precalculated key at registration time for
the fast path. On average we might expect 2 - 4 modules being loaded
worst case perhaps 15, so a key collision possibility is extremely
low, and guaranteed collision-free on LE/BE for all in-tree modules.
Overall this results in much simpler code, and all without the
overhead of an IDR. Due to the deterministic nature, modules can
now be unloaded, the congestion control algorithm for a specific
but unloaded key will fall back to the default one, and on module
reload time it will switch back to the expected algorithm
transparently.

Joint work with Florian Westphal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/net/inet_connection_sock.h |  3 +-
 include/net/tcp.h                  |  9 +++++-
 net/ipv4/tcp_cong.c                | 63 ++++++++++++++++++++++++++++++++++++--
 3 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 848e85c..5976bde 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -98,7 +98,8 @@ struct inet_connection_sock {
 	const struct tcp_congestion_ops *icsk_ca_ops;
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
-	__u8			  icsk_ca_state;
+	__u8			  icsk_ca_state:7,
+				  icsk_ca_dst_locked:1;
 	__u8			  icsk_retransmits;
 	__u8			  icsk_pending;
 	__u8			  icsk_backoff;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f50f29faf..135b70c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -787,6 +787,8 @@ enum tcp_ca_ack_event_flags {
 #define TCP_CA_MAX	128
 #define TCP_CA_BUF_MAX	(TCP_CA_NAME_MAX*TCP_CA_MAX)
 
+#define TCP_CA_UNSPEC	0
+
 /* Algorithm can be set on socket without CAP_NET_ADMIN privileges */
 #define TCP_CONG_NON_RESTRICTED 0x1
 /* Requires ECN/ECT set on all packets */
@@ -794,7 +796,8 @@ enum tcp_ca_ack_event_flags {
 
 struct tcp_congestion_ops {
 	struct list_head	list;
-	unsigned long flags;
+	u32 key;
+	u32 flags;
 
 	/* initialize private data (optional) */
 	void (*init)(struct sock *sk);
@@ -841,6 +844,10 @@ u32 tcp_reno_ssthresh(struct sock *sk);
 void tcp_reno_cong_avoid(struct sock *sk, u32 ack, u32 acked);
 extern struct tcp_congestion_ops tcp_reno;
 
+struct tcp_congestion_ops *tcp_ca_find_key(u32 key);
+u32 tcp_ca_get_key_by_name(const char *name);
+char *tcp_ca_get_name_by_key(u32 key, char *buffer);
+
 static inline bool tcp_ca_needs_ecn(const struct sock *sk)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 38f2f8a..8fd348c 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -13,6 +13,7 @@
 #include <linux/types.h>
 #include <linux/list.h>
 #include <linux/gfp.h>
+#include <linux/jhash.h>
 #include <net/tcp.h>
 
 static DEFINE_SPINLOCK(tcp_cong_list_lock);
@@ -31,6 +32,19 @@ static struct tcp_congestion_ops *tcp_ca_find(const char *name)
 	return NULL;
 }
 
+/* Simple linear search, not much in here. */
+struct tcp_congestion_ops *tcp_ca_find_key(u32 key)
+{
+	struct tcp_congestion_ops *e;
+
+	list_for_each_entry_rcu(e, &tcp_cong_list, list) {
+		if (e->key == key)
+			return e;
+	}
+
+	return NULL;
+}
+
 /*
  * Attach new congestion control algorithm to the list
  * of available options.
@@ -45,9 +59,12 @@ int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
 		return -EINVAL;
 	}
 
+	ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));
+
 	spin_lock(&tcp_cong_list_lock);
-	if (tcp_ca_find(ca->name)) {
-		pr_notice("%s already registered\n", ca->name);
+	if (ca->key == TCP_CA_UNSPEC || tcp_ca_find_key(ca->key)) {
+		pr_notice("%s already registered or non-unique key\n",
+			  ca->name);
 		ret = -EEXIST;
 	} else {
 		list_add_tail_rcu(&ca->list, &tcp_cong_list);
@@ -70,9 +87,48 @@ void tcp_unregister_congestion_control(struct tcp_congestion_ops *ca)
 	spin_lock(&tcp_cong_list_lock);
 	list_del_rcu(&ca->list);
 	spin_unlock(&tcp_cong_list_lock);
+
+	/* Wait for outstanding readers to complete before the
+	 * module gets removed entirely.
+	 *
+	 * A try_module_get() should fail by now as our module is
+	 * in "going" state since no refs are held anymore and
+	 * module_exit() handler being called.
+	 */
+	synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control);
 
+u32 tcp_ca_get_key_by_name(const char *name)
+{
+	const struct tcp_congestion_ops *ca;
+	u32 key;
+
+	rcu_read_lock();
+	ca = tcp_ca_find(name);
+	key = ca ? ca->key : TCP_CA_UNSPEC;
+	rcu_read_unlock();
+
+	return key;
+}
+EXPORT_SYMBOL_GPL(tcp_ca_get_key_by_name);
+
+char *tcp_ca_get_name_by_key(u32 key, char *buffer)
+{
+	const struct tcp_congestion_ops *ca;
+	char *ret = NULL;
+
+	rcu_read_lock();
+	ca = tcp_ca_find_key(key);
+	if (ca)
+		ret = strncpy(buffer, ca->name,
+			      TCP_CA_NAME_MAX);
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tcp_ca_get_name_by_key);
+
 /* Assign choice of congestion control. */
 void tcp_assign_congestion_control(struct sock *sk)
 {
@@ -256,6 +312,9 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
 	struct tcp_congestion_ops *ca;
 	int err = 0;
 
+	if (icsk->icsk_ca_dst_locked)
+		return -EPERM;
+
 	rcu_read_lock();
 	ca = tcp_ca_find(name);
 
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v2 4/4] net: tcp: add per route congestion control
From: Daniel Borkmann @ 2014-12-06 23:39 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev
In-Reply-To: <1417909163-19135-1-git-send-email-dborkman@redhat.com>

This work adds the possibility to define a per route/destination congestion
control algorithm. Generally, this opens up the possibility for a machine
with different links to enforce specific congestion control algorithms with
optimal strategies for each of them based on their network characteristics,
even transparently for a single application listening on all links.

For our specific use case, this additionally facilitates deployment of DCTCP,
for example, applications can easily serve internal traffic/dsts in DCTCP and
external one with CUBIC. Other scenarios would also allow for utilizing e.g.
long living, low priority background flows for certain destinations/routes
while still being able for normal traffic to utilize the default congestion
control algorithm. We also thought about a per netns setting (where different
defaults are possible), but given its actually a link specific property, we
argue that a per route/destination setting is the most natural and flexible.

The administrator can utilize this through ip-route(8) by appending "congctl
[lock] <name>", where <name> denotes the name of a loaded congestion control
algorithm and the optional lock parameter allows to enforce the given algorithm
so that applications in user space would not be allowed to overwrite that
algorithm for that destination.

The dst metric lookups are being done when a dst entry is already available
in order to avoid a costly lookup and still before the algorithms are being
initialized, thus overhead is very low when the feature is not being used.
While the client side would need to drop the current reference on the module,
on server side this can actually even be avoided as we just got a flat-copied
socket clone.

In case a dst metric contains a key to a congestion control module that has
just been removed from the kernel, it will fallback to the default congestion
control discipline for those and not reassign anything. Should at a later
point in time that module be reinserted into the kernel, then this algorithm
will be used again as keys are deterministic/static.

Joint work with Florian Westphal.

Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/net/tcp.h        |  6 ++++++
 net/ipv4/tcp_ipv4.c      |  2 ++
 net/ipv4/tcp_minisocks.c | 30 ++++++++++++++++++++++++++----
 net/ipv4/tcp_output.c    | 21 +++++++++++++++++++++
 net/ipv6/tcp_ipv6.c      |  2 ++
 5 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 95bb237..b8fdc6b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -448,6 +448,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb);
 struct sock *tcp_create_openreq_child(struct sock *sk,
 				      struct request_sock *req,
 				      struct sk_buff *skb);
+void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst);
 struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 				  struct request_sock *req,
 				  struct dst_entry *dst);
@@ -636,6 +637,11 @@ static inline u32 tcp_rto_min_us(struct sock *sk)
 	return jiffies_to_usecs(tcp_rto_min(sk));
 }
 
+static inline bool tcp_ca_dst_locked(const struct dst_entry *dst)
+{
+	return dst_metric_locked(dst, RTAX_CC_ALGO);
+}
+
 /* Compute the actual receive window we are currently advertising.
  * Rcv_nxt can be after the window if our peer push more data
  * than the offered window.
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 33f5ff0..ed94fdc 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1340,6 +1340,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 	}
 	sk_setup_caps(newsk, dst);
 
+	tcp_ca_openreq_child(newsk, dst);
+
 	tcp_sync_mss(newsk, dst_mtu(dst));
 	newtp->advmss = dst_metric_advmss(dst);
 	if (tcp_sk(sk)->rx_opt.user_mss &&
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 63d2680..bc9216d 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -399,6 +399,32 @@ static void tcp_ecn_openreq_child(struct tcp_sock *tp,
 	tp->ecn_flags = inet_rsk(req)->ecn_ok ? TCP_ECN_OK : 0;
 }
 
+void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	u32 ca_key = dst_metric(dst, RTAX_CC_ALGO);
+	bool ca_got_dst = false;
+
+	if (ca_key != TCP_CA_UNSPEC) {
+		const struct tcp_congestion_ops *ca;
+
+		rcu_read_lock();
+		ca = tcp_ca_find_key(ca_key);
+		if (likely(ca && try_module_get(ca->owner))) {
+			icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst);
+			icsk->icsk_ca_ops = ca;
+			ca_got_dst = true;
+		}
+		rcu_read_unlock();
+	}
+
+	if (!ca_got_dst && !try_module_get(icsk->icsk_ca_ops->owner))
+		tcp_assign_congestion_control(sk);
+
+	tcp_set_ca_state(sk, TCP_CA_Open);
+}
+EXPORT_SYMBOL_GPL(tcp_ca_openreq_child);
+
 /* This is not only more efficient than what we used to do, it eliminates
  * a lot of code duplication between IPv4/IPv6 SYN recv processing. -DaveM
  *
@@ -451,10 +477,6 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		newtp->snd_cwnd = TCP_INIT_CWND;
 		newtp->snd_cwnd_cnt = 0;
 
-		if (!try_module_get(newicsk->icsk_ca_ops->owner))
-			tcp_assign_congestion_control(newsk);
-
-		tcp_set_ca_state(newsk, TCP_CA_Open);
 		tcp_init_xmit_timers(newsk);
 		__skb_queue_head_init(&newtp->out_of_order_queue);
 		newtp->write_seq = newtp->pushed_seq = treq->snt_isn + 1;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f5bd4bd..5eb5676 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2916,6 +2916,25 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 }
 EXPORT_SYMBOL(tcp_make_synack);
 
+static void tcp_ca_dst_init(struct sock *sk, const struct dst_entry *dst)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	const struct tcp_congestion_ops *ca;
+	u32 ca_key = dst_metric(dst, RTAX_CC_ALGO);
+
+	if (ca_key == TCP_CA_UNSPEC)
+		return;
+
+	rcu_read_lock();
+	ca = tcp_ca_find_key(ca_key);
+	if (likely(ca && try_module_get(ca->owner))) {
+		module_put(icsk->icsk_ca_ops->owner);
+		icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst);
+		icsk->icsk_ca_ops = ca;
+	}
+	rcu_read_unlock();
+}
+
 /* Do all connect socket setups that can be done AF independent. */
 static void tcp_connect_init(struct sock *sk)
 {
@@ -2941,6 +2960,8 @@ static void tcp_connect_init(struct sock *sk)
 	tcp_mtup_init(sk);
 	tcp_sync_mss(sk, dst_mtu(dst));
 
+	tcp_ca_dst_init(sk, dst);
+
 	if (!tp->window_clamp)
 		tp->window_clamp = dst_metric(dst, RTAX_WINDOW);
 	tp->advmss = dst_metric_advmss(dst);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d06af89..20f44fe 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1199,6 +1199,8 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 		inet_csk(newsk)->icsk_ext_hdr_len = (newnp->opt->opt_nflen +
 						     newnp->opt->opt_flen);
 
+	tcp_ca_openreq_child(newsk, dst);
+
 	tcp_sync_mss(newsk, dst_mtu(dst));
 	newtp->advmss = dst_metric_advmss(dst);
 	if (tcp_sk(sk)->rx_opt.user_mss &&
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v2 3/4] net: tcp: add RTAX_CC_ALGO fib handling
From: Daniel Borkmann @ 2014-12-06 23:39 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev
In-Reply-To: <1417909163-19135-1-git-send-email-dborkman@redhat.com>

This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
control metric to be set up and dumped back to user space.

While the internal representation of RTAX_CC_ALGO is handled as a u32
key, we avoided to expose this implementation detail to user space, thus
instead, we chose the netlink attribute that is being exchanged between
user space to be the actual congestion control algorithm name, similarly
as in the setsockopt(2) API in order to allow for maximum flexibility,
even for 3rd party modules.

It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
it should have been stored in RTAX_FEATURES instead, we first thought
about reusing it for the congestion control key, but it brings more
complications and/or confusion than worth it. Trying to load a non
present congestion algorithm name will be rejected.

Joint work with Florian Westphal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/net/tcp.h              |  7 +++++++
 include/uapi/linux/rtnetlink.h |  2 ++
 net/core/rtnetlink.c           | 15 +++++++++++++--
 net/decnet/dn_fib.c            |  3 ++-
 net/decnet/dn_table.c          |  4 +++-
 net/ipv4/fib_semantics.c       | 14 ++++++++++++--
 net/ipv6/ip6_fib.c             | 15 ++++++++++++++-
 net/ipv6/route.c               |  3 ++-
 8 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 135b70c..95bb237 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -846,7 +846,14 @@ extern struct tcp_congestion_ops tcp_reno;
 
 struct tcp_congestion_ops *tcp_ca_find_key(u32 key);
 u32 tcp_ca_get_key_by_name(const char *name);
+#ifdef CONFIG_INET
 char *tcp_ca_get_name_by_key(u32 key, char *buffer);
+#else
+static inline char *tcp_ca_get_name_by_key(u32 key, char *buffer)
+{
+	return NULL;
+}
+#endif
 
 static inline bool tcp_ca_needs_ecn(const struct sock *sk)
 {
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 9c9b8b4..d81f22d 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -389,6 +389,8 @@ enum {
 #define RTAX_INITRWND RTAX_INITRWND
 	RTAX_QUICKACK,
 #define RTAX_QUICKACK RTAX_QUICKACK
+	RTAX_CC_ALGO,
+#define RTAX_CC_ALGO RTAX_CC_ALGO
 	__RTAX_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 61cb7e7..3566f41 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -50,6 +50,7 @@
 #include <net/arp.h>
 #include <net/route.h>
 #include <net/udp.h>
+#include <net/tcp.h>
 #include <net/sock.h>
 #include <net/pkt_sched.h>
 #include <net/fib_rules.h>
@@ -669,9 +670,19 @@ int rtnetlink_put_metrics(struct sk_buff *skb, u32 *metrics)
 
 	for (i = 0; i < RTAX_MAX; i++) {
 		if (metrics[i]) {
+			if (i == RTAX_CC_ALGO - 1) {
+				char tmp[TCP_CA_NAME_MAX], *name;
+
+				name = tcp_ca_get_name_by_key(metrics[i], tmp);
+				if (!name)
+					continue;
+				if (nla_put_string(skb, i + 1, name))
+					goto nla_put_failure;
+			} else {
+				if (nla_put_u32(skb, i + 1, metrics[i]))
+					goto nla_put_failure;
+			}
 			valid++;
-			if (nla_put_u32(skb, i+1, metrics[i]))
-				goto nla_put_failure;
 		}
 	}
 
diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
index d332aef..df48034 100644
--- a/net/decnet/dn_fib.c
+++ b/net/decnet/dn_fib.c
@@ -298,7 +298,8 @@ struct dn_fib_info *dn_fib_create_info(const struct rtmsg *r, struct nlattr *att
 			int type = nla_type(attr);
 
 			if (type) {
-				if (type > RTAX_MAX || nla_len(attr) < 4)
+				if (type > RTAX_MAX || type == RTAX_CC_ALGO ||
+				    nla_len(attr) < 4)
 					goto err_inval;
 
 				fi->fib_metrics[type-1] = nla_get_u32(attr);
diff --git a/net/decnet/dn_table.c b/net/decnet/dn_table.c
index 86e3807..3f19fcb 100644
--- a/net/decnet/dn_table.c
+++ b/net/decnet/dn_table.c
@@ -29,6 +29,7 @@
 #include <linux/route.h> /* RTF_xxx */
 #include <net/neighbour.h>
 #include <net/netlink.h>
+#include <net/tcp.h>
 #include <net/dst.h>
 #include <net/flow.h>
 #include <net/fib_rules.h>
@@ -273,7 +274,8 @@ static inline size_t dn_fib_nlmsg_size(struct dn_fib_info *fi)
 	size_t payload = NLMSG_ALIGN(sizeof(struct rtmsg))
 			 + nla_total_size(4) /* RTA_TABLE */
 			 + nla_total_size(2) /* RTA_DST */
-			 + nla_total_size(4); /* RTA_PRIORITY */
+			 + nla_total_size(4) /* RTA_PRIORITY */
+			 + nla_total_size(TCP_CA_NAME_MAX); /* RTAX_CC_ALGO */
 
 	/* space for nested metrics */
 	payload += nla_total_size((RTAX_MAX * nla_total_size(4)));
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index f99f41b..d2b7b55 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -360,7 +360,8 @@ static inline size_t fib_nlmsg_size(struct fib_info *fi)
 			 + nla_total_size(4) /* RTA_TABLE */
 			 + nla_total_size(4) /* RTA_DST */
 			 + nla_total_size(4) /* RTA_PRIORITY */
-			 + nla_total_size(4); /* RTA_PREFSRC */
+			 + nla_total_size(4) /* RTA_PREFSRC */
+			 + nla_total_size(TCP_CA_NAME_MAX); /* RTAX_CC_ALGO */
 
 	/* space for nested metrics */
 	payload += nla_total_size((RTAX_MAX * nla_total_size(4)));
@@ -859,7 +860,16 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 
 				if (type > RTAX_MAX)
 					goto err_inval;
-				val = nla_get_u32(nla);
+				if (type == RTAX_CC_ALGO) {
+					char tmp[TCP_CA_NAME_MAX];
+
+					nla_strlcpy(tmp, nla, sizeof(tmp));
+					val = tcp_ca_get_key_by_name(tmp);
+					if (val == TCP_CA_UNSPEC)
+						goto err_inval;
+				} else {
+					val = nla_get_u32(nla);
+				}
 				if (type == RTAX_ADVMSS && val > 65535 - 40)
 					val = 65535 - 40;
 				if (type == RTAX_MTU && val > 65535 - 15)
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index b2d1838..0998ac6 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -30,6 +30,7 @@
 #include <linux/slab.h>
 
 #include <net/ipv6.h>
+#include <net/tcp.h>
 #include <net/ndisc.h>
 #include <net/addrconf.h>
 
@@ -650,10 +651,22 @@ static int fib6_commit_metrics(struct dst_entry *dst,
 		int type = nla_type(nla);
 
 		if (type) {
+			u32 val;
+
 			if (type > RTAX_MAX)
 				return -EINVAL;
+			if (type == RTAX_CC_ALGO) {
+				char tmp[TCP_CA_NAME_MAX];
+
+				nla_strlcpy(tmp, nla, sizeof(tmp));
+				val = tcp_ca_get_key_by_name(tmp);
+				if (val == TCP_CA_UNSPEC)
+					return -EINVAL;
+			} else {
+				val = nla_get_u32(nla);
+			}
 
-			mp[type - 1] = nla_get_u32(nla);
+			mp[type - 1] = val;
 		}
 	}
 	return 0;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index c910831..818c99a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2534,7 +2534,8 @@ static inline size_t rt6_nlmsg_size(void)
 	       + nla_total_size(4) /* RTA_OIF */
 	       + nla_total_size(4) /* RTA_PRIORITY */
 	       + RTAX_MAX * nla_total_size(4) /* RTA_METRICS */
-	       + nla_total_size(sizeof(struct rta_cacheinfo));
+	       + nla_total_size(sizeof(struct rta_cacheinfo))
+	       + nla_total_size(TCP_CA_NAME_MAX); /* RTAX_CC_ALGO */
 }
 
 static int rt6_fill_node(struct net *net,
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v2 1/4] net: tcp: refactor reinitialization of congestion control
From: Daniel Borkmann @ 2014-12-06 23:39 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev
In-Reply-To: <1417909163-19135-1-git-send-email-dborkman@redhat.com>

We can just move this to an extra function and make the code
a bit more readable, no functional change.

Joint work with Florian Westphal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 net/ipv4/tcp_cong.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 27ead0d..38f2f8a 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -107,6 +107,18 @@ void tcp_init_congestion_control(struct sock *sk)
 		icsk->icsk_ca_ops->init(sk);
 }
 
+static void tcp_reinit_congestion_control(struct sock *sk,
+					  const struct tcp_congestion_ops *ca)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	tcp_cleanup_congestion_control(sk);
+	icsk->icsk_ca_ops = ca;
+
+	if (sk->sk_state != TCP_CLOSE && icsk->icsk_ca_ops->init)
+		icsk->icsk_ca_ops->init(sk);
+}
+
 /* Manage refcounts on socket close. */
 void tcp_cleanup_congestion_control(struct sock *sk)
 {
@@ -262,21 +274,13 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
 #endif
 	if (!ca)
 		err = -ENOENT;
-
 	else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
 		   ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
 		err = -EPERM;
-
 	else if (!try_module_get(ca->owner))
 		err = -EBUSY;
-
-	else {
-		tcp_cleanup_congestion_control(sk);
-		icsk->icsk_ca_ops = ca;
-
-		if (sk->sk_state != TCP_CLOSE && icsk->icsk_ca_ops->init)
-			icsk->icsk_ca_ops->init(sk);
-	}
+	else
+		tcp_reinit_congestion_control(sk, ca);
  out:
 	rcu_read_unlock();
 	return err;
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v2 0/4] net: allow setting congctl via routing table
From: Daniel Borkmann @ 2014-12-06 23:39 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev

This is the second part of our work and allows for setting the congestion
control algorithm via routing table. For details, please see individual
patches.

Joint work with Florian Westphal, suggested by Hannes Frederic Sowa.

Patch for iproute2: http://patchwork.ozlabs.org/patch/418149/

Thanks!

v1->v2:
 - Very sorry, I noticed I had decnet disabled during testing.
   Added missing header include in decnet, rest as is.

Daniel Borkmann (4):
  net: tcp: refactor reinitialization of congestion control
  net: tcp: add key management to congestion control
  net: tcp: add RTAX_CC_ALGO fib handling
  net: tcp: add per route congestion control

 include/net/inet_connection_sock.h |  3 +-
 include/net/tcp.h                  | 22 +++++++++-
 include/uapi/linux/rtnetlink.h     |  2 +
 net/core/rtnetlink.c               | 15 ++++++-
 net/decnet/dn_fib.c                |  3 +-
 net/decnet/dn_table.c              |  4 +-
 net/ipv4/fib_semantics.c           | 14 +++++-
 net/ipv4/tcp_cong.c                | 87 ++++++++++++++++++++++++++++++++------
 net/ipv4/tcp_ipv4.c                |  2 +
 net/ipv4/tcp_minisocks.c           | 30 +++++++++++--
 net/ipv4/tcp_output.c              | 21 +++++++++
 net/ipv6/ip6_fib.c                 | 15 ++++++-
 net/ipv6/route.c                   |  3 +-
 net/ipv6/tcp_ipv6.c                |  2 +
 14 files changed, 197 insertions(+), 26 deletions(-)

-- 
1.7.11.7

^ permalink raw reply

* Re: [ovs-discuss] kernel panic receiving flooded VXLAN traffic with OVS
From: Nicholas Bastin @ 2014-12-06 22:47 UTC (permalink / raw)
  To: Jesse Gross; +Cc: discuss-yBygre7rU0TnMu66kgdUjQ@public.gmane.org, netdev
In-Reply-To: <CAEP_g=_mfSCH1250ezx_h8_yM_4FzsYcsS6EnGi99AFoWO_MKw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 1160 bytes --]

On Fri, Dec 5, 2014 at 4:51 PM, Jesse Gross <jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:

> I don't think there is anything inherently wrong with aggregating TCP
> segments in VXLAN that are not destined for the local host. This is
> conceptually the same as doing aggregation for TCP packets where we
> only perform L2 bridging - in theory we shouldn't look at the upper
> layers but it is fine as long as we faithfully reconstruct it on the
> way out.
>

But you don't faithfully reconstruct what the user originally sent -
in-path reassembly is always wrong, which is why hardware switches don't do
it (by default, anyhow).  If you configure a middlebox to do some kind of
assembly/translation/whatever work for you, that's fine, but something that
advertises itself as a "switch" or "router" should definitely not do this
by default.

If you reassemble frames you completely obviate any kind of PMTU-D or
configured MTU that your user is using, and this breaks a lot of paths.  We
completely disable all GRO/TSO/etc., but if you are able to determine that
a packet is not destined for the local host you should definitely not
mutate it.

--
Nick

[-- Attachment #1.2: Type: text/html, Size: 1701 bytes --]

[-- Attachment #2: Type: text/plain, Size: 141 bytes --]

_______________________________________________
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss

^ permalink raw reply

* Re: [net-next 04/14] ixgbe: remove CIAA/D register reads from bad VF check
From: Alex Williamson @ 2014-12-06 19:42 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: David Miller, emil.s.tantilov, netdev, nhorman, sassmann,
	jogreene
In-Reply-To: <1417841892.2399.2.camel@jtkirshe-mobl>

On Fri, 2014-12-05 at 20:58 -0800, Jeff Kirsher wrote:
> On Fri, 2014-12-05 at 20:49 -0800, David Miller wrote:
> > From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> > Date: Fri,  5 Dec 2014 09:52:43 -0800
> > 
> > > From: Emil Tantilov <emil.s.tantilov@intel.com>
> > > 
> > > Accessing the CIAA/D register can block access to the PCI config space.
> > > 
> > > This patch removes the read/write operations to the CIAA/D registers
> > > and makes use of standard kernel functions for accessing the PCI config
> > > space.
> > > 
> > > In addition it moves ixgbevf_check_for_bad_vf() into the watchdog subtask
> > > which reduces the frequency of the checks.
> > > 
> > > CC: Alex Williamson <alex.williamson@redhat.com>
> > > Reported-by: Alex Williamson <alex.williamson@redhat.com>
> > > Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
> > > Tested-by: Aaron Brown <aaron.f.brown@intel.com>
> > > Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> > 
> > Alex Willaimson stated that he'd like to see this for -stable, but I'm warning
> > right now that a change not appropriate for 'net' is not approperiate for
> > '-stable' either.
> 
> Agreed, only reason I did not send this to net (along with the other
> fixes by Emil) was that we are at -rc7 and do not consider these
> "critical" to try and squeeze in before the release.

I'm not trying to subvert any process, just trying to note that the
commit log is rather subtle for the severity of the bug this fixes and
this should be considered for stable once accepted.  Stable obviously
doesn't take anything that's not already upstream.  Thanks,

Alex

^ permalink raw reply

* Re: [PATCH 1/3] netdev: introduce new NETIF_F_HW_SWITCH_OFFLOAD feature flag for switch device offloads
From: Roopa Prabhu @ 2014-12-06 18:59 UTC (permalink / raw)
  To: Thomas Graf
  Cc: jiri, sfeldma, jhs, bcrl, john.fastabend, stephen, linville,
	nhorman, nicolas.dichtel, vyasevic, f.fainelli, buytenh, aviadr,
	netdev, davem, shm, gospo
In-Reply-To: <20141206101425.GA15999@casper.infradead.org>

On 12/6/14, 2:14 AM, Thomas Graf wrote:
> On 12/05/14 at 11:46pm, Roopa Prabhu wrote:
>> On 12/5/14, 2:43 PM, Thomas Graf wrote:
>>> On 12/04/14 at 06:26pm, roopa@cumulusnetworks.com wrote:
>>>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>>
>>>> This is a generic high level feature flag for all switch asic features today.
>>>>
>>>> switch drivers set this flag on switch ports. Logical devices like
>>>> bridge, bonds, vxlans can inherit this flag from their slaves/ports.
>>>>
>>>> I had to use SWITCH in the name to avoid ambiguity with other feature
>>>> flags. But, since i have been harping about not calling it 'switch',
>>>> I am welcome to any suggestions :)
>>>>
>>>> An alternative to using a feature flag is to use a IFF_HW_OFFLOAD
>>>> in net_device_flags.
>>> What does this flag indicate specifically? What driver would
>>> implement ndo_bridge_setlink() but not set this flag?
>>>
>>> I think it should be clearly documented when this flag is to bet set.
>> I mentioned it as an alternative because it was there in my RFC patch. There
>> is no code for it yet.
>> And I will get rid of the comment in v2.
> Sorry, I was referring to NETIF_F_HW_SWITCH_OFFLOAD.
Ah, ok.

> I do not
> understand the scope of this bit/flag yet.
> Can you give examples
> when to set this bit? At what point would the existing ixgbe FDB
> offload set this bit? Is there a case when ndo_bridge_setlink()
> is implemented but this bit is not set?
The switch asic drivers will set this flag on the switch ports. This 
flag will be used when you want to offload
kernel networking to hardware or in other words accelerate  the 
corresponding kernel network function.
  In the context of current patches it is the bridge driver. A kernel 
bridge that has any of these switch asic ports
  as bridge ports can inherit this flag and use this flag on the ports to
call the switch driver ndo ops to offload state to hw and thus hw 
accelerate the bridging function.

The existing l2 offload flags, example the hwmode and self flags used 
for nic drivers bridge/l2 offload,
does not help switch asic hardware completely. AFAICT, those are used 
today to manage the bridge in nic hw directly.
They don't involve the bridge driver.

And the most important thing is they require the user to specify this 
additional flag to offload.


In my proposal,
- The flag/bit helps switch asic drivers to indicate that they 
accelerate the kernel networking objects/functions
- The user does not have to specify a new flag to do so. A bridge 
created with switch asic ports will thus be accelerated if the switch 
driver supports it.
- The user can continue to directly manage l2 in nics (ixgbe) using the 
existing hwmode/self flags
- And it also does not stop users from using the 'self' flag to talk to 
the switch asic driver directly
- And involving the bridge driver makes sure the add/del notifications 
to user space go out after both kernel and hardware are programmed

Regarding ixgbe driver setting this flag: For the current vepa and veb 
modes I don't see a need for it.
But if there is a use case in the future to indicate hw accelerating the 
bridge driver, ixgbe driver can set this flag

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox