Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [Qemu-devel] tap devices not receiving packets from a bridge
From: Peter Lieven @ 2012-11-29 18:58 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel@nongnu.org, netdev, mst
In-Reply-To: <20121123070211.GC22787@stefanha-thinkpad.hitronhub.home>

Am 23.11.2012 08:02, schrieb Stefan Hajnoczi:
> On Thu, Nov 22, 2012 at 03:29:52PM +0100, Peter Lieven wrote:
>> is anyone aware of a problem with the linux network bridge that in very rare circumstances stops
>> a bridge from sending pakets to a tap device?
>>
>> My problem occurs in conjunction with vanilla qemu-kvm-1.2.0 and Ubuntu Kernel 3.2.0-34.53
>> which is based on Linux 3.2.33.
>>
>> I was not yet able to reproduce the issue, it happens in really rare cases. The symptom is that
>> the tap does not have any TX packets. RX is working fine. I see the packets coming in at
>> the physical interface on the host, but they are not forwarded to the tap interface.
>> The bridge itself has learnt the mac address of the vServer that is connected to the tap interface.
>> It does not help to toggle the bridge link status,  the tap interface status or the interface in the vServer.
>> It seems that problem occurs if a tap interface that has previously been used, but set to nonpersistent
>> is set persistent again and then is by chance assigned to the same vServer (=same mac address on same
>> bridge) again. Unfortunately it seems not to be reproducible.
> 
> Not sure but this patch from Michael Tsirkin may help - it solves an
> issue with persistent tap devices:
> 
> http://patchwork.ozlabs.org/patch/198598/

I have tried that patch (even if I do not use persistant taps), but it doesn't help.

What I can say now is that if a VM is not working with a tap - lets say tap10 then
it will not work with tap10, even if the vm is shut down. tap10 is set to non-persistant.
then the vm is started again, assigned occasionally again tap10 and is not working again.

BUT, if I use qemu-kvm-1.0.1 in the second case it is working. I have seen that there is
a lot of changes between 1.0.1 and 1.2.0 in the tap code. Maybe there is a bug in the
inititialization since then.

What also seem to have changed is that vlans have been removed. I was not aware of that,
so I still use vlans which are now emulated via hubs. Maybe this is related.

Peter

> 
> Stefan
> 

^ permalink raw reply

* Re: [PATCH net-next] doc: make the description of how tcp_ecn works more explicit and clear
From: Rick Jones @ 2012-11-29 18:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1354214521.3299.30.camel@edumazet-glaptop>

On 11/29/2012 10:42 AM, Eric Dumazet wrote:
> On Thu, 2012-11-29 at 13:16 -0500, David Miller wrote:
>> From: raj@tardy.usa.hp.com (Rick Jones)
>> Date: Wed, 28 Nov 2012 11:53:10 -0800 (PST)
>>
>>> From: Rick Jones <rick.jones2@hp.com>
>>>
>>> Make the description of how tcp_ecn works a bit more explicit and clear.
>>>
>>> Signed-off-by: Rick Jones <rick.jones2@hp.com>
>>
>> Applied, thanks Rick.

Am I correct in assuming that the documentation is supposed to word-wrap 
somewhere around 72 columns?  If so, as I have time for floor sweeping I 
can try to go through more of it.

>> I think we should change the default to one, to be honest.  I thought
>> that's what we were doing by now...

You weren't the only one - what triggered my looking at that description 
in the first place was an assertion in the tcpm mailing list that Linux 
defaulted to ecn enabled.

>> '2' made sense 10 years ago, but it doesn't really today.
>
> With 1 setting, I for example was enable to connect to a HP device,
> when I was still working for SFR.
>
> (It was an HTTP/HTTPS based administrative software, to manage HP c7000
> enclosures)

If you have some of the particulars, feel free to send them to me 
offline.  Being one of the cobbler's children I cannot make promises but 
I can try to see if whatever it was has evolved since then.

> I would suggest making a large scale experiment before doing this 2->1
> move.

Perhaps one or more of the "development oriented" (term?) distros can 
ship with a sysctl.conf file that sets it to one?  Or some companies 
with rather large Internet presence.

At the time of the tcpm message I went ahead and set it to one on 
netperf.org but that is far from a large scale experiment.  It has been 
a couple weeks and I've captured almost 250000 SYN segments (netperf.org 
isn't all that busy).  My recollection is that at least one search 
engine provider's bots were negotiating ECN and one noteable one was 
not.  I'd think a search engine provider's crawlers would be a large 
scale experiment.

rick jones

^ permalink raw reply

* Re: [PATCH resend net-next 2/3] myri10ge: Add vlan rx for better GRO perf.
From: Andrew Gallatin @ 2012-11-29 19:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20121129.131738.1067398537384405805.davem@davemloft.net>

On 11/29/12 13:17, David Miller wrote:
> From: Andrew Gallatin <gallatin@myri.com>
> Date: Wed, 28 Nov 2012 16:20:56 -0500
>
>> + if ((dev->features & (NETIF_F_HW_VLAN_RX)) == NETIF_F_HW_VLAN_RX &&
>> +	    (veh->h_vlan_proto == ntohs(ETH_P_8021Q))) {
>> +		/* fixup csum if needed */
>> +		if (skb->ip_summed == CHECKSUM_COMPLETE)
>> +			skb->csum = csum_sub(skb->csum,
>> + csum_partial(va + ETH_HLEN,
>> + VLAN_HLEN, 0));
>
> This indentation looks like spaghetti, verify that this kind of error
> doesn't exist in the rest of your patches, and resend the series.
>

Sorry.  Emacs victim...  I'll clean it up & re-submit.  I'd stupidly
assumed that checkpatch would verify indentation.  :(

Drew

^ permalink raw reply

* [PATCH net-next v1 3/3] net/mlx4_en: Set number of rx/tx channels using ethtool
From: Amir Vadai @ 2012-11-29 19:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: Amir Vadai, Or Gerlitz, Oren Duer, netdev
In-Reply-To: <1354216903-830-1-git-send-email-amirv@mellanox.com>

Add support to changing number of rx/tx channels using
ethtool ('ethtool -[lL]'). Where the number of tx channels specified in ethtool
is the number of rings per user priority - not total number of tx rings.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |   69 +++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_main.c    |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |   26 +++++----
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      |    2 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |    8 ++-
 5 files changed, 93 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index dc8ccb4..681bc1b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -999,6 +999,73 @@ static int mlx4_en_set_rxnfc(struct net_device *dev, struct ethtool_rxnfc *cmd)
 	return err;
 }
 
+static void mlx4_en_get_channels(struct net_device *dev,
+		struct ethtool_channels *channel)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+
+	memset(channel, 0, sizeof(*channel));
+
+	channel->max_rx = MAX_RX_RINGS;
+	channel->max_tx = MLX4_EN_MAX_TX_RING_P_UP;
+
+	channel->rx_count = priv->rx_ring_num;
+	channel->tx_count = priv->tx_ring_num / MLX4_EN_NUM_UP;
+}
+
+static int mlx4_en_set_channels(struct net_device *dev,
+		struct ethtool_channels *channel)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct mlx4_en_dev *mdev = priv->mdev;
+	int port_up;
+	int err = 0;
+
+	if (channel->other_count || channel->combined_count ||
+	    channel->tx_count > channel->max_tx ||
+	    channel->rx_count > channel->max_rx ||
+	    !channel->tx_count || !channel->rx_count)
+		return -EINVAL;
+
+	mutex_lock(&mdev->state_lock);
+	if (priv->port_up) {
+		port_up = 1;
+		mlx4_en_stop_port(dev);
+	}
+
+	mlx4_en_free_resources(priv);
+
+	priv->num_tx_rings_p_up = channel->tx_count;
+	priv->tx_ring_num = channel->tx_count * MLX4_EN_NUM_UP;
+	priv->rx_ring_num = channel->rx_count;
+
+	err = mlx4_en_alloc_resources(priv);
+	if (err) {
+		en_err(priv, "Failed reallocating port resources\n");
+		goto out;
+	}
+
+	netif_set_real_num_tx_queues(dev, priv->tx_ring_num);
+	netif_set_real_num_rx_queues(dev, priv->rx_ring_num);
+
+	mlx4_en_setup_tc(dev, MLX4_EN_NUM_UP);
+
+	en_warn(priv, "Using %d TX rings\n", priv->tx_ring_num);
+	en_warn(priv, "Using %d RX rings\n", priv->rx_ring_num);
+
+	if (port_up) {
+		err = mlx4_en_start_port(dev);
+		if (err)
+			en_err(priv, "Failed starting port\n");
+	}
+
+	err = mlx4_en_moderation_update(priv);
+
+out:
+	mutex_unlock(&mdev->state_lock);
+	return err;
+}
+
 const struct ethtool_ops mlx4_en_ethtool_ops = {
 	.get_drvinfo = mlx4_en_get_drvinfo,
 	.get_settings = mlx4_en_get_settings,
@@ -1023,6 +1090,8 @@ const struct ethtool_ops mlx4_en_ethtool_ops = {
 	.get_rxfh_indir_size = mlx4_en_get_rxfh_indir_size,
 	.get_rxfh_indir = mlx4_en_get_rxfh_indir,
 	.set_rxfh_indir = mlx4_en_set_rxfh_indir,
+	.get_channels = mlx4_en_get_channels,
+	.set_channels = mlx4_en_set_channels,
 };
 
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_main.c b/drivers/net/ethernet/mellanox/mlx4/en_main.c
index a52922e..3a2b8c6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_main.c
@@ -250,7 +250,7 @@ static void *mlx4_en_add(struct mlx4_dev *dev)
 				rounddown_pow_of_two(max_t(int, MIN_RX_RINGS,
 							   min_t(int,
 								 dev->caps.num_comp_vectors,
-								 MAX_RX_RINGS)));
+								 DEF_RX_RINGS)));
 		} else {
 			mdev->profile.prof[i].rx_ring_num = rounddown_pow_of_two(
 				min_t(int, dev->caps.comp_pool/
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 2b23ca2..7d1287f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -47,11 +47,11 @@
 #include "mlx4_en.h"
 #include "en_port.h"
 
-static int mlx4_en_setup_tc(struct net_device *dev, u8 up)
+int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int i;
-	unsigned int q, offset = 0;
+	unsigned int offset = 0;
 
 	if (up && up != MLX4_EN_NUM_UP)
 		return -EINVAL;
@@ -59,10 +59,9 @@ static int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 	netdev_set_num_tc(dev, up);
 
 	/* Partition Tx queues evenly amongst UP's */
-	q = priv->tx_ring_num / up;
 	for (i = 0; i < up; i++) {
-		netdev_set_tc_queue(dev, i, q, offset);
-		offset += q;
+		netdev_set_tc_queue(dev, i, priv->num_tx_rings_p_up, offset);
+		offset += priv->num_tx_rings_p_up;
 	}
 
 	return 0;
@@ -1114,7 +1113,7 @@ int mlx4_en_start_port(struct net_device *dev)
 		/* Configure ring */
 		tx_ring = &priv->tx_ring[i];
 		err = mlx4_en_activate_tx_ring(priv, tx_ring, cq->mcq.cqn,
-			i / priv->mdev->profile.num_tx_rings_p_up);
+			i / priv->num_tx_rings_p_up);
 		if (err) {
 			en_err(priv, "Failed allocating Tx ring\n");
 			mlx4_en_deactivate_cq(priv, cq);
@@ -1564,10 +1563,13 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	int err;
 
 	dev = alloc_etherdev_mqs(sizeof(struct mlx4_en_priv),
-	    prof->tx_ring_num, prof->rx_ring_num);
+				 MAX_TX_RINGS, MAX_RX_RINGS);
 	if (dev == NULL)
 		return -ENOMEM;
 
+	netif_set_real_num_tx_queues(dev, prof->tx_ring_num);
+	netif_set_real_num_rx_queues(dev, prof->rx_ring_num);
+
 	SET_NETDEV_DEV(dev, &mdev->dev->pdev->dev);
 	dev->dev_id =  port - 1;
 
@@ -1586,15 +1588,17 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	priv->flags = prof->flags;
 	priv->ctrl_flags = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE |
 			MLX4_WQE_CTRL_SOLICITED);
+	priv->num_tx_rings_p_up = mdev->profile.num_tx_rings_p_up;
 	priv->tx_ring_num = prof->tx_ring_num;
-	priv->tx_ring = kzalloc(sizeof(struct mlx4_en_tx_ring) *
-			priv->tx_ring_num, GFP_KERNEL);
+
+	priv->tx_ring = kzalloc(sizeof(struct mlx4_en_tx_ring) * MAX_TX_RINGS,
+				GFP_KERNEL);
 	if (!priv->tx_ring) {
 		err = -ENOMEM;
 		goto out;
 	}
-	priv->tx_cq = kzalloc(sizeof(struct mlx4_en_cq) * priv->tx_ring_num,
-			GFP_KERNEL);
+	priv->tx_cq = kzalloc(sizeof(struct mlx4_en_cq) * MAX_RX_RINGS,
+			      GFP_KERNEL);
 	if (!priv->tx_cq) {
 		err = -ENOMEM;
 		goto out;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index b35094c..1f571d0 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -523,7 +523,7 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
 u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	u16 rings_p_up = priv->mdev->profile.num_tx_rings_p_up;
+	u16 rings_p_up = priv->num_tx_rings_p_up;
 	u8 up = 0;
 
 	if (dev->num_tc)
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index d3eba8b..334ec48 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -67,7 +67,8 @@
 
 #define MLX4_EN_PAGE_SHIFT	12
 #define MLX4_EN_PAGE_SIZE	(1 << MLX4_EN_PAGE_SHIFT)
-#define MAX_RX_RINGS		16
+#define DEF_RX_RINGS		16
+#define MAX_RX_RINGS		128
 #define MIN_RX_RINGS		4
 #define TXBB_SIZE		64
 #define HEADROOM		(2048 / TXBB_SIZE + 1)
@@ -118,6 +119,8 @@ enum {
 #define MLX4_EN_NUM_UP			8
 #define MLX4_EN_DEF_TX_RING_SIZE	512
 #define MLX4_EN_DEF_RX_RING_SIZE  	1024
+#define MAX_TX_RINGS			(MLX4_EN_MAX_TX_RING_P_UP * \
+					 MLX4_EN_NUM_UP)
 
 /* Target number of packets to coalesce with interrupt moderation */
 #define MLX4_EN_RX_COAL_TARGET	44
@@ -476,6 +479,7 @@ struct mlx4_en_priv {
 	u32 flags;
 #define MLX4_EN_FLAG_PROMISC	0x1
 #define MLX4_EN_FLAG_MC_PROMISC	0x2
+	u8 num_tx_rings_p_up;
 	u32 tx_ring_num;
 	u32 rx_ring_num;
 	u32 rx_skb_size;
@@ -596,6 +600,8 @@ int mlx4_en_QUERY_PORT(struct mlx4_en_dev *mdev, u8 port);
 extern const struct dcbnl_rtnl_ops mlx4_en_dcbnl_ops;
 #endif
 
+int mlx4_en_setup_tc(struct net_device *dev, u8 up);
+
 #ifdef CONFIG_RFS_ACCEL
 void mlx4_en_cleanup_filters(struct mlx4_en_priv *priv,
 			     struct mlx4_en_rx_ring *rx_ring);
-- 
1.7.8.2

^ permalink raw reply related

* [PATCH net-next v1 1/3] MAINTAINERS: Add Mellanox ethernet driver - mlx4_en
From: Amir Vadai @ 2012-11-29 19:21 UTC (permalink / raw)
  To: David S. Miller
  Cc: Amir Vadai, Or Gerlitz, Oren Duer, netdev, Yevgeny Petrilin
In-Reply-To: <1354216903-830-1-git-send-email-amirv@mellanox.com>

Set mlx4_en maintainer to Amir Vadai instead of Yevgeny Petrilin.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Cc: Yevgeny Petrilin <yevgenyp@mellanox.com>
---
 MAINTAINERS |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 5d72dd5..3c6e8cd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4822,6 +4822,14 @@ F:	Documentation/scsi/megaraid.txt
 F:	drivers/scsi/megaraid.*
 F:	drivers/scsi/megaraid/

+MELLANOX ETHERNET DRIVER (mlx4_en)
+M:	Amir Vadai <amirv@mellanox.com>
+L: 	netdev@vger.kernel.org
+S:	Supported
+W:	http://www.mellanox.com
+Q:	http://patchwork.ozlabs.org/project/netdev/list/
+F:	drivers/net/ethernet/mellanox/mlx4/en_*
+
 MEMORY MANAGEMENT
 L:	linux-mm@kvack.org
 W:	http://www.linux-mm.org
-- 
1.7.8.2

^ permalink raw reply related

* [PATCH net-next v1 0/3] mlx4_en: set number of rx/tx channels using ethtool
From: Amir Vadai @ 2012-11-29 19:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: Amir Vadai, Or Gerlitz, Oren Duer, netdev

1. Added a record in the MAINTAINERS file for the mlx4_en driver
2. Fix set_ringparam not to forget tx moderation info + remove code duplication
3. Add support to changing number of rx/tx channels using ethtool

---

Changes from V0:
- Added file pattern to MAINAINERS file

Amir Vadai (3):
  MAINTAINERS: Add Mellanox ethernet driver - mlx4_en
  net/mlx4_en: Fix TX moderation info loss after set_ringparam is
    called
  net/mlx4_en: Set number of rx/tx channels using ethtool

 MAINTAINERS                                     |    8 ++
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  128 +++++++++++++++++-----
 drivers/net/ethernet/mellanox/mlx4/en_main.c    |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |   26 +++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      |    2 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |    8 ++-
 6 files changed, 131 insertions(+), 43 deletions(-)

-- 
1.7.8.2

^ permalink raw reply

* [PATCH net-next v1 2/3] net/mlx4_en: Fix TX moderation info loss after set_ringparam is called
From: Amir Vadai @ 2012-11-29 19:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: Amir Vadai, Or Gerlitz, Oren Duer, netdev
In-Reply-To: <1354216903-830-1-git-send-email-amirv@mellanox.com>

We need to re-set tx moderation information after calling set_ringparam
else default tx moderation will be used.
Also avoid related code duplication, by putting it in a utility function.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |   59 ++++++++++++-----------
 1 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 9d0b88e..dc8ccb4 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -43,6 +43,34 @@
 #define EN_ETHTOOL_SHORT_MASK cpu_to_be16(0xffff)
 #define EN_ETHTOOL_WORD_MASK  cpu_to_be32(0xffffffff)
 
+static int mlx4_en_moderation_update(struct mlx4_en_priv *priv)
+{
+	int i;
+	int err = 0;
+
+	for (i = 0; i < priv->tx_ring_num; i++) {
+		priv->tx_cq[i].moder_cnt = priv->tx_frames;
+		priv->tx_cq[i].moder_time = priv->tx_usecs;
+		err = mlx4_en_set_cq_moder(priv, &priv->tx_cq[i]);
+		if (err)
+			return err;
+	}
+
+	if (priv->adaptive_rx_coal)
+		return 0;
+
+	for (i = 0; i < priv->rx_ring_num; i++) {
+		priv->rx_cq[i].moder_cnt = priv->rx_frames;
+		priv->rx_cq[i].moder_time = priv->rx_usecs;
+		priv->last_moder_time[i] = MLX4_EN_AUTO_CONF;
+		err = mlx4_en_set_cq_moder(priv, &priv->rx_cq[i]);
+		if (err)
+			return err;
+	}
+
+	return err;
+}
+
 static void
 mlx4_en_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *drvinfo)
 {
@@ -381,7 +409,6 @@ static int mlx4_en_set_coalesce(struct net_device *dev,
 			      struct ethtool_coalesce *coal)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	int err, i;
 
 	priv->rx_frames = (coal->rx_max_coalesced_frames ==
 			   MLX4_EN_AUTO_CONF) ?
@@ -397,14 +424,6 @@ static int mlx4_en_set_coalesce(struct net_device *dev,
 	    coal->tx_max_coalesced_frames != priv->tx_frames) {
 		priv->tx_usecs = coal->tx_coalesce_usecs;
 		priv->tx_frames = coal->tx_max_coalesced_frames;
-		for (i = 0; i < priv->tx_ring_num; i++) {
-			priv->tx_cq[i].moder_cnt = priv->tx_frames;
-			priv->tx_cq[i].moder_time = priv->tx_usecs;
-			if (mlx4_en_set_cq_moder(priv, &priv->tx_cq[i])) {
-				en_warn(priv, "Failed changing moderation "
-					      "for TX cq %d\n", i);
-			}
-		}
 	}
 
 	/* Set adaptive coalescing params */
@@ -414,18 +433,8 @@ static int mlx4_en_set_coalesce(struct net_device *dev,
 	priv->rx_usecs_high = coal->rx_coalesce_usecs_high;
 	priv->sample_interval = coal->rate_sample_interval;
 	priv->adaptive_rx_coal = coal->use_adaptive_rx_coalesce;
-	if (priv->adaptive_rx_coal)
-		return 0;
 
-	for (i = 0; i < priv->rx_ring_num; i++) {
-		priv->rx_cq[i].moder_cnt = priv->rx_frames;
-		priv->rx_cq[i].moder_time = priv->rx_usecs;
-		priv->last_moder_time[i] = MLX4_EN_AUTO_CONF;
-		err = mlx4_en_set_cq_moder(priv, &priv->rx_cq[i]);
-		if (err)
-			return err;
-	}
-	return 0;
+	return mlx4_en_moderation_update(priv);
 }
 
 static int mlx4_en_set_pauseparam(struct net_device *dev,
@@ -466,7 +475,6 @@ static int mlx4_en_set_ringparam(struct net_device *dev,
 	u32 rx_size, tx_size;
 	int port_up = 0;
 	int err = 0;
-	int i;
 
 	if (param->rx_jumbo_pending || param->rx_mini_pending)
 		return -EINVAL;
@@ -505,14 +513,7 @@ static int mlx4_en_set_ringparam(struct net_device *dev,
 			en_err(priv, "Failed starting port\n");
 	}
 
-	for (i = 0; i < priv->rx_ring_num; i++) {
-		priv->rx_cq[i].moder_cnt = priv->rx_frames;
-		priv->rx_cq[i].moder_time = priv->rx_usecs;
-		priv->last_moder_time[i] = MLX4_EN_AUTO_CONF;
-		err = mlx4_en_set_cq_moder(priv, &priv->rx_cq[i]);
-		if (err)
-			goto out;
-	}
+	err = mlx4_en_moderation_update(priv);
 
 out:
 	mutex_unlock(&mdev->state_lock);
-- 
1.7.8.2

^ permalink raw reply related

* Re: RDS: sendto() with very large buffer triggering WARNING: at mm/page_alloc.c:2403
From: Venkat Venkatsubra @ 2012-11-29 19:38 UTC (permalink / raw)
  To: Tommi Rantala; +Cc: netdev, rds-devel, Dave Jones
In-Reply-To: <CA+ydwtrVyun4LhMufd44ip4eXSwXp9Tw34+5yEdm3UdTVb9hQA@mail.gmail.com>

On 11/29/2012 2:39 AM, Tommi Rantala wrote:
> Hello,
>
> Is RDS supposed to cap the sendto() buffer size? Saw the WARNING while
> fuzzing with Trinity.
>
> #include<string.h>
> #include<arpa/inet.h>
> #include<sys/socket.h>
>
> static const char buf[1234000567];
>
> int main(void)
> {
>          int fd;
>          struct sockaddr_in sa;
>
>          fd = socket(21 /* AF_RDS */, SOCK_SEQPACKET, 0);
>          if (fd<  0)
>                  return 1;
>
>          memset(&sa, 0, sizeof(sa));
>          sa.sin_family = AF_INET;
>          sa.sin_addr.s_addr = inet_addr("127.0.0.1");
>          sa.sin_port = htons(11111);
>
>          bind(fd, (struct sockaddr *)&sa, sizeof(sa));
>
>          sendto(fd, buf, sizeof(buf), 0, (struct sockaddr *)&sa, sizeof(sa));
>
>          return 0;
> }
>
> $ strace -e sendto ./rds-sendto
> sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1234000567, 0, {sa_family=AF_INET, sin_port=htons(11111),
> sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ENOMEM (Cannot allocate
> memory)
>
> [ 7421.592595] ------------[ cut here ]------------
> [ 7421.592621] WARNING: at mm/page_alloc.c:2403
> __alloc_pages_nodemask+0x2c0/0x9f0()
> [ 7421.592628] Hardware name: EB1012
> [ 7421.592633] Modules linked in:
> [ 7421.592645] Pid: 3082, comm: rds-sendto Not tainted 3.7.0-rc7+ #58
> [ 7421.592650] Call Trace:
> [ 7421.592667]  [<ffffffff810a197b>] warn_slowpath_common+0x7b/0xc0
> [ 7421.592678]  [<ffffffff810a19d5>] warn_slowpath_null+0x15/0x20
> [ 7421.592689]  [<ffffffff81171900>] __alloc_pages_nodemask+0x2c0/0x9f0
> [ 7421.592700]  [<ffffffff81107c90>] ? __lock_acquire+0x3a0/0x9f0
> [ 7421.592711]  [<ffffffff81107c90>] ? __lock_acquire+0x3a0/0x9f0
> [ 7421.592724]  [<ffffffff811ac6bf>] alloc_pages_current+0x7f/0xf0
> [ 7421.592735]  [<ffffffff8116cc19>] __get_free_pages+0x9/0x40
> [ 7421.592746]  [<ffffffff811b3d8a>] kmalloc_order_trace+0x3a/0x190
> [ 7421.592755]  [<ffffffff81107c90>] ? __lock_acquire+0x3a0/0x9f0
> [ 7421.592765]  [<ffffffff811b4f59>] __kmalloc+0x229/0x240
> [ 7421.592778]  [<ffffffff81d1b06e>] rds_message_alloc+0x1e/0xa0
> [ 7421.592789]  [<ffffffff81d1db66>] rds_sendmsg+0x196/0x720
> [ 7421.592802]  [<ffffffff81a72a80>] ? sock_update_classid+0xf0/0x2b0
> [ 7421.592813]  [<ffffffff81a6abec>] sock_sendmsg+0xdc/0xf0
> [ 7421.592828]  [<ffffffff8118e9e5>] ? might_fault+0x85/0x90
> [ 7421.592838]  [<ffffffff8118e99c>] ? might_fault+0x3c/0x90
> [ 7421.592848]  [<ffffffff81a6e0fa>] sys_sendto+0xfa/0x130
> [ 7421.592859]  [<ffffffff8110953d>] ? trace_hardirqs_on_caller+0x10d/0x1a0
> [ 7421.592868]  [<ffffffff811095dd>] ? trace_hardirqs_on+0xd/0x10
> [ 7421.592881]  [<ffffffff81e94c9d>] ? _raw_spin_unlock_irq+0x3d/0x70
> [ 7421.592892]  [<ffffffff810bab44>] ? ptrace_notify+0x74/0x90
> [ 7421.592904]  [<ffffffff81e96250>] tracesys+0xdd/0xe2
> [ 7421.592911] ---[ end trace d9d681d0d60abf69 ]---
>
> Tommi
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
The man page of rds says this:
        The  default values for the send and receive buffer size are 
controlled
        by the A given  RDS  socket  has  limited  transmit  buffer  
space.  It
        defaults  to  the  system  wide  socket  send  buffer  size  set 
in the
        wmem_default and rmem_default sysctls, respectively. They can 
be  tuned
        by  the application through the SO_SNDBUF and SO_RCVBUF socket 
options.

rds_sendmsg (net/rds/send.c) checks this limit a bit later (after 
rds_message_alloc()):
         while (!rds_send_queue_rm(rs, conn, rm, rs->rs_bound_port,
                                   dport, &queued)) {
                 rds_stats_inc(s_send_queue_full);
                 /* XXX make sure this is reasonable */
                 if (payload_len > rds_sk_sndbuf(rs)) {
                         ret = -EMSGSIZE;
                         goto out;
                 }
                 ....
Venkat

^ permalink raw reply

* Re: [PATCH resend net-next 2/3] myri10ge: Add vlan rx for better GRO perf.
From: Joe Perches @ 2012-11-29 19:47 UTC (permalink / raw)
  To: Andrew Gallatin; +Cc: David Miller, netdev
In-Reply-To: <50B7B542.6090804@myri.com>

On Thu, 2012-11-29 at 14:19 -0500, Andrew Gallatin wrote:
> On 11/29/12 13:17, David Miller wrote:
> > From: Andrew Gallatin <gallatin@myri.com>
> > Date: Wed, 28 Nov 2012 16:20:56 -0500
> >
> >> + if ((dev->features & (NETIF_F_HW_VLAN_RX)) == NETIF_F_HW_VLAN_RX &&
> >> +	    (veh->h_vlan_proto == ntohs(ETH_P_8021Q))) {
> >> +		/* fixup csum if needed */
> >> +		if (skb->ip_summed == CHECKSUM_COMPLETE)
> >> +			skb->csum = csum_sub(skb->csum,
> >> + csum_partial(va + ETH_HLEN,
> >> + VLAN_HLEN, 0));
> >
> > This indentation looks like spaghetti, verify that this kind of error
> > doesn't exist in the rest of your patches, and resend the series.
> >
> 
> Sorry.  Emacs victim...  I'll clean it up & re-submit.  I'd stupidly
> assumed that checkpatch would verify indentation.  :(

checkpatch --strict

^ permalink raw reply

* Re: iputils: ping -I <iface>
From: Ben Greear @ 2012-11-29 19:48 UTC (permalink / raw)
  To: Jan Synacek; +Cc: YOSHIFUJI Hideaki, netdev
In-Reply-To: <50B76D5B.8010804@redhat.com>

On 11/29/2012 06:12 AM, Jan Synacek wrote:
> Hello,
>
> There seems to be a bug(?) when calling ping with -I lo:
>
> $ ping -I lo kernel.org
>
> PING kernel.org (149.20.4.69) from 192.168.1.10 lo: 56(84) bytes of data.
> ^C
>
> Note that 192.168.1.10 is my primary interface's address (em1). However, no
> replies are coming back.
>
> $ ping -I em1 kernel.org
>
> PING kernel.org (149.20.4.69) from 192.168.1.10 em1: 56(84) bytes of data.
> 64 bytes from pub2.kernel.org (149.20.4.69): icmp_seq=1 ttl=42 time=202 ms
> 64 bytes from pub2.kernel.org (149.20.4.69): icmp_seq=2 ttl=42 time=187 ms
> ^C
>
> Works as expected.
>
> I know that binding to loopback probably doesn't make much sense, but I think
> that ping should be able to cope with that.

I think it would be wrong if ping worked as you suggest.  Binding to an
interface means use that interface as the source of your packets, and having
it bind hard helps when using systems with multiple NICs on same subnet
(or possibly, same IP).

> Also, it would be nice to mention the difference between -I <ip> and -I <iface>
> in the manpage.

In my opinion, -I <iface> should use SO_BINDTODEVICE, but at least in
older versions of ping it did not.

Thanks,
Ben

>
> I don't understand the problem clearly enough to write a patch.
>
> Regards,
>


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [PATCH 1/5] cxgb4: Add T4 filter support
From: Ben Hutchings @ 2012-11-29 19:51 UTC (permalink / raw)
  To: Vipul Pandya
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	roland-BHEL68pLQRGGvPXPguhicg, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	divy-ut6Up61K2wZBDgjK7y7TUQ, dm-ut6Up61K2wZBDgjK7y7TUQ,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	abhishek-ut6Up61K2wZBDgjK7y7TUQ
In-Reply-To: <1354200745-23598-2-git-send-email-vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

On Thu, 2012-11-29 at 20:22 +0530, Vipul Pandya wrote:
> The T4 architecture is capable of filtering ingress packets at line rate
> using the rule in TCAM. If packet hits a rule in the TCAM then it can be either
> dropped or passed to the receive queues based on a rule settings.
> 
> This patch adds framework for managing filters and to use T4's filter
> capabilities. It constructs a Firmware Filter Work Request which writes the
> filter at a specified index to get the work done. It hosts shadow copy of
> ingress filter entry to check field size limitations and save memory in the
> case where the filter table is large.
> 
> Signed-off-by: Vipul Pandya <vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
> ---
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |  141 +++++++++++
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  309 ++++++++++++++++++++++-
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  |    2 +
>  drivers/net/ethernet/chelsio/cxgb4/l2t.c        |   34 +++
>  drivers/net/ethernet/chelsio/cxgb4/l2t.h        |    3 +
>  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c      |   23 ++-
>  drivers/net/ethernet/chelsio/cxgb4/t4_msg.h     |    1 +
>  drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h   |  279 ++++++++++++++++++++
>  8 files changed, 787 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
> index 378988b..8cfc1ba 100644
> --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
> +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
> @@ -545,6 +545,139 @@ struct adapter {
>  	spinlock_t stats_lock;
>  };
>  
> +/**
[...]

'/**' introduces a kernel-doc comment; please don't use it for any
comments not in that format.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH resend net-next 2/3] myri10ge: Add vlan rx for better GRO perf.
From: Andrew Gallatin @ 2012-11-29 19:53 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, netdev
In-Reply-To: <1354218478.26201.3.camel@joe-AO722>

On 11/29/12 14:47, Joe Perches wrote:
> On Thu, 2012-11-29 at 14:19 -0500, Andrew Gallatin wrote:
>> On 11/29/12 13:17, David Miller wrote:
>>> From: Andrew Gallatin <gallatin@myri.com>
>>> Date: Wed, 28 Nov 2012 16:20:56 -0500
>>>
>>>> + if ((dev->features & (NETIF_F_HW_VLAN_RX)) == NETIF_F_HW_VLAN_RX &&
>>>> +	    (veh->h_vlan_proto == ntohs(ETH_P_8021Q))) {
>>>> +		/* fixup csum if needed */
>>>> +		if (skb->ip_summed == CHECKSUM_COMPLETE)
>>>> +			skb->csum = csum_sub(skb->csum,
>>>> + csum_partial(va + ETH_HLEN,
>>>> + VLAN_HLEN, 0));
>>>
>>> This indentation looks like spaghetti, verify that this kind of error
>>> doesn't exist in the rest of your patches, and resend the series.
>>>
>>
>> Sorry.  Emacs victim...  I'll clean it up & re-submit.  I'd stupidly
>> assumed that checkpatch would verify indentation.  :(
>
> checkpatch --strict
>

That catches one formatting error in the patch, but does not complain
about the indentation issues.

Drew

^ permalink raw reply

* Re: [net-next PATCH V2 9/9] net: increase frag queue hash size and cache-line
From: Jesper Dangaard Brouer @ 2012-11-29 20:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu, David Laight
In-Reply-To: <1354208113.14302.1855.camel@edumazet-glaptop>

On Thu, 2012-11-29 at 08:55 -0800, Eric Dumazet wrote:
> On Thu, 2012-11-29 at 17:16 +0100, Jesper Dangaard Brouer wrote:
> > Increase frag queue hash size and assure cache-line alignment to
> > avoid false sharing.  Hash size is set to 256, because I have
> > observed 206 frag queues in use at 4x10G with packet size 4416 bytes
> > (three fragments).
> > 
[...]
> >  struct inet_frag_bucket {
> >  	struct hlist_head	chain;
> >  	spinlock_t		chain_lock;
> > -};
> > +} ____cacheline_aligned_in_smp;
> >  
> 
> This is a waste of memory.

Do keep in mind this is only 16 Kbytes (256 * 64 bytes = 16384 bytes).


> Most linux powered devices dont care at all about fragments.
> 
> Just increase hashsz if you really want, and rely on hash dispersion
> to avoid false sharing.

I must agree, that it is perhaps better usage of the memory to just
increase the hashsz (and drop ____cacheline_aligned_in_smp), especially
with the measured performance gain.

> You gave no performance results for this patch anyway.

Yes, I did! -- See cover-mail patch 08 vs 09.
But the gain is really too small, to argue for this cache alignment.


Patch-08:
  2x10G size(4416)  result:(5024+4925)= 9949 Mbit/s
                 V2 result:(5140+5206)=10346 Mbit/s

  4x10G size(4416)  result:(4156+4714+4300+3985)=17155 Mbit/s
                 V2 result:(4341+4607+3963+4450)=17361 Mbit/s
                       (gen:6614+5330+7745+5366 =25055 Mbit/s)

Patch-09:
  2x10G size(4416)  result:(5421+5268)=10689 Mbit/s
                 V2 result:(5377+5336)=10713 Mbit/s

  4x10G size(4416) result:(4890+4364+4139+4530)=17923 Mbit/s
                V2 result:(3860+4533+4936+4519)=17848 Mbit/s
                      (gen:5170+6873+5215+7632 =24890 Mbit/s)
  
Improvements Patch 08 -> 09:

 2x10G size(4416):
   RunV1 (10689-9949) =740 Mbit/s
   RunV2 (10713-10346)=367 Mbit/s

 4x10G size(4416):
   RunV1 (17923-17155)=768 Mbit/s
   RunV2 (17848-17361)=487 Mbit/s

Its consistently better performance, but given magnitude the other
improvements, I don't want to argue over "wasting" 16Kbytes kernel
memory.

I have some debug patches for dumping the content of the hash, which
shows that at 4x10G size(4416) three frags, 206 frag queues, cross CPU
collisions occur anyhow.

Lets focus on the other patches instead.

--Jesper

^ permalink raw reply

* [PATCH v3 net-next 0/2] myri10ge: LRO to GRO conversion
From: Andrew Gallatin @ 2012-11-29 21:07 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Hi,

The following patchset is a resubmission of '[PATCH resend net-next 0/3]
myri10ge: LRO to GRO conversion', and converts myri10ge from using
the old inet_lro interface to GRO, and to do vlan tag decap in
the driver so as to not suffer a performance penalty for vlan
tagged traffic due to the conversion.

Changes this time are:

- Clean up some messy indenting & parens in
   "2/3 myri10ge-Add-vlan-rx-for-better-GRO-perf"
   and store the vlan hdr csum in a variable, so now we are not calling
   csum_partial() nested so deeply it spreads across 3 lines

- Folded 3/3 myri10ge-Use-skb_fill_page_desc into 1/3
   myri10ge-Convert-from-LRO-to-GRO.patch since it is really just small
   part of the LRO removal cleanup.  It was originally a separate patch
   because I noticed the cleanup at the last second, and was too lazy
   to fold it into the first patch where it belonged.

Note that a naive LRO->GRO conversion of myri10ge will result in a
performance regression for vlan tagged frames.  This is because
myri10ge does not offer hardware vlan tag offload, and because GRO
requires hardware vlan tag offload to aggregate vlan tagged frames.

To address this performance regression, I have implemented vlan tag
popping in the myri10ge driver, as it seems to be the lesser of two
evils.  As eric.dumazet@gmail.com commented when I asked about this on
netdev: "Given GRO assumes NIC does hardware vlan
offloading, I guess I would chose to do that.  It seems unfortunate to
add vlan decap in GRO path, already very complex."

Andrew Gallatin (2):
myri10ge: Convert from LRO to GRO
myri10ge: Add vlan rx for better GRO perf.

  drivers/net/ethernet/myricom/Kconfig             |    1 -
  drivers/net/ethernet/myricom/myri10ge/myri10ge.c |  275 
++++++----------------
  2 files changed, 74 insertions(+), 202 deletions(-)

^ permalink raw reply

* [PATCH v3 net-next 1/2] myri10ge: Convert from LRO to GRO
From: Andrew Gallatin @ 2012-11-29 21:07 UTC (permalink / raw)
  To: David Miller; +Cc: netdev



Convert myri10ge from LRO to GRO, and simplify the driver by removing
various LRO-related code which is no longer needed including
ndo_fix_features op, custom skb building from frags, and LRO
header parsing.

Signed-off-by: Andrew Gallatin <gallatin@myri.com>
---
  drivers/net/ethernet/myricom/Kconfig             |    1 -
  drivers/net/ethernet/myricom/myri10ge/myri10ge.c |  236 
++++------------------
  2 files changed, 34 insertions(+), 203 deletions(-)

diff --git a/drivers/net/ethernet/myricom/Kconfig 
b/drivers/net/ethernet/myricom/Kconfig
index 540f0c6..3932d08 100644
--- a/drivers/net/ethernet/myricom/Kconfig
+++ b/drivers/net/ethernet/myricom/Kconfig
@@ -23,7 +23,6 @@ config MYRI10GE
  	depends on PCI && INET
  	select FW_LOADER
  	select CRC32
-	select INET_LRO
  	---help---
  	  This driver supports Myricom Myri-10G Dual Protocol interface in
  	  Ethernet mode. If the eeprom on your board is not recent enough,
diff --git a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c 
b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
index 83516e3..84207c0 100644
--- a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
+++ b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
@@ -50,7 +50,6 @@
  #include <linux/etherdevice.h>
  #include <linux/if_ether.h>
  #include <linux/if_vlan.h>
-#include <linux/inet_lro.h>
  #include <linux/dca.h>
  #include <linux/ip.h>
  #include <linux/inet.h>
@@ -96,8 +95,6 @@ MODULE_LICENSE("Dual BSD/GPL");

  #define MYRI10GE_EEPROM_STRINGS_SIZE 256
  #define MYRI10GE_MAX_SEND_DESC_TSO ((65536 / 2048) * 2)
-#define MYRI10GE_MAX_LRO_DESCRIPTORS 8
-#define MYRI10GE_LRO_MAX_PKTS 64

  #define MYRI10GE_NO_CONFIRM_DATA htonl(0xffffffff)
  #define MYRI10GE_NO_RESPONSE_RESULT 0xffffffff
@@ -165,8 +162,6 @@ struct myri10ge_rx_done {
  	dma_addr_t bus;
  	int cnt;
  	int idx;
-	struct net_lro_mgr lro_mgr;
-	struct net_lro_desc lro_desc[MYRI10GE_MAX_LRO_DESCRIPTORS];
  };

  struct myri10ge_slice_netstats {
@@ -338,11 +333,6 @@ static int myri10ge_debug = -1;	/* defaults above */
  module_param(myri10ge_debug, int, 0);
  MODULE_PARM_DESC(myri10ge_debug, "Debug level (0=none,...,16=all)");

-static int myri10ge_lro_max_pkts = MYRI10GE_LRO_MAX_PKTS;
-module_param(myri10ge_lro_max_pkts, int, S_IRUGO);
-MODULE_PARM_DESC(myri10ge_lro_max_pkts,
-		 "Number of LRO packets to be aggregated");
-
  static int myri10ge_fill_thresh = 256;
  module_param(myri10ge_fill_thresh, int, S_IRUGO | S_IWUSR);
  MODULE_PARM_DESC(myri10ge_fill_thresh, "Number of empty rx slots 
allowed");
@@ -1197,36 +1187,6 @@ static inline void myri10ge_vlan_ip_csum(struct 
sk_buff *skb, __wsum hw_csum)
  	}
  }

-static inline void
-myri10ge_rx_skb_build(struct sk_buff *skb, u8 * va,
-		      struct skb_frag_struct *rx_frags, int len, int hlen)
-{
-	struct skb_frag_struct *skb_frags;
-
-	skb->len = skb->data_len = len;
-	/* attach the page(s) */
-
-	skb_frags = skb_shinfo(skb)->frags;
-	while (len > 0) {
-		memcpy(skb_frags, rx_frags, sizeof(*skb_frags));
-		len -= skb_frag_size(rx_frags);
-		skb_frags++;
-		rx_frags++;
-		skb_shinfo(skb)->nr_frags++;
-	}
-
-	/* pskb_may_pull is not available in irq context, but
-	 * skb_pull() (for ether_pad and eth_type_trans()) requires
-	 * the beginning of the packet in skb_headlen(), move it
-	 * manually */
-	skb_copy_to_linear_data(skb, va, hlen);
-	skb_shinfo(skb)->frags[0].page_offset += hlen;
-	skb_frag_size_sub(&skb_shinfo(skb)->frags[0], hlen);
-	skb->data_len -= hlen;
-	skb->tail += hlen;
-	skb_pull(skb, MXGEFW_PAD);
-}
-
  static void
  myri10ge_alloc_rx_pages(struct myri10ge_priv *mgp, struct 
myri10ge_rx_buf *rx,
  			int bytes, int watchdog)
@@ -1304,18 +1264,14 @@ myri10ge_unmap_rx_page(struct pci_dev *pdev,
  	}
  }

-#define MYRI10GE_HLEN 64	/* The number of bytes to copy from a
-				 * page into an skb */
-
  static inline int
-myri10ge_rx_done(struct myri10ge_slice_state *ss, int len, __wsum csum,
-		 bool lro_enabled)
+myri10ge_rx_done(struct myri10ge_slice_state *ss, int len, __wsum csum)
  {
  	struct myri10ge_priv *mgp = ss->mgp;
  	struct sk_buff *skb;
-	struct skb_frag_struct rx_frags[MYRI10GE_MAX_FRAGS_PER_FRAME];
+	struct skb_frag_struct *rx_frags;
  	struct myri10ge_rx_buf *rx;
-	int i, idx, hlen, remainder, bytes;
+	int i, idx, remainder, bytes;
  	struct pci_dev *pdev = mgp->pdev;
  	struct net_device *dev = mgp->dev;
  	u8 *va;
@@ -1332,67 +1288,47 @@ myri10ge_rx_done(struct myri10ge_slice_state 
*ss, int len, __wsum csum,
  	idx = rx->cnt & rx->mask;
  	va = page_address(rx->info[idx].page) + rx->info[idx].page_offset;
  	prefetch(va);
+
+	skb = napi_get_frags(&ss->napi);
+	if (unlikely(skb == NULL)) {
+		ss->stats.rx_dropped++;
+		for (i = 0, remainder = len; remainder > 0; i++) {
+			myri10ge_unmap_rx_page(pdev, &rx->info[idx], bytes);
+			put_page(rx->info[idx].page);
+			rx->cnt++;
+			idx = rx->cnt & rx->mask;
+			remainder -= MYRI10GE_ALLOC_SIZE;
+		}
+		return 0;
+	}
+	rx_frags = skb_shinfo(skb)->frags;
  	/* Fill skb_frag_struct(s) with data from our receive */
  	for (i = 0, remainder = len; remainder > 0; i++) {
  		myri10ge_unmap_rx_page(pdev, &rx->info[idx], bytes);
-		__skb_frag_set_page(&rx_frags[i], rx->info[idx].page);
-		rx_frags[i].page_offset = rx->info[idx].page_offset;
-		if (remainder < MYRI10GE_ALLOC_SIZE)
-			skb_frag_size_set(&rx_frags[i], remainder);
-		else
-			skb_frag_size_set(&rx_frags[i], MYRI10GE_ALLOC_SIZE);
+		skb_fill_page_desc(skb, i, rx->info[idx].page,
+				   rx->info[idx].page_offset,
+				   remainder < MYRI10GE_ALLOC_SIZE ?
+				   remainder : MYRI10GE_ALLOC_SIZE);
  		rx->cnt++;
  		idx = rx->cnt & rx->mask;
  		remainder -= MYRI10GE_ALLOC_SIZE;
  	}

-	if (lro_enabled) {
-		rx_frags[0].page_offset += MXGEFW_PAD;
-		skb_frag_size_sub(&rx_frags[0], MXGEFW_PAD);
-		len -= MXGEFW_PAD;
-		lro_receive_frags(&ss->rx_done.lro_mgr, rx_frags,
-				  /* opaque, will come back in get_frag_header */
-				  len, len,
-				  (void *)(__force unsigned long)csum, csum);
+	/* remove padding */
+	rx_frags[0].page_offset += MXGEFW_PAD;
+	rx_frags[0].size -= MXGEFW_PAD;
+	len -= MXGEFW_PAD;

-		return 1;
-	}
-
-	hlen = MYRI10GE_HLEN > len ? len : MYRI10GE_HLEN;
-
-	/* allocate an skb to attach the page(s) to. This is done
-	 * after trying LRO, so as to avoid skb allocation overheads */
-
-	skb = netdev_alloc_skb(dev, MYRI10GE_HLEN + 16);
-	if (unlikely(skb == NULL)) {
-		ss->stats.rx_dropped++;
-		do {
-			i--;
-			__skb_frag_unref(&rx_frags[i]);
-		} while (i != 0);
-		return 0;
-	}
-
-	/* Attach the pages to the skb, and trim off any padding */
-	myri10ge_rx_skb_build(skb, va, rx_frags, len, hlen);
-	if (skb_frag_size(&skb_shinfo(skb)->frags[0]) <= 0) {
-		skb_frag_unref(skb, 0);
-		skb_shinfo(skb)->nr_frags = 0;
-	} else {
-		skb->truesize += bytes * skb_shinfo(skb)->nr_frags;
+	skb->len = len;
+	skb->data_len = len;
+	skb->truesize += len;
+	if (dev->features & NETIF_F_RXCSUM) {
+		skb->ip_summed = CHECKSUM_COMPLETE;
+		skb->csum = csum;
  	}
-	skb->protocol = eth_type_trans(skb, dev);
  	skb_record_rx_queue(skb, ss - &mgp->ss[0]);

-	if (dev->features & NETIF_F_RXCSUM) {
-		if ((skb->protocol == htons(ETH_P_IP)) ||
-		    (skb->protocol == htons(ETH_P_IPV6))) {
-			skb->csum = csum;
-			skb->ip_summed = CHECKSUM_COMPLETE;
-		} else
-			myri10ge_vlan_ip_csum(skb, csum);
-	}
-	netif_receive_skb(skb);
+	napi_gro_frags(&ss->napi);
  	return 1;
  }

@@ -1480,18 +1416,11 @@ myri10ge_clean_rx_done(struct 
myri10ge_slice_state *ss, int budget)
  	u16 length;
  	__wsum checksum;

-	/*
-	 * Prevent compiler from generating more than one ->features memory
-	 * access to avoid theoretical race condition with functions that
-	 * change NETIF_F_LRO flag at runtime.
-	 */
-	bool lro_enabled = !!(ACCESS_ONCE(mgp->dev->features) & NETIF_F_LRO);
-
  	while (rx_done->entry[idx].length != 0 && work_done < budget) {
  		length = ntohs(rx_done->entry[idx].length);
  		rx_done->entry[idx].length = 0;
  		checksum = csum_unfold(rx_done->entry[idx].checksum);
-		rx_ok = myri10ge_rx_done(ss, length, checksum, lro_enabled);
+		rx_ok = myri10ge_rx_done(ss, length, checksum);
  		rx_packets += rx_ok;
  		rx_bytes += rx_ok * (unsigned long)length;
  		cnt++;
@@ -1503,9 +1432,6 @@ myri10ge_clean_rx_done(struct myri10ge_slice_state 
*ss, int budget)
  	ss->stats.rx_packets += rx_packets;
  	ss->stats.rx_bytes += rx_bytes;

-	if (lro_enabled)
-		lro_flush_all(&rx_done->lro_mgr);
-
  	/* restock receive rings if needed */
  	if (ss->rx_small.fill_cnt - ss->rx_small.cnt < myri10ge_fill_thresh)
  		myri10ge_alloc_rx_pages(mgp, &ss->rx_small,
@@ -1779,7 +1705,6 @@ static const char 
myri10ge_gstrings_slice_stats[][ETH_GSTRING_LEN] = {
  	"tx_pkt_start", "tx_pkt_done", "tx_req", "tx_done",
  	"rx_small_cnt", "rx_big_cnt",
  	"wake_queue", "stop_queue", "tx_linearized",
-	"LRO aggregated", "LRO flushed", "LRO avg aggr", "LRO no_desc",
  };

  #define MYRI10GE_NET_STATS_LEN      21
@@ -1880,14 +1805,6 @@ myri10ge_get_ethtool_stats(struct net_device *netdev,
  		data[i++] = (unsigned int)ss->tx.wake_queue;
  		data[i++] = (unsigned int)ss->tx.stop_queue;
  		data[i++] = (unsigned int)ss->tx.linearized;
-		data[i++] = ss->rx_done.lro_mgr.stats.aggregated;
-		data[i++] = ss->rx_done.lro_mgr.stats.flushed;
-		if (ss->rx_done.lro_mgr.stats.flushed)
-			data[i++] = ss->rx_done.lro_mgr.stats.aggregated /
-			    ss->rx_done.lro_mgr.stats.flushed;
-		else
-			data[i++] = 0;
-		data[i++] = ss->rx_done.lro_mgr.stats.no_desc;
  	}
  }

@@ -2271,67 +2188,6 @@ static void myri10ge_free_irq(struct 
myri10ge_priv *mgp)
  		pci_disable_msix(pdev);
  }

-static int
-myri10ge_get_frag_header(struct skb_frag_struct *frag, void **mac_hdr,
-			 void **ip_hdr, void **tcpudp_hdr,
-			 u64 * hdr_flags, void *priv)
-{
-	struct ethhdr *eh;
-	struct vlan_ethhdr *veh;
-	struct iphdr *iph;
-	u8 *va = skb_frag_address(frag);
-	unsigned long ll_hlen;
-	/* passed opaque through lro_receive_frags() */
-	__wsum csum = (__force __wsum) (unsigned long)priv;
-
-	/* find the mac header, aborting if not IPv4 */
-
-	eh = (struct ethhdr *)va;
-	*mac_hdr = eh;
-	ll_hlen = ETH_HLEN;
-	if (eh->h_proto != htons(ETH_P_IP)) {
-		if (eh->h_proto == htons(ETH_P_8021Q)) {
-			veh = (struct vlan_ethhdr *)va;
-			if (veh->h_vlan_encapsulated_proto != htons(ETH_P_IP))
-				return -1;
-
-			ll_hlen += VLAN_HLEN;
-
-			/*
-			 *  HW checksum starts ETH_HLEN bytes into
-			 *  frame, so we must subtract off the VLAN
-			 *  header's checksum before csum can be used
-			 */
-			csum = csum_sub(csum, csum_partial(va + ETH_HLEN,
-							   VLAN_HLEN, 0));
-		} else {
-			return -1;
-		}
-	}
-	*hdr_flags = LRO_IPV4;
-
-	iph = (struct iphdr *)(va + ll_hlen);
-	*ip_hdr = iph;
-	if (iph->protocol != IPPROTO_TCP)
-		return -1;
-	if (ip_is_fragment(iph))
-		return -1;
-	*hdr_flags |= LRO_TCP;
-	*tcpudp_hdr = (u8 *) (*ip_hdr) + (iph->ihl << 2);
-
-	/* verify the IP checksum */
-	if (unlikely(ip_fast_csum((u8 *) iph, iph->ihl)))
-		return -1;
-
-	/* verify the  checksum */
-	if (unlikely(csum_tcpudp_magic(iph->saddr, iph->daddr,
-				       ntohs(iph->tot_len) - (iph->ihl << 2),
-				       IPPROTO_TCP, csum)))
-		return -1;
-
-	return 0;
-}
-
  static int myri10ge_get_txrx(struct myri10ge_priv *mgp, int slice)
  {
  	struct myri10ge_cmd cmd;
@@ -2402,7 +2258,6 @@ static int myri10ge_open(struct net_device *dev)
  	struct myri10ge_cmd cmd;
  	int i, status, big_pow2, slice;
  	u8 *itable;
-	struct net_lro_mgr *lro_mgr;

  	if (mgp->running != MYRI10GE_ETH_STOPPED)
  		return -EBUSY;
@@ -2513,19 +2368,6 @@ static int myri10ge_open(struct net_device *dev)
  			goto abort_with_rings;
  		}

-		lro_mgr = &ss->rx_done.lro_mgr;
-		lro_mgr->dev = dev;
-		lro_mgr->features = LRO_F_NAPI;
-		lro_mgr->ip_summed = CHECKSUM_COMPLETE;
-		lro_mgr->ip_summed_aggr = CHECKSUM_UNNECESSARY;
-		lro_mgr->max_desc = MYRI10GE_MAX_LRO_DESCRIPTORS;
-		lro_mgr->lro_arr = ss->rx_done.lro_desc;
-		lro_mgr->get_frag_header = myri10ge_get_frag_header;
-		lro_mgr->max_aggr = myri10ge_lro_max_pkts;
-		lro_mgr->frag_align_pad = 2;
-		if (lro_mgr->max_aggr > MAX_SKB_FRAGS)
-			lro_mgr->max_aggr = MAX_SKB_FRAGS;
-
  		/* must happen prior to any irq */
  		napi_enable(&(ss)->napi);
  	}
@@ -3143,15 +2985,6 @@ static int myri10ge_set_mac_address(struct 
net_device *dev, void *addr)
  	return 0;
  }

-static netdev_features_t myri10ge_fix_features(struct net_device *dev,
-	netdev_features_t features)
-{
-	if (!(features & NETIF_F_RXCSUM))
-		features &= ~NETIF_F_LRO;
-
-	return features;
-}
-
  static int myri10ge_change_mtu(struct net_device *dev, int new_mtu)
  {
  	struct myri10ge_priv *mgp = netdev_priv(dev);
@@ -3878,7 +3711,6 @@ static const struct net_device_ops 
myri10ge_netdev_ops = {
  	.ndo_get_stats64	= myri10ge_get_stats,
  	.ndo_validate_addr	= eth_validate_addr,
  	.ndo_change_mtu		= myri10ge_change_mtu,
-	.ndo_fix_features	= myri10ge_fix_features,
  	.ndo_set_rx_mode	= myri10ge_set_multicast_list,
  	.ndo_set_mac_address	= myri10ge_set_mac_address,
  };
@@ -4018,7 +3850,7 @@ static int myri10ge_probe(struct pci_dev *pdev, 
const struct pci_device_id *ent)

  	netdev->netdev_ops = &myri10ge_netdev_ops;
  	netdev->mtu = myri10ge_initial_mtu;
-	netdev->hw_features = mgp->features | NETIF_F_LRO | NETIF_F_RXCSUM;
+	netdev->hw_features = mgp->features | NETIF_F_RXCSUM;
  	netdev->features = netdev->hw_features;

  	if (dac_enabled)
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH v3 net-next 2/2] myri10ge: Add vlan rx for better GRO perf.
From: Andrew Gallatin @ 2012-11-29 21:07 UTC (permalink / raw)
  To: David Miller; +Cc: netdev



Unlike LRO, GRO requires that vlan tags be removed before
aggregation can occur.  Since the myri10ge NIC does not support
hardware vlan tag offload, we must remove the tag in the driver
to achieve performance comparable to LRO for vlan tagged frames.

Thanks to Eric Duzamet for his help simplifying the original patch.

Signed-off-by: Andrew Gallatin <gallatin@myri.com>
---
  drivers/net/ethernet/myricom/myri10ge/myri10ge.c |   41 
++++++++++++++++++++++
  1 file changed, 41 insertions(+)

diff --git a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c 
b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
index 84207c0..2fc984a 100644
--- a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
+++ b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
@@ -1264,6 +1264,42 @@ myri10ge_unmap_rx_page(struct pci_dev *pdev,
  	}
  }

+/*
+ * GRO does not support acceleration of tagged vlan frames, and
+ * this NIC does not support vlan tag offload, so we must pop
+ * the tag ourselves to be able to achieve GRO performance that
+ * is comparable to LRO.
+ */
+
+static inline void
+myri10ge_vlan_rx(struct net_device *dev, void *addr, struct sk_buff *skb)
+{
+	u8 *va;
+	struct vlan_ethhdr *veh;
+	struct skb_frag_struct *frag;
+	__wsum vsum;
+
+	va = addr;
+	va += MXGEFW_PAD;
+	veh = (struct vlan_ethhdr *)va;
+	if ((dev->features & NETIF_F_HW_VLAN_RX) == NETIF_F_HW_VLAN_RX &&
+	    veh->h_vlan_proto == ntohs(ETH_P_8021Q)) {
+		/* fixup csum if needed */
+		if (skb->ip_summed == CHECKSUM_COMPLETE) {
+			vsum = csum_partial(va + ETH_HLEN, VLAN_HLEN, 0);
+			skb->csum = csum_sub(skb->csum, vsum);
+		}
+		/* pop tag */
+		__vlan_hwaccel_put_tag(skb, ntohs(veh->h_vlan_TCI));
+		memmove(va + VLAN_HLEN, va, 2 * ETH_ALEN);
+		skb->len -= VLAN_HLEN;
+		skb->data_len -= VLAN_HLEN;
+		frag = skb_shinfo(skb)->frags;
+		frag->page_offset += VLAN_HLEN;
+		skb_frag_size_set(frag, skb_frag_size(frag) - VLAN_HLEN);
+	}
+}
+
  static inline int
  myri10ge_rx_done(struct myri10ge_slice_state *ss, int len, __wsum csum)
  {
@@ -1326,6 +1362,7 @@ myri10ge_rx_done(struct myri10ge_slice_state *ss, 
int len, __wsum csum)
  		skb->ip_summed = CHECKSUM_COMPLETE;
  		skb->csum = csum;
  	}
+	myri10ge_vlan_rx(mgp->dev, va, skb);
  	skb_record_rx_queue(skb, ss - &mgp->ss[0]);

  	napi_gro_frags(&ss->napi);
@@ -3851,6 +3888,10 @@ static int myri10ge_probe(struct pci_dev *pdev, 
const struct pci_device_id *ent)
  	netdev->netdev_ops = &myri10ge_netdev_ops;
  	netdev->mtu = myri10ge_initial_mtu;
  	netdev->hw_features = mgp->features | NETIF_F_RXCSUM;
+
+	/* fake NETIF_F_HW_VLAN_RX for good GRO performance */
+	netdev->hw_features |= NETIF_F_HW_VLAN_RX;
+
  	netdev->features = netdev->hw_features;

  	if (dac_enabled)
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH][RESEND] bonding: delete migrated IP addresses from the rlb hash table
From: Jay Vosburgh @ 2012-11-29 21:10 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Andy Gospodarek, netdev, David S. Miller
In-Reply-To: <20121128144214.GB2485@midget.suse.cz>

Jiri Bohac <jbohac@suse.cz> wrote:

>Bonding in balance-alb mode records information from ARP packets
>passing through the bond in a hash table (rx_hashtbl).
>
>At certain situations (e.g. link change of a slave),
>rlb_update_rx_clients() will send out ARP packets to update ARP
>caches of other hosts on the network to achieve RX load
>balancing.
>
>The problem is that once an IP address is recorded in the hash
>table, it stays there indefinitely. If this IP address is
>migrated to a different host in the network, bonding still sends
>out ARP packets that poison other systems' ARP caches with
>invalid information.
>
>This patch solves this by looking at all incoming ARP packets,
>and checking if the source IP address is one of the source
>addresses stored in the rx_hashtbl. If it is, but the MAC
>addresses differ, the corresponding hash table entries are
>removed. Thus, when an IP address is migrated, the first ARP
>broadcast by its new owner will purge the offending entries of
>rx_hashtbl.
>
>The hash table is hashed by ip_dst. To be able to do the above
>check efficiently (not walking the whole hash table), we need a
>reverse mapping (by ip_src).
>
>I added three new members in struct rlb_client_info:
>   rx_hashtbl[x].src_first will point to the start of a list of
>      entries for which hash(ip_src) == x.
>   The list is linked with src_next and src_prev.
>
>When an incoming ARP packet arrives at rlb_arp_recv()
>rlb_purge_src_ip() can quickly walk only the entries on the
>corresponding lists, i.e. the entries that are likely to contain
>the offending IP address.
>
>To avoid confusion, I renamed these existing fields of struct 
>rlb_client_info:
>	next -> used_next
>	prev -> used_prev
>	rx_hashtbl_head -> rx_hashtbl_used_head
>
>(The current linked list is _not_ a list of hash table
>entries with colliding ip_dst. It's a list of entries that are
>being used; its purpose is to avoid walking the whole hash table
>when looking for used entries.)
>
>Signed-off-by: Jiri Bohac <jbohac@suse.cz>

Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* [RFC PATCH 0/2] Fix some multiqueue TUN problems
From: Paul Moore @ 2012-11-29 22:06 UTC (permalink / raw)
  To: netdev, linux-security-module, selinux; +Cc: jasowang

A relatively short patchset to fix some problems that have arisen from
the multiqueue TUN work.  I'm sending this a RFC for now as it changes
the LSM TUN interfaces and therefore crosses a few boundaries,
e.g. SELinux, although it does so only in an effort to restore the
original behavior which was lost in the multiqueue conversion.

I'm not particularly enthused with the idea of passing a void** as a
parameter to security_tun_dev_alloc_security() but the alternatives
would be to move the tun_struct out of tun.c or have the LSM interface
return a pointer (which would require us to return a "fake" non-NULL
pointer when the LSM was disabled) ... neither of these seemed like good
alternatives to me.

I've compiled the code and booted it without drama, but I haven't really
stressed it too much so buyer beware at this point.  Comments are always
welcome ...

---

Paul Moore (2):
      tun: correctly report an error in tun_flow_init()
      tun: fix LSM/SELinux labeling of tun/tap devices

 drivers/net/tun.c                 |   16 ++++++++++----
 include/linux/security.h          |   37 ++++++++++++++++++++++-----------
 security/capability.c             |   14 +++++++++---
 security/security.c               |   22 ++++++++++++-------
 security/selinux/hooks.c          |   42 +++++++++++++++++++++++--------------
 security/selinux/include/objsec.h |    4 ++++
 6 files changed, 90 insertions(+), 45 deletions(-)

^ permalink raw reply

* [RFC PATCH 1/2] tun: correctly report an error in tun_flow_init()
From: Paul Moore @ 2012-11-29 22:06 UTC (permalink / raw)
  To: netdev, linux-security-module, selinux; +Cc: jasowang
In-Reply-To: <20121129215724.30020.69464.stgit@sifl>

On error, the error code from tun_flow_init() is lost inside
tun_set_iff(), this patch fixes this by assigning the tun_flow_init()
error code to the "err" variable which is returned by
the tun_flow_init() function on error.

Signed-off-by: Paul Moore <pmoore@redhat.com>
---
 drivers/net/tun.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 607a3a5..877ffe2 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1605,7 +1605,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 
 		tun_net_init(dev);
 
-		if (tun_flow_init(tun))
+		err = tun_flow_init(tun);
+		if (err < 0)
 			goto err_free_dev;
 
 		dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |

^ permalink raw reply related

* [RFC PATCH 2/2] tun: fix LSM/SELinux labeling of tun/tap devices
From: Paul Moore @ 2012-11-29 22:06 UTC (permalink / raw)
  To: netdev, linux-security-module, selinux; +Cc: jasowang
In-Reply-To: <20121129215724.30020.69464.stgit@sifl>

This patch corrects some problems with LSM/SELinux that were introduced
with the multiqueue patchset.  The problem stems from the fact that the
multiqueue work changed the relationship between the tun device and its
associated socket; before the socket persisted for the life of the
device, however after the multiqueue changes the socket only persisted
for the life of the userspace connection (fd open).  For non-persistent
devices this is not an issue, but for persistent devices this can cause
the tun device to lose its SELinux label.

We correct this problem by adding an opaque LSM security blob to the
tun device struct which allows us to have the LSM security state, e.g.
SELinux labeling information, persist for the lifetime of the tun
device.

Signed-off-by: Paul Moore <pmoore@redhat.com>
---
 drivers/net/tun.c                 |   13 ++++++++---
 include/linux/security.h          |   37 ++++++++++++++++++++++-----------
 security/capability.c             |   14 +++++++++---
 security/security.c               |   22 ++++++++++++-------
 security/selinux/hooks.c          |   42 +++++++++++++++++++++++--------------
 security/selinux/include/objsec.h |    4 ++++
 6 files changed, 88 insertions(+), 44 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 877ffe2..85cc924 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -182,6 +182,7 @@ struct tun_struct {
 	struct hlist_head flows[TUN_NUM_FLOW_ENTRIES];
 	struct timer_list flow_gc_timer;
 	unsigned long ageing_time;
+	void *security;
 };
 
 static inline u32 tun_hashfn(u32 rxhash)
@@ -465,6 +466,10 @@ static int tun_attach(struct tun_struct *tun, struct file *file)
 	struct tun_file *tfile = file->private_data;
 	int err;
 
+	err = security_tun_dev_attach(tfile->socket.sk, tun->security);
+	if (err < 0)
+		goto out;
+
 	err = -EINVAL;
 	if (rcu_dereference_protected(tfile->tun, lockdep_rtnl_is_held()))
 		goto out;
@@ -1365,6 +1370,7 @@ static void tun_free_netdev(struct net_device *dev)
 	struct tun_struct *tun = netdev_priv(dev);
 
 	tun_flow_uninit(tun);
+	security_tun_dev_free_security(tun->security);
 	free_netdev(dev);
 }
 
@@ -1548,9 +1554,6 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 
 		if (tun_not_capable(tun))
 			return -EPERM;
-		err = security_tun_dev_attach(tfile->socket.sk);
-		if (err < 0)
-			return err;
 
 		err = tun_attach(tun, file);
 		if (err < 0)
@@ -1601,7 +1604,9 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 
 		spin_lock_init(&tun->lock);
 
-		security_tun_dev_post_create(&tfile->sk);
+		err = security_tun_dev_alloc_security(&tun->security);
+		if (err < 0)
+			goto err_free_dev;
 
 		tun_net_init(dev);
 
diff --git a/include/linux/security.h b/include/linux/security.h
index 05e88bd..260e151 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -983,17 +983,23 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
  *	tells the LSM to decrement the number of secmark labeling rules loaded
  * @req_classify_flow:
  *	Sets the flow's sid to the openreq sid.
+ * @tun_dev_alloc_security:
+ *	This hook allows a module to allocate a security structure for a TUN
+ *	device.
+ *	@security pointer to a security structure pointer.
+ *	Returns a zero on success, negative values on failure.
+ * @tun_dev_free_security:
+ *	This hook allows a module to free the security structure for a TUN
+ *	device.
+ *	@security pointer to the TUN device's security structure
  * @tun_dev_create:
  *	Check permissions prior to creating a new TUN device.
- * @tun_dev_post_create:
- *	This hook allows a module to update or allocate a per-socket security
- *	structure.
- *	@sk contains the newly created sock structure.
  * @tun_dev_attach:
  *	Check permissions prior to attaching to a persistent TUN device.  This
  *	hook can also be used by the module to update any security state
  *	associated with the TUN device's sock structure.
  *	@sk contains the existing sock structure.
+ *	@security pointer to the TUN device's security structure.
  *
  * Security hooks for XFRM operations.
  *
@@ -1613,9 +1619,10 @@ struct security_operations {
 	void (*secmark_refcount_inc) (void);
 	void (*secmark_refcount_dec) (void);
 	void (*req_classify_flow) (const struct request_sock *req, struct flowi *fl);
-	int (*tun_dev_create)(void);
-	void (*tun_dev_post_create)(struct sock *sk);
-	int (*tun_dev_attach)(struct sock *sk);
+	int (*tun_dev_alloc_security) (void **security);
+	void (*tun_dev_free_security) (void *security);
+	int (*tun_dev_create) (void);
+	int (*tun_dev_attach) (struct sock *sk, void *security);
 #endif	/* CONFIG_SECURITY_NETWORK */
 
 #ifdef CONFIG_SECURITY_NETWORK_XFRM
@@ -2553,9 +2560,10 @@ void security_inet_conn_established(struct sock *sk,
 int security_secmark_relabel_packet(u32 secid);
 void security_secmark_refcount_inc(void);
 void security_secmark_refcount_dec(void);
+int security_tun_dev_alloc_security(void **security);
+void security_tun_dev_free_security(void *security);
 int security_tun_dev_create(void);
-void security_tun_dev_post_create(struct sock *sk);
-int security_tun_dev_attach(struct sock *sk);
+int security_tun_dev_attach(struct sock *sk, void *security);
 
 #else	/* CONFIG_SECURITY_NETWORK */
 static inline int security_unix_stream_connect(struct sock *sock,
@@ -2720,16 +2728,21 @@ static inline void security_secmark_refcount_dec(void)
 {
 }
 
-static inline int security_tun_dev_create(void)
+static inline int security_tun_dev_alloc_security(void **security)
 {
 	return 0;
 }
 
-static inline void security_tun_dev_post_create(struct sock *sk)
+static inline void security_tun_dev_free_security(void *security)
 {
 }
 
-static inline int security_tun_dev_attach(struct sock *sk)
+static inline int security_tun_dev_create(void)
+{
+	return 0;
+}
+
+static inline int security_tun_dev_attach(struct sock *sk, void *security)
 {
 	return 0;
 }
diff --git a/security/capability.c b/security/capability.c
index b14a30c..fd6e2dc 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -704,16 +704,21 @@ static void cap_req_classify_flow(const struct request_sock *req,
 {
 }
 
-static int cap_tun_dev_create(void)
+static int cap_tun_dev_alloc_security(void **security)
 {
 	return 0;
 }
 
-static void cap_tun_dev_post_create(struct sock *sk)
+static void cap_tun_dev_free_security(void *security)
+{
+}
+
+static int cap_tun_dev_create(void)
 {
+	return 0;
 }
 
-static int cap_tun_dev_attach(struct sock *sk)
+static int cap_tun_dev_attach(struct sock *sk, void *security)
 {
 	return 0;
 }
@@ -1044,8 +1049,9 @@ void __init security_fixup_ops(struct security_operations *ops)
 	set_to_cap_if_null(ops, secmark_refcount_inc);
 	set_to_cap_if_null(ops, secmark_refcount_dec);
 	set_to_cap_if_null(ops, req_classify_flow);
+	set_to_cap_if_null(ops, tun_dev_alloc_security);
+	set_to_cap_if_null(ops, tun_dev_free_security);
 	set_to_cap_if_null(ops, tun_dev_create);
-	set_to_cap_if_null(ops, tun_dev_post_create);
 	set_to_cap_if_null(ops, tun_dev_attach);
 #endif	/* CONFIG_SECURITY_NETWORK */
 #ifdef CONFIG_SECURITY_NETWORK_XFRM
diff --git a/security/security.c b/security/security.c
index 8dcd4ae..613ad36 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1244,21 +1244,27 @@ void security_secmark_refcount_dec(void)
 }
 EXPORT_SYMBOL(security_secmark_refcount_dec);
 
-int security_tun_dev_create(void)
+int security_tun_dev_alloc_security(void **security)
 {
-	return security_ops->tun_dev_create();
+	return security_ops->tun_dev_alloc_security(security);
 }
-EXPORT_SYMBOL(security_tun_dev_create);
+EXPORT_SYMBOL(security_tun_dev_alloc_security);
 
-void security_tun_dev_post_create(struct sock *sk)
+void security_tun_dev_free_security(void *security)
 {
-	return security_ops->tun_dev_post_create(sk);
+	security_ops->tun_dev_free_security(security);
 }
-EXPORT_SYMBOL(security_tun_dev_post_create);
+EXPORT_SYMBOL(security_tun_dev_free_security);
+
+int security_tun_dev_create(void)
+{
+	return security_ops->tun_dev_create();
+}
+EXPORT_SYMBOL(security_tun_dev_create);
 
-int security_tun_dev_attach(struct sock *sk)
+int security_tun_dev_attach(struct sock *sk, void *security)
 {
-	return security_ops->tun_dev_attach(sk);
+	return security_ops->tun_dev_attach(sk, security);
 }
 EXPORT_SYMBOL(security_tun_dev_attach);
 
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 61a5336..67b3423 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4414,40 +4414,49 @@ static int selinux_tun_dev_create(void)
 			    NULL);
 }
 
-static void selinux_tun_dev_post_create(struct sock *sk)
+static int selinux_tun_dev_alloc_security(void **security)
 {
-	struct sk_security_struct *sksec = sk->sk_security;
+	struct tun_security_struct *tunsec;
 
-	/* we don't currently perform any NetLabel based labeling here and it
-	 * isn't clear that we would want to do so anyway; while we could apply
-	 * labeling without the support of the TUN user the resulting labeled
-	 * traffic from the other end of the connection would almost certainly
-	 * cause confusion to the TUN user that had no idea network labeling
-	 * protocols were being used */
+	tunsec = kzalloc(sizeof(*tunsec), GFP_KERNEL);
+	if (!tunsec)
+		return -ENOMEM;
+	tunsec->sid = current_sid();
 
-	/* see the comments in selinux_tun_dev_create() about why we don't use
-	 * the sockcreate SID here */
+	*security = tunsec;
+	return 0;
+}
 
-	sksec->sid = current_sid();
-	sksec->sclass = SECCLASS_TUN_SOCKET;
+static void selinux_tun_dev_free_security(void *security)
+{
+	kfree(security);
 }
 
-static int selinux_tun_dev_attach(struct sock *sk)
+static int selinux_tun_dev_attach(struct sock *sk, void *security)
 {
+	struct tun_security_struct *tunsec = security;
 	struct sk_security_struct *sksec = sk->sk_security;
 	u32 sid = current_sid();
 	int err;
 
+	/* we don't currently perform any NetLabel based labeling here and it
+	 * isn't clear that we would want to do so anyway; while we could apply
+	 * labeling without the support of the TUN user the resulting labeled
+	 * traffic from the other end of the connection would almost certainly
+	 * cause confusion to the TUN user that had no idea network labeling
+	 * protocols were being used */
+
 	err = avc_has_perm(sid, sksec->sid, SECCLASS_TUN_SOCKET,
 			   TUN_SOCKET__RELABELFROM, NULL);
 	if (err)
 		return err;
-	err = avc_has_perm(sid, sid, SECCLASS_TUN_SOCKET,
+	err = avc_has_perm(sid, tunsec->sid, SECCLASS_TUN_SOCKET,
 			   TUN_SOCKET__RELABELTO, NULL);
 	if (err)
 		return err;
 
-	sksec->sid = sid;
+	sksec->sid = tunsec->sid;
+	sksec->sclass = SECCLASS_TUN_SOCKET;
 
 	return 0;
 }
@@ -5642,8 +5651,9 @@ static struct security_operations selinux_ops = {
 	.secmark_refcount_inc =		selinux_secmark_refcount_inc,
 	.secmark_refcount_dec =		selinux_secmark_refcount_dec,
 	.req_classify_flow =		selinux_req_classify_flow,
+	.tun_dev_alloc_security =	selinux_tun_dev_alloc_security,
+	.tun_dev_free_security =	selinux_tun_dev_free_security,
 	.tun_dev_create =		selinux_tun_dev_create,
-	.tun_dev_post_create = 		selinux_tun_dev_post_create,
 	.tun_dev_attach =		selinux_tun_dev_attach,
 
 #ifdef CONFIG_SECURITY_NETWORK_XFRM
diff --git a/security/selinux/include/objsec.h b/security/selinux/include/objsec.h
index 26c7eee..aa47bca 100644
--- a/security/selinux/include/objsec.h
+++ b/security/selinux/include/objsec.h
@@ -110,6 +110,10 @@ struct sk_security_struct {
 	u16 sclass;			/* sock security class */
 };
 
+struct tun_security_struct {
+	u32 sid;			/* SID for the tun device sockets */
+};
+
 struct key_security_struct {
 	u32 sid;	/* SID of key */
 };


^ permalink raw reply related

* Re: [PATCH v2 3/3] pppoatm: protect against freeing of vcc
From: David Woodhouse @ 2012-11-29 22:17 UTC (permalink / raw)
  To: chas williams - CONTRACTOR
  Cc: Krzysztof Mazur, David Laight, davem, netdev, linux-kernel,
	nathan
In-Reply-To: <20121129132902.04373f49@thirdoffive.cmf.nrl.navy.mil>

[-- Attachment #1: Type: text/plain, Size: 3894 bytes --]

On Thu, 2012-11-29 at 13:29 -0500, chas williams - CONTRACTOR wrote:
> On Thu, 29 Nov 2012 18:11:48 +0000
> David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > We do see the 'packet received for unknown VCC' complaint, after we
> > reboot the host without resetting the card. And as shown in the commit I
> > just referenced, we rely on the !ATM_VF_READY check in order to prevent
> > a use-after-free when the tasklet is sending packets up a VCC that's
> > just been closed.
> 
> well that behavior is just crap.  why is it delivering cells for a
> vpi/vci pair that is not open?

In the reboot case... Because it *was* open and the device has no way of
knowing that the host just rebooted. We don't reset the card on loading
the driver, because that would cause an ADSL resync.

Perhaps we could send a 'close all VCCs' command to the firmware though.
Nathan, could we add that to the firmware?

Or we could just respond to any unwanted incoming packet by sending a
close for that specific VCC. And be careful about potential races with
open().

In the close case... The *tasklet* is running to process an incoming
packet, finds an open and active VCC and is *about* to send a packet up
to it. Meanwhile, our close() runs and the VCC is destroyed. And then
the tasklet... oops, use-after-free and crash.

Hence commit 1f6ea6e51 which makes the tasklet refuse to process RX for
a VCC which doesn't have the ATM_VF_READY flag set. Because it knows
it's *being* closed. And the close() routine waits for any *existing*
tasklet run to finish, to make sure nobody's referencing the VCC, before
it returns and allows vcc_destroy_socket() to complete. It's the RCU
principle, basically.

> > You mean that (e.g.) pppoatm_push(vcc, NULL) should be waiting for any
> > pending TX skb (that has already been passed off to the driver) to
> > *complete*? How would it even do that?
> 
> i think the order of the vcc_destroy_socket() operations is a bit
> wrong.  it should call detach protocol (i.e. push a NULL).  this should
> cause the attached protocol to stop any future sends and receives, and
> it CAN sleep in this context (and only this context -- generally
> sending cannot sleep which is why this might seem confusing) to do
> whatever it needs to do to wait for the attached protocol to clean up
> queues, flush data etc.
> 
> then the driver close() should be called.  this takes care of cleaning
> up any pending tx or rx that is in the hardware.  and of course,
> close() can sleep since it will be in a interrutible context.

This is basically what Krzysztof's patch 1/7 was doing, which we've now
dropped from the series.

There are issues with *either* ordering.

The current case is that we call vcc->dev->ops->close(vcc) *first*,
before vcc->push(vcc, NULL). And that means that the device is told to
close the VCC while the protocol may still be pushing packets to it.
Hence the patches to both pppoatm and br2684 to make them check for
ATM_VF_READY and *stop* pushing packets if it's not set.

If we flip it round and tell the protocol first, then it tears down all
its data structures while the driver is still happily calling its
->pop() on transmitted skbs. Which leads to the panic in br2684_pop()
that we've *also* seen (because we weren't actually flushing the TX
packets in the driver's close(), which had the same effect). Yes, you
suggest that the protocol could keep track of the skbs it's sent down
and wait for them... but surely it's better to let the driver get called
first and *abort* them?

At this point, I think we're better off as we are (with Krzysztof's
patch 1/7 dropped, and leaving vcc->dev->ops->close() being called
before vcc->push(NULL). We've fairly much solved the issues with that
arrangement, by checking ATM_VF_READY in the protocols' ->push()
functions.

-- 
dwmw2

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
From: Jesper Dangaard Brouer @ 2012-11-29 22:17 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert
In-Reply-To: <20121129.124427.1093031685966728935.davem@davemloft.net>

On Thu, 2012-11-29 at 12:44 -0500, David Miller wrote:
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Thu, 29 Nov 2012 17:11:09 +0100
> 
> > The fragmentation evictor system have a very unfortunate eviction
> > system for killing fragment, when the system is put under pressure.
> > If packets are coming in too fast, the evictor code kills "warm"
> > fragments too quickly.  Resulting in a massive performance drop,
> > because we drop frag lists where we have already queue up a lot of
> > fragments/work, which gets killed before they have a chance to
> > complete.
> 
> I think this is a trade-off where the decision is somewhat
> arbitrary.
> 
> If you kill warm entries, the sending of all of the fragments is
> wasted.  If you retain warm entries and drop incoming new fragments,
> well then the sending of all of those newer fragments is wasted too.
> 
> The only way I could see this making sense is if some "probability
> of fulfillment" was taken into account.  For example, if you have
> more than half of the fragments already, then yes it may be
> advisable to retain the warm entry.

I disagree, once we have spend energy on allocation a frag queue and
adding just a single packet, while the entry is warn we must not drop
it.

I believe, we need the concept of "warn" entries.  Without it you scheme
is dangerous, as we would retain entries with missed/dropped fragments
too long.

> Otherwise, as I said, the decision seems arbitrary.

This patch/system actually includes a "promise/probability of
fulfillment". Let me explain.

We allow "warn" entries to complete, by allowing (new) fragments/packets
for these entries (present in the frag queue).  While we don't allow the
system to create new entries.  This creates the selection we interested
in (as we must drop some packets given the arrival rate bigger than the
processing rate).

(This might be hard to follow)
While blocking new frag queue entries, we are dropping packets.  When
fragments complete, the "window" opens up again.  Creating a frag queue
entry, in this situation, might be for a fragment which already have
lost some packets, thus wasting resources.  BUT its not such a big
problem, because our evictor system will quickly kill this fragment
queue, due to the LRU list (and the fq getting "cold").

I have through of keeping track of fragment id's of dropped fragments,
but is a scalability problem of its own to maintain this list.  And I'm
not sure how effective its going to be (as we already get some of the
effect "automatically", as descrived above).

> Let's take a step back and think about why this is happening at all.
> 
> I wonder how reasonable the high and low thresholds really are.  Even
> once you move them to per-cpu, I think the limits are far too small.

It is VERY important that you understand/realize that throwing more
memory at the problem is NOT the solution.  It a classical queue theory
situation where the arrival rate is bigger than the processing
capabilities. Thus, we must drop packets, or else the NIC will do it for
us... for fragments we need do this "dropping" more intelligent.

For example lets give a threshold of 2000 MBytes:

[root@dragon ~]# sysctl -w net/ipv4/ipfrag_high_thresh=$(((1024**2*2000)))
net.ipv4.ipfrag_high_thresh = 2097152000

[root@dragon ~]# sysctl -w net/ipv4/ipfrag_low_thresh=$(((1024**2*2000)-655350))
net.ipv4.ipfrag_low_thresh = 2096496650

4x10 Netperf adjusted output:
 Socket  Message  Elapsed      Messages
 Size    Size     Time         Okay Errors   Throughput
 bytes   bytes    secs            #      #   10^6bits/sec

 229376   65507   20.00      298685      0    7826.35
 212992           20.00          27              0.71

 229376   65507   20.00      366668      0    9607.71
 212992           20.00          13              0.34

 229376   65507   20.00      254790      0    6676.20
 212992           20.00          14              0.37

 229376   65507   20.00      309293      0    8104.33
 212992           20.00          15              0.39

Can we agree that the current evictor strategy is broken?

> I'm under the impression that it's common for skb->truesize for 1500
> MTU frames to be something rounded up to the next power of 2, so
> 2048 bytes, or something like that.  Then add in the sk_buff control
> overhead, as well as the inet_frag head.

(Plus, the size of the frag queue struct, which is also being accounted
for).

> So a 64K fragmented frame probably consumes close to 100K.
> 
> So once we have three 64K frames in flight, we're already over the
> high threshold and will start dropping things.
> 
> That's beyond stupid.

Yes, I would say the interval between ipfrag_high_thresh and
ipfrag_low_thresh seems quite a stupid default.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
From: Jesper Dangaard Brouer @ 2012-11-29 22:33 UTC (permalink / raw)
  To: Eric Dumazet, David Laight
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert
In-Reply-To: <1354211659.3299.15.camel@edumazet-glaptop>

On Thu, 2012-11-29 at 09:54 -0800, Eric Dumazet wrote:
> On Thu, 2012-11-29 at 12:48 -0500, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Thu, 29 Nov 2012 09:43:24 -0800
> > 
> > > Use a schem with a hash table of 256 (or 1024) slots.
> > > 
> > > Each slot/bucket has : 
> > >   - Its own spinlock.
> > >   - List of items
> > >   - A limit of 5 (or so) elems in the list.
> > > 
> > > No more LRU, no more rehash (thanks to jhash and the random seed at boot
> > > or first frag created), no more reader-writer lock.
> > > 
> > > Use a percpu_counter to implement ipfrag_low_thresh/ipfrag_high_thresh
> > 
> > If we limit the chain sizes to 5 elements, there is no need for
> > any thresholds at all.
> 
> One element can hold about 100KB.
> 
> I guess some systems could have some worries if we consume 1024 * 5 *
> 100 KB

1024 * 5 * 100k = 512 MB  -- That's just crasy!
I guess the embedded guys is going to choke reading this! 

Look at what I have achieved with 256KBytes per CPU...

--Jesper

^ permalink raw reply

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
From: Eric Dumazet @ 2012-11-29 23:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert
In-Reply-To: <1354227470.11754.348.camel@localhost>

On Thu, 2012-11-29 at 23:17 +0100, Jesper Dangaard Brouer wrote:

> For example lets give a threshold of 2000 MBytes:
> 
> [root@dragon ~]# sysctl -w net/ipv4/ipfrag_high_thresh=$(((1024**2*2000)))
> net.ipv4.ipfrag_high_thresh = 2097152000
> 
> [root@dragon ~]# sysctl -w net/ipv4/ipfrag_low_thresh=$(((1024**2*2000)-655350))
> net.ipv4.ipfrag_low_thresh = 2096496650
> 
> 4x10 Netperf adjusted output:
>  Socket  Message  Elapsed      Messages
>  Size    Size     Time         Okay Errors   Throughput
>  bytes   bytes    secs            #      #   10^6bits/sec
> 
>  229376   65507   20.00      298685      0    7826.35
>  212992           20.00          27              0.71
> 
>  229376   65507   20.00      366668      0    9607.71
>  212992           20.00          13              0.34
> 
>  229376   65507   20.00      254790      0    6676.20
>  212992           20.00          14              0.37
> 
>  229376   65507   20.00      309293      0    8104.33
>  212992           20.00          15              0.39
> 
> Can we agree that the current evictor strategy is broken?

Not really, you drop packets because of another limit.

^ permalink raw reply

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
From: Eric Dumazet @ 2012-11-29 23:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert
In-Reply-To: <1354227470.11754.348.camel@localhost>

On Thu, 2012-11-29 at 23:17 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 2012-11-29 at 12:44 -0500, David Miller wrote:

> It is VERY important that you understand/realize that throwing more
> memory at the problem is NOT the solution.  It a classical queue theory
> situation where the arrival rate is bigger than the processing
> capabilities. Thus, we must drop packets, or else the NIC will do it for
> us... for fragments we need do this "dropping" more intelligent.

Thats the typical head/tail drop choice. There is no bad/good choice.

Implementing head drop or tail drop is not a matter of only dealing with
regular traffic. We also can face DOS attacks, with packets of different
sizes and have to choose a compromise.

For example, it seems we have no protection against an attack using
small frags (but big truesize). So a particular 'packet' could consume
all the truesize we allowed for the whole frags queue.

skb_try_coalesce() cant always deal with this kind of thing.

Basically, thats why using frags in the first place is a bad choice.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox