Netdev List
 help / color / mirror / Atom feed
* Re: Configuring ethernet link fails with No such device
From: David Miller @ 2016-04-12 15:44 UTC (permalink / raw)
  To: bob.ham
  Cc: fabio.estevam, systemd-devel, netdev, bryan.wu, u.kleine-koenig,
	l.stach
In-Reply-To: <1460451492.6333.6.camel@collabora.com>

From: Bob Ham <bob.ham@collabora.com>
Date: Tue, 12 Apr 2016 09:58:12 +0100

> On Mon, 2016-04-11 at 15:46 -0700, Stefan Agner wrote:
> 
>> Or in other words: Is this a Kernel or systemd issue?
> 
> From what I recall, both; an issue with the FEC driver, and issues in
> systemd/udevd's handling of link-level settings.

This is my impression of the situation as well.
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

^ permalink raw reply

* Re: pull request: bluetooth-next 2016-04-12
From: David Miller @ 2016-04-12 15:58 UTC (permalink / raw)
  To: johan.hedberg; +Cc: netdev, linux-bluetooth
In-Reply-To: <20160412080815.GA17516@t440s>

From: Johan Hedberg <johan.hedberg@gmail.com>
Date: Tue, 12 Apr 2016 11:08:15 +0300

> Here's a set of Bluetooth & 802.15.4 patches intended for the 4.7 kernel:
> 
>  - Fix for race condition in vhci driver
>  - Memory leak fix for ieee802154/adf7242 driver
>  - Improvements to deal with single-mode (LE-only) Bluetooth controllers
>  - Fix for allowing the BT_SECURITY_FIPS security level
>  - New BCM2E71 ACPI ID
>  - NULL pointer dereference fix fox hci_ldisc driver
> 
> Let me know if there are any issues pulling. Thanks.

Pulled, thanks Johan.

^ permalink raw reply

* Re: [RFC v5 0/5] Add virtio transport for AF_VSOCK
From: Ian Campbell @ 2016-04-12 16:07 UTC (permalink / raw)
  To: Michael S. Tsirkin, Stefan Hajnoczi
  Cc: marius vlad, Stefan Hajnoczi, kvm, netdev, virtualization,
	Claudio Imbrenda, Matt Benjamin, Greg Kurz, Christoffer Dall
In-Reply-To: <20160411154517-mutt-send-email-mst@redhat.com>

Some how Stefan's reply disapeared from my INBOX (although I did see
it) so replying here.

On Mon, 2016-04-11 at 15:54 +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 11, 2016 at 11:45:48AM +0100, Stefan Hajnoczi wrote:
> > 
> > On Fri, Apr 08, 2016 at 04:35:05PM +0100, Ian Campbell wrote:
> > > 
> > > On Fri, 2016-04-01 at 15:23 +0100, Stefan Hajnoczi wrote:
> > > > 
> > > > This series is based on Michael Tsirkin's vhost branch (v4.5-rc6).
> > > > 
> > > > I'm about to process Claudio Imbrenda's locking fixes for virtio-vsock but
> > > > first I want to share the latest version of the code.  Several people are
> > > > playing with vsock now so sharing the latest code should avoid duplicate work.
> > > Thanks for this, I've been using it in my project and it mostly seems
> > > fine.
> > > 
> > > One wrinkle I came across, which I'm not sure if it is by design or a
> > > problem is that I can see this sequence coming from the guest (with
> > > other activity in between):
> > > 
> > >     1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
> > >     2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
> > >     3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX
> > > 
> > > I orignally had my backend close things down at #2, however this meant
> > > that when #3 arrived it was for a non-existent socket (or, worse, an
> > > active one if the ports got reused). I checked v5 of the spec
> > > proposal[0] which says:
> > >     If these bits are set and there are no more virtqueue buffers
> > >     pending the socket is disconnected.
> > > 
> > > but I'm not entirely sure if this behaviour contradicts this or not
> > > (the bits have both been set at #2, but not at the same time).
> > > 
> > > BTW, how does one tell if there are no more virtqueue buffers pending
> > > or not while processing the op?
> > #2 is odd.  The shutdown bits are sticky so they cannot be cleared once
> > set.  I would have expected just #1 and #3.  The behavior you observe
> > look like a bug.
> > 
> > The spec text does not convey the meaning of OP_SHUTDOWN well.
> > OP_SHUTDOWN SHUTDOWN_TX|SHUTDOWN_RX means no further rx/tx is possible
> > for this connection.  "there are no more virtqueue buffers pending the
> > socket" really means that this isn't an immediate close from the
> > perspective of the application.  If the application still has unread rx
> > buffers then the socket stays readable until the rx data has been fully
> > read.

Thanks, distinguishing the local buffer to the application from the
vring would make that clearer. Perhaps by not talking about "virtqueue
buffers" since they sound like a vring thing.

However, as Michael observes I'm not sure that's the whole story.

> Yes but you also wrote:
> 	If these bits are set and there are no more virtqueue buffers
> 	pending the socket is disconnected.
> 
> how does remote know that there are no buffers pending and so it's safe
> to reuse the same source/destination address now?

Indeed this is one of the things I struggled with. e.g. If I send a
SHUTDOWN_RX to my peer am I supposed to wait for that buffer to come
back (so I know the peer has seen it) and then wait for an entire
"cycle" of the TX ring to know there is nothing still in flight? That's
some tricky book-keeping.

>   Maybe destination
> should send RST at that point?

i.e. upon receipt of SHUTDOWN_RX|SHUTDOWN_TX from the peer you are
expected to send a RST. When the peer observes that then they know
there is no further data in that connection on the ring?

That sounds like it would be helpful.

> > > Another thing I noticed, which is really more to do with the generic
> > > AF_VSOCK bits than anything to do with your patches is that there is no
> > > limitations on which vsock ports a non-privileged user can bind to and
> > > relatedly that there is no netns support so e.g. users in unproivileged
> > > containers can bind to any vsock port and talk to the host, which might
> > > be undesirable. For my use for now I just went with the big hammer
> > > approach of denying access from anything other than init_net
> > > namespace[1] while I consider what the right answer is.
> > From the vhost point of view each netns should have its own AF_VSOCK
> > namespace.  This way two containers could act as "the host" (CID 2) for
> > their respective guests.

When you say "should" you mean that's the intended design as opposed to
what the current code is actually doing, right?

Ian.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [RFC PATCH net-next] sunvnet: Add ethtool support for netdev stastics
From: Atish Patra @ 2016-04-12 16:19 UTC (permalink / raw)
  To: netdev; +Cc: aaron.young, sowmini.varadhan, atish.patra

Currently, sunvnet does not support statstics display
via sysfs entry using ethtool.

This patch enables the basic support in sunvnet that
displays the all the netdev stastics for the specific
interface using ethtool. Other sunvnet statstics and
support for ldmvsw will be added later.

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 drivers/net/ethernet/sun/sunvnet.c        |   26 +++++++++++++
 drivers/net/ethernet/sun/sunvnet_common.c |   47 ++++++++++++++++++++++++
 drivers/net/ethernet/sun/sunvnet_common.h |   57 +++++++++++++++++++++++++++++
 3 files changed, 130 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index a2f9b47..7c8aa98 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -77,10 +77,36 @@ static void vnet_set_msglevel(struct net_device *dev, u32 value)
 	vp->msg_enable = value;
 }
 
+static int vnet_get_sset_count(struct net_device *netdev, int sset)
+{
+	int scount;
+
+	scount = sunvnet_get_sset_count_common(netdev, sset,
+					       SUNVNET_ETHTOOL);
+	return scount;
+}
+
+static void vnet_get_ethtool_stats(struct net_device *netdev,
+				   struct ethtool_stats *stats, u64 *data)
+{
+	sunvnet_get_ethtool_stats_common(netdev, stats,
+					 data, SUNVNET_ETHTOOL);
+}
+
+static void vnet_get_strings(struct net_device *netdev,
+			     u32 stringset, u8 *data)
+{
+	sunvnet_get_strings_common(netdev, stringset,
+				   data, SUNVNET_ETHTOOL);
+}
+
 static const struct ethtool_ops vnet_ethtool_ops = {
 	.get_drvinfo		= vnet_get_drvinfo,
 	.get_msglevel		= vnet_get_msglevel,
 	.set_msglevel		= vnet_set_msglevel,
+	.get_strings            = vnet_get_strings,
+	.get_ethtool_stats      = vnet_get_ethtool_stats,
+	.get_sset_count         = vnet_get_sset_count,
 	.get_link		= ethtool_op_get_link,
 };
 
diff --git a/drivers/net/ethernet/sun/sunvnet_common.c b/drivers/net/ethernet/sun/sunvnet_common.c
index 904a5a1..2781740 100644
--- a/drivers/net/ethernet/sun/sunvnet_common.c
+++ b/drivers/net/ethernet/sun/sunvnet_common.c
@@ -1730,3 +1730,50 @@ void sunvnet_port_rm_txq_common(struct vnet_port *port)
 						port->q_index));
 }
 EXPORT_SYMBOL_GPL(sunvnet_port_rm_txq_common);
+
+int sunvnet_get_sset_count_common(struct net_device *netdev,
+				  int sset, unsigned int drv_type)
+{
+	switch (sset) {
+	case ETH_SS_STATS:
+		return VNET_NETDEV_STATS_LEN;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+EXPORT_SYMBOL_GPL(sunvnet_get_sset_count_common);
+
+void sunvnet_get_ethtool_stats_common(struct net_device *netdev,
+				      struct ethtool_stats *stats, u64 *data,
+				      unsigned int drv_type)
+{
+	int i = 0;
+	int j;
+	char *p;
+
+	for (j = 0; j < VNET_NETDEV_STATS_LEN; j++) {
+		p = (char *)netdev + vnet_gstrings_net_stats[j].stat_offset;
+		data[i++] = (vnet_gstrings_net_stats[j].sizeof_stat ==
+			     sizeof(u64)) ? *(u64 *)p : *(u32 *)p;
+	}
+}
+EXPORT_SYMBOL_GPL(sunvnet_get_ethtool_stats_common);
+
+void sunvnet_get_strings_common(struct net_device *netdev,
+				u32 stringset, u8 *data,
+				unsigned int drv_type)
+{
+	u8 *p = data;
+	int i;
+
+	switch (stringset) {
+	case ETH_SS_STATS:
+		for (i = 0; i < VNET_NETDEV_STATS_LEN; i++) {
+			snprintf(p, ETH_GSTRING_LEN, "%s",
+				 vnet_gstrings_net_stats[i].stat_string);
+			p += ETH_GSTRING_LEN;
+		}
+		break;
+	}
+}
+EXPORT_SYMBOL_GPL(sunvnet_get_strings_common);
diff --git a/drivers/net/ethernet/sun/sunvnet_common.h b/drivers/net/ethernet/sun/sunvnet_common.h
index bd36528..9618d42 100644
--- a/drivers/net/ethernet/sun/sunvnet_common.h
+++ b/drivers/net/ethernet/sun/sunvnet_common.h
@@ -72,6 +72,55 @@ struct vnet_port {
 	u16			q_index;
 };
 
+/* Ethtool support strings and macros */
+enum {
+	SUNVNET_ETHTOOL,
+	VSW_ETHTOOL
+};
+
+struct vnet_stats {
+	char stat_string[ETH_GSTRING_LEN];
+	u8 sizeof_stat;
+	u64 stat_offset;
+};
+
+#define VNET_STAT(_type, _name, _stat) { \
+	.stat_string = _name, \
+	.sizeof_stat = FIELD_SIZEOF(_type, _stat), \
+	.stat_offset = offsetof(_type, _stat) \
+}
+
+#define VNET_NETDEV_STAT(_name, _net_stat) \
+		VNET_STAT(struct net_device, _name, _net_stat)
+
+static const struct vnet_stats vnet_gstrings_net_stats[] = {
+	VNET_NETDEV_STAT("rx_packets", stats.rx_packets),
+	VNET_NETDEV_STAT("tx_packets", stats.tx_packets),
+	VNET_NETDEV_STAT("rx_bytes", stats.rx_bytes),
+	VNET_NETDEV_STAT("tx_bytes", stats.tx_bytes),
+	VNET_NETDEV_STAT("rx_errors", stats.rx_errors),
+	VNET_NETDEV_STAT("tx_errors", stats.tx_errors),
+	VNET_NETDEV_STAT("rx_dropped", stats.rx_dropped),
+	VNET_NETDEV_STAT("tx_dropped", stats.tx_dropped),
+	VNET_NETDEV_STAT("multicast", stats.multicast),
+	VNET_NETDEV_STAT("collisions", stats.collisions),
+	VNET_NETDEV_STAT("rx_length_errors", stats.rx_length_errors),
+	VNET_NETDEV_STAT("rx_over_errors", stats.rx_over_errors),
+	VNET_NETDEV_STAT("rx_crc_errors", stats.rx_crc_errors),
+	VNET_NETDEV_STAT("rx_frame_errors", stats.rx_frame_errors),
+	VNET_NETDEV_STAT("rx_fifo_errors", stats.rx_fifo_errors),
+	VNET_NETDEV_STAT("rx_missed_errors", stats.rx_missed_errors),
+	VNET_NETDEV_STAT("tx_aborted_errors", stats.tx_aborted_errors),
+	VNET_NETDEV_STAT("tx_carrier_errors", stats.tx_carrier_errors),
+	VNET_NETDEV_STAT("tx_fifo_errors", stats.tx_fifo_errors),
+	VNET_NETDEV_STAT("tx_heartbeat_errors", stats.tx_heartbeat_errors),
+	VNET_NETDEV_STAT("tx_window_errors", stats.tx_window_errors),
+	VNET_NETDEV_STAT("rx_compressed", stats.rx_compressed),
+	VNET_NETDEV_STAT("tx_compressed", stats.tx_compressed),
+};
+
+#define VNET_NETDEV_STATS_LEN ARRAY_SIZE(vnet_gstrings_net_stats)
+
 static inline struct vnet_port *to_vnet_port(struct vio_driver_state *vio)
 {
 	return container_of(vio, struct vnet_port, vio);
@@ -141,5 +190,13 @@ void sunvnet_port_free_tx_bufs_common(struct vnet_port *port);
 bool sunvnet_port_is_up_common(struct vnet_port *vnet);
 void sunvnet_port_add_txq_common(struct vnet_port *port);
 void sunvnet_port_rm_txq_common(struct vnet_port *port);
+int sunvnet_get_sset_count_common(struct net_device *netdev, int sset,
+				  unsigned int drv_type);
+void sunvnet_get_ethtool_stats_common(struct net_device *netdev,
+				      struct ethtool_stats *stats, u64 *data,
+				      unsigned int drv_type);
+void sunvnet_get_strings_common(struct net_device *netdev,
+				u32 stringset, u8 *data,
+				unsigned int drv_type);
 
 #endif /* _SUNVNETCOMMON_H */
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH net v3] net: sched: do not requeue a NULL skb
From: Eric Dumazet @ 2016-04-12 16:23 UTC (permalink / raw)
  To: Lars Persson; +Cc: netdev, jhs, linux-kernel, xiyou.wangcong, Lars Persson
In-Reply-To: <1460443552-26710-1-git-send-email-larper@axis.com>

On Tue, 2016-04-12 at 08:45 +0200, Lars Persson wrote:
> A failure in validate_xmit_skb_list() triggered an unconditional call
> to dev_requeue_skb with skb=NULL. This slowly grows the queue
> discipline's qlen count until all traffic through the queue stops.

> 
> Fixes: 55a93b3ea780 ("qdisc: validate skb without holding lock")
> Signed-off-by: Lars Persson <larper@axis.com>
> ---


Acked-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply

* [PATCH 4.6 fix] bgmac: fix MAC soft-reset bit for corerev > 4
From: Rafał Miłecki @ 2016-04-12 16:27 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Hauke Mehrtens, Felix Fietkau, Rafał Miłecki

From: Felix Fietkau <nbd@openwrt.org>

Only core revisions older than 4 use BGMAC_CMDCFG_SR_REV0. This mainly
fixes support for BCM4708A0KF SoCs with Ethernet core rev 5 (it means
only some devices as most of BCM4708A0KF-s got core rev 4).
This was tested for regressions on BCM47094 which doesn't seem to care
which bit gets used.

Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Rafał Miłecki <zajec5@gmail.com>
---
If it's OK for everyone, I suggest taking this patch into net.git repo.
It obviously can't regress old devices and was well tested on new ones
by OpenWrt users.
---
 drivers/net/ethernet/broadcom/bgmac.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bgmac.h b/drivers/net/ethernet/broadcom/bgmac.h
index 4fbb093..9a03c14 100644
--- a/drivers/net/ethernet/broadcom/bgmac.h
+++ b/drivers/net/ethernet/broadcom/bgmac.h
@@ -199,9 +199,9 @@
 #define  BGMAC_CMDCFG_TAI			0x00000200
 #define  BGMAC_CMDCFG_HD			0x00000400	/* Set if in half duplex mode */
 #define  BGMAC_CMDCFG_HD_SHIFT			10
-#define  BGMAC_CMDCFG_SR_REV0			0x00000800	/* Set to reset mode, for other revs */
-#define  BGMAC_CMDCFG_SR_REV4			0x00002000	/* Set to reset mode, only for core rev 4 */
-#define  BGMAC_CMDCFG_SR(rev)  ((rev == 4) ? BGMAC_CMDCFG_SR_REV4 : BGMAC_CMDCFG_SR_REV0)
+#define  BGMAC_CMDCFG_SR_REV0			0x00000800	/* Set to reset mode, for core rev 0-3 */
+#define  BGMAC_CMDCFG_SR_REV4			0x00002000	/* Set to reset mode, for core rev >= 4 */
+#define  BGMAC_CMDCFG_SR(rev)  ((rev >= 4) ? BGMAC_CMDCFG_SR_REV4 : BGMAC_CMDCFG_SR_REV0)
 #define  BGMAC_CMDCFG_ML			0x00008000	/* Set to activate mac loopback mode */
 #define  BGMAC_CMDCFG_AE			0x00400000
 #define  BGMAC_CMDCFG_CFE			0x00800000
-- 
1.8.4.5

^ permalink raw reply related

* Re: [RFC v5 0/5] Add virtio transport for AF_VSOCK
From: Ian Campbell @ 2016-04-12 16:37 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: marius vlad, kvm, Michael S. Tsirkin, ijc, virtualization,
	Claudio Imbrenda, Matt Benjamin, netdev, Greg Kurz,
	Christoffer Dall
In-Reply-To: <20160412135957.GB29405@stefanha-x1.localdomain>


[-- Attachment #1.1: Type: text/plain, Size: 6431 bytes --]

For some reason your mails in this thread only appear in the gmail web UI
and not on the IMAP version of my mailbox (my own and Michael's mails are
fine).

So I'm replying via the web interface, sorry for the inevitable formatting
mess :-/

I've CCd another mailbox in the hopes of getting your mails in that IMAP
folder instead/aswell so I can avoid this next time.

On 12 April 2016 at 14:59, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> > > > One wrinkle I came across, which I'm not sure if it is by design or a
> > > > problem is that I can see this sequence coming from the guest (with
> > > > other activity in between):
> > > >
> > > >     1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
> > > >     2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
> > > >     3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX
>
> How did you trigger this sequence?  I'd like to reproduce it.
>

Nothing magic. I've written some logging into my backend and captured the
result for a simple backend initiated connection.

In the log "TX" and "RX" indicate the thread doing the processing (with
"TX" being the one which processes the guest's TX ring, i.e. data coming
from the guest to the host). "<=" indicates a buffer going from guest to
host and "=>" is from host to guest. NB that guest to host replies are
queued synchronously by the TX thread onto the RX ring which is why the
somewhat odd looking "TX: =>" combination can occur. A host initiated
connection also happens from the TX thread in the same way.

The trace is of a simple request response (which both fit in one buffer in
each direction), the lines without an "?X:" prefix are my
annotations/guesses as to what is going on:

TX: =>SRC:00000002.00010002 DST:00000003.00000948
TX:   LEN:00000000 TYPE:0001 OP:1=REQUEST
TX:  FLAGS:00000000 BUF_ALLOC:00008000 FWD_CNT:00000000

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:2=RESPONSE
TX:  FLAGS:00000000 BUF_ALLOC:00040000 FWD_CNT:00000000

REQUEST + RESPONSE == Channel open successfully

RX: =>SRC:00000002.00010002 DST:00000003.00000948
RX:   LEN:0000005e TYPE:0001 OP:5=RW
RX:  FLAGS:00000000 BUF_ALLOC:00008000 FWD_CNT:00000000

Host sends a request to the guest

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:6=CREDIT_UPDATE
TX:  FLAGS:00000000 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest replies with a credit update

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000091 TYPE:0001 OP:5=RW
TX:  FLAGS:00000000 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest replies with the answer to the request

RX: =>SRC:00000002.00010002 DST:00000003.00000948
RX:   LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
RX:  FLAGS:00000002 BUF_ALLOC:00008000 FWD_CNT:00000091

Host has sent its only request, so host app must have done
shutdown(SHUT_WR) I suppose and host therefore sends SHUTDOWN_TX.

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
TX:  FLAGS:00000001 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest SHUTDOWN_RX. I'm not sure if this is a direct kernel response to the
SHUTDOWN_TX or if the application inside the guest saw an EOF when reading
the socket and did the corresponding shutdown(SHUT_RD).

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
TX:  FLAGS:00000002 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest SHUTDOWN_TX, I presume that having sent the only response it is going
to it then does shutdown(SHUT_WR).

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
TX:  FLAGS:00000003 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest shuts down both directions.

Perhaps the guest end is turning shutdown(foo) directly into a vsock
message without or-ing in the current state?

> > > I orignally had my backend close things down at #2, however this meant
> > > > that when #3 arrived it was for a non-existent socket (or, worse, an
> > > > active one if the ports got reused). I checked v5 of the spec
> > > > proposal[0] which says:
> > > >     If these bits are set and there are no more virtqueue buffers
> > > >     pending the socket is disconnected.
> > > >
> > > > but I'm not entirely sure if this behaviour contradicts this or not
> > > > (the bits have both been set at #2, but not at the same time).
> > > >
> > > > BTW, how does one tell if there are no more virtqueue buffers pending
> > > > or not while processing the op?
> > >
> > > #2 is odd.  The shutdown bits are sticky so they cannot be cleared once
> > > set.  I would have expected just #1 and #3.  The behavior you observe
> > > look like a bug.
> > >
> > > The spec text does not convey the meaning of OP_SHUTDOWN well.
> > > OP_SHUTDOWN SHUTDOWN_TX|SHUTDOWN_RX means no further rx/tx is possible
> > > for this connection.  "there are no more virtqueue buffers pending the
> > > socket" really means that this isn't an immediate close from the
> > > perspective of the application.  If the application still has unread rx
> > > buffers then the socket stays readable until the rx data has been fully
> > > read.
> >
> > Yes but you also wrote:
> >       If these bits are set and there are no more virtqueue buffers
> >       pending the socket is disconnected.
> >
> > how does remote know that there are no buffers pending and so it's safe
> > to reuse the same source/destination address now?  Maybe destination
> > should send RST at that point?
>
> You are right, the source/destination address could be reused while the
> remote still has the connection in their table.  Connection
> establishment would fail with a RST reply.
>
> I can think of two solutions:
>
> 1. Implementations must remove connections from their table as soon as
>    SHUTDOWN_TX|SHUTDOWN_RX is received.  This way the source/destination
>    address tuple can be reused immediately, i.e. new connections with
>    the same source/destination would be possible while an application is
>    still draining the receive buffers of an old connection.
>
> 2. Extend the connection lifecycle so that an A->B
>    SHUTDOWN_TX|SHUTDOWN_RX must be followed by a by a B->A RST to close
>    a connection.  This way the source/destination address is only in use
>    once at a time.
>
> Option #2 seems safer because there is no overlap in source/destination
> address usage.
>

I agree (I hadn't seen this when I replied to Michael similar suggestion).

Ian. (sorry again for the formatting)

[-- Attachment #1.2: Type: text/html, Size: 8558 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: Inter-VRF routing on a single machine
From: David Ahern @ 2016-04-12 16:48 UTC (permalink / raw)
  To: Darwin Dingel, netdev
In-Reply-To: <CAAJwJQkfVWkdL+tOvcu5C9XBwTpNKcB8qvGeDUO32Tq6nW=xMw@mail.gmail.com>

On 4/12/16 4:09 AM, Darwin Dingel wrote:
> Hi All,
>
> Have anyone tried the following setup on a single machine with 2 TCP
> sockets on different VRF's and succeeded?
>
> - client_socket on VRF1
> - server_socket on VRF2
> - ip rules and iproutes for inter-VRF set up
> - client_socket sends TCP connect to server_socket. skb was sent using
> VRF1 interface
> - skb received in loopback interface

That is the key problem there.

> - TCP code got SYN but cannot route back to VRF1 to send ACK.
>
> I was wondering if this is a known limitation of VRF as of the moment,
> or could work with proper iprules/iproute.

In general local (within a single system) routing does not work with top 
of tree. e.g., within a VRF you can not connect sockets or ping a VRF 
local address. Inter-vrf connections within a system also do not work.

I have patches from our 4.1 kernel that I have rebased to top of tree. I 
hope to test and send those out in the next week or so. It addresses the 
first problem -- connections within a VRF. While it does not resolve 
your problem of connecting across VRFs within a system I think it is the 
foundation for how to fix it.

^ permalink raw reply

* Re: TCP reaching to maximum throughput after a long time
From: Yuchung Cheng @ 2016-04-12 17:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Machani, Yaniv, netdev, David S. Miller, Eric Dumazet,
	Neal Cardwell, Nandita Dukkipati, open list, Kama, Meirav
In-Reply-To: <1460472764.6473.589.camel@edumazet-glaptop3.roam.corp.google.com>

On Tue, Apr 12, 2016 at 7:52 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> On Tue, 2016-04-12 at 12:17 +0000, Machani, Yaniv wrote:
> > Hi,
> > After updating from Kernel 3.14 to Kernel 4.4 we have seen a TCP performance degradation over Wi-Fi.
> > In 3.14 kernel, TCP got to its max throughout after less than a second, while in the 4.4  it is taking ~20-30 seconds.
> > UDP TX/RX and TCP RX performance is as expected.
> > We are using a Beagle Bone Black and a WiLink8 device.
> >
> > Were there any related changes that might cause such behavior ?
> > Kernel configuration and sysctl values were compared, but no significant differences have been found.
> >
> > See a log of the behavior below :
> > -----------------------------------------------------------
> > Client connecting to 10.2.46.5, TCP port 5001
> > TCP window size:  320 KByte (WARNING: requested  256 KByte)
> > ------------------------------------------------------------
> > [  3] local 10.2.46.6 port 49282 connected with 10.2.46.5 port 5001
> > [ ID] Interval       Transfer     Bandwidth
> > [  3]  0.0- 1.0 sec  5.75 MBytes  48.2 Mbits/sec
> > [  3]  1.0- 2.0 sec  6.50 MBytes  54.5 Mbits/sec
> > [  3]  2.0- 3.0 sec  6.50 MBytes  54.5 Mbits/sec
> > [  3]  3.0- 4.0 sec  6.50 MBytes  54.5 Mbits/sec
> > [  3]  4.0- 5.0 sec  6.75 MBytes  56.6 Mbits/sec
> > [  3]  5.0- 6.0 sec  3.38 MBytes  28.3 Mbits/sec
> > [  3]  6.0- 7.0 sec  6.38 MBytes  53.5 Mbits/sec
> > [  3]  7.0- 8.0 sec  6.88 MBytes  57.7 Mbits/sec
> > [  3]  8.0- 9.0 sec  7.12 MBytes  59.8 Mbits/sec
> > [  3]  9.0-10.0 sec  7.12 MBytes  59.8 Mbits/sec
> > [  3] 10.0-11.0 sec  7.12 MBytes  59.8 Mbits/sec
> > [  3] 11.0-12.0 sec  7.25 MBytes  60.8 Mbits/sec
> > [  3] 12.0-13.0 sec  7.12 MBytes  59.8 Mbits/sec
> > [  3] 13.0-14.0 sec  7.25 MBytes  60.8 Mbits/sec
> > [  3] 14.0-15.0 sec  7.62 MBytes  64.0 Mbits/sec
> > [  3] 15.0-16.0 sec  7.88 MBytes  66.1 Mbits/sec
> > [  3] 16.0-17.0 sec  8.12 MBytes  68.2 Mbits/sec
> > [  3] 17.0-18.0 sec  8.25 MBytes  69.2 Mbits/sec
> > [  3] 18.0-19.0 sec  8.50 MBytes  71.3 Mbits/sec
> > [  3] 19.0-20.0 sec  8.88 MBytes  74.4 Mbits/sec
> > [  3] 20.0-21.0 sec  8.75 MBytes  73.4 Mbits/sec
> > [  3] 21.0-22.0 sec  8.62 MBytes  72.4 Mbits/sec
> > [  3] 22.0-23.0 sec  8.75 MBytes  73.4 Mbits/sec
> > [  3] 23.0-24.0 sec  8.50 MBytes  71.3 Mbits/sec
> > [  3] 24.0-25.0 sec  8.62 MBytes  72.4 Mbits/sec
> > [  3] 25.0-26.0 sec  8.62 MBytes  72.4 Mbits/sec
> > [  3] 26.0-27.0 sec  8.62 MBytes  72.4 Mbits/sec
> >
>
> CC netdev, where this is better discussed.
>
> This could be a lot of different factors, and caused by a sender
> problem, a receiver problem, ...
>
> TCP behavior depends on the drivers, so maybe a change there can explain
> this.
>
> You could capture the first 5000 frames of the flow and post the pcap ?
> (-s 128 to capture only the headers)
pcap would be really helpful indeed. if possible please capture on
both 4.4 and 3.14 kernels.

>
> tcpdump -p -s 128 -i eth0 -c 5000 host 10.2.46.5 -w flow.pcap
>
>
> Also, while test is running, you could fetch
> ss -temoi dst 10.2.46.5:5001
>
>
>
>
>

^ permalink raw reply

* [PATCH net 0/2] Fixes for SO_REUSEPORT and mixed v4/v6 sockets
From: Craig Gallek @ 2016-04-12 17:11 UTC (permalink / raw)
  To: davem; +Cc: netdev

From: Craig Gallek <kraig@google.com>

Recent changes to the datastructures associated with SO_REUSEPORT broke
an existing behavior when equivalent SO_REUSEPORT sockets are created
using both AF_INET and AF_INET6.  This patch series restores the previous
behavior and includes a test to validate it.

This series should be a trivial merge to stable kernels (if deemed
necessary), but will have conflicts in net-next.  The following patches
recently replaced the use of hlist_nulls with hlists for UDP and TCP
socket lists:
ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")

If this series is accepted, I will send an RFC for the net-next change
to assist with the merge.

Craig Gallek (2):
  soreuseport: fix ordering for mixed v4/v6 sockets
  soreuseport: test mixed v4/v6 sockets

 include/linux/rculist_nulls.h                     |  39 ++++
 include/net/sock.h                                |   6 +-
 net/ipv4/udp.c                                    |   9 +-
 tools/testing/selftests/net/.gitignore            |   1 +
 tools/testing/selftests/net/Makefile              |   2 +-
 tools/testing/selftests/net/reuseport_dualstack.c | 208 ++++++++++++++++++++++
 6 files changed, 261 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/net/reuseport_dualstack.c

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply

* [PATCH net 1/2] soreuseport: fix ordering for mixed v4/v6 sockets
From: Craig Gallek @ 2016-04-12 17:11 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1460481086-12982-1-git-send-email-kraigatgoog@gmail.com>

From: Craig Gallek <kraig@google.com>

With the SO_REUSEPORT socket option, it is possible to create sockets
in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
the AF_INET6 sockets.

Prior to the commits referenced below, an incoming IPv4 packet would
always be routed to a socket of type AF_INET when this mixed-mode was used.
After those changes, the same packet would be routed to the most recently
bound socket (if this happened to be an AF_INET6 socket, it would
have an IPv4 mapped IPv6 address).

The change in behavior occurred because the recent SO_REUSEPORT optimizations
short-circuit the socket scoring logic as soon as they find a match.  They
did not take into account the scoring logic that favors AF_INET sockets
over AF_INET6 sockets in the event of a tie.

To fix this problem, this patch changes the insertion order of AF_INET
and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
the tail of the list.  This will force AF_INET sockets to always be
considered first.

Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")

Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/rculist_nulls.h | 39 +++++++++++++++++++++++++++++++++++++++
 include/net/sock.h            |  6 +++++-
 net/ipv4/udp.c                |  9 +++++++--
 3 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
index 1c33dd7da4a7..4ae95f7e8597 100644
--- a/include/linux/rculist_nulls.h
+++ b/include/linux/rculist_nulls.h
@@ -98,6 +98,45 @@ static inline void hlist_nulls_add_head_rcu(struct hlist_nulls_node *n,
 	if (!is_a_nulls(first))
 		first->pprev = &n->next;
 }
+
+/**
+ * hlist_nulls_add_tail_rcu
+ * @n: the element to add to the hash list.
+ * @h: the list to add to.
+ *
+ * Description:
+ * Adds the specified element to the end of the specified hlist_nulls,
+ * while permitting racing traversals.  NOTE: tail insertion requires
+ * list traversal.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_nulls_add_head_rcu()
+ * or hlist_nulls_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_nulls_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs.  Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static inline void hlist_nulls_add_tail_rcu(struct hlist_nulls_node *n,
+					struct hlist_nulls_head *h)
+{
+	struct hlist_nulls_node *i, *last = NULL;
+
+	for (i = hlist_nulls_first_rcu(h); !is_a_nulls(i);
+	     i = hlist_nulls_next_rcu(i))
+		last = i;
+
+	if (last) {
+		n->next = last->next;
+		n->pprev = &last->next;
+		rcu_assign_pointer(hlist_nulls_next_rcu(last), n);
+	} else {
+		hlist_nulls_add_head_rcu(n, h);
+	}
+}
+
 /**
  * hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type
  * @tpos:	the type * to use as a loop cursor.
diff --git a/include/net/sock.h b/include/net/sock.h
index 255d3e03727b..121ffc115c4f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -630,7 +630,11 @@ static inline void sk_add_node_rcu(struct sock *sk, struct hlist_head *list)
 
 static inline void __sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
 {
-	hlist_nulls_add_head_rcu(&sk->sk_nulls_node, list);
+	if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
+	    sk->sk_family == AF_INET6)
+		hlist_nulls_add_tail_rcu(&sk->sk_nulls_node, list);
+	else
+		hlist_nulls_add_head_rcu(&sk->sk_nulls_node, list);
 }
 
 static inline void sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 08eed5e16df0..a2e7f55a1f61 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -339,8 +339,13 @@ found:
 
 		hslot2 = udp_hashslot2(udptable, udp_sk(sk)->udp_portaddr_hash);
 		spin_lock(&hslot2->lock);
-		hlist_nulls_add_head_rcu(&udp_sk(sk)->udp_portaddr_node,
-					 &hslot2->head);
+		if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
+			sk->sk_family == AF_INET6)
+			hlist_nulls_add_tail_rcu(&udp_sk(sk)->udp_portaddr_node,
+						 &hslot2->head);
+		else
+			hlist_nulls_add_head_rcu(&udp_sk(sk)->udp_portaddr_node,
+						 &hslot2->head);
 		hslot2->count++;
 		spin_unlock(&hslot2->lock);
 	}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net 2/2] soreuseport: test mixed v4/v6 sockets
From: Craig Gallek @ 2016-04-12 17:11 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1460481086-12982-1-git-send-email-kraigatgoog@gmail.com>

From: Craig Gallek <kraig@google.com>

Test to validate the behavior of SO_REUSEPORT sockets that are
created with both AF_INET and AF_INET6.  See the commit prior to this
for a description of this behavior.

Signed-off-by: Craig Gallek <kraig@google.com>
---
 tools/testing/selftests/net/.gitignore            |   1 +
 tools/testing/selftests/net/Makefile              |   2 +-
 tools/testing/selftests/net/reuseport_dualstack.c | 208 ++++++++++++++++++++++
 3 files changed, 210 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/reuseport_dualstack.c

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index 69bb3fc38fb2..0840684deb7d 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -3,3 +3,4 @@ psock_fanout
 psock_tpacket
 reuseport_bpf
 reuseport_bpf_cpu
+reuseport_dualstack
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index c658792d47b4..0e5340742620 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -4,7 +4,7 @@ CFLAGS = -Wall -O2 -g
 
 CFLAGS += -I../../../../usr/include/
 
-NET_PROGS = socket psock_fanout psock_tpacket reuseport_bpf reuseport_bpf_cpu
+NET_PROGS = socket psock_fanout psock_tpacket reuseport_bpf reuseport_bpf_cpu reuseport_dualstack
 
 all: $(NET_PROGS)
 %: %.c
diff --git a/tools/testing/selftests/net/reuseport_dualstack.c b/tools/testing/selftests/net/reuseport_dualstack.c
new file mode 100644
index 000000000000..90958aaaafb9
--- /dev/null
+++ b/tools/testing/selftests/net/reuseport_dualstack.c
@@ -0,0 +1,208 @@
+/*
+ * It is possible to use SO_REUSEPORT to open multiple sockets bound to
+ * equivalent local addresses using AF_INET and AF_INET6 at the same time.  If
+ * the AF_INET6 socket has IPV6_V6ONLY set, it's clear which socket should
+ * receive a given incoming packet.  However, when it is not set, incoming v4
+ * packets should prefer the AF_INET socket(s).  This behavior was defined with
+ * the original SO_REUSEPORT implementation, but broke with
+ * e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
+ * This test creates these mixed AF_INET/AF_INET6 sockets and asserts the
+ * AF_INET preference for v4 packets.
+ */
+
+#define _GNU_SOURCE
+
+#include <arpa/inet.h>
+#include <errno.h>
+#include <error.h>
+#include <linux/in.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/epoll.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <unistd.h>
+
+static const int PORT = 8888;
+
+static void build_rcv_fd(int family, int proto, int *rcv_fds, int count)
+{
+	struct sockaddr_storage addr;
+	struct sockaddr_in  *addr4;
+	struct sockaddr_in6 *addr6;
+	int opt, i;
+
+	switch (family) {
+	case AF_INET:
+		addr4 = (struct sockaddr_in *)&addr;
+		addr4->sin_family = AF_INET;
+		addr4->sin_addr.s_addr = htonl(INADDR_ANY);
+		addr4->sin_port = htons(PORT);
+		break;
+	case AF_INET6:
+		addr6 = (struct sockaddr_in6 *)&addr;
+		addr6->sin6_family = AF_INET6;
+		addr6->sin6_addr = in6addr_any;
+		addr6->sin6_port = htons(PORT);
+		break;
+	default:
+		error(1, 0, "Unsupported family %d", family);
+	}
+
+	for (i = 0; i < count; ++i) {
+		rcv_fds[i] = socket(family, proto, 0);
+		if (rcv_fds[i] < 0)
+			error(1, errno, "failed to create receive socket");
+
+		opt = 1;
+		if (setsockopt(rcv_fds[i], SOL_SOCKET, SO_REUSEPORT, &opt,
+			       sizeof(opt)))
+			error(1, errno, "failed to set SO_REUSEPORT");
+
+		if (bind(rcv_fds[i], (struct sockaddr *)&addr, sizeof(addr)))
+			error(1, errno, "failed to bind receive socket");
+
+		if (proto == SOCK_STREAM && listen(rcv_fds[i], 10))
+			error(1, errno, "failed to listen on receive port");
+	}
+}
+
+static void send_from_v4(int proto)
+{
+	struct sockaddr_in  saddr, daddr;
+	int fd;
+
+	saddr.sin_family = AF_INET;
+	saddr.sin_addr.s_addr = htonl(INADDR_ANY);
+	saddr.sin_port = 0;
+
+	daddr.sin_family = AF_INET;
+	daddr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+	daddr.sin_port = htons(PORT);
+
+	fd = socket(AF_INET, proto, 0);
+	if (fd < 0)
+		error(1, errno, "failed to create send socket");
+
+	if (bind(fd, (struct sockaddr *)&saddr, sizeof(saddr)))
+		error(1, errno, "failed to bind send socket");
+
+	if (connect(fd, (struct sockaddr *)&daddr, sizeof(daddr)))
+		error(1, errno, "failed to connect send socket");
+
+	if (send(fd, "a", 1, 0) < 0)
+		error(1, errno, "failed to send message");
+
+	close(fd);
+}
+
+static int receive_once(int epfd, int proto)
+{
+	struct epoll_event ev;
+	int i, fd;
+	char buf[8];
+
+	i = epoll_wait(epfd, &ev, 1, -1);
+	if (i < 0)
+		error(1, errno, "epoll_wait failed");
+
+	if (proto == SOCK_STREAM) {
+		fd = accept(ev.data.fd, NULL, NULL);
+		if (fd < 0)
+			error(1, errno, "failed to accept");
+		i = recv(fd, buf, sizeof(buf), 0);
+		close(fd);
+	} else {
+		i = recv(ev.data.fd, buf, sizeof(buf), 0);
+	}
+
+	if (i < 0)
+		error(1, errno, "failed to recv");
+
+	return ev.data.fd;
+}
+
+static void test(int *rcv_fds, int count, int proto)
+{
+	struct epoll_event ev;
+	int epfd, i, test_fd;
+	uint16_t test_family;
+	socklen_t len;
+
+	epfd = epoll_create(1);
+	if (epfd < 0)
+		error(1, errno, "failed to create epoll");
+
+	ev.events = EPOLLIN;
+	for (i = 0; i < count; ++i) {
+		ev.data.fd = rcv_fds[i];
+		if (epoll_ctl(epfd, EPOLL_CTL_ADD, rcv_fds[i], &ev))
+			error(1, errno, "failed to register sock epoll");
+	}
+
+	send_from_v4(proto);
+
+	test_fd = receive_once(epfd, proto);
+	if (getsockopt(test_fd, SOL_SOCKET, SO_DOMAIN, &test_family, &len))
+		error(1, errno, "failed to read socket domain");
+	if (test_family != AF_INET)
+		error(1, 0, "expected to receive on v4 socket but got v6 (%d)",
+		      test_family);
+
+	close(epfd);
+}
+
+int main(void)
+{
+	int rcv_fds[32], i;
+
+	fprintf(stderr, "---- UDP IPv4 created before IPv6 ----\n");
+	build_rcv_fd(AF_INET, SOCK_DGRAM, rcv_fds, 5);
+	build_rcv_fd(AF_INET6, SOCK_DGRAM, &(rcv_fds[5]), 5);
+	test(rcv_fds, 10, SOCK_DGRAM);
+	for (i = 0; i < 10; ++i)
+		close(rcv_fds[i]);
+
+	fprintf(stderr, "---- UDP IPv6 created before IPv4 ----\n");
+	build_rcv_fd(AF_INET6, SOCK_DGRAM, rcv_fds, 5);
+	build_rcv_fd(AF_INET, SOCK_DGRAM, &(rcv_fds[5]), 5);
+	test(rcv_fds, 10, SOCK_DGRAM);
+	for (i = 0; i < 10; ++i)
+		close(rcv_fds[i]);
+
+	/* NOTE: UDP socket lookups traverse a different code path when there
+	 * are > 10 sockets in a group.
+	 */
+	fprintf(stderr, "---- UDP IPv4 created before IPv6 (large) ----\n");
+	build_rcv_fd(AF_INET, SOCK_DGRAM, rcv_fds, 16);
+	build_rcv_fd(AF_INET6, SOCK_DGRAM, &(rcv_fds[16]), 16);
+	test(rcv_fds, 32, SOCK_DGRAM);
+	for (i = 0; i < 32; ++i)
+		close(rcv_fds[i]);
+
+	fprintf(stderr, "---- UDP IPv6 created before IPv4 (large) ----\n");
+	build_rcv_fd(AF_INET6, SOCK_DGRAM, rcv_fds, 16);
+	build_rcv_fd(AF_INET, SOCK_DGRAM, &(rcv_fds[16]), 16);
+	test(rcv_fds, 32, SOCK_DGRAM);
+	for (i = 0; i < 32; ++i)
+		close(rcv_fds[i]);
+
+	fprintf(stderr, "---- TCP IPv4 created before IPv6 ----\n");
+	build_rcv_fd(AF_INET, SOCK_STREAM, rcv_fds, 5);
+	build_rcv_fd(AF_INET6, SOCK_STREAM, &(rcv_fds[5]), 5);
+	test(rcv_fds, 10, SOCK_STREAM);
+	for (i = 0; i < 10; ++i)
+		close(rcv_fds[i]);
+
+	fprintf(stderr, "---- TCP IPv6 created before IPv4 ----\n");
+	build_rcv_fd(AF_INET6, SOCK_STREAM, rcv_fds, 5);
+	build_rcv_fd(AF_INET, SOCK_STREAM, &(rcv_fds[5]), 5);
+	test(rcv_fds, 10, SOCK_STREAM);
+	for (i = 0; i < 10; ++i)
+		close(rcv_fds[i]);
+
+	fprintf(stderr, "SUCCESS\n");
+	return 0;
+}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Alexei Starovoitov @ 2016-04-12 17:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: lsf@lists.linux-foundation.org, James Bottomley, Sagi Grimberg,
	Tom Herbert, Brenden Blanco, Christoph Hellwig, linux-mm,
	netdev@vger.kernel.org, Bart Van Assche,
	lsf-pc@lists.linux-foundation.org
In-Reply-To: <20160412081649.4cb4f9db@redhat.com>

On Tue, Apr 12, 2016 at 08:16:49AM +0200, Jesper Dangaard Brouer wrote:
> 
> On Mon, 11 Apr 2016 15:21:26 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote:
> > > 
> > > On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote:
> > >   
> [...]
> > > > 
> > > > If we go down this road how about also attaching some driver opaques
> > > > to the page sets?  
> > > 
> > > That was the ultimate plan... to leave some opaques bytes left in the
> > > page struct that drivers could use.
> > > 
> > > In struct page I would need a pointer back to my page_pool struct and a
> > > page flag.  Then, I would need room to store the dma_unmap address.
> > > (And then some of the usual fields are still needed, like the refcnt,
> > > and reusing some of the list constructs).  And a zero-copy cross-domain
> > > id.  
> > 
> > I don't think we need to add anything to struct page.
> > This is supposed to be small cache of dma_mapped pages with lockless access.
> > It can be implemented as an array or link list where every element
> > is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to
> > send it to back to page allocator.
> 
> It sounds like the Intel drivers recycle facility, where they split the
> page into two parts, and keep page in RX-ring, by swapping to other
> half of page, if page_count(page) is <= 2.  Thus, they use the atomic
> page ref count to synchronize on.

actually I'm proposing the opposite. one page = one packet.
I'm perfectly happy to waste half a page, since number of such pages is small
and performance matter more. Typical performance vs memory tradeoff.

> Thus, we end-up having two atomic operations per RX packet, on the page
> refcnt.  Where DPDK have zero...

the page recycling cache should have zero atomic ops per packet
otherwise it's non starter.

> By fully taking over the page as an allocator, almost like slab. I can
> optimize the common case (of the packet-page getting allocated and
> free'ed on the same CPU), and remove these atomic operations.

slub is doing local cmpxchg. 40G networking cannot afford it per packet.
If it's amortized due to batching that will be ok.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH net] bpf/verifier: reject invalid LD_ABS | BPF_DW instruction
From: Alexei Starovoitov @ 2016-04-12 17:26 UTC (permalink / raw)
  To: David S . Miller; +Cc: Daniel Borkmann, netdev

verifier must check for reserved size bits in instruction opcode and
reject BPF_LD | BPF_ABS | BPF_DW and BPF_LD | BPF_IND | BPF_DW instructions,
otherwise interpreter will WARN_RATELIMIT on them during execution.

Fixes: ddd872bc3098 ("bpf: verifier: add checks for BPF_ABS | BPF_IND instructions")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 kernel/bpf/verifier.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2e08f8e9b771..618ef77c302a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1374,6 +1374,7 @@ static int check_ld_abs(struct verifier_env *env, struct bpf_insn *insn)
 	}
 
 	if (insn->dst_reg != BPF_REG_0 || insn->off != 0 ||
+	    BPF_SIZE(insn->code) == BPF_DW ||
 	    (mode == BPF_ABS && insn->src_reg != BPF_REG_0)) {
 		verbose("BPF_LD_ABS uses reserved fields\n");
 		return -EINVAL;
-- 
2.8.0

^ permalink raw reply related

* Re: [PATCH] mwifiex: fix possible NULL dereference
From: Rustad, Mark D @ 2016-04-12 17:43 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Sudip Mukherjee, Amitkumar Karwar, Nishant Sarmukadam, Kalle Valo,
	linux-kernel@vger.kernel.org, open list:TI WILINK WIRELES...,
	netdev, Sudip Mukherjee
In-Reply-To: <CAHp75VfO3A+HBjkjc9WC9jzB9gtYrVM6izLmpiqLntChGOzjMA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1971 bytes --]

Andy Shevchenko <andy.shevchenko@gmail.com> wrote:

> On Mon, Apr 11, 2016 at 6:27 PM, Sudip Mukherjee
> <sudipm.mukherjee@gmail.com> wrote:
>> From: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
>>
>> We have a check for card just after dereferencing it. So if it is NULL
>> we have already dereferenced it before its check. Lets dereference it
>> after checking card for NULL.
>
> IIUC the code does nothing with dereference.
>
> I would have told NAK if I would have been maintainer.
>
>> Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
>> ---
>>  drivers/net/wireless/marvell/mwifiex/pcie.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c  
>> b/drivers/net/wireless/marvell/mwifiex/pcie.c
>> index edf8b07..84562d0 100644
>> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
>> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
>> @@ -2884,10 +2884,11 @@ static void mwifiex_unregister_dev(struct  
>> mwifiex_adapter *adapter)
>>  {
>>         struct pcie_service_card *card = adapter->card;
>
> Let's say it's 0.
>
>>        const struct mwifiex_pcie_card_reg *reg;
>> -       struct pci_dev *pdev = card->dev;
>
> This would be equal to offset of dev member in pcie_service_card struct.
>
> Nothing wrong here.

Actually, that is not true. The dereference of card tells the compiler that  
card can't be NULL, so it is free to eliminate the check below.  
Unbelievably, this can even happen for a reference such as &ptr->thing  
where the pointer isn't even actually dereferenced at all!

>> +       struct pci_dev *pdev;
>>         int i;
>>
>>         if (card) {
>> +               pdev = card->dev;
>>                 if (card->msix_enable) {
>>                         for (i = 0; i < MWIFIEX_NUM_MSIX_VECTORS; i++)
>>                                 synchronize_irq(card->msix_entries[i].vector);
>> --
>> 1.9.1
>
>
>
> --
> With Best Regards,
> Andy Shevchenko

^ permalink raw reply

* Issue with ping source address display
From: Daniele Orlandi @ 2016-04-12 17:52 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 4826 bytes --]


Hello,

More than one year ago I posted the following message but it hasn't
received a reply, now I've been stung by a similar issue, you may want
to investigate:


I noticed that when ping receives ICMP messages from different sources
the first IP address is always used and displayed:


vihai@seviolab:~$ ping -V
ping utility, iputils-s20121221

This is a (simulated) flapping route:

vihai@seviolab:~$ ping 10.254.10.140
PING 10.254.10.140 (10.254.10.140) 56(84) bytes of data.
From 192.168.1.1 icmp_seq=1 Destination Host Unreachable
From 192.168.1.1 icmp_seq=2 Destination Host Unreachable
From 192.168.1.1 icmp_seq=3 Destination Host Unreachable
64 bytes from 192.168.1.1: icmp_seq=4 ttl=61 time=24.7 ms
64 bytes from 192.168.1.1: icmp_seq=5 ttl=61 time=25.6 ms
64 bytes from 192.168.1.1: icmp_seq=6 ttl=61 time=69.6 ms
From 192.168.1.1 icmp_seq=7 Destination Host Unreachable
From 192.168.1.1 icmp_seq=8 Destination Host Unreachable
From 192.168.1.1 icmp_seq=9 Destination Host Unreachable
^C
--- 10.254.10.140 ping statistics ---
9 packets transmitted, 3 received, +6 errors, 66% packet loss, time 8001ms
rtt min/avg/max/mdev = 24.797/40.061/69.692/20.955 ms


The sources, however are different:

vihai@seviolab:~$ sudo tcpdump -n icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
^[OA^[OA^[OA17:09:55.932981 IP 192.168.1.21 > 10.254.10.140: ICMP echo
request, id 9278, seq 1, length 64
17:09:55.933234 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
unreachable, length 92
17:09:56.933169 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 2, length 64
17:09:56.933416 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
unreachable, length 92
17:09:57.933160 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 3, length 64
17:09:57.933404 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
unreachable, length 92
17:09:58.933163 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 4, length 64
17:09:58.957939 IP 10.254.10.140 > 192.168.1.21: ICMP echo reply, id
9278, seq 4, length 64
17:09:59.935050 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 5, length 64
17:09:59.960724 IP 10.254.10.140 > 192.168.1.21: ICMP echo reply, id
9278, seq 5, length 64
17:10:00.936177 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 6, length 64
17:10:01.005849 IP 10.254.10.140 > 192.168.1.21: ICMP echo reply, id
9278, seq 6, length 64
17:10:01.936313 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 7, length 64
17:10:01.936626 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
unreachable, length 92
17:10:02.935321 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 8, length 64
17:10:02.935591 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
unreachable, length 92
17:10:03.934322 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 9, length 64
17:10:03.934613 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
unreachable, length 92




Tried with a different ping implementation (RouterOS) and the behaviour
seems correct:

[vihai@SevioLab SW1] > ping 10.254.10.140
HOST                                     SIZE TTL TIME  STATUS
192.168.1.1                             84  64 0ms   host unreachable
192.168.1.1                             84  64 0ms   host unreachable
192.168.1.1                             84  64 0ms   host unreachable
10.254.10.140                           56  61 20ms
10.254.10.140                           56  61 46ms
10.254.10.140                           56  61 37ms
192.168.1.1                             84  64 0ms   host unreachable
192.168.1.1                             84  64 0ms   host unreachable
192.168.1.1                             84  64 0ms   host unreachable
    sent=9 received=3 packet-loss=66% min-rtt=20ms avg-rtt=34ms max-rtt=46ms


Recently I was pinging with IPv6, a router in between filtered the
packet, however the shown source address was not the right one:

root@monitor:~# ping6 -i 0.2 www.google.com
PING www.google.com(mil01s25-in-x04.1e100.net) 56 data bytes
From mil01s25-in-x04.1e100.net icmp_seq=1 Destination unreachable: No route
From mil01s25-in-x04.1e100.net icmp_seq=2 Destination unreachable: No route
From mil01s25-in-x04.1e100.net icmp_seq=3 Destination unreachable: No route

19:33:19.589285 IP6 2a01:2d8:aca0:fce:944e:c8ff:fe4d:96de >
mil01s25-in-x04.1e100.net: ICMP6, echo request, seq 1, length 64
19:33:19.611666 IP6 mix-br2.intercom.it >
2a01:2d8:aca0:fce:944e:c8ff:fe4d:96de: ICMP6, destination unreachable,
unreachable route mil01s25-in-x04.1e100.net, length 112


Bye,


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4326 bytes --]

^ permalink raw reply

* RE: TCP reaching to maximum throughput after a long time
From: Machani, Yaniv @ 2016-04-12 19:31 UTC (permalink / raw)
  To: Ben Greear, netdev
  Cc: Eric Dumazet, David S. Miller, Eric Dumazet, Neal Cardwell,
	Yuchung Cheng, Nandita Dukkipati, open list, Kama, Meirav
In-Reply-To: <570D0E94.3080006@candelatech.com>

On Tue, Apr 12, 2016 at 18:04:52, Ben Greear wrote:
> On 04/12/2016 07:52 AM, Eric Dumazet wrote:
> > On Tue, 2016-04-12 at 12:17 +0000, Machani, Yaniv wrote:
>>>
> 
> If you are using 'Cubic' TCP congestion control, then please try 
> something different.
> It was broken last I checked, at least when used with the ath10k driver.
> 

Thanks Ben, this indeed seems to be the issue !
Switching to reno got me to max throughput instantly.

I'm still looking through the thread you have shared, but from what I understand there is no planned fix for it ?

--
Thanks,
Yaniv


^ permalink raw reply

* [PATCH net-next] ibmvnic: Defer tx completion processing using a wait queue
From: John Allen @ 2016-04-12 19:38 UTC (permalink / raw)
  To: Thomas Falcon; +Cc: netdev, linuxppc-dev

Moves tx completion processing out of interrupt context, deferring work
using a wait queue. With this work now deferred, we must account for the
possibility that skbs can be sent faster than we can process completion
requests in which case the tx buffer will overflow. If the tx buffer is
full, ibmvnic_xmit will return NETDEV_TX_BUSY and stop the current tx
queue. Subsequently, the queue will be restarted in ibmvnic_complete_tx
when all pending tx completion requests have been cleared.

Signed-off-by: John Allen <jallen@linux.vnet.ibm.com>
---
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 864cb21..641e340 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -105,6 +105,7 @@ static int pending_scrq(struct ibmvnic_adapter *,
 			struct ibmvnic_sub_crq_queue *);
 static union sub_crq *ibmvnic_next_scrq(struct ibmvnic_adapter *,
 					struct ibmvnic_sub_crq_queue *);
+static int ibmvnic_tx_work(void *data);
 static int ibmvnic_poll(struct napi_struct *napi, int data);
 static void send_map_query(struct ibmvnic_adapter *adapter);
 static void send_request_map(struct ibmvnic_adapter *, dma_addr_t, __be32, u8);
@@ -437,6 +438,17 @@ static int ibmvnic_open(struct net_device *netdev)

 		tx_pool->consumer_index = 0;
 		tx_pool->producer_index = 0;
+
+		init_waitqueue_head(&tx_pool->ibmvnic_tx_comp_q);
+		tx_pool->work_thread =
+			kthread_run(ibmvnic_tx_work, adapter->tx_scrq[i],
+				    "%s_%s_%d",
+				    IBMVNIC_NAME, adapter->netdev->name, i);
+		if (IS_ERR(tx_pool->work_thread)) {
+			dev_err(dev, "Couldn't create kernel thread: %ld\n",
+				PTR_ERR(tx_pool->work_thread));
+			goto thread_failed;
+		}
 	}
 	adapter->bounce_buffer_size =
 	    (netdev->mtu + ETH_HLEN - 1) / PAGE_SIZE + 1;
@@ -477,6 +489,9 @@ bounce_map_failed:
 bounce_alloc_failed:
 	i = tx_subcrqs - 1;
 	kfree(adapter->tx_pool[i].free_map);
+thread_failed:
+	for (j = 0; j < i; j++)
+		kthread_stop(adapter->tx_pool[j].work_thread);
 tx_fm_alloc_failed:
 	free_long_term_buff(adapter, &adapter->tx_pool[i].long_term_buff);
 tx_ltb_alloc_failed:
@@ -731,6 +746,16 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
 	}

 	index = tx_pool->free_map[tx_pool->consumer_index];
+	/* tx queue full */
+	if ((index + 1) % adapter->max_tx_entries_per_subcrq ==
+	    tx_pool->free_map[tx_pool->producer_index]) {
+		netif_tx_stop_queue(netdev_get_tx_queue(netdev, queue_num));
+		tx_send_failed++;
+		tx_dropped++;
+		ret = NETDEV_TX_BUSY;
+		goto out;
+	}
+
 	offset = index * adapter->req_mtu;
 	dst = tx_pool->long_term_buff.buff + offset;
 	memset(dst, 0, adapter->req_mtu);
@@ -1314,6 +1339,7 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter *adapter,
 {
 	struct device *dev = &adapter->vdev->dev;
 	struct ibmvnic_tx_buff *txbuff;
+	struct netdev_queue *txq;
 	union sub_crq *next;
 	int index;
 	int i, j;
@@ -1361,6 +1387,10 @@ restart_loop:
 		next->tx_comp.first = 0;
 	}

+	txq = netdev_get_tx_queue(adapter->netdev, scrq->pool_index);
+	if (netif_tx_queue_stopped(txq))
+		netif_tx_wake_queue(txq);
+
 	enable_scrq_irq(adapter, scrq);

 	if (pending_scrq(adapter, scrq)) {
@@ -1371,13 +1401,35 @@ restart_loop:
 	return 0;
 }

+static int ibmvnic_tx_work(void *data)
+{
+	int rc;
+	struct ibmvnic_sub_crq_queue *scrq = data;
+	struct ibmvnic_adapter *adapter = scrq->adapter;
+
+	while (1) {
+		rc = wait_event_interruptible(adapter->
+					      tx_pool[scrq->pool_index].
+					      ibmvnic_tx_comp_q,
+					      pending_scrq(adapter, scrq));
+		BUG_ON(rc);
+
+		if (kthread_should_stop())
+			break;
+
+		disable_scrq_irq(adapter, scrq);
+
+		ibmvnic_complete_tx(adapter, scrq);
+	}
+	return 0;
+}
+
 static irqreturn_t ibmvnic_interrupt_tx(int irq, void *instance)
 {
 	struct ibmvnic_sub_crq_queue *scrq = instance;
 	struct ibmvnic_adapter *adapter = scrq->adapter;

-	disable_scrq_irq(adapter, scrq);
-	ibmvnic_complete_tx(adapter, scrq);
+	wake_up(&adapter->tx_pool[scrq->pool_index].ibmvnic_tx_comp_q);

 	return IRQ_HANDLED;
 }

^ permalink raw reply related

* Re: Issue with ping source address display
From: Julian Anastasov @ 2016-04-12 19:38 UTC (permalink / raw)
  To: Daniele Orlandi; +Cc: netdev
In-Reply-To: <570D35CF.9070006@orlandi.com>


	Hello,

On Tue, 12 Apr 2016, Daniele Orlandi wrote:

> I noticed that when ping receives ICMP messages from different sources
> the first IP address is always used and displayed:
> 
> 
> vihai@seviolab:~$ ping -V
> ping utility, iputils-s20121221

	What is the kernel version?

Regards

^ permalink raw reply

* Re: Issue with ping source address display
From: Daniele Orlandi @ 2016-04-12 19:47 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: netdev
In-Reply-To: <alpine.LFD.2.11.1604122235020.1611@ja.home.ssi.bg>

[-- Attachment #1: Type: text/plain, Size: 696 bytes --]

On 12/04/2016 21:38, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Tue, 12 Apr 2016, Daniele Orlandi wrote:
> 
>> I noticed that when ping receives ICMP messages from different sources
>> the first IP address is always used and displayed:
>>
>>
>> vihai@seviolab:~$ ping -V
>> ping utility, iputils-s20121221
> 
> 	What is the kernel version?

Much time passed and I don't remember, however the latest test with IPv6
ping was made on this host:

root@monitor:~# uname -a
Linux monitor 4.4.0-17-generic #33-Ubuntu SMP Tue Mar 29 17:17:28 UTC
2016 x86_64 x86_64 x86_64 GNU/Linux

root@monitor:~# ping -V
ping utility, iputils-s20121221

Thanks,

-- 
  Daniele Orlandi


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4326 bytes --]

^ permalink raw reply

* Re: [PATCH v3 1/2] sctp: compress bit-wide flags to a bitfield on sctp_sock
From: Neil Horman @ 2016-04-12 19:50 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: netdev, Vlad Yasevich, linux-sctp, David Laight, Jakub Sitnicki
In-Reply-To: <1d6d26426e6f1eee2851aed47b77ffe8bc335210.1460144373.git.marcelo.leitner@gmail.com>

On Fri, Apr 08, 2016 at 04:41:27PM -0300, Marcelo Ricardo Leitner wrote:
> It wastes space and gets worse as we add new flags, so convert bit-wide
> flags to a bitfield.
> 
> Currently it already saves 4 bytes in sctp_sock, which are left as holes
> in it for now. The whole struct needs packing, which should be done in
> another patch.
> 
> Note that do_auto_asconf cannot be merged, as explained in the comment
> before it.
> 
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
>  include/net/sctp/structs.h | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 6df1ce7a411c548bda4163840a90578b6e1b4cfe..1a6a626904bba4223b7921bbb4be41c2550271a7 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -210,14 +210,14 @@ struct sctp_sock {
>  	int user_frag;
>  
>  	__u32 autoclose;
> -	__u8 nodelay;
> -	__u8 disable_fragments;
> -	__u8 v4mapped;
> -	__u8 frag_interleave;
>  	__u32 adaptation_ind;
>  	__u32 pd_point;
> -	__u8 recvrcvinfo;
> -	__u8 recvnxtinfo;
> +	__u16	nodelay:1,
> +		disable_fragments:1,
> +		v4mapped:1,
> +		frag_interleave:1,
> +		recvrcvinfo:1,
> +		recvnxtinfo:1;
>  
>  	atomic_t pd_mode;
>  	/* Receive to here while partial delivery is in effect. */
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

I've not run it myself, but this series looks reasonable

Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply

* Re: TCP reaching to maximum throughput after a long time
From: Ben Greear @ 2016-04-12 20:11 UTC (permalink / raw)
  To: Machani, Yaniv, netdev
  Cc: Eric Dumazet, David S. Miller, Eric Dumazet, Neal Cardwell,
	Yuchung Cheng, Nandita Dukkipati, open list, Kama, Meirav
In-Reply-To: <AE1C82FB3D0EC64DB1F752C81CBD110139101889@DFRE01.ent.ti.com>

On 04/12/2016 12:31 PM, Machani, Yaniv wrote:
> On Tue, Apr 12, 2016 at 18:04:52, Ben Greear wrote:
>> On 04/12/2016 07:52 AM, Eric Dumazet wrote:
>>> On Tue, 2016-04-12 at 12:17 +0000, Machani, Yaniv wrote:
>>>>
>>
>> If you are using 'Cubic' TCP congestion control, then please try
>> something different.
>> It was broken last I checked, at least when used with the ath10k driver.
>>
>
> Thanks Ben, this indeed seems to be the issue !
> Switching to reno got me to max throughput instantly.
>
> I'm still looking through the thread you have shared, but from what I understand there is no planned fix for it ?

I think at the time it was blamed on ath10k and no one cared to try to fix it.

Or, maybe no one really uses CUBIC anymore?

Either way, I have no plans to try to fix CUBIC, but maybe someone who knows
this code better could give it a try.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [PATCH net-next] ibmvnic: Defer tx completion processing using a wait queue
From: Eric Dumazet @ 2016-04-12 20:12 UTC (permalink / raw)
  To: John Allen; +Cc: Thomas Falcon, netdev, linuxppc-dev
In-Reply-To: <570D4EBC.60409@linux.vnet.ibm.com>

On Tue, 2016-04-12 at 14:38 -0500, John Allen wrote:
> Moves tx completion processing out of interrupt context, deferring work
> using a wait queue. With this work now deferred, we must account for the
> possibility that skbs can be sent faster than we can process completion
> requests in which case the tx buffer will overflow. If the tx buffer is
> full, ibmvnic_xmit will return NETDEV_TX_BUSY and stop the current tx
> queue. Subsequently, the queue will be restarted in ibmvnic_complete_tx
> when all pending tx completion requests have been cleared.

1) Why is this needed ?

2) If it is needed, why is this not done in a generic way, so that other
drivers can use this ?

Thanks.

^ permalink raw reply

* Re: TCP reaching to maximum throughput after a long time
From: Eric Dumazet @ 2016-04-12 20:17 UTC (permalink / raw)
  To: Ben Greear
  Cc: Machani, Yaniv, netdev, David S. Miller, Eric Dumazet,
	Neal Cardwell, Yuchung Cheng, Nandita Dukkipati, open list,
	Kama, Meirav
In-Reply-To: <570D566B.8050302@candelatech.com>

On Tue, 2016-04-12 at 13:11 -0700, Ben Greear wrote:
> On 04/12/2016 12:31 PM, Machani, Yaniv wrote:
> > On Tue, Apr 12, 2016 at 18:04:52, Ben Greear wrote:
> >> On 04/12/2016 07:52 AM, Eric Dumazet wrote:
> >>> On Tue, 2016-04-12 at 12:17 +0000, Machani, Yaniv wrote:
> >>>>
> >>
> >> If you are using 'Cubic' TCP congestion control, then please try
> >> something different.
> >> It was broken last I checked, at least when used with the ath10k driver.
> >>
> >
> > Thanks Ben, this indeed seems to be the issue !
> > Switching to reno got me to max throughput instantly.
> >
> > I'm still looking through the thread you have shared, but from what I understand there is no planned fix for it ?
> 
> I think at the time it was blamed on ath10k and no one cared to try to fix it.
> 
> Or, maybe no one really uses CUBIC anymore?
> 
> Either way, I have no plans to try to fix CUBIC, but maybe someone who knows
> this code better could give it a try.

Well, cubic seems to work in many cases, assuming they are not too many
drops.

Assuming one flow can get nominal speed in few RTT is kind a dream, and
so far nobody claimed a CC was able to do that, while still being fair
and resilient.

TCP CC are full of heuristics, and by definition heuristics that were
working 6 years ago might need to be refreshed.

We are still maintaining Cubic for sure.

^ permalink raw reply

* Re: TCP reaching to maximum throughput after a long time
From: Ben Greear @ 2016-04-12 20:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Machani, Yaniv, netdev, David S. Miller, Eric Dumazet,
	Neal Cardwell, Yuchung Cheng, Nandita Dukkipati, open list,
	Kama, Meirav
In-Reply-To: <1460492253.6473.596.camel@edumazet-glaptop3.roam.corp.google.com>

On 04/12/2016 01:17 PM, Eric Dumazet wrote:
> On Tue, 2016-04-12 at 13:11 -0700, Ben Greear wrote:
>> On 04/12/2016 12:31 PM, Machani, Yaniv wrote:
>>> On Tue, Apr 12, 2016 at 18:04:52, Ben Greear wrote:
>>>> On 04/12/2016 07:52 AM, Eric Dumazet wrote:
>>>>> On Tue, 2016-04-12 at 12:17 +0000, Machani, Yaniv wrote:
>>>>>>
>>>>
>>>> If you are using 'Cubic' TCP congestion control, then please try
>>>> something different.
>>>> It was broken last I checked, at least when used with the ath10k driver.
>>>>
>>>
>>> Thanks Ben, this indeed seems to be the issue !
>>> Switching to reno got me to max throughput instantly.
>>>
>>> I'm still looking through the thread you have shared, but from what I understand there is no planned fix for it ?
>>
>> I think at the time it was blamed on ath10k and no one cared to try to fix it.
>>
>> Or, maybe no one really uses CUBIC anymore?
>>
>> Either way, I have no plans to try to fix CUBIC, but maybe someone who knows
>> this code better could give it a try.
>
> Well, cubic seems to work in many cases, assuming they are not too many
> drops.
>
> Assuming one flow can get nominal speed in few RTT is kind a dream, and
> so far nobody claimed a CC was able to do that, while still being fair
> and resilient.
>
> TCP CC are full of heuristics, and by definition heuristics that were
> working 6 years ago might need to be refreshed.
>
> We are still maintaining Cubic for sure.

It worked well enough for years that I didn't even know other algorithms were
available.  It was broken around 4.0 time, and I reported it to the list,
and no one seemed to really care enough to do anything about it.  I changed
to reno and ignored the problem as well.

It is trivially easy to see the regression when using ath10k NIC, and from this email
thread, I guess other NICs have similar issues.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox