[RFC PATCH net-next 0/3] net: batched receive in GRO path

Netdev List
 help / color / mirror / Atom feed

* [RFC PATCH net-next 0/3] net: batched receive in GRO path
@ 2019-07-09 19:27 Edward Cree
  2019-07-09 19:28 ` [RFC PATCH net-next 1/3] sfc: don't score irq moderation points for GRO Edward Cree
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Edward Cree @ 2019-07-09 19:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet

This series listifies part of GRO processing, in a manner which allows those
 packets which are not GROed (i.e. for which dev_gro_receive returns
 GRO_NORMAL) to be passed on to the listified regular receive path.
dev_gro_receive() itself is not listified, nor the per-protocol GRO
 callback, since GRO's need to hold packets on lists under napi->gro_hash
 makes keeping the packets on other lists awkward, and since the GRO control
 block state of held skbs can refer only to one 'new' skb at a time.
Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb
 onto a list in the napi struct, which is received at the end of the napi
 poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch.
Unlike my previous design ([1]), this does not require changes in drivers,
 and also does not prevent the re-use of napi->skb after GRO_MERGED_FREE or
 GRO_DROP.

Performance figures with this series, collected on a back-to-back pair of
 Solarflare sfn8522-r2 NICs with 120-second NetPerf tests.  In the stats,
 sample size n for old and new code is 6 runs each; p is from a Welch t-test.
Tests were run both with GRO enabled and disabled, the latter simulating
 uncoalesceable packets (e.g. due to IP or TCP options).  The receive side
 (which was the device under test) had the NetPerf process pinned to one CPU,
 and the device interrupts pinned to a second CPU.  CPU utilisation figures
 (used in cases of line-rate performance) are summed across all CPUs.
Where not specified (as batch=), net.core.gro_normal_batch was set to 8.
The net-next baseline used for these tests was commit 7d30a7f6424e.
TCP 4 streams, GRO on: all results line rate (9.415Gbps)
net-next: 210.3% cpu
after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next)
after #3: 191.7% cpu (- 8.9%, p=0.102 vs net-next)
TCP 4 streams, GRO off:
after #1: 7.785 Gbps
after #3: 8.387 Gbps (+ 7.7%, p=0.215 vs #1, but note *)
TCP 1 stream, GRO on: all results line rate & ~200% cpu.
TCP 1 stream, GRO off:
after #1: 6.444 Gbps
after #3: 7.363 Gbps (+14.3%, p=0.003 vs #1)
batch=16: 7.199 Gbps
batch= 4: 7.354 Gbps
batch= 0: 5.899 Gbps
TCP 100 RR, GRO off:
net-next: 995.083 us
after #1: 969.167 us (- 2.6%, p=0.204 vs net-next)
after #3: 976.433 us (- 1.9%, p=0.254 vs net-next)

(*) These tests produced a mixture of line-rate and below-line-rate results,
 meaning that statistically speaking the results were 'censored' by the
 upper bound, and were thus not normally distributed, making a Welch t-test
 mathematically invalid.  I therefore also calculated estimators according
 to [2], which gave the following:
after #1: 8.155 Gbps
after #3: 8.716 Gbps (+ 6.9%, p=0.291 vs #1)
(though my procedure for determining ν wasn't mathematically well-founded
 either, so take that p-value with a grain of salt).

Conclusion:
* Patch #1 is a fairly unambiguous improvement.
* Patch #3 has no statistically significant effect when GRO is active.
* Any effect of patch #3 on latency is within statistical noise.
* When GRO is inactive, patch #3 improves bandwidth, though for multiple
  streams the effect is smaller (possibly owing to the line-rate limit).
* The optimal batch size for this setup appears to be around 8.
* Setting the batch size to zero gives worse performance than before the
  patch; perhaps a static key is needed?
* Drivers which, unlike sfc, pass UDP traffic to GRO would expect to see a
  benefit from gaining access to batching.

Notes for future thought: in principle if we passed the napi pointer to
 napi_gro_complete(), it could add its superframe skb to napi->rx_list,
 rather than immediately netif_receive_skb_internal()ing it.  Without that
 I'm not sure if there's a possibility of OoO between normal and GROed SKBs
 on the same flow.

[1]: http://patchwork.ozlabs.org/cover/997844/
[2]: Cohen 1959, doi: 10.1080/00401706.1959.10489859

Edward Cree (3):
  sfc: don't score irq moderation points for GRO
  sfc: falcon: don't score irq moderation points for GRO
  net: use listified RX for handling GRO_NORMAL skbs

 drivers/net/ethernet/sfc/falcon/rx.c |  5 +----
 drivers/net/ethernet/sfc/rx.c        |  5 +----
 include/linux/netdevice.h            |  3 +++
 net/core/dev.c                       | 32 ++++++++++++++++++++++++++--
 net/core/sysctl_net_core.c           |  8 +++++++
 5 files changed, 43 insertions(+), 10 deletions(-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH net-next 1/3] sfc: don't score irq moderation points for GRO
  2019-07-09 19:27 [RFC PATCH net-next 0/3] net: batched receive in GRO path Edward Cree
@ 2019-07-09 19:28 ` Edward Cree
  2019-07-09 19:29 ` [RFC PATCH net-next 2/3] sfc: falcon: " Edward Cree
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Edward Cree @ 2019-07-09 19:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet

We already scored points when handling the RX event, no-one else does this,
 and looking at the history it appears this was originally meant to only
 score on merges, not on GRO_NORMAL.  Moreover, it gets in the way of
 changing GRO to not immediately pass GRO_NORMAL skbs to the stack.
Performance testing with four TCP streams received on a single CPU (where
 throughput was line rate of 9.4Gbps in all tests) showed a 13.7% reduction
 in RX CPU usage (n=6, p=0.03).

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/rx.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index d5db045535d3..85ec07f5a674 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -412,7 +412,6 @@ efx_rx_packet_gro(struct efx_channel *channel, struct efx_rx_buffer *rx_buf,
 		  unsigned int n_frags, u8 *eh)
 {
 	struct napi_struct *napi = &channel->napi_str;
-	gro_result_t gro_result;
 	struct efx_nic *efx = channel->efx;
 	struct sk_buff *skb;
 
@@ -449,9 +448,7 @@ efx_rx_packet_gro(struct efx_channel *channel, struct efx_rx_buffer *rx_buf,
 
 	skb_record_rx_queue(skb, channel->rx_queue.core_index);
 
-	gro_result = napi_gro_frags(napi);
-	if (gro_result != GRO_DROP)
-		channel->irq_mod_score += 2;
+	napi_gro_frags(napi);
 }
 
 /* Allocate and construct an SKB around page fragments */


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH net-next 2/3] sfc: falcon: don't score irq moderation points for GRO
  2019-07-09 19:27 [RFC PATCH net-next 0/3] net: batched receive in GRO path Edward Cree
  2019-07-09 19:28 ` [RFC PATCH net-next 1/3] sfc: don't score irq moderation points for GRO Edward Cree
@ 2019-07-09 19:29 ` Edward Cree
  2019-07-09 19:29 ` [RFC PATCH net-next 3/3] net: use listified RX for handling GRO_NORMAL skbs Edward Cree
  2019-07-10  7:27 ` [RFC PATCH net-next 0/3] net: batched receive in GRO path Paolo Abeni
  3 siblings, 0 replies; 14+ messages in thread
From: Edward Cree @ 2019-07-09 19:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet

Same rationale as for sfc, except that this wasn't performance-tested.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/falcon/rx.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/sfc/falcon/rx.c b/drivers/net/ethernet/sfc/falcon/rx.c
index fd850d3d8ec0..05ea3523890a 100644
--- a/drivers/net/ethernet/sfc/falcon/rx.c
+++ b/drivers/net/ethernet/sfc/falcon/rx.c
@@ -424,7 +424,6 @@ ef4_rx_packet_gro(struct ef4_channel *channel, struct ef4_rx_buffer *rx_buf,
 		  unsigned int n_frags, u8 *eh)
 {
 	struct napi_struct *napi = &channel->napi_str;
-	gro_result_t gro_result;
 	struct ef4_nic *efx = channel->efx;
 	struct sk_buff *skb;
 
@@ -460,9 +459,7 @@ ef4_rx_packet_gro(struct ef4_channel *channel, struct ef4_rx_buffer *rx_buf,
 
 	skb_record_rx_queue(skb, channel->rx_queue.core_index);
 
-	gro_result = napi_gro_frags(napi);
-	if (gro_result != GRO_DROP)
-		channel->irq_mod_score += 2;
+	napi_gro_frags(napi);
 }
 
 /* Allocate and construct an SKB around page fragments */


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH net-next 3/3] net: use listified RX for handling GRO_NORMAL skbs
  2019-07-09 19:27 [RFC PATCH net-next 0/3] net: batched receive in GRO path Edward Cree
  2019-07-09 19:28 ` [RFC PATCH net-next 1/3] sfc: don't score irq moderation points for GRO Edward Cree
  2019-07-09 19:29 ` [RFC PATCH net-next 2/3] sfc: falcon: " Edward Cree
@ 2019-07-09 19:29 ` Edward Cree
  2019-07-10  7:27 ` [RFC PATCH net-next 0/3] net: batched receive in GRO path Paolo Abeni
  3 siblings, 0 replies; 14+ messages in thread
From: Edward Cree @ 2019-07-09 19:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet

When GRO decides not to coalesce a packet, in napi_frags_finish(), instead
 of passing it to the stack immediately, place it on a list in the napi
 struct.  Then, at flush time (napi_complete_done() or napi_poll()), call
 netif_receive_skb_list_internal() on the list.  We'd like to do that in
 napi_gro_flush(), but it's not called if !napi->gro_bitmask, so we have to
 do it in the callers instead.  (There are a handful of drivers that call
 napi_gro_flush() themselves, but it's not clear why, or whether this will
 affect them.)
Because a full 64 packets is an inefficiently large batch, also consume the
 list whenever it exceeds gro_normal_batch, a new net/core sysctl that
 defaults to 8.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/netdevice.h  |  3 +++
 net/core/dev.c             | 32 ++++++++++++++++++++++++++++++--
 net/core/sysctl_net_core.c |  8 ++++++++
 3 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 88292953aa6f..55ac223553f8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -332,6 +332,8 @@ struct napi_struct {
 	struct net_device	*dev;
 	struct gro_list		gro_hash[GRO_HASH_BUCKETS];
 	struct sk_buff		*skb;
+	struct list_head	rx_list; /* Pending GRO_NORMAL skbs */
+	int			rx_count; /* length of rx_list */
 	struct hrtimer		timer;
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
@@ -4239,6 +4241,7 @@ extern int		dev_weight_rx_bias;
 extern int		dev_weight_tx_bias;
 extern int		dev_rx_weight;
 extern int		dev_tx_weight;
+extern int		gro_normal_batch;
 
 bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
 struct net_device *netdev_upper_get_next_dev_rcu(struct net_device *dev,
diff --git a/net/core/dev.c b/net/core/dev.c
index fc676b2610e3..4b6f2ec67fbc 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3963,6 +3963,8 @@ int dev_weight_rx_bias __read_mostly = 1;  /* bias for backlog weight */
 int dev_weight_tx_bias __read_mostly = 1;  /* bias for output_queue quota */
 int dev_rx_weight __read_mostly = 64;
 int dev_tx_weight __read_mostly = 64;
+/* Maximum number of GRO_NORMAL skbs to batch up for list-RX */
+int gro_normal_batch __read_mostly = 8;
 
 /* Called with irq disabled */
 static inline void ____napi_schedule(struct softnet_data *sd,
@@ -5742,6 +5744,26 @@ struct sk_buff *napi_get_frags(struct napi_struct *napi)
 }
 EXPORT_SYMBOL(napi_get_frags);
 
+/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
+static void gro_normal_list(struct napi_struct *napi)
+{
+	if (!napi->rx_count)
+		return;
+	netif_receive_skb_list_internal(&napi->rx_list);
+	INIT_LIST_HEAD(&napi->rx_list);
+	napi->rx_count = 0;
+}
+
+/* Queue one GRO_NORMAL SKB up for list processing.  If batch size exceeded,
+ * pass the whole batch up to the stack.
+ */
+static void gro_normal_one(struct napi_struct *napi, struct sk_buff *skb)
+{
+	list_add_tail(&skb->list, &napi->rx_list);
+	if (++napi->rx_count >= gro_normal_batch)
+		gro_normal_list(napi);
+}
+
 static gro_result_t napi_frags_finish(struct napi_struct *napi,
 				      struct sk_buff *skb,
 				      gro_result_t ret)
@@ -5751,8 +5773,8 @@ static gro_result_t napi_frags_finish(struct napi_struct *napi,
 	case GRO_HELD:
 		__skb_push(skb, ETH_HLEN);
 		skb->protocol = eth_type_trans(skb, skb->dev);
-		if (ret == GRO_NORMAL && netif_receive_skb_internal(skb))
-			ret = GRO_DROP;
+		if (ret == GRO_NORMAL)
+			gro_normal_one(napi, skb);
 		break;
 
 	case GRO_DROP:
@@ -6029,6 +6051,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 				 NAPIF_STATE_IN_BUSY_POLL)))
 		return false;
 
+	gro_normal_list(n);
+
 	if (n->gro_bitmask) {
 		unsigned long timeout = 0;
 
@@ -6267,6 +6291,8 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 	napi->timer.function = napi_watchdog;
 	init_gro_hash(napi);
 	napi->skb = NULL;
+	INIT_LIST_HEAD(&napi->rx_list);
+	napi->rx_count = 0;
 	napi->poll = poll;
 	if (weight > NAPI_POLL_WEIGHT)
 		netdev_err_once(dev, "%s() called with weight %d\n", __func__,
@@ -6363,6 +6389,8 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
 		goto out_unlock;
 	}
 
+	gro_normal_list(n);
+
 	if (n->gro_bitmask) {
 		/* flush too old packets
 		 * If HZ < 1000, flush all packets.
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index f9204719aeee..dba52f53eace 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -569,6 +569,14 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_do_static_key,
 	},
+	{
+		.procname	= "gro_normal_batch",
+		.data		= &gro_normal_batch,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
 	{ }
 };
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 0/3] net: batched receive in GRO path
  2019-07-09 19:27 [RFC PATCH net-next 0/3] net: batched receive in GRO path Edward Cree
                   ` (2 preceding siblings ...)
  2019-07-09 19:29 ` [RFC PATCH net-next 3/3] net: use listified RX for handling GRO_NORMAL skbs Edward Cree
@ 2019-07-10  7:27 ` Paolo Abeni
  2019-07-10 14:52   ` Edward Cree
  3 siblings, 1 reply; 14+ messages in thread
From: Paolo Abeni @ 2019-07-10  7:27 UTC (permalink / raw)
  To: Edward Cree, David Miller; +Cc: netdev, Eric Dumazet

Hi,

On Tue, 2019-07-09 at 20:27 +0100, Edward Cree wrote:
> Where not specified (as batch=), net.core.gro_normal_batch was set to 8.
> The net-next baseline used for these tests was commit 7d30a7f6424e.
> TCP 4 streams, GRO on: all results line rate (9.415Gbps)
> net-next: 210.3% cpu
> after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next)
> after #3: 191.7% cpu (- 8.9%, p=0.102 vs net-next)
> TCP 4 streams, GRO off:
> after #1: 7.785 Gbps
> after #3: 8.387 Gbps (+ 7.7%, p=0.215 vs #1, but note *)
> TCP 1 stream, GRO on: all results line rate & ~200% cpu.
> TCP 1 stream, GRO off:
> after #1: 6.444 Gbps
> after #3: 7.363 Gbps (+14.3%, p=0.003 vs #1)
> batch=16: 7.199 Gbps
> batch= 4: 7.354 Gbps
> batch= 0: 5.899 Gbps
> TCP 100 RR, GRO off:
> net-next: 995.083 us
> after #1: 969.167 us (- 2.6%, p=0.204 vs net-next)
> after #3: 976.433 us (- 1.9%, p=0.254 vs net-next)
> 
> (*) These tests produced a mixture of line-rate and below-line-rate results,
>  meaning that statistically speaking the results were 'censored' by the
>  upper bound, and were thus not normally distributed, making a Welch t-test
>  mathematically invalid.  I therefore also calculated estimators according
>  to [2], which gave the following:
> after #1: 8.155 Gbps
> after #3: 8.716 Gbps (+ 6.9%, p=0.291 vs #1)
> (though my procedure for determining ν wasn't mathematically well-founded
>  either, so take that p-value with a grain of salt).

I'm toying with a patch similar to your 3/3 (most relevant difference
being the lack of a limit to the batch size), on top of ixgbe (which
sends all the pkts to the GRO engine), and I'm observing more
controversial results (UDP only):

* when a single rx queue is running, I see a just-above-noise
peformance delta
* when multiple rx queues are running, I observe measurable regressions
(note: I use small pkts, still well under line rate even with multiple
rx queues)

I'll try to test your patch in the following days.

Side note: I think that in patch 3/3, it's necessary to add a call to
gro_normal_list() also inside napi_busy_loop().

Cheers,

Paolo





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 0/3] net: batched receive in GRO path
  2019-07-10  7:27 ` [RFC PATCH net-next 0/3] net: batched receive in GRO path Paolo Abeni
@ 2019-07-10 14:52   ` Edward Cree
  2019-07-10 15:41     ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Edward Cree @ 2019-07-10 14:52 UTC (permalink / raw)
  To: Paolo Abeni, David Miller; +Cc: netdev, Eric Dumazet

On 10/07/2019 08:27, Paolo Abeni wrote:
> I'm toying with a patch similar to your 3/3 (most relevant difference
> being the lack of a limit to the batch size), on top of ixgbe (which
> sends all the pkts to the GRO engine), and I'm observing more
> controversial results (UDP only):
>
> * when a single rx queue is running, I see a just-above-noise
> peformance delta
> * when multiple rx queues are running, I observe measurable regressions
> (note: I use small pkts, still well under line rate even with multiple
> rx queues)
>
> I'll try to test your patch in the following days.
I look forward to it.

> Side note: I think that in patch 3/3, it's necessary to add a call to
> gro_normal_list() also inside napi_busy_loop().
Hmm, I was caught out by the call to napi_poll() actually being a local
 function pointer, not the static function of the same name.  How did a
 shadow like that ever get allowed?
But in that case I _really_ don't understand napi_busy_loop(); nothing
 in it seems to ever flush GRO, so it's relying on either
 (1) stuff getting flushed because the bucket runs out of space, or
 (2) the next napi poll after busy_poll_stop() doing the flush.
What am I missing, and where exactly in napi_busy_loop() should the
 gro_normal_list() call go?

-Ed

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 0/3] net: batched receive in GRO path
  2019-07-10 14:52   ` Edward Cree
@ 2019-07-10 15:41     ` Eric Dumazet
  2019-07-10 16:47       ` Edward Cree
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2019-07-10 15:41 UTC (permalink / raw)
  To: Edward Cree, Paolo Abeni, David Miller; +Cc: netdev, Eric Dumazet



On 7/10/19 4:52 PM, Edward Cree wrote:

> Hmm, I was caught out by the call to napi_poll() actually being a local
>  function pointer, not the static function of the same name.  How did a
>  shadow like that ever get allowed?
> But in that case I _really_ don't understand napi_busy_loop(); nothing
>  in it seems to ever flush GRO, so it's relying on either
>  (1) stuff getting flushed because the bucket runs out of space, or
>  (2) the next napi poll after busy_poll_stop() doing the flush.
> What am I missing, and where exactly in napi_busy_loop() should the
>  gro_normal_list() call go?

Please look at busy_poll_stop()


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 0/3] net: batched receive in GRO path
  2019-07-10 15:41     ` Eric Dumazet
@ 2019-07-10 16:47       ` Edward Cree
  2019-07-10 17:39         ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Edward Cree @ 2019-07-10 16:47 UTC (permalink / raw)
  To: Eric Dumazet, Paolo Abeni, David Miller; +Cc: netdev

On 10/07/2019 16:41, Eric Dumazet wrote:
> On 7/10/19 4:52 PM, Edward Cree wrote:
>> Hmm, I was caught out by the call to napi_poll() actually being a local
>>  function pointer, not the static function of the same name.  How did a
>>  shadow like that ever get allowed?
>> But in that case I _really_ don't understand napi_busy_loop(); nothing
>>  in it seems to ever flush GRO, so it's relying on either
>>  (1) stuff getting flushed because the bucket runs out of space, or
>>  (2) the next napi poll after busy_poll_stop() doing the flush.
>> What am I missing, and where exactly in napi_busy_loop() should the
>>  gro_normal_list() call go?
> Please look at busy_poll_stop()
I did look there, but now I've looked again and harder, and I think I get it:
busy_poll_stop() calls napi->poll(), which (eventually, possibly in the
 subsequent poll that we schedule if rc == budget) calls napi_complete_done()
 which does the flush.
In which case, the same applies to the napi->rx_list, which similarly gets
 handled in napi_complete_done().  So I don't think napi_busy_loop() needs a
 gro_normal_list() adding to it(?)

As a general rule, I think we need to gro_normal_list() in those places, and
 only those places, that call napi_gro_flush().  But as I mentioned in the
 patch 3/3 description, I'm still confused by the (few) drivers that call
 napi_gro_flush() themselves.

-Ed

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 0/3] net: batched receive in GRO path
  2019-07-10 16:47       ` Edward Cree
@ 2019-07-10 17:39         ` Eric Dumazet
  2019-07-12 15:59           ` Edward Cree
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2019-07-10 17:39 UTC (permalink / raw)
  To: Edward Cree, Eric Dumazet, Paolo Abeni, David Miller; +Cc: netdev



On 7/10/19 6:47 PM, Edward Cree wrote:
> On 10/07/2019 16:41, Eric Dumazet wrote:
>> On 7/10/19 4:52 PM, Edward Cree wrote:
>>> Hmm, I was caught out by the call to napi_poll() actually being a local
>>>  function pointer, not the static function of the same name.  How did a
>>>  shadow like that ever get allowed?
>>> But in that case I _really_ don't understand napi_busy_loop(); nothing
>>>  in it seems to ever flush GRO, so it's relying on either
>>>  (1) stuff getting flushed because the bucket runs out of space, or
>>>  (2) the next napi poll after busy_poll_stop() doing the flush.
>>> What am I missing, and where exactly in napi_busy_loop() should the
>>>  gro_normal_list() call go?
>> Please look at busy_poll_stop()
> I did look there, but now I've looked again and harder, and I think I get it:
> busy_poll_stop() calls napi->poll(), which (eventually, possibly in the
>  subsequent poll that we schedule if rc == budget) calls napi_complete_done()
>  which does the flush.
> In which case, the same applies to the napi->rx_list, which similarly gets
>  handled in napi_complete_done().  So I don't think napi_busy_loop() needs a
>  gro_normal_list() adding to it(?)

I advise you to try busypoll then, with TCP_RR, with say 50 usec delay in /proc/sys/net/core/busy_read

Holding a small packet in the list up to the point we call busy_poll_stop()
will basically make busypoll non working anymore.

napi_complete_done() has special behavior when busy polling is active.


> 
> As a general rule, I think we need to gro_normal_list() in those places, and
>  only those places, that call napi_gro_flush().  But as I mentioned in the
>  patch 3/3 description, I'm still confused by the (few) drivers that call
>  napi_gro_flush() themselves.
> 
> -Ed
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 0/3] net: batched receive in GRO path
  2019-07-10 17:39         ` Eric Dumazet
@ 2019-07-12 15:59           ` Edward Cree
  2019-07-12 16:48             ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Edward Cree @ 2019-07-12 15:59 UTC (permalink / raw)
  To: Eric Dumazet, Paolo Abeni, David Miller; +Cc: netdev

On 10/07/2019 18:39, Eric Dumazet wrote:
> Holding a small packet in the list up to the point we call busy_poll_stop()
> will basically make busypoll non working anymore.
>
> napi_complete_done() has special behavior when busy polling is active.
Yep, I get it now, sorry for being dumb :)
Essentially we're saying that things coalesced by GRO are 'bulk' traffic and
 can wait around, but the rest is the stuff we're polling for for low latency.
I'm putting a gro_normal_list() call after the trace_napi_poll() in
 napi_busy_loop() and testing that, let's see how it goes...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 0/3] net: batched receive in GRO path
  2019-07-12 15:59           ` Edward Cree
@ 2019-07-12 16:48             ` Eric Dumazet
  2019-07-12 18:30               ` David Miller
  2019-07-24 21:49               ` Edward Cree
  0 siblings, 2 replies; 14+ messages in thread
From: Eric Dumazet @ 2019-07-12 16:48 UTC (permalink / raw)
  To: Edward Cree, Eric Dumazet, Paolo Abeni, David Miller; +Cc: netdev



On 7/12/19 5:59 PM, Edward Cree wrote:
> On 10/07/2019 18:39, Eric Dumazet wrote:
>> Holding a small packet in the list up to the point we call busy_poll_stop()
>> will basically make busypoll non working anymore.
>>
>> napi_complete_done() has special behavior when busy polling is active.
> Yep, I get it now, sorry for being dumb :)
> Essentially we're saying that things coalesced by GRO are 'bulk' traffic and
>  can wait around, 


GRO can still be beneficial even when busypolling, since TCP stack
will send a single ACK back, and a read()/recv() will copy the whole train
instead of a single MSS.

I should have mentioned that we have a patch that I forgot to upstream adding
the PSH flag to all TSO packets, meaning the receiver can automatically learn
the boundary of a GRO packet and not have to wait for the napi->poll() end
(busypolling or not)

> but the rest is the stuff we're polling for for low latency.
> I'm putting a gro_normal_list() call after the trace_napi_poll() in
>  napi_busy_loop() and testing that, let's see how it goes...
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 0/3] net: batched receive in GRO path
  2019-07-12 16:48             ` Eric Dumazet
@ 2019-07-12 18:30               ` David Miller
  2019-07-24 21:49               ` Edward Cree
  1 sibling, 0 replies; 14+ messages in thread
From: David Miller @ 2019-07-12 18:30 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ecree, pabeni, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 12 Jul 2019 18:48:29 +0200

> I should have mentioned that we have a patch that I forgot to
> upstream adding the PSH flag to all TSO packets, meaning the
> receiver can automatically learn the boundary of a GRO packet and
> not have to wait for the napi->poll() end (busypolling or not)

Wow, I thought we were already doing this...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 0/3] net: batched receive in GRO path
  2019-07-12 16:48             ` Eric Dumazet
  2019-07-12 18:30               ` David Miller
@ 2019-07-24 21:49               ` Edward Cree
  2019-07-30 20:02                 ` Edward Cree
  1 sibling, 1 reply; 14+ messages in thread
From: Edward Cree @ 2019-07-24 21:49 UTC (permalink / raw)
  To: Eric Dumazet, Paolo Abeni, David Miller; +Cc: netdev

On 12/07/2019 17:48, Eric Dumazet wrote:
>> but the rest is the stuff we're polling for for low latency.
>> I'm putting a gro_normal_list() call after the trace_napi_poll() in
>>  napi_busy_loop() and testing that, let's see how it goes...
One thing that's causing me some uncertainty: busy_poll_stop() does a
 napi->poll(), which can potentially gro_normal_one() something.  But
 when I tried to put a gro_normal_list() just after that, I ran into
 list corruption because it could race against the one in
 napi_complete_done().  I'm not entirely sure how, my current theory
 goes something like:
- clear_bit(IN_BUSY_POLL)
- task switch, start napi poll
- get as far as starting gro_normal_list()
- task switch back to busy_poll_stop()
- local_bh_disable()
- do a napi poll
- start gro_normal_list()
- list corruption ensues as we have two instances of
  netif_receive_skb_list_internal() trying to consume the same list
But I may be wildly mistaken.
Questions that arise from that:
1) Is it safe to potentially be adding to the rx_list (gro_normal_one(),
   which in theory can end up calling gro_normal_list() as well) within
   busy_poll_stop()?  I haven't ever seen a splat from that, but it seems
   every bit as possible as what I have been seeing.
2) Why does busy_poll_stop() not do its local_bh_disable() *before*
   clearing napi state bits, which (if I'm understanding correctly) would
   ensure an ordinary napi poll can't race with the one in busy_poll_stop()?

Apart from that I have indeed established that with the patches as posted
 busy-polling latency is awful, but adding a gro_normal_list() into
 napi_busy_loop() fixes that, as expected.

-Ed

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 0/3] net: batched receive in GRO path
  2019-07-24 21:49               ` Edward Cree
@ 2019-07-30 20:02                 ` Edward Cree
  0 siblings, 0 replies; 14+ messages in thread
From: Edward Cree @ 2019-07-30 20:02 UTC (permalink / raw)
  To: Eric Dumazet, Paolo Abeni, David Miller; +Cc: netdev

On 24/07/2019 22:49, Edward Cree wrote:
> One thing that's causing me some uncertainty: busy_poll_stop() does a
>  napi->poll(), which can potentially gro_normal_one() something.  But
>  when I tried to put a gro_normal_list() just after that, I ran into
>  list corruption because it could race against the one in
>  napi_complete_done().
Turns out that the bh_disables are a red herring, we're racing against a
 napi poll on a different CPU.
I *think* that the sequence of events is
- enter busy_poll_stop
- enter napi->poll
- napi->poll calls napi_complete_done(), which schedules a new napi
- that new napi runs, on another CPU
- meanwhile, busy_poll_stop returns from napi->poll; if it then does a
  gro_normal_list it can race with the one on the other CPU from the new
  napi-poll.
Which means that...
> Questions that arise from that:
> 1) Is it safe to potentially be adding to the rx_list (gro_normal_one(),
>    which in theory can end up calling gro_normal_list() as well) within
>    busy_poll_stop()?  I haven't ever seen a splat from that, but it seems
>    every bit as possible as what I have been seeing.
... this isn't a problem, because napi->poll() will do all of its
 gro_normal_one()s before it calls napi_complete_done().

Just gathering some final performance numbers, then I'll post the updated
 patches (hopefully) later tonight.

-Ed

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-07-30 20:02 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-07-09 19:27 [RFC PATCH net-next 0/3] net: batched receive in GRO path Edward Cree
2019-07-09 19:28 ` [RFC PATCH net-next 1/3] sfc: don't score irq moderation points for GRO Edward Cree
2019-07-09 19:29 ` [RFC PATCH net-next 2/3] sfc: falcon: " Edward Cree
2019-07-09 19:29 ` [RFC PATCH net-next 3/3] net: use listified RX for handling GRO_NORMAL skbs Edward Cree
2019-07-10  7:27 ` [RFC PATCH net-next 0/3] net: batched receive in GRO path Paolo Abeni
2019-07-10 14:52   ` Edward Cree
2019-07-10 15:41     ` Eric Dumazet
2019-07-10 16:47       ` Edward Cree
2019-07-10 17:39         ` Eric Dumazet
2019-07-12 15:59           ` Edward Cree
2019-07-12 16:48             ` Eric Dumazet
2019-07-12 18:30               ` David Miller
2019-07-24 21:49               ` Edward Cree
2019-07-30 20:02                 ` Edward Cree

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox