netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alexander Lobakin <alobakin@dlink.ru>
To: Saeed Mahameed <saeedm@mellanox.com>
Cc: ecree@solarflare.com, Maxim Mikityanskiy <maximmi@mellanox.com>,
	Jiri Pirko <jiri@mellanox.com>,
	edumazet@google.com, netdev@vger.kernel.org, davem@davemloft.net,
	Tariq Toukan <tariqt@mellanox.com>
Subject: Re: [PATCH net] net: Fix packet reordering caused by GRO and listified RX cooperation
Date: Sat, 18 Jan 2020 13:05:19 +0300	[thread overview]
Message-ID: <da13831f11d0141728a96954685fdf40@dlink.ru> (raw)
In-Reply-To: <7939223efeb4ed9523a802702874be9b8f37f231.camel@mellanox.com>

Hi Saeed,

Saeed Mahameed wrote 18.01.2020 01:47:
> On Fri, 2020-01-17 at 15:09 +0000, Maxim Mikityanskiy wrote:
>> Commit 6570bc79c0df ("net: core: use listified Rx for GRO_NORMAL in
>> napi_gro_receive()") introduces batching of GRO_NORMAL packets in
>> napi_skb_finish. However, dev_gro_receive, that is called just before
>> napi_skb_finish, can also pass skbs to the networking stack: e.g.,
>> when
>> the GRO session is flushed, napi_gro_complete is called, which passes
>> pp
>> directly to netif_receive_skb_internal, skipping napi->rx_list. It
>> means
>> that the packet stored in pp will be handled by the stack earlier
>> than
>> the packets that arrived before, but are still waiting in napi-
>> >rx_list.
>> It leads to TCP reorderings that can be observed in the TCPOFOQueue
>> counter in netstat.
>> 
>> This commit fixes the reordering issue by making napi_gro_complete
>> also
>> use napi->rx_list, so that all packets going through GRO will keep
>> their
>> order.
>> 
>> Fixes: 6570bc79c0df ("net: core: use listified Rx for GRO_NORMAL in
>> napi_gro_receive()")
>> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
>> Cc: Alexander Lobakin <alobakin@dlink.ru>
>> Cc: Edward Cree <ecree@solarflare.com>
>> ---
>> Alexander and Edward, please verify the correctness of this patch. If
>> it's necessary to pass that SKB to the networking stack right away, I
>> can change this patch to flush napi->rx_list by calling
>> gro_normal_list
>> first, instead of putting the SKB in the list.
>> 
> 
> actually this will break performance of traffic that needs to skip
> gro.. and we will loose bulking, so don't do it :)
> 
> But your point is valid when napi_gro_complete() is called outside of
> napi_gro_receive() path.
> 
> see below..
> 
>>  net/core/dev.c | 55 +++++++++++++++++++++++++-----------------------
>> --
>>  1 file changed, 28 insertions(+), 27 deletions(-)
>> 
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 0ad39c87b7fd..db7a105bbc77 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -5491,9 +5491,29 @@ static void flush_all_backlogs(void)
>>  	put_online_cpus();
>>  }
>> 
>> +/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
>> +static void gro_normal_list(struct napi_struct *napi)
>> +{
>> +	if (!napi->rx_count)
>> +		return;
>> +	netif_receive_skb_list_internal(&napi->rx_list);
>> +	INIT_LIST_HEAD(&napi->rx_list);
>> +	napi->rx_count = 0;
>> +}
>> +
>> +/* Queue one GRO_NORMAL SKB up for list processing. If batch size
>> exceeded,
>> + * pass the whole batch up to the stack.
>> + */
>> +static void gro_normal_one(struct napi_struct *napi, struct sk_buff
>> *skb)
>> +{
>> +	list_add_tail(&skb->list, &napi->rx_list);
>> +	if (++napi->rx_count >= gro_normal_batch)
>> +		gro_normal_list(napi);
>> +}
>> +
>>  INDIRECT_CALLABLE_DECLARE(int inet_gro_complete(struct sk_buff *,
>> int));
>>  INDIRECT_CALLABLE_DECLARE(int ipv6_gro_complete(struct sk_buff *,
>> int));
>> -static int napi_gro_complete(struct sk_buff *skb)
>> +static int napi_gro_complete(struct napi_struct *napi, struct
>> sk_buff *skb)
>>  {
>>  	struct packet_offload *ptype;
>>  	__be16 type = skb->protocol;
>> @@ -5526,7 +5546,8 @@ static int napi_gro_complete(struct sk_buff
>> *skb)
>>  	}
>> 
>>  out:
>> -	return netif_receive_skb_internal(skb);
>> +	gro_normal_one(napi, skb);
>> +	return NET_RX_SUCCESS;
>>  }
>> 
> 
> The patch looks fine when napi_gro_complete() is called form
> napi_gro_receive() path.
> 
> But napi_gro_complete() is also used by napi_gro_flush() which is
> called in other contexts, which might break, if they really meant to
> flush to the stack..
> 
> examples:
> 1. drives that use napi_gro_flush() which is not "eventually" followed
> by napi_complete_done(), might break.. possible bug in those drivers
> though. drivers must always return with napi_complete_done();

Drivers *should not* use napi_gro_flush() by themselves. This was
discussed several times here and at the moment me and Edward are
waiting for proper NAPI usage in iwlwifi driver to unexport this
one and make it static.

> 2. the following code in napi_complete_done()
> 
> /* When the NAPI instance uses a timeout and keeps postponing
>  * it, we need to bound somehow the time packets are kept in
>  * the GRO layer
>  */
>   napi_gro_flush(n, !!timeout);
> 
> with the new implementation we won't really flush to the stack unless

Oh, I got this one. This is really an issue. gro_normal_list() is
called earlier than napi_gro_flush() in napi_complete_done(), so
several skbs might stuck in napi->rx_list until next NAPI session.
Thanks for pointing this out, I missed it.

> one possible solution: is to call gro_normal_list(napi); inside:
> napi_gro_flush() ?
> 
> another possible solution:
> allays make sure to follow napi_gro_flush(); with gro_normal_list(n);
> 
> since i see two places in dev.c where we do:
> 
> gro_normal_list(n);
> if (cond) {
>    napi_gro_flush();
> }
> 
> instead, we can change them to:
> 
> if (cond) {
>    /* flush gro to napi->rx_list, with your implementation  */
>    napi_gro_flush();
> }
> gro_normal_list(n); /* Now flush to the stack */
> 
> And your implementation will be correct for such use cases.

I think this one would be more straightforward and correct.
But this needs tests for sure. I could do them only Monday, 20
unfortunately.

Or we can call gro_normal_list() directly in napi_gro_complete()
as Maxim proposed as alternative solution.
I'd like to see what Edward thinks about it. But this one really
needs to be handled either way.

>>  static void __napi_gro_flush_chain(struct napi_struct *napi, u32
>> index,
>> @@ -5539,7 +5560,7 @@ static void __napi_gro_flush_chain(struct
>> napi_struct *napi, u32 index,
>>  		if (flush_old && NAPI_GRO_CB(skb)->age == jiffies)
>>  			return;
>>  		skb_list_del_init(skb);
>> -		napi_gro_complete(skb);
>> +		napi_gro_complete(napi, skb);
>>  		napi->gro_hash[index].count--;
>>  	}
>> 
>> @@ -5641,7 +5662,7 @@ static void gro_pull_from_frag0(struct sk_buff
>> *skb, int grow)
>>  	}
>>  }
>> 
>> -static void gro_flush_oldest(struct list_head *head)
>> +static void gro_flush_oldest(struct napi_struct *napi, struct
>> list_head *head)
>>  {
>>  	struct sk_buff *oldest;
>> 
>> @@ -5657,7 +5678,7 @@ static void gro_flush_oldest(struct list_head
>> *head)
>>  	 * SKB to the chain.
>>  	 */
>>  	skb_list_del_init(oldest);
>> -	napi_gro_complete(oldest);
>> +	napi_gro_complete(napi, oldest);
>>  }
>> 
>>  INDIRECT_CALLABLE_DECLARE(struct sk_buff *inet_gro_receive(struct
>> list_head *,
>> @@ -5733,7 +5754,7 @@ static enum gro_result dev_gro_receive(struct
>> napi_struct *napi, struct sk_buff
>> 
>>  	if (pp) {
>>  		skb_list_del_init(pp);
>> -		napi_gro_complete(pp);
>> +		napi_gro_complete(napi, pp);
>>  		napi->gro_hash[hash].count--;
>>  	}
>> 
>> @@ -5744,7 +5765,7 @@ static enum gro_result dev_gro_receive(struct
>> napi_struct *napi, struct sk_buff
>>  		goto normal;
>> 
>>  	if (unlikely(napi->gro_hash[hash].count >= MAX_GRO_SKBS)) {
>> -		gro_flush_oldest(gro_head);
>> +		gro_flush_oldest(napi, gro_head);
>>  	} else {
>>  		napi->gro_hash[hash].count++;
>>  	}
>> @@ -5802,26 +5823,6 @@ struct packet_offload
>> *gro_find_complete_by_type(__be16 type)
>>  }
>>  EXPORT_SYMBOL(gro_find_complete_by_type);
>> 
>> -/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
>> -static void gro_normal_list(struct napi_struct *napi)
>> -{
>> -	if (!napi->rx_count)
>> -		return;
>> -	netif_receive_skb_list_internal(&napi->rx_list);
>> -	INIT_LIST_HEAD(&napi->rx_list);
>> -	napi->rx_count = 0;
>> -}
>> -
>> -/* Queue one GRO_NORMAL SKB up for list processing. If batch size
>> exceeded,
>> - * pass the whole batch up to the stack.
>> - */
>> -static void gro_normal_one(struct napi_struct *napi, struct sk_buff
>> *skb)
>> -{
>> -	list_add_tail(&skb->list, &napi->rx_list);
>> -	if (++napi->rx_count >= gro_normal_batch)
>> -		gro_normal_list(napi);
>> -}
>> -
>>  static void napi_skb_free_stolen_head(struct sk_buff *skb)
>>  {
>>  	skb_dst_drop(skb);

Regards,
ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ

  reply	other threads:[~2020-01-18 10:08 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-17 15:09 [PATCH net] net: Fix packet reordering caused by GRO and listified RX cooperation Maxim Mikityanskiy
2020-01-17 16:09 ` Alexander Lobakin
2020-01-17 22:47 ` Saeed Mahameed
2020-01-18 10:05   ` Alexander Lobakin [this message]
2020-01-20  9:44     ` Alexander Lobakin
2020-01-20 14:39       ` Edward Cree
2020-01-20 14:55         ` Alexander Lobakin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=da13831f11d0141728a96954685fdf40@dlink.ru \
    --to=alobakin@dlink.ru \
    --cc=davem@davemloft.net \
    --cc=ecree@solarflare.com \
    --cc=edumazet@google.com \
    --cc=jiri@mellanox.com \
    --cc=maximmi@mellanox.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@mellanox.com \
    --cc=tariqt@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).