From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hannes Frederic Sowa Subject: Re: [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Date: Fri, 26 Sep 2014 13:06:35 +0200 Message-ID: <1411729595.1776139.172032293.6FCC3EBB@webmail.messagingengine.com> References: <20140924160932.9721.56450.stgit@localhost> <20140924161047.9721.43080.stgit@localhost> <1411579395.15395.41.camel@edumazet-glaptop2.roam.corp.google.com> <20140924195831.6fb91051@redhat.com> <1411586550.15395.46.camel@edumazet-glaptop2.roam.corp.google.com> <1411611140.16953.1.camel@edumazet-glaptop2.roam.corp.google.com> <1411688409.16953.64.camel@edumazet-glaptop2.roam.corp.google.com> <1411711496.16953.92.camel@edumazet-glaptop2.roam.corp.google.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Jesper Dangaard Brouer , Linux Netdev List , "David S. Miller" , Alexander Duyck , =?UTF-8?Q?Toke=20H=C3=B8iland-J=C3=B8rgensen?= , Florian Westphal , Jamal Hadi Salim , Dave Taht , John Fastabend , Daniel Borkmann To: Eric Dumazet , Tom Herbert Return-path: Received: from out3-smtp.messagingengine.com ([66.111.4.27]:41361 "EHLO out3-smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752149AbaIZLGg convert rfc822-to-8bit (ORCPT ); Fri, 26 Sep 2014 07:06:36 -0400 Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by gateway2.nyi.internal (Postfix) with ESMTP id EED3020B4C for ; Fri, 26 Sep 2014 07:06:35 -0400 (EDT) In-Reply-To: <1411711496.16953.92.camel@edumazet-glaptop2.roam.corp.google.com> Sender: netdev-owner@vger.kernel.org List-ID: On Fri, Sep 26, 2014, at 08:04, Eric Dumazet wrote: > From: Eric Dumazet >=20 > While doing high throughput test on a BQL enabled NIC, > I found a very high cost in ndo_start_xmit() when accessing BQL data. >=20 > It turned out the problem was caused by compiler trying to be > smart, but involving a bad MESI transaction : >=20 > 0.05 =E2=94=82 mov 0xc0(%rax),%edi // LOAD dql->num_queued > 0.48 =E2=94=82 mov %edx,0xc8(%rax) // STORE dql->last_obj_cn= t =3D count > 58.23 =E2=94=82 add %edx,%edi > 0.58 =E2=94=82 cmp %edi,0xc4(%rax) > 0.76 =E2=94=82 mov %edi,0xc0(%rax) // STORE dql->num_queued = +=3D count > 0.72 =E2=94=82 js bd8 >=20 >=20 > I got an incredible 10 % gain [1] by making sure cpu do not attempt > to get the cache line in Shared mode, but directly requests for > ownership. >=20 > New code : > mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt =3D count > add %edx,0xc0(%rax) // RMW dql->num_queued +=3D count > mov 0xc4(%rax),%ecx // LOAD dql->adj_limit > mov 0xc0(%rax),%edx // LOAD dql->num_queued > cmp %edx,%ecx >=20 > The TX completion was running from another cpu, with high interrupts > rate. >=20 > Note that I am using barrier() as a soft hint, as mb() here could be > too heavy cost. >=20 > [1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled. >=20 > Signed-off-by: Eric Dumazet > --- > include/linux/dynamic_queue_limits.h | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) >=20 > diff --git a/include/linux/dynamic_queue_limits.h > b/include/linux/dynamic_queue_limits.h > index 5621547d631b..a4be70398ce1 100644 > --- a/include/linux/dynamic_queue_limits.h > +++ b/include/linux/dynamic_queue_limits.h > @@ -73,14 +73,22 @@ static inline void dql_queued(struct dql *dql, > unsigned int count) > { > BUG_ON(count > DQL_MAX_OBJECT); > =20 > - dql->num_queued +=3D count; > dql->last_obj_cnt =3D count; > + > + /* We want to force a write first, so that cpu do not attempt > + * to get cache line containing last_obj_cnt, num_queued, > adj_limit > + * in Shared state, but directly does a Request For Ownership > + * It is only a hint, we use barrier() only. > + */ > + barrier(); > + > + dql->num_queued +=3D count; > } I thought that prefetchw() would be the canonical way to solve write stalls in CPUs and prepare the specific cache line to be written to. Interesting, thanks Eric. Thanks, Hannes