Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue

All of lore.kernel.org
 help / color / mirror / Atom feed

From: John Fastabend <john.fastabend@gmail.com>
To: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	fw@strlen.de, jhs@mojatatu.com, eric.dumazet@gmail.com,
	netdev@vger.kernel.org
Subject: Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue
Date: Fri, 15 Jul 2016 15:18:12 -0700	[thread overview]
Message-ID: <57896124.6090402@gmail.com> (raw)
In-Reply-To: <20160715132329.0d04ac42@redhat.com>

On 16-07-15 04:23 AM, Jesper Dangaard Brouer wrote:
> On Thu, 14 Jul 2016 17:07:33 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>> On 16-07-14 04:42 PM, Alexei Starovoitov wrote:
>>> On Wed, Jul 13, 2016 at 11:23:12PM -0700, John Fastabend wrote:  
>>>> This converts the pfifo_fast qdisc to use the alf_queue enqueue and
>>>> dequeue routines then sets the NOLOCK bit.
>>>>
>>>> This also removes the logic used to pick the next band to dequeue from
>>>> and instead just checks each alf_queue for packets from top priority
>>>> to lowest. This might need to be a bit more clever but seems to work
>>>> for now.
>>>>
>>>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>>>> ---
>>>>  net/sched/sch_generic.c |  131 +++++++++++++++++++++++++++--------------------  
>>>   
>>>>  static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
>>>>  			      struct sk_buff **to_free)
>>>>  {
>>>> -	return qdisc_drop(skb, qdisc, to_free);
>>>> +	err = skb_array_produce_bh(q, skb);  
>>> ..  
>>>>  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>>  {
>>>> +		skb = skb_array_consume_bh(q);  
>>>
>>> For this particular qdisc the performance gain should come from
>>> granularityof spin_lock, right?  
>>
>> And the fact that the consumer and producer are using different
>> locks now.
> 
> Yes. Splitting up enqueue'ers (producer's) from the dequeuer (consumer)
> is an important step, because today the qdisc layer have this problem
> that enqueue'ers can starve the single dequeuer.  The current
> mitigation tricks are the enq busy_lock and bulk dequeue.
> 
> As John says, using skb_array cause producers and consumer to use
> different locks.
> 
>>> Before we were taking the lock much earlier. Here we keep the lock,
>>> but for the very short time.
>>>         original pps        lockless        diff
>>>         1       1418168             1269450         -148718
>>>         2       1587390             1553408         -33982
>>>         4       1084961             1683639         +598678
>>>         8       989636              1522723         +533087
>>>         12      1014018             1348172         +334154
>>>
> 

I was able to recover the performance loss here and actually improve it
by fixing a few things in the patchset. Namely qdisc_run was
being called in a few places unnecessarily creating a fairly large per
packet cost overhead and then using the _bh locks was costing quite a
bit and is not needed as Jesper pointer out.

So new pps data here in somewhat raw format. I ran five iterations of
each thread count (1,2,4,8,12)

nolock (pfifo_fast)
1:  1440293 1421602 1409553 1393469 1424543
2:  1754890 1819292 1727948 1797711 1743427
4:  3282665 3344095 3315220 3332777 3348972
8:  2940079 1644450 2950777 2922085 2946310
12: 2042084 2610060 2857581 3493162 3104611

lock (pfifo_fast)
1:  1471479 1469142 1458825 1456788 1453952
2:  1746231 1749490 1753176 1753780 1755959
4:  1119626 1120515 1121478 1119220 1121115
8:  1001471  999308 1000318 1000776 1000384
12:  989269  992122  991590  986581  990430

nolock (mq)
1:   1435952  1459523  1448860  1385451   1435031
2:   2850662  2855702  2859105  2855443   2843382
4:   5288135  5271192  5252242  5270192   5311642
8:  10042731 10018063  9891813  9968382   9956727
12: 13265277 13384199 13438955 13363771  13436198

lock (mq)
1:   1448374  1444208  1437459  1437088  1452453
2:   2687963  2679221  2651059  2691630  2667479
4:   5153884  4684153  5091728  4635261  4902381
8:   9292395  9625869  9681835  9711651  9660498
12: 13553918 13682410 14084055 13946138 13724726

So then if we just use the first test example because I'm being a
bit lazy and don't want to calculate the avg/mean/whatever we get
a pfifo_fast chart like,

      locked             nolock           diff
---------------------------------------------------
1     1471479            1440293          −  31186
2     1746231            1754890          +   8659
4     1119626            3282665          +2163039
8     1119626            2940079          +1820453
12     989269            2857581*         +1868312

[*] I pulled the 3rd iteration here as the 1st one seems off

And the mq chart looks reasonable again with these changes,


       locked            nolock           diff
---------------------------------------------------
1       1448374          1435952          -  12422
2       2687963          2850662          + 162699
4       5153884          5288135          + 134251
8       9292395         10042731          + 750336
12     13553918         13265277          - 288641

So the mq case is a bit of a wash from my point of view which I sort
of expected seeing in this test case there is no contention on the
enqueue()/producer or dequeue()/consumer case when running pktgen
at 1 thread per qdisc/queue. A better test would be to fire up a few
thousand udp sessions and bang on the qdiscs to get contention on the
enqueue side. I'll try this next. On another note the variance is a
touch concerning in the data above for the no lock case so might look
into that a bit more to see why we can get 1mpps swing in one of those
cases I sort of wonder if something kicked off on my test machine
to cause that.

Also I'm going to take a look at Jesper's microbenchmark numbers but I
think if I can convince myself that using skb_array helps or at least
does no harm I might push to have this include with skb_array and then
work on optimizing the ring type/kind/etc. as a follow up patch.
Additionally it does seem to provide goodness on the pfifo_fast single
queue case.

Final point is there are more optimizations we can do once the enqueue
and dequeue is separated. For example two fairly easy things include
removing HARD_TX_LOCK nn NICs with a ring per core and adding bulk
dequeue() to the skb_array or alf queue or whatever object we end up
on. And I expect this will provide additional perf boost.

Thanks,
John

next prev parent reply	other threads:[~2016-07-15 22:18 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-14  6:19 [RFC PATCH v2 00/10] running qdiscs without qdisc_lock John Fastabend
2016-07-14  6:19 ` [RFC PATCH v2 01/10] net: sched: allow qdiscs to handle locking John Fastabend
2016-07-14  6:44   ` John Fastabend
2016-07-14  6:20 ` [RFC PATCH v2 02/10] net: sched: qdisc_qlen for per cpu logic John Fastabend
2016-07-14  6:20 ` [RFC PATCH v2 03/10] net: sched: provide per cpu qstat helpers John Fastabend
2016-07-14  6:21 ` [RFC PATCH v2 04/10] net: sched: a dflt qdisc may be used with per cpu stats John Fastabend
2016-07-14  6:21 ` [RFC PATCH v2 05/10] net: sched: per cpu gso handlers John Fastabend
2016-07-14  6:22 ` [RFC PATCH v2 06/10] net: sched: support qdisc_reset on NOLOCK qdisc John Fastabend
2016-07-14  6:22 ` [RFC PATCH v2 07/10] net: sched: support skb_bad_tx with lockless qdisc John Fastabend
2016-07-14  6:23 ` [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue John Fastabend
2016-07-14 15:11   ` Jesper Dangaard Brouer
2016-07-15  0:09     ` John Fastabend
2016-07-15 10:09       ` Jesper Dangaard Brouer
2016-07-15 17:29         ` John Fastabend
2016-07-14 23:42   ` Alexei Starovoitov
2016-07-15  0:07     ` John Fastabend
2016-07-15 11:23       ` Jesper Dangaard Brouer
2016-07-15 22:18         ` John Fastabend [this message]
2016-07-15 22:48           ` Alexei Starovoitov
2016-07-14  6:23 ` [RFC PATCH v2 09/10] net: sched: helper to sum qlen John Fastabend
2016-07-14  6:24 ` [RFC PATCH v2 10/10] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq John Fastabend

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57896124.6090402@gmail.com \
    --to=john.fastabend@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=brouer@redhat.com \
    --cc=eric.dumazet@gmail.com \
    --cc=fw@strlen.de \
    --cc=jhs@mojatatu.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.