netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: netdev@vger.kernel.org, therbert@google.com,
	"David S. Miller" <davem@davemloft.net>,
	Alexander Duyck <alexander.h.duyck@intel.com>,
	toke@toke.dk, Florian Westphal <fw@strlen.de>,
	jhs@mojatatu.com, Dave Taht <dave.taht@gmail.com>,
	John Fastabend <john.r.fastabend@intel.com>,
	Daniel Borkmann <dborkman@redhat.com>,
	Hannes Frederic Sowa <hannes@stressinduktion.org>,
	brouer@redhat.com
Subject: Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
Date: Wed, 24 Sep 2014 19:58:31 +0200	[thread overview]
Message-ID: <20140924195831.6fb91051@redhat.com> (raw)
In-Reply-To: <1411579395.15395.41.camel@edumazet-glaptop2.roam.corp.google.com>

On Wed, 24 Sep 2014 10:23:15 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2014-09-24 at 18:12 +0200, Jesper Dangaard Brouer wrote:
> > Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
> > sending/processing an entire skb list.
> > 
> > This patch implements qdisc bulk dequeue, by allowing multiple packets
> > to be dequeued in dequeue_skb().
> > 
> > The optimization principle for this is two fold, (1) to amortize
> > locking cost and (2) avoid expensive tailptr update for notifying HW.
> >  (1) Several packets are dequeued while holding the qdisc root_lock,
> > amortizing locking cost over several packet.  The dequeued SKB list is
> > processed under the TXQ lock in dev_hard_start_xmit(), thus also
> > amortizing the cost of the TXQ lock.
> >  (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
> > API to delay HW tailptr update, which also reduces the cost per
> > packet.
> > 
> > One restriction of the new API is that every SKB must belong to the
> > same TXQ.  This patch takes the easy way out, by restricting bulk
> > dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
> > qdisc only have attached a single TXQ.
> > 
> > Some detail about the flow; dev_hard_start_xmit() will process the skb
> > list, and transmit packets individually towards the driver (see
> > xmit_one()).  In case the driver stops midway in the list, the
> > remaining skb list is returned by dev_hard_start_xmit().  In
> > sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
> > 
> > To avoid overshooting the HW limits, which results in requeuing, the
> > patch limits the amount of bytes dequeued, based on the drivers BQL
> > limits.  In-effect bulking will only happen for BQL enabled drivers.
> > Besides the bytelimit from BQL, also limit bulking to maximum 8
> > packets to avoid any issues with available descriptor in HW.
> > 
> > For now, as a conservative approach, don't bulk dequeue GSO and
> > segmented GSO packets, because testing showed requeuing occuring
> > with segmented GSO packets.
> > 
> > Jointed work with Hannes, Daniel and Florian.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> > Signed-off-by: Florian Westphal <fw@strlen.de>
> > 
> > ---
> > V4:
> > - Patch rewritten in the Red Hat Neuchatel office jointed work with
> >   Hannes, Daniel and Florian.
> > - Conservative approach of only using on BQL enabled drivers
> > - No user tunable parameter, but limit bulking to 8 packets.
> > - For now, avoid bulking GSO packets packets.
> > 
[...]
> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> > index 19696eb..6fba089 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -56,6 +56,42 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
> >  	return 0;
> >  }
> >  
> > +static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q,
> > +					    struct sk_buff *head_skb,
> > +					    int bytelimit)
> > +{
> > +	struct sk_buff *skb, *tail_skb = head_skb;
> > +	int budget = 8; /* Arbitrary, but conservatively choosen limit */
> > +
> > +	while (bytelimit > 0 && --budget > 0) {
> > +		/* For now, don't bulk dequeue GSO (or GSO segmented) pkts */
> 
> Hmm... this is a serious limitation.

We plan to remove this at a later point, this is just to be conservative.

 
> > +		if (tail_skb->next || skb_is_gso(tail_skb))
> > +			break;
> > +
> > +		skb = q->dequeue(q);
> > +		if (!skb)
> > +			break;
> > +
> > +		bytelimit -= skb->len; /* covers GSO len */
> 
> Not really, use qdisc_pkt_len(skb) instead, to have a better byte count.

Understood, will update patch tomorrow.

 
> > +		skb = validate_xmit_skb(skb, qdisc_dev(q));
> > +		if (!skb)
> > +			break;
> > +
> > +		/* "skb" can be a skb list after validate call above
> > +		 * (GSO segmented), but it is okay to append it to
> > +		 * current tail_skb->next, because next round will exit
> > +		 * in-case "tail_skb->next" is a skb list.
> > +		 */
> 
> It would be trivial to change validate_xmit_skb() to return the head and
> tail of the chain. Walking the chain would hit hot cache lines, but it
> is better not having to walk it.

Yes, we could do this when we later add support for GSO segmented packets.


> > +		tail_skb->next = skb;
> > +		tail_skb = skb;
> 
> So that here we do : tail_skb = tail;
> 
> > +	}
> > +
> > +	return head_skb;
> > +}
> > +
> > +/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
> > + * A requeued skb (via q->gso_skb) can also be a SKB list.
> > + */
> >  static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
> >  {
> >  	struct sk_buff *skb = q->gso_skb;
> > @@ -70,10 +106,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
> >  		} else
> >  			skb = NULL;
> >  	} else {
> > -		if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
> > +		if (!(q->flags & TCQ_F_ONETXQUEUE) ||
> > +		    !netif_xmit_frozen_or_stopped(txq)) {
> > +			int bytelimit = qdisc_avail_bulklimit(txq);
> > +
> >  			skb = q->dequeue(q);
> > -			if (skb)
> > +			if (skb) {
> > +				bytelimit -= skb->len;
> 
>   qdisc_pkt_len(skb)

Okay, will update patch tomorrow.
 
> >  				skb = validate_xmit_skb(skb, qdisc_dev(q));
> > +			}
> > +			if (skb && qdisc_may_bulk(q))
> > +				skb = try_bulk_dequeue_skb(q, skb, bytelimit);
> >  		}
> >  	}
> >  
> 
> It looks good, but we really need to take care of TSO packets.

As notes above, TSO packets are planned as a future feature.  Lets
first see if this works well, without introducing any HoL blocking
issues or other regressions.


> pktgen is nice, but do not represent the majority of the traffic we send
> from high performance host where we want this bulk dequeue thing ;)

This patch is actually targetted towards more normal use-cases.
Pktgen cannot even use this work, as it bypass the qdisc layer...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

  reply	other threads:[~2014-09-24 18:29 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-24 16:10 [net-next PATCH 0/1 V4] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer
2014-09-24 16:12 ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
2014-09-24 17:23   ` Eric Dumazet
2014-09-24 17:58     ` Jesper Dangaard Brouer [this message]
2014-09-24 18:34       ` Tom Herbert
2014-09-24 19:22         ` Eric Dumazet
2014-09-25  2:12           ` Eric Dumazet
2014-09-25  2:38             ` Tom Herbert
2014-09-25  2:58               ` Dave Taht
2014-09-26 18:06                 ` Eric Dumazet
2014-09-25 23:40             ` Eric Dumazet
2014-09-26  6:04               ` [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Eric Dumazet
2014-09-26  8:47                 ` Jesper Dangaard Brouer
2014-09-26 11:06                 ` Hannes Frederic Sowa
2014-09-26 12:02                   ` Eric Dumazet
2014-09-28 21:43                 ` David Miller
2014-09-26  9:23               ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE David Laight
2014-09-26 13:16                 ` David Laight
2014-09-26 13:38                   ` Eric Dumazet
2014-09-24 22:13       ` Jamal Hadi Salim
2014-09-25  8:25         ` Jesper Dangaard Brouer
2014-09-25 12:48           ` Jamal Hadi Salim
2014-09-25 14:40             ` Tom Herbert
2014-09-25 14:57               ` Jesper Dangaard Brouer
2014-09-25 15:05                 ` Tom Herbert
2014-09-25 15:23                   ` Jesper Dangaard Brouer
2014-09-25 15:58                     ` Tom Herbert
2014-09-29 20:23                       ` Jesper Dangaard Brouer
2014-09-29 20:14                     ` Jesper Dangaard Brouer
2014-09-25 15:12                 ` Eric Dumazet
2014-09-25 13:52           ` Dave Taht

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140924195831.6fb91051@redhat.com \
    --to=brouer@redhat.com \
    --cc=alexander.h.duyck@intel.com \
    --cc=dave.taht@gmail.com \
    --cc=davem@davemloft.net \
    --cc=dborkman@redhat.com \
    --cc=eric.dumazet@gmail.com \
    --cc=fw@strlen.de \
    --cc=hannes@stressinduktion.org \
    --cc=jhs@mojatatu.com \
    --cc=john.r.fastabend@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=therbert@google.com \
    --cc=toke@toke.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).