All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: netdev@vger.kernel.org, therbert@google.com,
	"David S. Miller" <davem@davemloft.net>,
	Alexander Duyck <alexander.h.duyck@intel.com>,
	toke@toke.dk, Florian Westphal <fw@strlen.de>,
	jhs@mojatatu.com, Dave Taht <dave.taht@gmail.com>,
	John Fastabend <john.r.fastabend@intel.com>,
	Daniel Borkmann <dborkman@redhat.com>,
	Hannes Frederic Sowa <hannes@stressinduktion.org>,
	brouer@redhat.com
Subject: Re: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
Date: Wed, 24 Sep 2014 19:58:31 +0200	[thread overview]
Message-ID: <20140924195831.6fb91051@redhat.com> (raw)
In-Reply-To: <1411579395.15395.41.camel@edumazet-glaptop2.roam.corp.google.com>

On Wed, 24 Sep 2014 10:23:15 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2014-09-24 at 18:12 +0200, Jesper Dangaard Brouer wrote:
> > Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
> > sending/processing an entire skb list.
> > 
> > This patch implements qdisc bulk dequeue, by allowing multiple packets
> > to be dequeued in dequeue_skb().
> > 
> > The optimization principle for this is two fold, (1) to amortize
> > locking cost and (2) avoid expensive tailptr update for notifying HW.
> >  (1) Several packets are dequeued while holding the qdisc root_lock,
> > amortizing locking cost over several packet.  The dequeued SKB list is
> > processed under the TXQ lock in dev_hard_start_xmit(), thus also
> > amortizing the cost of the TXQ lock.
> >  (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
> > API to delay HW tailptr update, which also reduces the cost per
> > packet.
> > 
> > One restriction of the new API is that every SKB must belong to the
> > same TXQ.  This patch takes the easy way out, by restricting bulk
> > dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
> > qdisc only have attached a single TXQ.
> > 
> > Some detail about the flow; dev_hard_start_xmit() will process the skb
> > list, and transmit packets individually towards the driver (see
> > xmit_one()).  In case the driver stops midway in the list, the
> > remaining skb list is returned by dev_hard_start_xmit().  In
> > sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
> > 
> > To avoid overshooting the HW limits, which results in requeuing, the
> > patch limits the amount of bytes dequeued, based on the drivers BQL
> > limits.  In-effect bulking will only happen for BQL enabled drivers.
> > Besides the bytelimit from BQL, also limit bulking to maximum 8
> > packets to avoid any issues with available descriptor in HW.
> > 
> > For now, as a conservative approach, don't bulk dequeue GSO and
> > segmented GSO packets, because testing showed requeuing occuring
> > with segmented GSO packets.
> > 
> > Jointed work with Hannes, Daniel and Florian.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> > Signed-off-by: Florian Westphal <fw@strlen.de>
> > 
> > ---
> > V4:
> > - Patch rewritten in the Red Hat Neuchatel office jointed work with
> >   Hannes, Daniel and Florian.
> > - Conservative approach of only using on BQL enabled drivers
> > - No user tunable parameter, but limit bulking to 8 packets.
> > - For now, avoid bulking GSO packets packets.
> > 
[...]
> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> > index 19696eb..6fba089 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -56,6 +56,42 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
> >  	return 0;
> >  }
> >  
> > +static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q,
> > +					    struct sk_buff *head_skb,
> > +					    int bytelimit)
> > +{
> > +	struct sk_buff *skb, *tail_skb = head_skb;
> > +	int budget = 8; /* Arbitrary, but conservatively choosen limit */
> > +
> > +	while (bytelimit > 0 && --budget > 0) {
> > +		/* For now, don't bulk dequeue GSO (or GSO segmented) pkts */
> 
> Hmm... this is a serious limitation.

We plan to remove this at a later point, this is just to be conservative.

 
> > +		if (tail_skb->next || skb_is_gso(tail_skb))
> > +			break;
> > +
> > +		skb = q->dequeue(q);
> > +		if (!skb)
> > +			break;
> > +
> > +		bytelimit -= skb->len; /* covers GSO len */
> 
> Not really, use qdisc_pkt_len(skb) instead, to have a better byte count.

Understood, will update patch tomorrow.

 
> > +		skb = validate_xmit_skb(skb, qdisc_dev(q));
> > +		if (!skb)
> > +			break;
> > +
> > +		/* "skb" can be a skb list after validate call above
> > +		 * (GSO segmented), but it is okay to append it to
> > +		 * current tail_skb->next, because next round will exit
> > +		 * in-case "tail_skb->next" is a skb list.
> > +		 */
> 
> It would be trivial to change validate_xmit_skb() to return the head and
> tail of the chain. Walking the chain would hit hot cache lines, but it
> is better not having to walk it.

Yes, we could do this when we later add support for GSO segmented packets.


> > +		tail_skb->next = skb;
> > +		tail_skb = skb;
> 
> So that here we do : tail_skb = tail;
> 
> > +	}
> > +
> > +	return head_skb;
> > +}
> > +
> > +/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
> > + * A requeued skb (via q->gso_skb) can also be a SKB list.
> > + */
> >  static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
> >  {
> >  	struct sk_buff *skb = q->gso_skb;
> > @@ -70,10 +106,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
> >  		} else
> >  			skb = NULL;
> >  	} else {
> > -		if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
> > +		if (!(q->flags & TCQ_F_ONETXQUEUE) ||
> > +		    !netif_xmit_frozen_or_stopped(txq)) {
> > +			int bytelimit = qdisc_avail_bulklimit(txq);
> > +
> >  			skb = q->dequeue(q);
> > -			if (skb)
> > +			if (skb) {
> > +				bytelimit -= skb->len;
> 
>   qdisc_pkt_len(skb)

Okay, will update patch tomorrow.
 
> >  				skb = validate_xmit_skb(skb, qdisc_dev(q));
> > +			}
> > +			if (skb && qdisc_may_bulk(q))
> > +				skb = try_bulk_dequeue_skb(q, skb, bytelimit);
> >  		}
> >  	}
> >  
> 
> It looks good, but we really need to take care of TSO packets.

As notes above, TSO packets are planned as a future feature.  Lets
first see if this works well, without introducing any HoL blocking
issues or other regressions.


> pktgen is nice, but do not represent the majority of the traffic we send
> from high performance host where we want this bulk dequeue thing ;)

This patch is actually targetted towards more normal use-cases.
Pktgen cannot even use this work, as it bypass the qdisc layer...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

  reply	other threads:[~2014-09-24 18:29 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-24 16:10 [net-next PATCH 0/1 V4] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer
2014-09-24 16:12 ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
2014-09-24 17:23   ` Eric Dumazet
2014-09-24 17:58     ` Jesper Dangaard Brouer [this message]
2014-09-24 18:34       ` Tom Herbert
2014-09-24 19:22         ` Eric Dumazet
2014-09-25  2:12           ` Eric Dumazet
2014-09-25  2:38             ` Tom Herbert
2014-09-25  2:58               ` Dave Taht
2014-09-26 18:06                 ` Eric Dumazet
2014-09-25 23:40             ` Eric Dumazet
2014-09-26  6:04               ` [PATCH net-next] dql: dql_queued() should write first to reduce bus transactions Eric Dumazet
2014-09-26  8:47                 ` Jesper Dangaard Brouer
2014-09-26 11:06                 ` Hannes Frederic Sowa
2014-09-26 12:02                   ` Eric Dumazet
2014-09-28 21:43                 ` David Miller
2014-09-26  9:23               ` [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE David Laight
2014-09-26 13:16                 ` David Laight
2014-09-26 13:38                   ` Eric Dumazet
2014-09-24 22:13       ` Jamal Hadi Salim
2014-09-25  8:25         ` Jesper Dangaard Brouer
2014-09-25 12:48           ` Jamal Hadi Salim
2014-09-25 14:40             ` Tom Herbert
2014-09-25 14:57               ` Jesper Dangaard Brouer
2014-09-25 15:05                 ` Tom Herbert
2014-09-25 15:23                   ` Jesper Dangaard Brouer
2014-09-25 15:58                     ` Tom Herbert
2014-09-29 20:23                       ` Jesper Dangaard Brouer
2014-09-29 20:14                     ` Jesper Dangaard Brouer
2014-09-25 15:12                 ` Eric Dumazet
2014-09-25 13:52           ` Dave Taht

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140924195831.6fb91051@redhat.com \
    --to=brouer@redhat.com \
    --cc=alexander.h.duyck@intel.com \
    --cc=dave.taht@gmail.com \
    --cc=davem@davemloft.net \
    --cc=dborkman@redhat.com \
    --cc=eric.dumazet@gmail.com \
    --cc=fw@strlen.de \
    --cc=hannes@stressinduktion.org \
    --cc=jhs@mojatatu.com \
    --cc=john.r.fastabend@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=therbert@google.com \
    --cc=toke@toke.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.