Re: [PATCH RFC] sched: split classification and enqueue

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Stephen Hemminger <stephen@networkplumber.org>
To: Florian Westphal <fw@strlen.de>
Cc: <netdev@vger.kernel.org>
Subject: Re: [PATCH RFC] sched: split classification and enqueue
Date: Wed, 22 Jun 2016 11:15:52 -0700	[thread overview]
Message-ID: <20160622111552.30f64c35@xeon-e3> (raw)
In-Reply-To: <1466589835-18521-1-git-send-email-fw@strlen.de>

On Wed, 22 Jun 2016 12:03:55 +0200
Florian Westphal <fw@strlen.de> wrote:

> Currently classification and enqueue is done in a single step.
> 
> core acquires the qdisc lock, then calls the ->enqueue() function
> of the qdisc.
> 
> Its the job of the qdisc and its attached classifiers to figure out what
> to do next.
> 
> Typically the enqueue function will call tc_classify() to lookup a
> child class, then call ->enqueue of the child qdisc.
> 
> This can repeat a number of times until a leaf qdisc is reached; this leaf
> will do the real enqueue operation (pfifo for example).
> 
> While this approach gives qdiscs and the classifier/action subsystem
> a lot of control, it has one major drawback:  The root qdisc lock
> is held for a prolonged period of time while we recurse through
> the qdisc hierarchy from root to leaf.
> 
> This (unfinished!) hack splits classification and enqueue into
> two steps.
> 
> Before enqueueing the packet and *before* acquiring the root qdisc lock,
> the new qdisc ->classify() function is invoked.
> 
> This function -- much like enqueue in the current scheme -- looks up
> a child class and/or determines the next qdisc where the packet needs
> to be handled via the classifier action subsystem.
> Then it invokes the new ->classify() hook of the child, which can repeat
> until the leaf qdisc is reached.
> 
> Whenever a classify operation has succeeded, each classify "level"
> stores itself (qdisc) and a cookie (typically just a pointer to the
> class) in the newly added Qdisc_classify_state struct.
> 
> After qdisc_classify returns, this struct contains the path that
> the skb is supposed to take through the qdisc hierarchy in *reverse*
> order, i.e. starting from the leaf up to the root.
> 
> Then, the root qdisc lock is taken and the *leaf* (first entry in
> Qdisc_classify_state) ->enqueue function is invoked.
> 
> If this succeeds, the kernel will invoke the new ->senqeue
> function for all the remaining entries in Qdisc_classify_state.
> 
> This function is responsible to perform needed qdisc internal actions,
> f.e. scheduling a class for transmit.
> 
> This reduces the amount of work done under qdisc root lock.
> Very simple test w. hfsc + u32 filter, gso off, 10G link and 20 netperf
> TCP_STREAM:
> 
>   before: 8.2 GB/s
>   after: 9.4 GB/s (same as w. mq+fq_codel)
> 
> Known issues with this patch:
>  - only converted a few qdiscs
>  - qdisc stats will race since they're changed outside of root lock.
>  - need to protect vs. class removal during classify ops
>  - mirred needs some work for this (but not worse than current situation)
>  - use-after-free in some places, need to ensure that skb remains valid
>    iff enqueue returns != DROP.
>  - senqeue is a horrid name.  It is probably better to split qdiscs
>    into leaf and non-leaf ones, where leaf qdiscs implement ->enqueue() and
>    the others a notify_enqueue() (which doesn't return a value).
> 
> So far I did not see any fundamental issues with this approach.
> 
> If you see any, I'd be happy to learn about them before I spend more
> cycles on this.  The main work to be done is to convert qstats to
> percpu counters and then convert all the qdiscs to this new scheme,
> and of course to extend this to all intree schedulers.
> 
> If you attend NF workshop in Amsterdam feel free to yell at me in person there.
> ---
>  include/net/sch_generic.h | 50 +++++++++++++++++++++++++++++++++
>  net/core/dev.c            | 10 ++++++-
>  net/sched/sch_drr.c       | 38 +++++++++++++++++++++++--
>  net/sched/sch_fq_codel.c  | 71 +++++++++++++++++++++++++++++++++++++++++++++--
>  net/sched/sch_generic.c   | 32 +++++++++++++++++++++
>  net/sched/sch_hfsc.c      | 38 +++++++++++++++++++++++--
>  6 files changed, 232 insertions(+), 7 deletions(-)
> 

The architecture of linux packet scheduler makes this too racy.
Classification can be attached to different nodes on the hierarchy, and the
hierarchy could change underneath.

It is a good idea, but maybe would be better to have a separate classification
step before scheduler (use BPF) and put meta-data in skb. That way it could
be safely lockless and atomic.

     prev parent reply	other threads:[~2016-06-22 18:17 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-22 10:03 [PATCH RFC] sched: split classification and enqueue Florian Westphal
2016-06-22 17:05 ` Cong Wang
2016-06-22 17:23   ` Florian Westphal
2016-06-22 17:29 ` Alexei Starovoitov
2016-06-22 17:41   ` Florian Westphal
2016-06-22 18:15 ` Stephen Hemminger [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160622111552.30f64c35@xeon-e3 \
    --to=stephen@networkplumber.org \
    --cc=fw@strlen.de \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).