All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: Florian Westphal <fw@strlen.de>
Cc: <netdev@vger.kernel.org>
Subject: Re: [PATCH RFC] sched: split classification and enqueue
Date: Wed, 22 Jun 2016 11:15:52 -0700	[thread overview]
Message-ID: <20160622111552.30f64c35@xeon-e3> (raw)
In-Reply-To: <1466589835-18521-1-git-send-email-fw@strlen.de>

On Wed, 22 Jun 2016 12:03:55 +0200
Florian Westphal <fw@strlen.de> wrote:

> Currently classification and enqueue is done in a single step.
> 
> core acquires the qdisc lock, then calls the ->enqueue() function
> of the qdisc.
> 
> Its the job of the qdisc and its attached classifiers to figure out what
> to do next.
> 
> Typically the enqueue function will call tc_classify() to lookup a
> child class, then call ->enqueue of the child qdisc.
> 
> This can repeat a number of times until a leaf qdisc is reached; this leaf
> will do the real enqueue operation (pfifo for example).
> 
> While this approach gives qdiscs and the classifier/action subsystem
> a lot of control, it has one major drawback:  The root qdisc lock
> is held for a prolonged period of time while we recurse through
> the qdisc hierarchy from root to leaf.
> 
> This (unfinished!) hack splits classification and enqueue into
> two steps.
> 
> Before enqueueing the packet and *before* acquiring the root qdisc lock,
> the new qdisc ->classify() function is invoked.
> 
> This function -- much like enqueue in the current scheme -- looks up
> a child class and/or determines the next qdisc where the packet needs
> to be handled via the classifier action subsystem.
> Then it invokes the new ->classify() hook of the child, which can repeat
> until the leaf qdisc is reached.
> 
> Whenever a classify operation has succeeded, each classify "level"
> stores itself (qdisc) and a cookie (typically just a pointer to the
> class) in the newly added Qdisc_classify_state struct.
> 
> After qdisc_classify returns, this struct contains the path that
> the skb is supposed to take through the qdisc hierarchy in *reverse*
> order, i.e. starting from the leaf up to the root.
> 
> Then, the root qdisc lock is taken and the *leaf* (first entry in
> Qdisc_classify_state) ->enqueue function is invoked.
> 
> If this succeeds, the kernel will invoke the new ->senqeue
> function for all the remaining entries in Qdisc_classify_state.
> 
> This function is responsible to perform needed qdisc internal actions,
> f.e. scheduling a class for transmit.
> 
> This reduces the amount of work done under qdisc root lock.
> Very simple test w. hfsc + u32 filter, gso off, 10G link and 20 netperf
> TCP_STREAM:
> 
>   before: 8.2 GB/s
>   after: 9.4 GB/s (same as w. mq+fq_codel)
> 
> Known issues with this patch:
>  - only converted a few qdiscs
>  - qdisc stats will race since they're changed outside of root lock.
>  - need to protect vs. class removal during classify ops
>  - mirred needs some work for this (but not worse than current situation)
>  - use-after-free in some places, need to ensure that skb remains valid
>    iff enqueue returns != DROP.
>  - senqeue is a horrid name.  It is probably better to split qdiscs
>    into leaf and non-leaf ones, where leaf qdiscs implement ->enqueue() and
>    the others a notify_enqueue() (which doesn't return a value).
> 
> So far I did not see any fundamental issues with this approach.
> 
> If you see any, I'd be happy to learn about them before I spend more
> cycles on this.  The main work to be done is to convert qstats to
> percpu counters and then convert all the qdiscs to this new scheme,
> and of course to extend this to all intree schedulers.
> 
> If you attend NF workshop in Amsterdam feel free to yell at me in person there.
> ---
>  include/net/sch_generic.h | 50 +++++++++++++++++++++++++++++++++
>  net/core/dev.c            | 10 ++++++-
>  net/sched/sch_drr.c       | 38 +++++++++++++++++++++++--
>  net/sched/sch_fq_codel.c  | 71 +++++++++++++++++++++++++++++++++++++++++++++--
>  net/sched/sch_generic.c   | 32 +++++++++++++++++++++
>  net/sched/sch_hfsc.c      | 38 +++++++++++++++++++++++--
>  6 files changed, 232 insertions(+), 7 deletions(-)
> 

The architecture of linux packet scheduler makes this too racy.
Classification can be attached to different nodes on the hierarchy, and the
hierarchy could change underneath.

It is a good idea, but maybe would be better to have a separate classification
step before scheduler (use BPF) and put meta-data in skb. That way it could
be safely lockless and atomic.

      parent reply	other threads:[~2016-06-22 18:17 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-22 10:03 [PATCH RFC] sched: split classification and enqueue Florian Westphal
2016-06-22 17:05 ` Cong Wang
2016-06-22 17:23   ` Florian Westphal
2016-06-22 17:29 ` Alexei Starovoitov
2016-06-22 17:41   ` Florian Westphal
2016-06-22 18:15 ` Stephen Hemminger [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160622111552.30f64c35@xeon-e3 \
    --to=stephen@networkplumber.org \
    --cc=fw@strlen.de \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.