From: Stephen Hemminger <stephen@networkplumber.org>
To: Florian Westphal <fw@strlen.de>
Cc: <netdev@vger.kernel.org>
Subject: Re: [PATCH RFC] sched: split classification and enqueue
Date: Wed, 22 Jun 2016 11:15:52 -0700 [thread overview]
Message-ID: <20160622111552.30f64c35@xeon-e3> (raw)
In-Reply-To: <1466589835-18521-1-git-send-email-fw@strlen.de>
On Wed, 22 Jun 2016 12:03:55 +0200
Florian Westphal <fw@strlen.de> wrote:
> Currently classification and enqueue is done in a single step.
>
> core acquires the qdisc lock, then calls the ->enqueue() function
> of the qdisc.
>
> Its the job of the qdisc and its attached classifiers to figure out what
> to do next.
>
> Typically the enqueue function will call tc_classify() to lookup a
> child class, then call ->enqueue of the child qdisc.
>
> This can repeat a number of times until a leaf qdisc is reached; this leaf
> will do the real enqueue operation (pfifo for example).
>
> While this approach gives qdiscs and the classifier/action subsystem
> a lot of control, it has one major drawback: The root qdisc lock
> is held for a prolonged period of time while we recurse through
> the qdisc hierarchy from root to leaf.
>
> This (unfinished!) hack splits classification and enqueue into
> two steps.
>
> Before enqueueing the packet and *before* acquiring the root qdisc lock,
> the new qdisc ->classify() function is invoked.
>
> This function -- much like enqueue in the current scheme -- looks up
> a child class and/or determines the next qdisc where the packet needs
> to be handled via the classifier action subsystem.
> Then it invokes the new ->classify() hook of the child, which can repeat
> until the leaf qdisc is reached.
>
> Whenever a classify operation has succeeded, each classify "level"
> stores itself (qdisc) and a cookie (typically just a pointer to the
> class) in the newly added Qdisc_classify_state struct.
>
> After qdisc_classify returns, this struct contains the path that
> the skb is supposed to take through the qdisc hierarchy in *reverse*
> order, i.e. starting from the leaf up to the root.
>
> Then, the root qdisc lock is taken and the *leaf* (first entry in
> Qdisc_classify_state) ->enqueue function is invoked.
>
> If this succeeds, the kernel will invoke the new ->senqeue
> function for all the remaining entries in Qdisc_classify_state.
>
> This function is responsible to perform needed qdisc internal actions,
> f.e. scheduling a class for transmit.
>
> This reduces the amount of work done under qdisc root lock.
> Very simple test w. hfsc + u32 filter, gso off, 10G link and 20 netperf
> TCP_STREAM:
>
> before: 8.2 GB/s
> after: 9.4 GB/s (same as w. mq+fq_codel)
>
> Known issues with this patch:
> - only converted a few qdiscs
> - qdisc stats will race since they're changed outside of root lock.
> - need to protect vs. class removal during classify ops
> - mirred needs some work for this (but not worse than current situation)
> - use-after-free in some places, need to ensure that skb remains valid
> iff enqueue returns != DROP.
> - senqeue is a horrid name. It is probably better to split qdiscs
> into leaf and non-leaf ones, where leaf qdiscs implement ->enqueue() and
> the others a notify_enqueue() (which doesn't return a value).
>
> So far I did not see any fundamental issues with this approach.
>
> If you see any, I'd be happy to learn about them before I spend more
> cycles on this. The main work to be done is to convert qstats to
> percpu counters and then convert all the qdiscs to this new scheme,
> and of course to extend this to all intree schedulers.
>
> If you attend NF workshop in Amsterdam feel free to yell at me in person there.
> ---
> include/net/sch_generic.h | 50 +++++++++++++++++++++++++++++++++
> net/core/dev.c | 10 ++++++-
> net/sched/sch_drr.c | 38 +++++++++++++++++++++++--
> net/sched/sch_fq_codel.c | 71 +++++++++++++++++++++++++++++++++++++++++++++--
> net/sched/sch_generic.c | 32 +++++++++++++++++++++
> net/sched/sch_hfsc.c | 38 +++++++++++++++++++++++--
> 6 files changed, 232 insertions(+), 7 deletions(-)
>
The architecture of linux packet scheduler makes this too racy.
Classification can be attached to different nodes on the hierarchy, and the
hierarchy could change underneath.
It is a good idea, but maybe would be better to have a separate classification
step before scheduler (use BPF) and put meta-data in skb. That way it could
be safely lockless and atomic.
prev parent reply other threads:[~2016-06-22 18:17 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-22 10:03 [PATCH RFC] sched: split classification and enqueue Florian Westphal
2016-06-22 17:05 ` Cong Wang
2016-06-22 17:23 ` Florian Westphal
2016-06-22 17:29 ` Alexei Starovoitov
2016-06-22 17:41 ` Florian Westphal
2016-06-22 18:15 ` Stephen Hemminger [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160622111552.30f64c35@xeon-e3 \
--to=stephen@networkplumber.org \
--cc=fw@strlen.de \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).