From mboxrd@z Thu Jan  1 00:00:00 1970
From: Eric Dumazet <eric.dumazet@gmail.com>
Subject: Re: Per-CPU Queueing for QoS
Date: Mon, 13 Nov 2017 15:08:36 -0800
Message-ID: <1510614516.2849.157.camel@edumazet-glaptop3.roam.corp.google.com>
References: <CAAmHdhzo4fviGGTqaEKe5Q4xnWdFY=hkaDt4d=St3okHLeSB2g@mail.gmail.com>
         <C1915904-3DE4-410F-A898-CA3C70F98E97@gmail.com>
         <20171112161431.04d45345@xeon-e3>
         <CAAmHdhw9NzaLQUfwPXvJ5m9UWgnYS806j1uEFR8=BST8t-ioLg@mail.gmail.com>
         <CAKgT0Uc=t9sTn47WpJikMgaPdGSTpYxCxQBpBZ1D=iR9mgz9pg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
Cc: Michael Ma <make0818@gmail.com>,
        Stephen Hemminger <stephen@networkplumber.org>,
        Linux Kernel Network Developers <netdev@vger.kernel.org>,
        jianjun.duan@alibaba-inc.com, xiangning.yu@alibaba-inc.com
To: Alexander Duyck <alexander.duyck@gmail.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-io0-f174.google.com ([209.85.223.174]:46589 "EHLO
        mail-io0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751317AbdKMXIj (ORCPT
        <rfc822;netdev@vger.kernel.org>); Mon, 13 Nov 2017 18:08:39 -0500
Received: by mail-io0-f174.google.com with SMTP id n79so11569654ion.3
        for <netdev@vger.kernel.org>; Mon, 13 Nov 2017 15:08:38 -0800 (PST)
In-Reply-To: <CAKgT0Uc=t9sTn47WpJikMgaPdGSTpYxCxQBpBZ1D=iR9mgz9pg@mail.gmail.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@gmail.com> wrote:
> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@networkplumber.org>:
> >> On Sun, 12 Nov 2017 13:43:13 -0800
> >> Michael Ma <make0818@gmail.com> wrote:
> >>
> >>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
> >>>
> >>> Thanks,
> >>> Michael
> >>>
> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
> >>> >
> >>> > Currently txq/qdisc selection is based on flow hash so packets from
> >>> > the same flow will follow the order when they enter qdisc/txq, which
> >>> > avoids out-of-order problem.
> >>> >
> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
> >>> > per-cpu queues for a single TC class and do busy polling from a
> >>> > per-class thread to drain these queues. If we can do this frequently
> >>> > enough the out-of-order situation in this polling thread should not be
> >>> > that bad.
> >>> >
> >>> > To give more details - in the send path we introduce per-cpu per-class
> >>> > queues so that packets from the same class and same core will be
> >>> > enqueued to the same place. Then a per-class thread poll the queues
> >>> > belonging to its class from all the cpus and aggregate them into
> >>> > another per-class queue. This can effectively reduce contention but
> >>> > inevitably introduces potential out-of-order issue.
> >>> >
> >>> > Any concern/suggestion for working towards this direction?
> >>
> >> In general, there is no meta design discussions in Linux development
> >> Several developers have tried to do lockless
> >> qdisc and similar things in the past.
> >>
> >> The devil is in the details, show us the code.
> >
> > Thanks for the response, Stephen. The code is fairly straightforward,
> > we have a per-cpu per-class queue defined as this:
> >
> > struct bandwidth_group
> > {
> >     struct skb_list queues[MAX_CPU_COUNT];
> >     struct skb_list drain;
> > }
> >
> > "drain" queue is used to aggregate per-cpu queues belonging to the
> > same class. In the enqueue function, we determine the cpu where the
> > packet is processed and enqueue it to the corresponding per-cpu queue:
> >
> > int cpu;
> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
> >
> > cpu = get_cpu();
> > skb_list_append(&bwg->queues[cpu], skb);
> >
> > Here we don't check the flow of the packet so if there is task
> > migration or multiple threads sending packets through the same flow we
> > theoretically can have packets enqueued to different queues and
> > aggregated to the "drain" queue out of order.
> >
> > Also AFAIK there is no lockless htb-like qdisc implementation
> > currently, however if there is already similar effort ongoing please
> > let me know.
> 
> The question I would have is how would this differ from using XPS w/
> mqprio? Would this be a classful qdisc like HTB or a classless one
> like mqprio?
> 
> From what I can tell XPS would be able to get you your per-cpu
> functionality, the benefit of it though would be that it would avoid
> out-of-order issues for sockets originating on the local system. The
> only thing I see as an issue right now is that the rate limiting with
> mqprio is assumed to be handled via hardware due to mechanisms such as
> DCB.

I think one of the key point was in : " do busy polling from a per-class
thread to drain these queues." 

I mentioned this idea in TX path of :

https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf