From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: Per-CPU Queueing for QoS Date: Mon, 13 Nov 2017 15:08:36 -0800 Message-ID: <1510614516.2849.157.camel@edumazet-glaptop3.roam.corp.google.com> References: <20171112161431.04d45345@xeon-e3> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Cc: Michael Ma , Stephen Hemminger , Linux Kernel Network Developers , jianjun.duan@alibaba-inc.com, xiangning.yu@alibaba-inc.com To: Alexander Duyck Return-path: Received: from mail-io0-f174.google.com ([209.85.223.174]:46589 "EHLO mail-io0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751317AbdKMXIj (ORCPT ); Mon, 13 Nov 2017 18:08:39 -0500 Received: by mail-io0-f174.google.com with SMTP id n79so11569654ion.3 for ; Mon, 13 Nov 2017 15:08:38 -0800 (PST) In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote: > On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma wrote: > > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger : > >> On Sun, 12 Nov 2017 13:43:13 -0800 > >> Michael Ma wrote: > >> > >>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback. > >>> > >>> Thanks, > >>> Michael > >>> > >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma wrote: > >>> > > >>> > Currently txq/qdisc selection is based on flow hash so packets from > >>> > the same flow will follow the order when they enter qdisc/txq, which > >>> > avoids out-of-order problem. > >>> > > >>> > To improve the concurrency of QoS algorithm we plan to have multiple > >>> > per-cpu queues for a single TC class and do busy polling from a > >>> > per-class thread to drain these queues. If we can do this frequently > >>> > enough the out-of-order situation in this polling thread should not be > >>> > that bad. > >>> > > >>> > To give more details - in the send path we introduce per-cpu per-class > >>> > queues so that packets from the same class and same core will be > >>> > enqueued to the same place. Then a per-class thread poll the queues > >>> > belonging to its class from all the cpus and aggregate them into > >>> > another per-class queue. This can effectively reduce contention but > >>> > inevitably introduces potential out-of-order issue. > >>> > > >>> > Any concern/suggestion for working towards this direction? > >> > >> In general, there is no meta design discussions in Linux development > >> Several developers have tried to do lockless > >> qdisc and similar things in the past. > >> > >> The devil is in the details, show us the code. > > > > Thanks for the response, Stephen. The code is fairly straightforward, > > we have a per-cpu per-class queue defined as this: > > > > struct bandwidth_group > > { > > struct skb_list queues[MAX_CPU_COUNT]; > > struct skb_list drain; > > } > > > > "drain" queue is used to aggregate per-cpu queues belonging to the > > same class. In the enqueue function, we determine the cpu where the > > packet is processed and enqueue it to the corresponding per-cpu queue: > > > > int cpu; > > struct bandwidth_group *bwg = &bw_rx_groups[bwgid]; > > > > cpu = get_cpu(); > > skb_list_append(&bwg->queues[cpu], skb); > > > > Here we don't check the flow of the packet so if there is task > > migration or multiple threads sending packets through the same flow we > > theoretically can have packets enqueued to different queues and > > aggregated to the "drain" queue out of order. > > > > Also AFAIK there is no lockless htb-like qdisc implementation > > currently, however if there is already similar effort ongoing please > > let me know. > > The question I would have is how would this differ from using XPS w/ > mqprio? Would this be a classful qdisc like HTB or a classless one > like mqprio? > > From what I can tell XPS would be able to get you your per-cpu > functionality, the benefit of it though would be that it would avoid > out-of-order issues for sockets originating on the local system. The > only thing I see as an issue right now is that the rate limiting with > mqprio is assumed to be handled via hardware due to mechanisms such as > DCB. I think one of the key point was in : " do busy polling from a per-class thread to drain these queues." I mentioned this idea in TX path of : https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf