From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:47740) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eJvRc-0001JW-TH for qemu-devel@nongnu.org; Wed, 29 Nov 2017 01:02:09 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eJvRY-0005BP-3f for qemu-devel@nongnu.org; Wed, 29 Nov 2017 01:02:04 -0500 Received: from mail-eopbgr40131.outbound.protection.outlook.com ([40.107.4.131]:47204 helo=EUR03-DB5-obe.outbound.protection.outlook.com) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eJvRX-0005Ac-Om for qemu-devel@nongnu.org; Wed, 29 Nov 2017 01:02:00 -0500 Date: Wed, 29 Nov 2017 09:01:49 +0300 From: Roman Kagan Message-ID: <20171129060149.GD2374@rkaganip.lan> References: <1511530010-511740-1-git-send-email-dplotnikov@virtuozzo.com> <20171128195817.GA29077@localhost.localdomain> <2cd31202-27c3-983f-85a9-6814ff504706@virtuozzo.com> <20171128211326.GV3037@localhost.localdomain> <5A1E43A6.1010607@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5A1E43A6.1010607@huawei.com> Subject: Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Longpeng (Mike)" Cc: Eduardo Habkost , "Denis V. Lunev" , "Michael S. Tsirkin" , Denis Plotnikov , pbonzini@redhat.com, rth@twiddle.net, qemu-devel@nongnu.org, Gonglei , huangpeng , zhaoshenglong , herongguang.he@huawei.com On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote: > On 2017/11/29 5:13, Eduardo Habkost wrote: > > > [CCing the people who were copied in the original patch that > > enabled l3cache] > > > > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: > >> On 11/28/2017 10:58 PM, Eduardo Habkost wrote: > >>> Hi, > >>> > >>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: > >>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" > >>>> introduced and set by default exposing l3 to the guest. > >>>> > >>>> The motivation behind it was that in the Linux scheduler, when waking up > >>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue > >>>> directly, without sending a reschedule IPI. Reduction in the IPI count > >>>> led to performance gain. > >>>> > >>>> However, this isn't the whole story. Once the task is on the target > >>>> CPU's runqueue, it may have to preempt the current task on that CPU, be > >>>> it the idle task putting the CPU to sleep or just another running task. > >>>> For that a reschedule IPI will have to be issued, too. Only when that > >>>> other CPU is running a normal task for too little time, the fairness > >>>> constraints will prevent the preemption and thus the IPI. > >>>> > > Agree. :) > > Our testing VM is Suse11 guest with idle=poll at that time and now I realize ^^^^^^^^^ Oh, that's a whole lot of a difference! I wish you mentioned that in that patch. > that Suse11 has a BUG in its scheduler. > > For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if > rq->idle is not polling: > ''' > static void ttwu_queue_remote(struct task_struct *p, int cpu) > { > struct rq *rq = cpu_rq(cpu); > > if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) { > if (!set_nr_if_polling(rq->idle)) > smp_send_reschedule(cpu); > else > trace_sched_wake_idle_without_ipi(cpu); > } > } > ''' > > But for Suse11, it does not check, it send a RES IPI unconditionally. > > >>>> This boils down to the improvement being only achievable in workloads > >>>> with many actively switching tasks. We had no access to the > >>>> (proprietary?) SAP HANA benchmark the commit referred to, but the > >>>> pattern is also reproduced with "perf bench sched messaging -g 1" > >>>> on 1 socket, 8 cores vCPU topology, we see indeed: > >>>> > >>>> l3-cache #res IPI /s #time / 10000 loops > >>>> off 560K 1.8 sec > >>>> on 40K 0.9 sec > >>>> > >>>> Now there's a downside: with L3 cache the Linux scheduler is more eager > >>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > >>>> interactions and therefore exessive halts and IPIs. E.g. "perf bench > >>>> sched pipe -i 100000" gives > >>>> > >>>> l3-cache #res IPI /s #HLT /s #time /100000 loops > >>>> off 200 (no K) 230 0.2 sec > >>>> on 400K 330K 0.5 sec > >>>> > > I guess this issue could be resolved by disable the SD_WAKE_AFFINE. But that requires extra tuning in the guest which is even less likely to happen in the cloud case when VM admin != host admin. > As Gonglei said: > 1. the L3 cache relates to the user experience. > 2. the glibc would get the cache info by CPUID directly, and relates to the > memory performance. > > What's more, the L3 cache relates to the sched_domain which is important to the > (load) balancer when system is busy. > > All this doesn't mean the patch is insignificant, I just think we should do more > research before decide. I'll do some tests, thanks. :) Looking forward to it, thanks! Roman.