From mboxrd@z Thu Jan 1 00:00:00 1970 From: Radim =?utf-8?B?S3LEjW3DocWZ?= Subject: Re: [PATCH 2/2] x86/idle: use dynamic halt poll Date: Tue, 4 Jul 2017 16:13:23 +0200 Message-ID: <20170704141322.GC30880@potion> References: <4444ffc8-9e7b-5bd2-20da-af422fe834cc@redhat.com> <2245bef7-b668-9265-f3f8-3b63d71b1033@gmail.com> <7d085956-2573-212f-44f4-86104beba9bb@gmail.com> <05ec7efc-fb9c-ae24-5770-66fc472545a4@redhat.com> <20170627134043.GA1487@potion> <2771f905-d1b0-b118-9ae9-db5fb87f877c@redhat.com> <20170627142251.GB1487@potion> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Paolo Bonzini , Wanpeng Li , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , the arch/x86 maintainers , Jonathan Corbet , tony.luck@intel.com, Borislav Petkov , Peter Zijlstra , mchehab@kernel.org, Andrew Morton , krzk@kernel.org, jpoimboe@redhat.com, Andy Lutomirski , Christian Borntraeger , Thomas Garnier , Robert Gerst , Mathias Krause , douly.fnst@cn.fujitsu.com, Nicolai Stange , Frederic Weisbecker , dvlasenk@redhat.com, To: Yang Zhang Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-doc-owner@vger.kernel.org List-Id: kvm.vger.kernel.org 2017-07-03 17:28+0800, Yang Zhang: > The background is that we(Alibaba Cloud) do get more and more complaints > from our customers in both KVM and Xen compare to bare-mental.After > investigations, the root cause is known to us: big cost in message passing > workload(David show it in KVM forum 2015) > > A typical message workload like below: > vcpu 0 vcpu 1 > 1. send ipi 2. doing hlt > 3. go into idle 4. receive ipi and wake up from hlt > 5. write APIC time twice 6. write APIC time twice to > to stop sched timer reprogram sched timer One write is enough to disable/re-enable the APIC timer -- why does Linux use two? > 7. doing hlt 8. handle task and send ipi to > vcpu 0 > 9. same to 4. 10. same to 3 > > One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The > cost of such vmexits will degrades performance severely. Yeah, sounds like too much ... I understood that there are IPI from 1 to 2 4 * APIC timer IPI from 2 to 1 which adds to 6 MSR writes -- what are the other 4? > Linux kernel > already provide idle=poll to mitigate the trend. But it only eliminates the > IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A > compromise would be to turn off NOHZ kernel, but it is not the default > config for new distributions. Same for halt-poll in KVM, it only solve the > cost from schedule in/out in host and can not help such workload much. > > The purpose of this patch we want to improve current idle=poll mechanism to Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow down the sibling hyperthread. MWAIT solves the IPI problem, but doesn't get rid of the timer one. > use dynamic polling and do poll before touch sched timer. It should not be a > virtualization specific feature but seems bare mental have low cost to > access the MSR. So i want to only enable it in VM. Though the idea below the > patch may not so perfect to fit all conditions, it looks no worse than now. It adds code to hot-paths (interrupt handlers) while trying to optimize an idle-path, which is suspicious. > How about we keep current implementation and i integrate the patch to > para-virtualize part as Paolo suggested? We can continue discuss it and i > will continue to refine it if anyone has better suggestions? I think there is a nicer solution to avoid the expensive timer rewrite: Linux uses one-shot APIC timers and getting the timer interrupt is about as expensive as programming the timer, so the guest can keep the timer armed, but not re-arm it after the expiration if the CPU is idle. This should also mitigate the problem with short idle periods, but the optimized window is anywhere between 0 to 1ms. Do you see disadvantages of this combined with MWAIT? Thanks.