From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Paul E. McKenney" Subject: Re: [RFC] Make need_resched() return true when rcu_urgent_qs requested Date: Tue, 17 Jul 2018 05:56:53 -0700 Message-ID: <20180717125653.GH12945@linux.vnet.ibm.com> References: <9ee4d6fd-02e2-2c73-36a7-36ef4f6413b0@de.ibm.com> <20180711201759.GB3593@linux.vnet.ibm.com> <1531340384.8759.86.camel@infradead.org> <20180711210828.GD3593@linux.vnet.ibm.com> <1531396842.8759.125.camel@infradead.org> <20180712125351.GP3593@linux.vnet.ibm.com> <20180712161704.GA20726@linux.vnet.ibm.com> <20180716154015.GA21419@linux.vnet.ibm.com> <1531815548.19223.23.camel@infradead.org> Reply-To: paulmck@linux.vnet.ibm.com Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Cc: Christian Borntraeger , Peter Zijlstra , mhillenb@amazon.de, linux-kernel , kvm To: David Woodhouse Return-path: Content-Disposition: inline In-Reply-To: <1531815548.19223.23.camel@infradead.org> Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org On Tue, Jul 17, 2018 at 10:19:08AM +0200, David Woodhouse wrote: > On Mon, 2018-07-16 at 08:40 -0700, Paul E. McKenney wrote: > > Most of the weekend was devoted to testing today's upcoming pull request, > > but I did get a bit more testing done on this. > > > > I was able to make this happen more often by tweaking rcutorture a > > bit, but I still do not yet have statistically significant results. > > Nevertheless, I have thus far only seen failures with David's patch or > > with both David's and my patch.  And I actually got a full-up rcutorture > > failure (a too-short grace period) in addition to the aforementioned > > close calls. > > > > Over this coming week I expect to devote significant testing time to > > the commit just prior to David's in my stack.  If I don't see failures > > on that commit, we will need to spent some quality time with the KVM > > folks on whether or not kvm_x86_ops->run() and friends have the option of > > failing to return, but instead causing control to pop up somewhere else. > > Or someone could tell me how I am being blind to some obvious bug in > > the two commits that allow RCU to treat KVM guest-OS execution as an > > extended quiescent state.  ;-) > > One thing we can try, if my patch is implicated, is moving the calls to > rcu_kvm_en{ter,xit} closer to the actual VM entry. Let's try putting > them around the large asm block in arch/x86/kvm/vmx.c::vmx_vcpu_run() > for example. If that fixes it, then we know we've missed something else > interesting that's happening in the middle. I don't have enough data to say anything with too much certainty, but my patch has rcu_kvm_en{ter,xit}() quite a bit farther apart than yours does, and I am not seeing massive increases in error rate in my patch compared to yours. Which again might or might not mean anything. Plus I haven't proven that your patch isn't an innocent bystander yet. If it isn't just an innocent bystander, that will take most of this week do demonstrate given current failure rates. I am also working on improving rcutorture diagnostics which should help me work out how to change rcutorture so as to find this more quickly. > Testing on Skylake shows a guest CPUID goes from ~3000 cycles to ~3500 > with this patch, so in the next iteration it definitely needs to be > ifdef CONFIG_NO_HZ_FULL anyway, because it's actually required there > (AFAICT) and it's too expensive otherwise as Christian pointed out. Makes sense! Thanx, Paul