From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36055) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dJc6q-0002Fx-2C for qemu-devel@nongnu.org; Sat, 10 Jun 2017 04:51:04 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dJc6l-0004uQ-Da for qemu-devel@nongnu.org; Sat, 10 Jun 2017 04:51:04 -0400 Received: from mail-wr0-x22c.google.com ([2a00:1450:400c:c0c::22c]:36591) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1dJc6l-0004tS-2v for qemu-devel@nongnu.org; Sat, 10 Jun 2017 04:50:59 -0400 Received: by mail-wr0-x22c.google.com with SMTP id v111so52737825wrc.3 for ; Sat, 10 Jun 2017 01:50:58 -0700 (PDT) References: <20170609170100.3599-1-alex.bennee@linaro.org> <20170609170100.3599-4-alex.bennee@linaro.org> From: Alex =?utf-8?Q?Benn=C3=A9e?= In-reply-to: Date: Sat, 10 Jun 2017 09:51:26 +0100 Message-ID: <87vao4b4z5.fsf@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] [RFC DEBUG PATCH 3/3] translate-a64: fix lookup_tb_ptr hang (DEBUG!) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Richard Henderson Cc: peter.maydell@linaro.org, pbonzini@redhat.com, edgar.iglesias@xilinx.com, cota@braap.org, qemu-devel@nongnu.org, Peter Crosthwaite , "open list:ARM" Richard Henderson writes: > On 06/09/2017 10:01 AM, Alex Bennée wrote: >> THIS IS A DEBUG PATCH DO NOT MERGE >> >> I include all the comments to show my working. I was trying to >> isolate which instructions cause the problem. It turns out it is the >> RET instruction. I don't understand why because AFAICT it is a >> pretty much a BR instruction. > > Yeah, same thing for Alpha. > > It has been my guess that not chaining through RET means that we get > back to the main loop regularly and often, letting interrupts be > recognized in a timely manner. > > I can't figure out why that would be, however, since interrupts > *ought* to be setting icount_decr, and the TB to which we chain *is* > checking that to return to the main loop. Indeed - if that was broken a lot more stuff wouldn't work. > Since changing the timing affects the outcome (e.g. -d exec), it > follows that this *must* be some sort of race condition. But since > this still happens with single-threaded mode, I can't imagine what > sort of race condition it might be. Apart from timer expiry I can't think what other interactions the other threads have on the main TCG thread. I guess there is IO but my test hangs way before the kernel starts poking the disk. Is there an interaction between IRQs and QEMU's serial driver? > > More data points. I removed the tb_htable_lookup, and that by itself > is enough to fix Alpha booting. But it doesn't help the aarch64 > kernel+image that I have. Which does still boot with -d nochain > (which, along with disabling goto_tb chaining, also disables all > goto_ptr). I wonder what is different about your aarch64 image and mine then? Because mine works just with suppressing the chaining for RET. > > Not really sure where to go from here. I would agree with Emilio that we revert but I can't quite shake the feeling we are missing an underlying problem. Would just skipping the htable lookup (but keeping the tb_jmp_cache) be an OK fix for now? Have we just been lucky that whatever mechanism causes the "hang" wasn't due to? > > > r~ -- Alex Bennée