From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:37849) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aqeGt-0005Xz-8i for qemu-devel@nongnu.org; Thu, 14 Apr 2016 06:13:12 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aqeGq-0004dC-Ho for qemu-devel@nongnu.org; Thu, 14 Apr 2016 06:13:11 -0400 Received: from mx1.redhat.com ([209.132.183.28]:57392) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aqeGq-0004d8-CJ for qemu-devel@nongnu.org; Thu, 14 Apr 2016 06:13:08 -0400 Date: Thu, 14 Apr 2016 11:13:03 +0100 From: "Dr. David Alan Gilbert" Message-ID: <20160414101302.GC2252@work-vm> References: <20160412175501.GB6415@work-vm> <20160413080545.GA2270@work-vm> <20160413114103.GB2270@work-vm> <20160413125053.GC2270@work-vm> <20160413205132.GG26364@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160413205132.GG26364@redhat.com> Subject: Re: [Qemu-devel] post-copy is broken? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Andrea Arcangeli Cc: "Li, Liang Z" , Amit Shah , "qemu-devel@nongnu.org" , "quintela@redhat.com" * Andrea Arcangeli (aarcange@redhat.com) wrote: > On Wed, Apr 13, 2016 at 01:50:53PM +0100, Dr. David Alan Gilbert wrote: > > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote: > > > > > + if ( ((b + 1) % 255) == last_byte && !hit_edge) { > > > > Ahem, that should be 256. > > > > I'm going to bisect the kernel and see where we get to. > > Andrea's userfaultfd self-test passes on 2.5, so it's something more > > subtle. > > > > David already tracked down 1df59b8497f47495e873c23abd6d3d290c730505 > good and 984065055e6e39f8dd812529e11922374bd39352 bad. > > git diff 1df59b8497f47495e873c23abd6d3d290c730505..984065055e6e39f8dd812529e11922374bd39352 fs/userfaultfd.c mm/userfaultfd.c > > Nothing that could break it in the diff of the relevant two files. > > The only other userfault related change in this commit range that > comes to mind is in fixup_user_fault, but if that was buggy you don't > userfault into futexes with postcopy so you couldn't notice, so the > only other user of that is s390. > > The next suspect is the massive THP refcounting change that went > upstream recently: ... > As further debug hint, can you try to disable THP and see if that > makes the problem go away? Yeh, looks like it is THP. My bisect is currently at 17ec4cd985780a7e30aa45bb8f272237c12502a4 and with that from a fresh boot it fails, if I disable THP it works and if I reenable THP back to madvise it fails. I spotted that my previous bisect point it failed before I'd done the next kernel build but failed after I'd done the build (but before I rebooted!) - so I guess after the build it couldn't find any THPs to do. Dave > > Thanks, > Andrea -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK