From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:37849)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1aqeGt-0005Xz-8i
	for qemu-devel@nongnu.org; Thu, 14 Apr 2016 06:13:12 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1aqeGq-0004dC-Ho
	for qemu-devel@nongnu.org; Thu, 14 Apr 2016 06:13:11 -0400
Received: from mx1.redhat.com ([209.132.183.28]:57392)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1aqeGq-0004d8-CJ
	for qemu-devel@nongnu.org; Thu, 14 Apr 2016 06:13:08 -0400
Date: Thu, 14 Apr 2016 11:13:03 +0100
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Message-ID: <20160414101302.GC2252@work-vm>
References: <F2CBF3009FA73547804AE4C663CAB28E0417E6B1@shsmsx102.ccr.corp.intel.com>
	<20160412175501.GB6415@work-vm>
	<F2CBF3009FA73547804AE4C663CAB28E0417EE92@shsmsx102.ccr.corp.intel.com>
	<F2CBF3009FA73547804AE4C663CAB28E0417EEE4@shsmsx102.ccr.corp.intel.com>
	<20160413080545.GA2270@work-vm> <20160413114103.GB2270@work-vm>
	<20160413125053.GC2270@work-vm> <20160413205132.GG26364@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160413205132.GG26364@redhat.com>
Subject: Re: [Qemu-devel] post-copy is broken?
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Li, Liang Z" <liang.z.li@intel.com>, Amit Shah <amit.shah@redhat.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, "quintela@redhat.com" <quintela@redhat.com>

* Andrea Arcangeli (aarcange@redhat.com) wrote:
> On Wed, Apr 13, 2016 at 01:50:53PM +0100, Dr. David Alan Gilbert wrote:
> > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > 
> > > +            if ( ((b + 1) % 255) == last_byte && !hit_edge) {
> > 
> > Ahem, that should be 256.
> > 
> > I'm going to bisect the kernel and see where we get to.
> > Andrea's userfaultfd self-test passes on 2.5, so it's something more
> > subtle.
> > 
> 
> David already tracked down 1df59b8497f47495e873c23abd6d3d290c730505
> good and 984065055e6e39f8dd812529e11922374bd39352 bad.
> 
> git diff 1df59b8497f47495e873c23abd6d3d290c730505..984065055e6e39f8dd812529e11922374bd39352 fs/userfaultfd.c mm/userfaultfd.c
> 
> Nothing that could break it in the diff of the relevant two files.
> 
> The only other userfault related change in this commit range that
> comes to mind is in fixup_user_fault, but if that was buggy you don't
> userfault into futexes with postcopy so you couldn't notice, so the
> only other user of that is s390.
> 
> The next suspect is the massive THP refcounting change that went
> upstream recently:

...

> As further debug hint, can you try to disable THP and see if that
> makes the problem go away?

Yeh, looks like it is THP.
My bisect is currently at 17ec4cd985780a7e30aa45bb8f272237c12502a4
and with that from a fresh boot it fails, if I disable THP it works
and if I reenable THP back to madvise it fails.

I spotted that my previous bisect point it failed before I'd done
the next kernel build but failed after I'd done the build (but before
I rebooted!) - so I guess after the build it couldn't find any THPs to do.

Dave

> 
> Thanks,
> Andrea
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK