From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:36439)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kirill@shutemov.name>) id 1ar5ar-0000xi-Ag
	for qemu-devel@nongnu.org; Fri, 15 Apr 2016 11:23:38 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <kirill@shutemov.name>) id 1ar5ao-0007Bk-4G
	for qemu-devel@nongnu.org; Fri, 15 Apr 2016 11:23:37 -0400
Received: from mail-wm0-x22e.google.com ([2a00:1450:400c:c09::22e]:38663)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kirill@shutemov.name>) id 1ar5an-0007Bc-Q7
	for qemu-devel@nongnu.org; Fri, 15 Apr 2016 11:23:34 -0400
Received: by mail-wm0-x22e.google.com with SMTP id u206so37329427wme.1
	for <qemu-devel@nongnu.org>; Fri, 15 Apr 2016 08:23:33 -0700 (PDT)
Date: Fri, 15 Apr 2016 18:23:30 +0300
From: "Kirill A. Shutemov" <kirill@shutemov.name>
Message-ID: <20160415152330.GB3376@node.shutemov.name>
References: <F2CBF3009FA73547804AE4C663CAB28E0417EE92@shsmsx102.ccr.corp.intel.com>
	<F2CBF3009FA73547804AE4C663CAB28E0417EEE4@shsmsx102.ccr.corp.intel.com>
	<20160413080545.GA2270@work-vm> <20160413114103.GB2270@work-vm>
	<20160413125053.GC2270@work-vm> <20160413205132.GG26364@redhat.com>
	<20160414123441.GF2252@work-vm> <20160414162230.GC9976@redhat.com>
	<20160415125236.GA3376@node.shutemov.name>
	<20160415134233.GG2229@work-vm>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160415134233.GG2229@work-vm>
Subject: Re: [Qemu-devel] post-copy is broken?
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>, kirill.shutemov@linux.intel.com, "Li, Liang Z" <liang.z.li@intel.com>, Amit Shah <amit.shah@redhat.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, "quintela@redhat.com" <quintela@redhat.com>, linux-mm@kvack.org

On Fri, Apr 15, 2016 at 02:42:33PM +0100, Dr. David Alan Gilbert wrote:
> * Kirill A. Shutemov (kirill@shutemov.name) wrote:
> > On Thu, Apr 14, 2016 at 12:22:30PM -0400, Andrea Arcangeli wrote:
> > > Adding linux-mm too,
> > > 
> > > On Thu, Apr 14, 2016 at 01:34:41PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Andrea Arcangeli (aarcange@redhat.com) wrote:
> > > > 
> > > > > The next suspect is the massive THP refcounting change that went
> > > > > upstream recently:
> > > > 
> > > > > As further debug hint, can you try to disable THP and see if that
> > > > > makes the problem go away?
> > > > 
> > > > Yep, this seems to be the problem (cc'ing in Kirill).
> > > > 
> > > > 122afea9626ab3f717b250a8dd3d5ebf57cdb56c - works (just before Kirill disables THP)
> > > > 61f5d698cc97600e813ca5cf8e449b1ea1c11492 - breaks (when THP is reenabled)
> > > > 
> > > > It's pretty reliable; as you say disabling THP makes it work again
> > > > and putting it back to THP/madvise mode makes it break.  And you need
> > > > to test on a machine with some free ram to make sure THP has a chance
> > > > to have happened.
> > > > 
> > > > I'm not sure of all of the rework that happened in that series,
> > > > but my reading of it is that splitting of THP pages gets deferred;
> > > > so I wonder if when I do the madvise to turn THP off, if it's actually
> > > > still got THP pages and thus we end up with a whole THP mapped
> > > > when I'm expecting to be userfaulting those pages.
> > > 
> > > Good thing at least I didn't make UFFDIO_COPY THP aware yet so there's
> > > less variables (as no user was interested to handle userfaults at THP
> > > granularity yet, and from userland such an improvement would be
> > > completely invisible in terms of API, so if an user starts doing that
> > > we can just optimize the kernel for it, criu restore could do that as
> > > the faults will come from disk-I/O, when network is involved THP
> > > userfaults wouldn't have a great tradeoff with regard to the increased
> > > fault latency).
> > > 
> > > I suspect there is an handle_userfault missing somewhere in connection
> > > with trans_huge_pmd splits (not anymore THP splits) that you're doing
> > > with MADV_DONTNEED to zap those pages in the destination that got
> > > redirtied in source during the last precopy stage. Or more simply
> > > MADV_DONTNEED isn't zapping all the right ptes after the trans huge
> > > pmd got splitted.
> > > 
> > > The fact the page isn't splitted shouldn't matter too much, all we care
> > > about is the pte triggers handle_userfault after MADV_DONTNEED.
> > > 
> > > The userfaultfd testcase in the kernel isn't exercising this case
> > > unfortunately, that should probably be improved too, so there is a
> > > simpler way to reproduce than running precopy before postcopy in qemu.
> > 
> > I've tested current Linus' tree and v4.5 using qemu postcopy test case for
> > both x86-64 and i386 and it never failed for me:
> > 
> > /x86_64/postcopy: first_byte = 7e last_byte = 7d hit_edge = 1 OK
> > OK
> > /i386/postcopy: first_byte = f6 last_byte = f5 hit_edge = 1 OK
> > OK
> > 
> > I've run it directly, setting relevant QTEST_QEMU_BINARY.
> 
> Interesting; it's failing reliably for me - but only with a reasonably
> freshly booted machine (so that the pages get THPd).

The same here. Freshly booted machine with 64GiB ram. I've checked
/proc/vmstat: huge pages were allocated

-- 
 Kirill A. Shutemov