From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41728) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1avQkm-0001ci-AM for qemu-devel@nongnu.org; Wed, 27 Apr 2016 10:47:54 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1avQkg-00023e-9B for qemu-devel@nongnu.org; Wed, 27 Apr 2016 10:47:48 -0400 Received: from mx1.redhat.com ([209.132.183.28]:54781) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1avQkg-00023U-3y for qemu-devel@nongnu.org; Wed, 27 Apr 2016 10:47:42 -0400 Date: Wed, 27 Apr 2016 16:47:39 +0200 From: Andrea Arcangeli Message-ID: <20160427144739.GF10120@redhat.com> References: <20160414162230.GC9976@redhat.com> <20160415125236.GA3376@node.shutemov.name> <20160415134233.GG2229@work-vm> <20160415152330.GB3376@node.shutemov.name> <20160415163448.GJ2229@work-vm> <20160418095528.GD2222@work-vm> <20160418101555.GE2222@work-vm> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [Qemu-devel] post-copy is broken? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Li, Liang Z" Cc: "Dr. David Alan Gilbert" , "Kirill A. Shutemov" , "kirill.shutemov@linux.intel.com" , Amit Shah , "qemu-devel@nongnu.org" , "quintela@redhat.com" , "linux-mm@kvack.org" Hello Liang, On Mon, Apr 18, 2016 at 10:33:14AM +0000, Li, Liang Z wrote: > If the THP is disabled, no fails. > And your test was always passed, even when real post-copy was failed. > > In my env, the output of > 'cat /sys/kernel/mm/transparent_hugepage/enabled' is: > > [always] ... > Can you test the fix? https://marc.info/?l=linux-mm&m=146175869123580&w=2 This was not a breakage in userfaultfd nor in postcopy. userfaultfd had no bugs and is fully rock solid and with zero chances of generating undetected memory corruption like it was happening in v4.5. As I suspected, the same problem would have happened with any THP pmd_trans_huge split (swapping/inflating-balloon etc..). Postcopy just makes it easier to reproduce the problem because it does a scattered MADV_DONTNEED on the destination qemu guest memory for the pages redirtied during the last precopy pass that run, or not transferred (to allow THP faults in destination qemu during precopy), just before starting the guest in the destination node. Other reports of KVM memory corruption happening on v4.5 with THP enabled will also be taken care of by the above fix. I hope I managed to fix this in time for v4.6 final (current is v4.6-rc5-69), so the only kernel where KVM must not be used with THP enabled will be v4.5. On a side note, this MADV_DONTEED trigger reminded me as soon as the madvisev syscall is merged, loadvm_postcopy_ram_handle_discard should start using it to reduce the enter/exit kernel to just 1 (or a few madvisev in case we want to give a limit to the temporary buffer to avoid the risk of allocating too much temporary RAM for very large guests) to do the MADV_DONTNEED scattered zapping. Same thing in virtio_balloon_handle_output. Thanks, Andrea