* [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away [not found] <200906221549.n5MFn3Qd015389@d03av02.boulder.ibm.com> @ 2009-06-22 16:12 ` Avi Kivity 2009-06-22 16:25 ` Anthony Liguori 0 siblings, 1 reply; 13+ messages in thread From: Avi Kivity @ 2009-06-22 16:12 UTC (permalink / raw) To: Anthony Liguori; +Cc: qemu-devel On 06/22/2009 06:51 PM, Anthony Liguori wrote: > From: Anthony Liguori<aliguori@us.ibm.com> > > Otherwise, after migration, we end up with a much larger RSS size then we > ought to have. > > We have the same issue on the migration source node. I don't see a simple way to solve it, though. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away 2009-06-22 16:12 ` [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away Avi Kivity @ 2009-06-22 16:25 ` Anthony Liguori 2009-06-22 16:38 ` Avi Kivity 0 siblings, 1 reply; 13+ messages in thread From: Anthony Liguori @ 2009-06-22 16:25 UTC (permalink / raw) To: Avi Kivity; +Cc: Anthony Liguori, qemu-devel Avi Kivity wrote: > On 06/22/2009 06:51 PM, Anthony Liguori wrote: >> From: Anthony Liguori<aliguori@us.ibm.com> >> >> Otherwise, after migration, we end up with a much larger RSS size >> then we >> ought to have. >> >> > > We have the same issue on the migration source node. I don't see a > simple way to solve it, though. I don't follow. In this case, the issue is: 1) Start a guest with 1024, balloon down to 128MB. RSS size is now ~128MB 2) Live migrate to a different node 3) RSS on different node jumps to ~1GB 4) Weep at all your lost memory Xen had a similar issue. This ends up biting people who overcommit their VMs via ballooning, live migration, and badness ensues. At least for us, the error is swapping but madvise also avoids the issue by never consuming that memory to begin with. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away 2009-06-22 16:25 ` Anthony Liguori @ 2009-06-22 16:38 ` Avi Kivity 2009-06-22 16:58 ` Anthony Liguori 2009-06-22 17:03 ` Anthony Liguori 0 siblings, 2 replies; 13+ messages in thread From: Avi Kivity @ 2009-06-22 16:38 UTC (permalink / raw) To: Anthony Liguori; +Cc: Anthony Liguori, qemu-devel On 06/22/2009 07:25 PM, Anthony Liguori wrote: > Avi Kivity wrote: >> On 06/22/2009 06:51 PM, Anthony Liguori wrote: >>> From: Anthony Liguori<aliguori@us.ibm.com> >>> >>> Otherwise, after migration, we end up with a much larger RSS size >>> then we >>> ought to have. >>> >> >> We have the same issue on the migration source node. I don't see a >> simple way to solve it, though. > > I don't follow. In this case, the issue is: > > 1) Start a guest with 1024, balloon down to 128MB. RSS size is now > ~128MB > 2) Live migrate to a different node > 3) RSS on different node jumps to ~1GB 3.5) RSS on source node jumps to ~1GB, since reading the page instantiates the pte > 4) Weep at all your lost memory 4.5) And at the swapping going on in the source node > > Xen had a similar issue. This ends up biting people who overcommit > their VMs via ballooning, live migration, and badness ensues. At > least for us, the error is swapping but madvise also avoids the issue > by never consuming that memory to begin with. Right. I'd love to do madvise() on the source node as well if we fault in a page and find out it's zero, but the guest (and aio) is still running and we might drop live data. We need a madvise(MADV_DONTNEED_IFZERO), or a mincore() flag that tells us if the page exists (vs. swapped). ksm would also do this, but it is overkill for some applications. Note that the patch contains a small bug -- the kernel is allowed to ignore the advise according to the manual page, so it's better to memset() the memory before dropping it. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away 2009-06-22 16:38 ` Avi Kivity @ 2009-06-22 16:58 ` Anthony Liguori 2009-06-22 17:12 ` Avi Kivity 2009-06-22 17:03 ` Anthony Liguori 1 sibling, 1 reply; 13+ messages in thread From: Anthony Liguori @ 2009-06-22 16:58 UTC (permalink / raw) To: Avi Kivity; +Cc: qemu-devel Avi Kivity wrote: > On 06/22/2009 07:25 PM, Anthony Liguori wrote: > >> >> Xen had a similar issue. This ends up biting people who overcommit >> their VMs via ballooning, live migration, and badness ensues. At >> least for us, the error is swapping but madvise also avoids the issue >> by never consuming that memory to begin with. > > Right. I'd love to do madvise() on the source node as well if we > fault in a page and find out it's zero, but the guest (and aio) is > still running and we might drop live data. We need a > madvise(MADV_DONTNEED_IFZERO), or a mincore() flag that tells us if > the page exists (vs. swapped). ksm would also do this, but it is > overkill for some applications. > > Note that the patch contains a small bug -- the kernel is allowed to > ignore the advise according to the manual page, so it's better to > memset() the memory before dropping it. Hrm, that's not quite how I interpreted the man page. "This call does not influence the semantics of the application (except in the case of MADV_DONTNEED), but may influence its performance. The kernel is free to ignore the advice." MADV_DONTNEED is called out as changing the application semantics. Specifically, I think the kernel has to zero-fill even if it choose to ignore the advice. I limited the guard to Linux specifically because I was unsure about that behavior but it would be good to clarify if anyone knows how. -- Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away 2009-06-22 16:58 ` Anthony Liguori @ 2009-06-22 17:12 ` Avi Kivity 0 siblings, 0 replies; 13+ messages in thread From: Avi Kivity @ 2009-06-22 17:12 UTC (permalink / raw) To: Anthony Liguori; +Cc: qemu-devel On 06/22/2009 07:58 PM, Anthony Liguori wrote: >> Note that the patch contains a small bug -- the kernel is allowed to >> ignore the advise according to the manual page, so it's better to >> memset() the memory before dropping it. > > > Hrm, that's not quite how I interpreted the man page. > > "This call > does not influence the semantics of the application (except in the case > of MADV_DONTNEED), but may influence its performance. The kernel is > free to ignore the advice." > > MADV_DONTNEED is called out as changing the application semantics. > Specifically, I think the kernel has to zero-fill even if it choose to > ignore the advice. > > I limited the guard to Linux specifically because I was unsure about > that behavior but it would be good to clarify if anyone knows how. This is not posix (there is a POSIX_MADV_DONTNEED which appears to be non-destructive (and a better complement to the other advices)), so the only references are the manual page and the code. I don't see Linux ever avoiding brake brake brake the kernel certainly can ignore the advice: static long madvise_dontneed(struct vm_area_struct * vma, struct vm_area_struct ** prev, unsigned long start, unsigned long end) { *prev = vma; if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) return -EINVAL; if (unlikely(vma->vm_flags & VM_NONLINEAR)) { struct zap_details details = { .nonlinear_vma = vma, .last_index = ULONG_MAX, }; zap_page_range(vma, start, end - start, &details); } else zap_page_range(vma, start, end - start, NULL); return 0; } it won't do it silently, but we don't check the return code either. Let's take the safe path on this and zero the page. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away 2009-06-22 16:38 ` Avi Kivity 2009-06-22 16:58 ` Anthony Liguori @ 2009-06-22 17:03 ` Anthony Liguori 2009-06-22 17:20 ` Avi Kivity 2009-06-22 19:38 ` Paul Brook 1 sibling, 2 replies; 13+ messages in thread From: Anthony Liguori @ 2009-06-22 17:03 UTC (permalink / raw) To: Avi Kivity; +Cc: qemu-devel Avi Kivity wrote: > On 06/22/2009 07:25 PM, Anthony Liguori wrote: >> Avi Kivity wrote: >>> On 06/22/2009 06:51 PM, Anthony Liguori wrote: >>>> From: Anthony Liguori<aliguori@us.ibm.com> >>>> >>>> Otherwise, after migration, we end up with a much larger RSS size >>>> then we >>>> ought to have. >>>> >>> >>> We have the same issue on the migration source node. I don't see a >>> simple way to solve it, though. >> >> I don't follow. In this case, the issue is: >> >> 1) Start a guest with 1024, balloon down to 128MB. RSS size is now >> ~128MB >> 2) Live migrate to a different node >> 3) RSS on different node jumps to ~1GB > > 3.5) RSS on source node jumps to ~1GB, since reading the page > instantiates the pte Surely we can do better here... For TCG, we always know when memory is dirty and we can check it atomically. So we know whether a page has changed since we knew it was last zero. We basically need a ZERO_DIRTY bit. All memory initially carries this bit and ballooning also sets the bit. During live migration, we can check the dirty bit first. For KVM, we would have to enable dirty tracking always to keep ZERO_DIRTY up to date. Since write faults are going to happen anyway at start up, perhaps the cost of doing this wouldn't be so bad? > > Right. I'd love to do madvise() on the source node as well if we > fault in a page and find out it's zero, but the guest (and aio) is > still running and we might drop live data. We need a > madvise(MADV_DONTNEED_IFZERO), or a mincore() flag that tells us if > the page exists (vs. swapped). ksm would also do this, but it is > overkill for some applications. For KVM, we could just have an KVM_IOCTL_MADVISE_IF_NOT_DIRTY, but that's a bad solution. That's more or less the desired semantics though. -- Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away 2009-06-22 17:03 ` Anthony Liguori @ 2009-06-22 17:20 ` Avi Kivity 2009-06-22 17:37 ` Anthony Liguori 2009-06-22 17:44 ` Anthony Liguori 2009-06-22 19:38 ` Paul Brook 1 sibling, 2 replies; 13+ messages in thread From: Avi Kivity @ 2009-06-22 17:20 UTC (permalink / raw) To: Anthony Liguori; +Cc: qemu-devel On 06/22/2009 08:03 PM, Anthony Liguori wrote: >>> >>> 1) Start a guest with 1024, balloon down to 128MB. RSS size is now >>> ~128MB >>> 2) Live migrate to a different node >>> 3) RSS on different node jumps to ~1GB >> >> 3.5) RSS on source node jumps to ~1GB, since reading the page >> instantiates the pte > I don't follow. In this case, the issue is: > > Surely we can do better here... > > For TCG, we always know when memory is dirty and we can check it > atomically. So we know whether a page has changed since we knew it > was last zero. We basically need a ZERO_DIRTY bit. All memory > initially carries this bit and ballooning also sets the bit. During > live migration, we can check the dirty bit first. You mean, a NONZERO bit which is cleared by ballooning and set on any write. This will work naturally with the qemu dirty bytemap. > > For KVM, we would have to enable dirty tracking always to keep > ZERO_DIRTY up to date. Since write faults are going to happen anyway > at start up, perhaps the cost of doing this wouldn't be so bad? You need to do this on the source node. Unfortunately, there's no way to initialize the values racelessly when you start live migration without introducing a new ioctl. I'd like a more general solution rather than something that targets this specific problem. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away 2009-06-22 17:20 ` Avi Kivity @ 2009-06-22 17:37 ` Anthony Liguori 2009-06-22 18:01 ` Avi Kivity 2009-06-22 17:44 ` Anthony Liguori 1 sibling, 1 reply; 13+ messages in thread From: Anthony Liguori @ 2009-06-22 17:37 UTC (permalink / raw) To: Avi Kivity; +Cc: qemu-devel Avi Kivity wrote: > You mean, a NONZERO bit which is cleared by ballooning and set on any > write. This will work naturally with the qemu dirty bytemap. Yes. >> >> For KVM, we would have to enable dirty tracking always to keep >> ZERO_DIRTY up to date. Since write faults are going to happen anyway >> at start up, perhaps the cost of doing this wouldn't be so bad? > > You need to do this on the source node. Unfortunately, there's no way > to initialize the values racelessly when you start live migration > without introducing a new ioctl. I'd like a more general solution > rather than something that targets this specific problem. I'm saying, always enable dirty tracking from start-of-day with KVM. Then the QEMU dirty bitmap is always accurate. The trick is to never start resetting it until you need to do live migration. The idea being that once the dirty bits have been set, the overhead (hopefully) should be zero. -- Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away 2009-06-22 17:37 ` Anthony Liguori @ 2009-06-22 18:01 ` Avi Kivity 0 siblings, 0 replies; 13+ messages in thread From: Avi Kivity @ 2009-06-22 18:01 UTC (permalink / raw) To: Anthony Liguori; +Cc: qemu-devel On 06/22/2009 08:37 PM, Anthony Liguori wrote: > I'm saying, always enable dirty tracking from start-of-day with KVM. > Then the QEMU dirty bitmap is always accurate. The trick is to never > start resetting it until you need to do live migration. That disables large pages and incurs a performance penalty (needlessly setting bits). > The idea being that once the dirty bits have been set, the overhead > (hopefully) should be zero. > It's not zero. You also need to clear the bits somehow when you balloon. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away 2009-06-22 17:20 ` Avi Kivity 2009-06-22 17:37 ` Anthony Liguori @ 2009-06-22 17:44 ` Anthony Liguori 2009-06-22 18:04 ` Avi Kivity 1 sibling, 1 reply; 13+ messages in thread From: Anthony Liguori @ 2009-06-22 17:44 UTC (permalink / raw) To: Avi Kivity; +Cc: qemu-devel [-- Attachment #1: Type: text/plain, Size: 46 bytes --] See attached. -- Regards, Anthony Liguori [-- Attachment #2: zero-mem-madvise.patch --] [-- Type: text/x-patch, Size: 1243 bytes --] commit a70787dc8200dad03f3d3a85fdb40497f336f12b Author: Anthony Liguori <aliguori@us.ibm.com> Date: Mon Jun 22 12:39:00 2009 -0500 Make sure to zero out memory before calling madvise to increase robustness Avi pointed out that it's not entirely safe to rely on madvise zeroing out memory. So let's do it explicitly before calling madvise. Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> diff --git a/vl.c b/vl.c index 60a00e1..1c077b4 100644 --- a/vl.c +++ b/vl.c @@ -3358,13 +3358,13 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id) if (flags & RAM_SAVE_FLAG_COMPRESS) { uint8_t ch = qemu_get_byte(f); -#if defined(__linux__) + memset(qemu_get_ram_ptr(addr), ch, TARGET_PAGE_SIZE); +#ifndef _WIN32 if (ch == 0 && (!kvm_enabled() || kvm_has_sync_mmu())) { madvise(qemu_get_ram_ptr(addr), TARGET_PAGE_SIZE, MADV_DONTNEED); - } else + } #endif - memset(qemu_get_ram_ptr(addr), ch, TARGET_PAGE_SIZE); } else if (flags & RAM_SAVE_FLAG_PAGE) qemu_get_buffer(f, qemu_get_ram_ptr(addr), TARGET_PAGE_SIZE); } while (!(flags & RAM_SAVE_FLAG_EOS)); ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away 2009-06-22 17:44 ` Anthony Liguori @ 2009-06-22 18:04 ` Avi Kivity 0 siblings, 0 replies; 13+ messages in thread From: Avi Kivity @ 2009-06-22 18:04 UTC (permalink / raw) To: Anthony Liguori; +Cc: qemu-devel On 06/22/2009 08:44 PM, Anthony Liguori wrote: > See attached. > Author: Anthony Liguori<aliguori@us.ibm.com> > Date: Mon Jun 22 12:39:00 2009 -0500 > > Make sure to zero out memory before calling madvise to increase robustness > > Avi pointed out that it's not entirely safe to rely on madvise zeroing out > memory. So let's do it explicitly before calling madvise. > > Signed-off-by: Anthony Liguori<aliguori@us.ibm.com> > > diff --git a/vl.c b/vl.c > index 60a00e1..1c077b4 100644 > --- a/vl.c > +++ b/vl.c > @@ -3358,13 +3358,13 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id) > > if (flags& RAM_SAVE_FLAG_COMPRESS) { > uint8_t ch = qemu_get_byte(f); > -#if defined(__linux__) > + memset(qemu_get_ram_ptr(addr), ch, TARGET_PAGE_SIZE); > +#ifndef _WIN32 > if (ch == 0&& > (!kvm_enabled() || kvm_has_sync_mmu())) { > madvise(qemu_get_ram_ptr(addr), TARGET_PAGE_SIZE, MADV_DONTNEED); > - } else > + } > #endif > - memset(qemu_get_ram_ptr(addr), ch, TARGET_PAGE_SIZE); > } else if (flags& RAM_SAVE_FLAG_PAGE) > qemu_get_buffer(f, qemu_get_ram_ptr(addr), TARGET_PAGE_SIZE); > } while (!(flags& RAM_SAVE_FLAG_EOS)); > Pretty similar to my December patch... which had another case, is it missing? http://article.gmane.org/gmane.comp.emulators.qemu/34523 -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away 2009-06-22 17:03 ` Anthony Liguori 2009-06-22 17:20 ` Avi Kivity @ 2009-06-22 19:38 ` Paul Brook 2009-06-22 19:49 ` Anthony Liguori 1 sibling, 1 reply; 13+ messages in thread From: Paul Brook @ 2009-06-22 19:38 UTC (permalink / raw) To: qemu-devel; +Cc: Anthony Liguori, Avi Kivity > >>> We have the same issue on the migration source node. I don't see a > >>> simple way to solve it, though. > >> > >> I don't follow. In this case, the issue is: > >> > >> 1) Start a guest with 1024, balloon down to 128MB. RSS size is now > >> ~128MB > >> 2) Live migrate to a different node > >> 3) RSS on different node jumps to ~1GB > > > > 3.5) RSS on source node jumps to ~1GB, since reading the page > > instantiates the pte > > Surely we can do better here... Huh, I'm surprised we have to. I was expecting linux to populate the region with CoW mappings of a single zero page, but apparently not. I guess reading a zero page without writing to it is a fairly rare case, and it's usually better to instantiate a writable page on the first access. Paul ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away 2009-06-22 19:38 ` Paul Brook @ 2009-06-22 19:49 ` Anthony Liguori 0 siblings, 0 replies; 13+ messages in thread From: Anthony Liguori @ 2009-06-22 19:49 UTC (permalink / raw) To: Paul Brook; +Cc: qemu-devel, Avi Kivity Paul Brook wrote: >>>>> We have the same issue on the migration source node. I don't see a >>>>> simple way to solve it, though. >>>>> >>>> I don't follow. In this case, the issue is: >>>> >>>> 1) Start a guest with 1024, balloon down to 128MB. RSS size is now >>>> ~128MB >>>> 2) Live migrate to a different node >>>> 3) RSS on different node jumps to ~1GB >>>> >>> 3.5) RSS on source node jumps to ~1GB, since reading the page >>> instantiates the pte >>> >> Surely we can do better here... >> > > Huh, I'm surprised we have to. I was expecting linux to populate the region > with CoW mappings of a single zero page, but apparently not. I guess reading a > zero page without writing to it is a fairly rare case, and it's usually better > to instantiate a writable page on the first access. > I did too. If you mmap(/dev/zero) you'll get that behavior but not with a normal allocation. Regards, Anthony Liguori > Paul > -- Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2009-06-22 19:49 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <200906221549.n5MFn3Qd015389@d03av02.boulder.ibm.com> 2009-06-22 16:12 ` [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away Avi Kivity 2009-06-22 16:25 ` Anthony Liguori 2009-06-22 16:38 ` Avi Kivity 2009-06-22 16:58 ` Anthony Liguori 2009-06-22 17:12 ` Avi Kivity 2009-06-22 17:03 ` Anthony Liguori 2009-06-22 17:20 ` Avi Kivity 2009-06-22 17:37 ` Anthony Liguori 2009-06-22 18:01 ` Avi Kivity 2009-06-22 17:44 ` Anthony Liguori 2009-06-22 18:04 ` Avi Kivity 2009-06-22 19:38 ` Paul Brook 2009-06-22 19:49 ` Anthony Liguori
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).