From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:33232) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cYBs2-0000sg-Mg for qemu-devel@nongnu.org; Mon, 30 Jan 2017 08:19:48 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cYBrx-0000us-Fx for qemu-devel@nongnu.org; Mon, 30 Jan 2017 08:19:46 -0500 Received: from mail-db5eur01on0139.outbound.protection.outlook.com ([104.47.2.139]:9833 helo=EUR01-DB5-obe.outbound.protection.outlook.com) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cYBrw-0000tx-K9 for qemu-devel@nongnu.org; Mon, 30 Jan 2017 08:19:41 -0500 References: <20170116155113.21034-1-pbutsykin@virtuozzo.com> <20170116155113.21034-2-pbutsykin@virtuozzo.com> <20170127110119.GB3323@work-vm> From: Pavel Butsykin Message-ID: <588F3A11.4070704@virtuozzo.com> Date: Mon, 30 Jan 2017 16:05:21 +0300 MIME-Version: 1.0 In-Reply-To: <20170127110119.GB3323@work-vm> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 1/2] add 'discard-ram' migrate capability List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Dr. David Alan Gilbert" Cc: qemu-devel@nongnu.org, quintela@redhat.com, armbru@redhat.com, amit.shah@redhat.com, den@openvz.org On 27.01.2017 14:01, Dr. David Alan Gilbert wrote: > * Pavel Butsykin (pbutsykin@virtuozzo.com) wrote: >> This feature frees the migrated memory on the source during postcopy-ram >> migration. In the second step of postcopy-ram migration when the source vm >> is put on pause we can free unnecessary memory. It will allow, in particular, >> to start relaxing the memory stress on the source host in a load-balancing >> scenario. >> >> Signed-off-by: Pavel Butsykin > > Hi Pavel, > Firstly a higher-level thing, can we use a different word than 'discard' > because I already use 'discard' in postcopy to mean the request from the source > to the destination to discard pages that are redirtied. I suggest 'release-ram' > to just pick a different word that means the same thing. > Hi David, Yeah, I thought about it.. the same names can confuse here, so I agree to change the name. > Also, see patchew's build error it spotted. > >> --- >> include/migration/migration.h | 1 + >> include/migration/qemu-file.h | 3 ++- >> migration/migration.c | 9 +++++++ >> migration/qemu-file.c | 59 ++++++++++++++++++++++++++++++++++++++----- >> migration/ram.c | 24 +++++++++++++++++- >> qapi-schema.json | 5 +++- >> 6 files changed, 91 insertions(+), 10 deletions(-) >> >> diff --git a/include/migration/migration.h b/include/migration/migration.h >> index c309d23370..d7bd404365 100644 >> --- a/include/migration/migration.h >> +++ b/include/migration/migration.h >> @@ -294,6 +294,7 @@ void migrate_add_blocker(Error *reason); >> */ >> void migrate_del_blocker(Error *reason); >> >> +bool migrate_discard_ram(void); >> bool migrate_postcopy_ram(void); >> bool migrate_zero_blocks(void); >> >> diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h >> index abedd466c9..0cd648a733 100644 >> --- a/include/migration/qemu-file.h >> +++ b/include/migration/qemu-file.h >> @@ -132,7 +132,8 @@ void qemu_put_byte(QEMUFile *f, int v); >> * put_buffer without copying the buffer. >> * The buffer should be available till it is sent asynchronously. >> */ >> -void qemu_put_buffer_async(QEMUFile *f, const uint8_t *buf, size_t size); >> +void qemu_put_buffer_async(QEMUFile *f, const uint8_t *buf, size_t size, >> + bool may_free); >> bool qemu_file_mode_is_not_valid(const char *mode); >> bool qemu_file_is_writable(QEMUFile *f); >> >> diff --git a/migration/migration.c b/migration/migration.c >> index f498ab84f2..391db6f28b 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -1251,6 +1251,15 @@ void qmp_migrate_set_downtime(double value, Error **errp) >> qmp_migrate_set_parameters(&p, errp); >> } >> >> +bool migrate_discard_ram(void) >> +{ >> + MigrationState *s; >> + >> + s = migrate_get_current(); >> + >> + return s->enabled_capabilities[MIGRATION_CAPABILITY_DISCARD_RAM]; >> +} >> + >> bool migrate_postcopy_ram(void) >> { >> MigrationState *s; >> diff --git a/migration/qemu-file.c b/migration/qemu-file.c >> index e9fae31158..f85a0ecd9e 100644 >> --- a/migration/qemu-file.c >> +++ b/migration/qemu-file.c >> @@ -30,6 +30,7 @@ >> #include "qemu/coroutine.h" >> #include "migration/migration.h" >> #include "migration/qemu-file.h" >> +#include "sysemu/sysemu.h" >> #include "trace.h" >> >> #define IO_BUF_SIZE 32768 >> @@ -49,6 +50,7 @@ struct QEMUFile { >> int buf_size; /* 0 when writing */ >> uint8_t buf[IO_BUF_SIZE]; >> >> + DECLARE_BITMAP(may_free, MAX_IOV_SIZE); >> struct iovec iov[MAX_IOV_SIZE]; >> unsigned int iovcnt; >> >> @@ -132,6 +134,40 @@ bool qemu_file_is_writable(QEMUFile *f) >> return f->ops->writev_buffer; >> } >> >> +static void qemu_iovec_discard_ram(QEMUFile *f) >> +{ >> + struct iovec iov; >> + unsigned long idx; >> + >> + if (!migrate_discard_ram() || !runstate_check(RUN_STATE_FINISH_MIGRATE)) { >> + return; >> + } > > Can we split this out into a separate function please; so qemu_iovec_discard_ram > always does it, and then you have something that only calls it if enabled. > Of course, I think migrate_discard_ram() is not needed here, because it is already invoked in ram.c: static int ram_save_page(QEMUFile *f, PageSearchStatus *pss, ... if (send_async) { qemu_put_buffer_async( f, p, TARGET_PAGE_SIZE, migrate_discard_ram()); } else { qemu_put_buffer(f, p, TARGET_PAGE_SIZE); ... Also, it will fix 'build error'. >> + idx = find_next_bit(f->may_free, f->iovcnt, 0); >> + if (idx >= f->iovcnt) { >> + return; >> + } >> + iov = f->iov[idx]; >> + >> + while ((idx = find_next_bit(f->may_free, f->iovcnt, idx + 1)) < f->iovcnt) { >> + /* check for adjacent buffer and coalesce them */ >> + if (iov.iov_base + iov.iov_len == f->iov[idx].iov_base) { >> + iov.iov_len += f->iov[idx].iov_len; >> + continue; >> + } >> + if (qemu_madvise(iov.iov_base, iov.iov_len, QEMU_MADV_DONTNEED) < 0) { >> + error_report("migrate: madvise DONTNEED failed %p %ld: %s", >> + iov.iov_base, iov.iov_len, strerror(errno)); >> + } >> + iov = f->iov[idx]; > > Can you add some comments in here please; it took me a while to understand it; Ok, no problem. > I think what you're doing is that the madvise in the loop is called for the last > iov within a continuous range and then you fall through to deal with the last one. > Yes, contiguous memory is often mixed with descriptors, so that the release of the contiguous memory may occur more than one call qemu_madvise(). > Also, see my 'postcopy: enhance ram_discard_range for hugepages' - these madvise's > get a bit more complex with hugepage. > Thanks, it's a good improvement. >> + } >> + if (qemu_madvise(iov.iov_base, iov.iov_len, QEMU_MADV_DONTNEED) < 0) { >> + error_report("migrate: madvise DONTNEED failed %p %ld: %s", >> + iov.iov_base, iov.iov_len, strerror(errno)); >> + } >> + memset(f->may_free, 0, sizeof(f->may_free)); >> +} >> + >> /** >> * Flushes QEMUFile buffer >> * >> @@ -151,6 +187,8 @@ void qemu_fflush(QEMUFile *f) >> if (f->iovcnt > 0) { >> expect = iov_size(f->iov, f->iovcnt); >> ret = f->ops->writev_buffer(f->opaque, f->iov, f->iovcnt, f->pos); >> + >> + qemu_iovec_discard_ram(f); >> } >> >> if (ret >= 0) { >> @@ -304,13 +342,19 @@ int qemu_fclose(QEMUFile *f) >> return ret; >> } >> >> -static void add_to_iovec(QEMUFile *f, const uint8_t *buf, size_t size) >> +static void add_to_iovec(QEMUFile *f, const uint8_t *buf, size_t size, >> + bool may_free) >> { >> /* check for adjacent buffer and coalesce them */ >> if (f->iovcnt > 0 && buf == f->iov[f->iovcnt - 1].iov_base + >> - f->iov[f->iovcnt - 1].iov_len) { >> + f->iov[f->iovcnt - 1].iov_len && >> + may_free == test_bit(f->iovcnt - 1, f->may_free)) >> + { >> f->iov[f->iovcnt - 1].iov_len += size; >> } else { >> + if (may_free) { >> + set_bit(f->iovcnt, f->may_free); >> + } >> f->iov[f->iovcnt].iov_base = (uint8_t *)buf; >> f->iov[f->iovcnt++].iov_len = size; >> } >> @@ -320,14 +364,15 @@ static void add_to_iovec(QEMUFile *f, const uint8_t *buf, size_t size) >> } >> } >> >> -void qemu_put_buffer_async(QEMUFile *f, const uint8_t *buf, size_t size) >> +void qemu_put_buffer_async(QEMUFile *f, const uint8_t *buf, size_t size, >> + bool may_free) >> { >> if (f->last_error) { >> return; >> } >> >> f->bytes_xfer += size; >> - add_to_iovec(f, buf, size); >> + add_to_iovec(f, buf, size, may_free); >> } >> >> void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, size_t size) >> @@ -345,7 +390,7 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, size_t size) >> } >> memcpy(f->buf + f->buf_index, buf, l); >> f->bytes_xfer += l; >> - add_to_iovec(f, f->buf + f->buf_index, l); >> + add_to_iovec(f, f->buf + f->buf_index, l, false); >> f->buf_index += l; >> if (f->buf_index == IO_BUF_SIZE) { >> qemu_fflush(f); >> @@ -366,7 +411,7 @@ void qemu_put_byte(QEMUFile *f, int v) >> >> f->buf[f->buf_index] = v; >> f->bytes_xfer++; >> - add_to_iovec(f, f->buf + f->buf_index, 1); >> + add_to_iovec(f, f->buf + f->buf_index, 1, false); >> f->buf_index++; >> if (f->buf_index == IO_BUF_SIZE) { >> qemu_fflush(f); >> @@ -647,7 +692,7 @@ ssize_t qemu_put_compression_data(QEMUFile *f, const uint8_t *p, size_t size, >> } >> qemu_put_be32(f, blen); >> if (f->ops->writev_buffer) { >> - add_to_iovec(f, f->buf + f->buf_index, blen); >> + add_to_iovec(f, f->buf + f->buf_index, blen, false); >> } >> f->buf_index += blen; >> if (f->buf_index == IO_BUF_SIZE) { >> diff --git a/migration/ram.c b/migration/ram.c >> index a1c8089010..b0322a0b5c 100644 >> --- a/migration/ram.c >> +++ b/migration/ram.c >> @@ -27,6 +27,7 @@ >> */ >> #include "qemu/osdep.h" >> #include "qemu-common.h" >> +#include "sysemu/sysemu.h" >> #include "cpu.h" >> #include >> #include "qapi-event.h" >> @@ -713,6 +714,18 @@ static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset, >> return pages; >> } >> >> +static void ram_discard_page(uint8_t *addr, int pages) >> +{ >> + if (!migrate_discard_ram() || !runstate_check(RUN_STATE_FINISH_MIGRATE)) { >> + return; >> + } >> + >> + if (qemu_madvise(addr, pages << TARGET_PAGE_BITS, QEMU_MADV_DONTNEED) < 0) { >> + error_report("migrate: madvise DONTNEED failed %p %d: %s", >> + addr, pages << TARGET_PAGE_BITS, strerror(errno)); >> + } >> +} > > Would it make sense to pick up my 'Fold postcopy_ram_discard_range into ram_discard_range' > patch and then just call that wrapped with your test? > Again, this is another reason to change the word 'discard'. > NP. I will rebase on "Postcopy: Hugepage support" patch series. >> /** >> * ram_save_page: Send the given page to the stream >> * >> @@ -772,6 +785,7 @@ static int ram_save_page(QEMUFile *f, PageSearchStatus *pss, >> * page would be stale >> */ >> xbzrle_cache_zero_page(current_addr); >> + ram_discard_page(p, pages); >> } else if (!ram_bulk_stage && >> !migration_in_postcopy(migrate_get_current()) && >> migrate_use_xbzrle()) { >> @@ -791,9 +805,11 @@ static int ram_save_page(QEMUFile *f, PageSearchStatus *pss, >> *bytes_transferred += save_page_header(f, block, >> offset | RAM_SAVE_FLAG_PAGE); >> if (send_async) { >> - qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE); >> + qemu_put_buffer_async( >> + f, p, TARGET_PAGE_SIZE, migrate_discard_ram()); >> } else { >> qemu_put_buffer(f, p, TARGET_PAGE_SIZE); >> + ram_discard_page(p, 1); > > Does this case ever happen? Isn't the only time we hit non-async > if xbzrle is enabled, and then 'p' can be a pointer into the xbzrle > cache? (and we don't support xbzrle with postcopy). > Yes, you're right. Will fix. >> } >> *bytes_transferred += TARGET_PAGE_SIZE; >> pages = 1; >> @@ -821,6 +837,7 @@ static int do_compress_ram_page(QEMUFile *f, RAMBlock *block, >> error_report("compressed data failed!"); >> } else { >> bytes_sent += blen; >> + ram_discard_page(p, 1); >> } >> >> return bytes_sent; >> @@ -959,12 +976,17 @@ static int ram_save_compressed_page(QEMUFile *f, PageSearchStatus *pss, >> error_report("compressed data failed!"); >> } >> } >> + if (pages > 0) { >> + ram_discard_page(p, pages); >> + } >> } else { >> offset |= RAM_SAVE_FLAG_CONTINUE; >> pages = save_zero_page(f, block, offset, p, bytes_transferred); >> if (pages == -1) { >> pages = compress_page_with_multi_thread(f, block, offset, >> bytes_transferred); >> + } else { >> + ram_discard_page(p, pages); >> } >> } > > Not that we support compressed pages on postcopy yet (destination code needs > writing). > I know, but why not? Here we can make discard-ram support in advance for symmetry :) >> } >> diff --git a/qapi-schema.json b/qapi-schema.json >> index ce20f16757..f02b434765 100644 >> --- a/qapi-schema.json >> +++ b/qapi-schema.json >> @@ -588,11 +588,14 @@ >> # side, this process is called COarse-Grain LOck Stepping (COLO) for >> # Non-stop Service. (since 2.8) >> # >> +# @discard-ram: if enabled, qemu will free the migrated ram pages on the source >> +# during postcopy-ram migration. (since 2.9) >> +# >> # Since: 1.2 >> ## >> { 'enum': 'MigrationCapability', >> 'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks', >> - 'compress', 'events', 'postcopy-ram', 'x-colo'] } >> + 'compress', 'events', 'postcopy-ram', 'x-colo', 'discard-ram'] } > > Interestingly, it's not directly tied to postcopy, but I guess postcopy > is the only time you get the bulk of pages sent in FINISH_MIGRATE mode; > if enabled it would trigger for the final stage of a precopy migration. > > I suppose we have to worry about a case like: > postcopy is enabled > discard-ram is enabled > migration starts > migration goes into FINISH_MIGRATE very quickly in precopy > discard-ram throws source pages away. > migration fails (some screwup as it starts destination) > management tells source to carry on (after all it never actually entered postcopy) > -> source is using some emptied ram. Good point, I missed it. > The fix for that would be to tie it to postcopy. Thanks! Yes, actually discard-ram for MIGRATION_STATUS_ACTIVE doesn't make sense. > > Dave >> >> ## >> # @MigrationCapabilityStatus: >> -- >> 2.11.0 >> >> > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >