From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60600) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1axsr1-0001yK-08 for qemu-devel@nongnu.org; Wed, 04 May 2016 05:12:29 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1axsqo-0006b9-I0 for qemu-devel@nongnu.org; Wed, 04 May 2016 05:12:17 -0400 Received: from mx1.redhat.com ([209.132.183.28]:58492) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1axsqo-0006YZ-3O for qemu-devel@nongnu.org; Wed, 04 May 2016 05:12:10 -0400 From: Juan Quintela In-Reply-To: <1462257521-16075-1-git-send-email-liang.z.li@intel.com> (Liang Li's message of "Tue, 3 May 2016 14:38:41 +0800") References: <1462257521-16075-1-git-send-email-liang.z.li@intel.com> Reply-To: quintela@redhat.com Date: Wed, 04 May 2016 11:11:55 +0200 Message-ID: <878tzquqzo.fsf@emacs.mitica> MIME-Version: 1.0 Content-Type: text/plain Subject: Re: [Qemu-devel] [PATCH] migration: Fix multi-thread compression bug List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Liang Li Cc: qemu-devel@nongnu.org, amit.shah@redhat.com, dgilbert@redhat.com Liang Li wrote: > Recently, a bug related to multiple thread compression feature for > live migration is reported. The destination side will be blocked > during live migration if there are heavy workload in host and > memory intensive workload in guest, this is most likely to happen > when there is one decompression thread. > > Some parts of the decompression code are incorrect: > 1. The main thread receives data from source side will enter a busy > loop to wait for a free decompression thread. > 2. A lock is needed to protect the decomp_param[idx]->start, because > it is checked in the main thread and is updated in the decompression > thread. > > Fix these two issues by following the code pattern for compression. > > Reported-by: Daniel P. Berrange > Signed-off-by: Liang Li step in the right direction, so: Reviewed-by: Juan Quintela but I am still not sure that this is enough. if you have the change, look at the multiple-fd code that I posted, is very similar here. > struct DecompressParam { what protect start, and what protect done? > bool start; > + bool done; > QemuMutex mutex; > QemuCond cond; > void *des; > @@ -287,6 +288,8 @@ static bool quit_comp_thread; > static bool quit_decomp_thread; > static DecompressParam *decomp_param; > static QemuThread *decompress_threads; > +static QemuMutex decomp_done_lock; > +static QemuCond decomp_done_cond; > > static int do_compress_ram_page(CompressParam *param); > > @@ -834,6 +837,7 @@ static inline void start_compression(CompressParam *param) > > static inline void start_decompression(DecompressParam *param) > { Here nothing protects done > + param->done = false; > qemu_mutex_lock(¶m->mutex); > param->start = true; > qemu_cond_signal(¶m->cond); > @@ -2193,19 +2197,24 @@ static void *do_data_decompress(void *opaque) > qemu_mutex_lock(¶m->mutex); we are looking at quit_decomp_thread and nothing protects it > while (!param->start && !quit_decomp_thread) { > qemu_cond_wait(¶m->cond, ¶m->mutex); > + } > + if (!quit_decomp_thread) { > pagesize = TARGET_PAGE_SIZE; > - if (!quit_decomp_thread) { > - /* uncompress() will return failed in some case, especially > - * when the page is dirted when doing the compression, it's > - * not a problem because the dirty page will be retransferred > - * and uncompress() won't break the data in other pages. > - */ > - uncompress((Bytef *)param->des, &pagesize, > - (const Bytef *)param->compbuf, param->len); > - } > - param->start = false; > + /* uncompress() will return failed in some case, especially > + * when the page is dirted when doing the compression, it's > + * not a problem because the dirty page will be retransferred > + * and uncompress() won't break the data in other pages. > + */ > + uncompress((Bytef *)param->des, &pagesize, > + (const Bytef *)param->compbuf, param->len); We are calling uncompress (a slow operation) with param->mutex taken, is there any reason why we can't just put the param->* vars in locals? > } > + param->start = false; Why are we setting start to false when we _are_ not decompressing a page? I think this line should be inside the loop. > qemu_mutex_unlock(¶m->mutex); > + > + qemu_mutex_lock(&decomp_done_lock); > + param->done = true; here param->done is protected by decomp_done_lock. > + qemu_cond_signal(&decomp_done_cond); > + qemu_mutex_unlock(&decomp_done_lock); > } > > return NULL; > @@ -2219,10 +2228,13 @@ void migrate_decompress_threads_create(void) > decompress_threads = g_new0(QemuThread, thread_count); > decomp_param = g_new0(DecompressParam, thread_count); > quit_decomp_thread = false; > + qemu_mutex_init(&decomp_done_lock); > + qemu_cond_init(&decomp_done_cond); > for (i = 0; i < thread_count; i++) { > qemu_mutex_init(&decomp_param[i].mutex); > qemu_cond_init(&decomp_param[i].cond); > decomp_param[i].compbuf = g_malloc0(compressBound(TARGET_PAGE_SIZE)); > + decomp_param[i].done = true; > qemu_thread_create(decompress_threads + i, "decompress", > do_data_decompress, decomp_param + i, > QEMU_THREAD_JOINABLE); > @@ -2258,9 +2270,10 @@ static void decompress_data_with_multi_threads(QEMUFile *f, > int idx, thread_count; > > thread_count = migrate_decompress_threads(); > + qemu_mutex_lock(&decomp_done_lock); we took decomp_done_lock > while (true) { > for (idx = 0; idx < thread_count; idx++) { > - if (!decomp_param[idx].start) { > + if (decomp_param[idx].done) { and we can protecet done with it. > qemu_get_buffer(f, decomp_param[idx].compbuf, len); > decomp_param[idx].des = host; > decomp_param[idx].len = len; but this ones should be proteced by docomp_param[idx].mutex, no? > @@ -2270,8 +2283,11 @@ static void decompress_data_with_multi_threads(QEMUFile *f, > } > if (idx < thread_count) { > break; > + } else { > + qemu_cond_wait(&decomp_done_cond, &decomp_done_lock); > } > } > + qemu_mutex_unlock(&decomp_done_lock); > } > > /* Thanks, Juan.