* Requesting review about optimizing large guest start up time
@ 2020-02-11 12:08 bauerchen(陈蒙蒙)
2020-02-20 11:11 ` Paolo Bonzini
0 siblings, 1 reply; 2+ messages in thread
From: bauerchen(陈蒙蒙) @ 2020-02-11 12:08 UTC (permalink / raw)
To: qemu-devel; +Cc: pbonzini >
From c882b155466313fcd85ac330a45a573e608b0d74 Mon Sep 17 00:00:00 2001
From: bauerchen <bauerchen@tencent.com>
Date: Tue, 11 Feb 2020 17:10:35 +0800
Subject: [PATCH] Optimize: large guest start-up in mem-prealloc
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
[desc]:
Large memory VM starts slowly when using -mem-prealloc, and
there are some areas to optimize in current method;
1、mmap will be used to alloc threads stack during create page
clearing threads, and it will attempt mm->mmap_sem for write
lock, but clearing threads have hold read lock, this competition
will cause threads createion very slow;
2、methods of calcuating pages for per threads is not well;if we use
64 threads to split 160 hugepage,63 threads clear 2page,1 thread
clear 34 page,so the entire speed is very slow;
to solve the first problem,we add a mutex in thread function,and
start all threads when all threads finished createion;
and the second problem, we spread remainder to other threads,in
situation that 160 hugepage and 64 threads, there are 32 threads
clear 3 pages,and 32 threads clear 2 pages;
[test]:
320G 84c VM start time can be reduced to 10s
680G 84c VM start time can be reduced to 18s
Signed-off-by: bauerchen <bauerchen@tencent.com>
Reviewed-by:Pan Rui <ruippan@tencent.com>
Reviewed-by:Ivan Ren <ivanren@tencent.com>
---
util/oslib-posix.c | 44 ++++++++++++++++++++++++++++++++++++--------
1 file changed, 36 insertions(+), 8 deletions(-)
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 5a291cc..e97369b 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -76,6 +76,10 @@ static MemsetThread *memset_thread;
static int memset_num_threads;
static bool memset_thread_failed;
+static QemuMutex page_mutex;
+static QemuCond page_cond;
+static volatile bool thread_create_flag;
+
int qemu_get_thread_id(void)
{
#if defined(__linux__)
@@ -403,6 +407,14 @@ static void *do_touch_pages(void *arg)
MemsetThread *memset_args = (MemsetThread *)arg;
sigset_t set, oldset;
+ /*wait for all threads create finished */
+ qemu_mutex_lock(&page_mutex);
+ while(!thread_create_flag){
+ qemu_cond_wait(&page_cond, &page_mutex);
+ }
+ qemu_mutex_unlock(&page_mutex);
+
+
/* unblock SIGBUS */
sigemptyset(&set);
sigaddset(&set, SIGBUS);
@@ -448,30 +460,46 @@ static inline int get_memset_num_threads(int smp_cpus)
return ret;
}
+static void calc_page_per_thread(size_t numpages, int memset_threads, size_t *pages_per_thread){
+ int avg = numpages / memset_threads + 1;
+ int i = 0;
+ int last = avg * memset_threads - numpages;
+ for (i = 0; i < memset_threads; i++)
+ {
+ if(memset_threads - i <= last){
+ pages_per_thread[i] = avg - 1;
+ }else
+ pages_per_thread[i] = avg;
+ }
+}
+
static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
int smp_cpus)
{
- size_t numpages_per_thread;
- size_t size_per_thread;
+ size_t *numpages_per_thread;
char *addr = area;
int i = 0;
memset_thread_failed = false;
+ thread_create_flag = false;
memset_num_threads = get_memset_num_threads(smp_cpus);
+ numpages_per_thread = g_new0(size_t, memset_num_threads);
memset_thread = g_new0(MemsetThread, memset_num_threads);
- numpages_per_thread = (numpages / memset_num_threads);
- size_per_thread = (hpagesize * numpages_per_thread);
+ calc_page_per_thread(numpages, memset_num_threads, numpages_per_thread);
+
for (i = 0; i < memset_num_threads; i++) {
memset_thread[i].addr = addr;
- memset_thread[i].numpages = (i == (memset_num_threads - 1)) ?
- numpages : numpages_per_thread;
+ memset_thread[i].numpages = numpages_per_thread[i];
memset_thread[i].hpagesize = hpagesize;
qemu_thread_create(&memset_thread[i].pgthread, "touch_pages",
do_touch_pages, &memset_thread[i],
QEMU_THREAD_JOINABLE);
- addr += size_per_thread;
- numpages -= numpages_per_thread;
+ addr += numpages_per_thread[i] * hpagesize;
+ numpages -= numpages_per_thread[i];
}
+ thread_create_flag = true;
+ qemu_cond_broadcast(&page_cond);
+
for (i = 0; i < memset_num_threads; i++) {
qemu_thread_join(&memset_thread[i].pgthread);
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: Requesting review about optimizing large guest start up time
2020-02-11 12:08 Requesting review about optimizing large guest start up time bauerchen(陈蒙蒙)
@ 2020-02-20 11:11 ` Paolo Bonzini
0 siblings, 0 replies; 2+ messages in thread
From: Paolo Bonzini @ 2020-02-20 11:11 UTC (permalink / raw)
To: bauerchen(陈蒙蒙), qemu-devel
I queued the patch, thanks!
I simplified a bit the computation of the number of pages per thread, like this:
numpages_per_thread = numpages / memset_num_threads;
leftover = numpages % memset_num_threads;
for (i = 0; i < memset_num_threads; i++) {
memset_thread[i].addr = addr;
memset_thread[i].numpages = numpages_per_thread + (i < leftover);
memset_thread[i].hpagesize = hpagesize;
...
This removes the need for a separate array and function.
Thanks,
Paolo
On 11/02/20 13:08, bauerchen(陈蒙蒙) wrote:
> From c882b155466313fcd85ac330a45a573e608b0d74 Mon Sep 17 00:00:00 2001
> From: bauerchen <bauerchen@tencent.com>
> Date: Tue, 11 Feb 2020 17:10:35 +0800
> Subject: [PATCH] Optimize: large guest start-up in mem-prealloc
> MIME-Version: 1.0
> Content-Type: text/plain; charset=utf-8
> Content-Transfer-Encoding: 8bit
>
> [desc]:
> Large memory VM starts slowly when using -mem-prealloc, and
> there are some areas to optimize in current method;
>
> 1、mmap will be used to alloc threads stack during create page
> clearing threads, and it will attempt mm->mmap_sem for write
> lock, but clearing threads have hold read lock, this competition
> will cause threads createion very slow;
>
> 2、methods of calcuating pages for per threads is not well;if we use
> 64 threads to split 160 hugepage,63 threads clear 2page,1 thread
> clear 34 page,so the entire speed is very slow;
>
> to solve the first problem,we add a mutex in thread function,and
> start all threads when all threads finished createion;
> and the second problem, we spread remainder to other threads,in
> situation that 160 hugepage and 64 threads, there are 32 threads
> clear 3 pages,and 32 threads clear 2 pages;
> [test]:
> 320G 84c VM start time can be reduced to 10s
> 680G 84c VM start time can be reduced to 18s
>
> Signed-off-by: bauerchen <bauerchen@tencent.com>
> Reviewed-by:Pan Rui <ruippan@tencent.com>
> Reviewed-by:Ivan Ren <ivanren@tencent.com>
> ---
> util/oslib-posix.c | 44 ++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 36 insertions(+), 8 deletions(-)
>
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 5a291cc..e97369b 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -76,6 +76,10 @@ static MemsetThread *memset_thread;
> static int memset_num_threads;
> static bool memset_thread_failed;
>
> +static QemuMutex page_mutex;
> +static QemuCond page_cond;
> +static volatile bool thread_create_flag;
> +
> int qemu_get_thread_id(void)
> {
> #if defined(__linux__)
> @@ -403,6 +407,14 @@ static void *do_touch_pages(void *arg)
> MemsetThread *memset_args = (MemsetThread *)arg;
> sigset_t set, oldset;
>
> + /*wait for all threads create finished */
> + qemu_mutex_lock(&page_mutex);
> + while(!thread_create_flag){
> + qemu_cond_wait(&page_cond, &page_mutex);
> + }
> + qemu_mutex_unlock(&page_mutex);
> +
> +
> /* unblock SIGBUS */
> sigemptyset(&set);
> sigaddset(&set, SIGBUS);
> @@ -448,30 +460,46 @@ static inline int get_memset_num_threads(int smp_cpus)
> return ret;
> }
>
> +static void calc_page_per_thread(size_t numpages, int memset_threads, size_t *pages_per_thread){
> + int avg = numpages / memset_threads + 1;
> + int i = 0;
> + int last = avg * memset_threads - numpages;
> + for (i = 0; i < memset_threads; i++)
> + {
> + if(memset_threads - i <= last){
> + pages_per_thread[i] = avg - 1;
> + }else
> + pages_per_thread[i] = avg;
> + }
> +}
> +
> static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
> int smp_cpus)
> {
> - size_t numpages_per_thread;
> - size_t size_per_thread;
> + size_t *numpages_per_thread;
> char *addr = area;
> int i = 0;
>
> memset_thread_failed = false;
> + thread_create_flag = false;
> memset_num_threads = get_memset_num_threads(smp_cpus);
> + numpages_per_thread = g_new0(size_t, memset_num_threads);
> memset_thread = g_new0(MemsetThread, memset_num_threads);
> - numpages_per_thread = (numpages / memset_num_threads);
> - size_per_thread = (hpagesize * numpages_per_thread);
> + calc_page_per_thread(numpages, memset_num_threads, numpages_per_thread);
> +
> for (i = 0; i < memset_num_threads; i++) {
> memset_thread[i].addr = addr;
> - memset_thread[i].numpages = (i == (memset_num_threads - 1)) ?
> - numpages : numpages_per_thread;
> + memset_thread[i].numpages = numpages_per_thread[i];
> memset_thread[i].hpagesize = hpagesize;
> qemu_thread_create(&memset_thread[i].pgthread, "touch_pages",
> do_touch_pages, &memset_thread[i],
> QEMU_THREAD_JOINABLE);
> - addr += size_per_thread;
> - numpages -= numpages_per_thread;
> + addr += numpages_per_thread[i] * hpagesize;
> + numpages -= numpages_per_thread[i];
> }
> + thread_create_flag = true;
> + qemu_cond_broadcast(&page_cond);
> +
> for (i = 0; i < memset_num_threads; i++) {
> qemu_thread_join(&memset_thread[i].pgthread);
> }
> --
> 1.8.3.1
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2020-02-20 11:12 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-02-11 12:08 Requesting review about optimizing large guest start up time bauerchen(陈蒙蒙)
2020-02-20 11:11 ` Paolo Bonzini
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).