From: Chao Fan <cfan@redhat.com>
To: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: ats-kumagai@wm.jp.nec.com, zhouwj-fnst@cn.fujitsu.com,
kexec@lists.infradead.org
Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
Date: Wed, 23 Dec 2015 21:20:48 -0500 (EST) [thread overview]
Message-ID: <1655994487.2838319.1450923648068.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <20151222.173225.158968237.d.hatayama@jp.fujitsu.com>
----- Original Message -----
> From: "HATAYAMA Daisuke" <d.hatayama@jp.fujitsu.com>
> To: cfan@redhat.com
> Cc: ats-kumagai@wm.jp.nec.com, zhouwj-fnst@cn.fujitsu.com, kexec@lists.infradead.org
> Sent: Tuesday, December 22, 2015 4:32:25 PM
> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
>
> Chao,
>
> From: Chao Fan <cfan@redhat.com>
> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> Date: Thu, 10 Dec 2015 05:54:28 -0500
>
> >
> >
> > ----- Original Message -----
> >> From: "Wenjian Zhou/周文剑" <zhouwj-fnst@cn.fujitsu.com>
> >> To: "Chao Fan" <cfan@redhat.com>
> >> Cc: "Atsushi Kumagai" <ats-kumagai@wm.jp.nec.com>,
> >> kexec@lists.infradead.org
> >> Sent: Thursday, December 10, 2015 6:32:32 PM
> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> >>
> >> On 12/10/2015 05:58 PM, Chao Fan wrote:
> >> >
> >> >
> >> > ----- Original Message -----
> >> >> From: "Wenjian Zhou/周文剑" <zhouwj-fnst@cn.fujitsu.com>
> >> >> To: "Atsushi Kumagai" <ats-kumagai@wm.jp.nec.com>
> >> >> Cc: kexec@lists.infradead.org
> >> >> Sent: Thursday, December 10, 2015 5:36:47 PM
> >> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> >> >>
> >> >> On 12/10/2015 04:14 PM, Atsushi Kumagai wrote:
> >> >>>> Hello Kumagai,
> >> >>>>
> >> >>>> On 12/04/2015 10:30 AM, Atsushi Kumagai wrote:
> >> >>>>> Hello, Zhou
> >> >>>>>
> >> >>>>>> On 12/02/2015 03:24 PM, Dave Young wrote:
> >> >>>>>>> Hi,
> >> >>>>>>>
> >> >>>>>>> On 12/02/15 at 01:29pm, "Zhou, Wenjian/周文剑" wrote:
> >> >>>>>>>> I think there is no problem if other test results are as
> >> >>>>>>>> expected.
> >> >>>>>>>>
> >> >>>>>>>> --num-threads mainly reduces the time of compressing.
> >> >>>>>>>> So for lzo, it can't do much help at most of time.
> >> >>>>>>>
> >> >>>>>>> Seems the help of --num-threads does not say it exactly:
> >> >>>>>>>
> >> >>>>>>> [--num-threads THREADNUM]:
> >> >>>>>>> Using multiple threads to read and compress data of each
> >> >>>>>>> page
> >> >>>>>>> in parallel.
> >> >>>>>>> And it will reduces time for saving DUMPFILE.
> >> >>>>>>> This feature only supports creating DUMPFILE in
> >> >>>>>>> kdump-comressed format from
> >> >>>>>>> VMCORE in kdump-compressed format or elf format.
> >> >>>>>>>
> >> >>>>>>> Lzo is also a compress method, it should be mentioned that
> >> >>>>>>> --num-threads only
> >> >>>>>>> supports zlib compressed vmcore.
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> Sorry, it seems that something I said is not so clear.
> >> >>>>>> lzo is also supported. Since lzo compresses data at a high speed,
> >> >>>>>> the
> >> >>>>>> improving of the performance is not so obvious at most of time.
> >> >>>>>>
> >> >>>>>>> Also worth to mention about the recommended -d value for this
> >> >>>>>>> feature.
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> Yes, I think it's worth. I forgot it.
> >> >>>>>
> >> >>>>> I saw your patch, but I think I should confirm what is the problem
> >> >>>>> first.
> >> >>>>>
> >> >>>>>> However, when "-d 31" is specified, it will be worse.
> >> >>>>>> Less than 50 buffers are used to cache the compressed page.
> >> >>>>>> And even the page has been filtered, it will also take a buffer.
> >> >>>>>> So if "-d 31" is specified, the filtered page will use a lot
> >> >>>>>> of buffers. Then the page which needs to be compressed can't
> >> >>>>>> be compressed parallel.
> >> >>>>>
> >> >>>>> Could you explain why compression will not be parallel in more
> >> >>>>> detail ?
> >> >>>>> Actually the buffers are used also for filtered pages, it sounds
> >> >>>>> inefficient.
> >> >>>>> However, I don't understand why it prevents parallel compression.
> >> >>>>>
> >> >>>>
> >> >>>> Think about this, in a huge memory, most of the page will be
> >> >>>> filtered,
> >> >>>> and
> >> >>>> we have 5 buffers.
> >> >>>>
> >> >>>> page1 page2 page3 page4 page5 page6
> >> >>>> page7
> >> >>>> .....
> >> >>>> [buffer1] [2] [3] [4] [5]
> >> >>>> unfiltered filtered filtered filtered filtered unfiltered
> >> >>>> filtered
> >> >>>>
> >> >>>> Since filtered page will take a buffer, when compressing page1,
> >> >>>> page6 can't be compressed at the same time.
> >> >>>> That why it will prevent parallel compression.
> >> >>>
> >> >>> Thanks for your explanation, I understand.
> >> >>> This is just an issue of the current implementation, there is no
> >> >>> reason to stand this restriction.
> >> >>>
> >> >>>>> Further, according to Chao's benchmark, there is a big performance
> >> >>>>> degradation even if the number of thread is 1. (58s vs 240s)
> >> >>>>> The current implementation seems to have some problems, we should
> >> >>>>> solve them.
> >> >>>>>
> >> >>>>
> >> >>>> If "-d 31" is specified, on the one hand we can't save time by
> >> >>>> compressing
> >> >>>> parallel, on the other hand we will introduce some extra work by
> >> >>>> adding
> >> >>>> "--num-threads". So it is obvious that it will have a performance
> >> >>>> degradation.
> >> >>>
> >> >>> Sure, there must be some overhead due to "some extra work"(e.g.
> >> >>> exclusive
> >> >>> lock),
> >> >>> but "--num-threads=1 is 4 times slower than --num-threads=0" still
> >> >>> sounds
> >> >>> too slow, the degradation is too big to be called "some extra work".
> >> >>>
> >> >>> Both --num-threads=0 and --num-threads=1 are serial processing,
> >> >>> the above "buffer fairness issue" will not be related to this
> >> >>> degradation.
> >> >>> What do you think what make this degradation ?
> >> >>>
> >> >>
> >> >> I can't get such result at this moment, so I can't do some further
> >> >> investigation
> >> >> right now. I guess it may be caused by the underlying implementation of
> >> >> pthread.
> >> >> I reviewed the test result of the patch v2 and found in different
> >> >> machines,
> >> >> the results are quite different.
> >> >
> >> > Hi Zhou Wenjian,
> >> >
> >> > I have done more tests in another machine with 128G memory, and get the
> >> > result:
> >> >
> >> > the size of vmcore is 300M in "-d 31"
> >> > makedumpfile -l --message-level 1 -d 31:
> >> > time: 8.6s page-faults: 2272
> >> >
> >> > makedumpfile -l --num-threads 1 --message-level 1 -d 31:
> >> > time: 28.1s page-faults: 2359
> >> >
> >> >
> >> > and the size of vmcore is 2.6G in "-d 0".
> >> > In this machine, I get the same result as yours:
> >> >
> >> >
> >> > makedumpfile -c --message-level 1 -d 0:
> >> > time: 597s page-faults: 2287
> >> >
> >> > makedumpfile -c --num-threads 1 --message-level 1 -d 0:
> >> > time: 602s page-faults: 2361
> >> >
> >> > makedumpfile -c --num-threads 2 --message-level 1 -d 0:
> >> > time: 337s page-faults: 2397
> >> >
> >> > makedumpfile -c --num-threads 4 --message-level 1 -d 0:
> >> > time: 175s page-faults: 2461
> >> >
> >> > makedumpfile -c --num-threads 8 --message-level 1 -d 0:
> >> > time: 103s page-faults: 2611
> >> >
> >> >
> >> > But the machine of my first test is not under my control, should I wait
> >> > for
> >> > the first machine to do more tests?
> >> > If there are still some problems in my tests, please tell me.
> >> >
> >>
> >> Thanks a lot for your test, it seems that there is nothing wrong.
> >> And I haven't got any idea about more tests...
> >>
> >> Could you provide the information of your cpu ?
> >> I will do some further investigation later.
> >>
> >
> > OK, of course, here is the information of cpu:
> >
> > # lscpu
> > Architecture: x86_64
> > CPU op-mode(s): 32-bit, 64-bit
> > Byte Order: Little Endian
> > CPU(s): 48
> > On-line CPU(s) list: 0-47
> > Thread(s) per core: 1
> > Core(s) per socket: 6
> > Socket(s): 8
> > NUMA node(s): 8
> > Vendor ID: AuthenticAMD
> > CPU family: 16
> > Model: 8
> > Model name: Six-Core AMD Opteron(tm) Processor 8439 SE
> > Stepping: 0
> > CPU MHz: 2793.040
> > BogoMIPS: 5586.22
> > Virtualization: AMD-V
> > L1d cache: 64K
> > L1i cache: 64K
> > L2 cache: 512K
> > L3 cache: 5118K
> > NUMA node0 CPU(s): 0,8,16,24,32,40
> > NUMA node1 CPU(s): 1,9,17,25,33,41
> > NUMA node2 CPU(s): 2,10,18,26,34,42
> > NUMA node3 CPU(s): 3,11,19,27,35,43
> > NUMA node4 CPU(s): 4,12,20,28,36,44
> > NUMA node5 CPU(s): 5,13,21,29,37,45
> > NUMA node6 CPU(s): 6,14,22,30,38,46
> > NUMA node7 CPU(s): 7,15,23,31,39,47
>
> This CPU assignment on NUMA nodes looks interesting. Is it possible
> that this affects performance of makedumpfile? This is just a guess.
>
> Could you check whether the performance gets imporoved if you run each
> thread on the same NUMA node? For example:
>
> # taskset -c 0,8,16,24 makedumpfile --num-threads 4 -c -d 0 vmcore
> vmcore-cd0
>
Hi HATAYAMA,
I think your guess is right, but maybe your command has a little problem.
From my test, the NUMA did affect the performance, but not too much.
The average time of cpus in the same NUMA node:
# taskset -c 0,8,16,24,32 makedumpfile --num-threads 4 -c -d 0 vmcore vmcore-cd0
is 314s
The average time of cpus in different NUMA node:
# taskset -c 2,3,5,6,7 makedumpfile --num-threads 4 -c -d 0 vmcore vmcore-cd0
is 354s
But I think if you want to use "--num-threads 4", the --cpu-list numbers
following "taskset -c" should be 5 cpus at least, otherwise the time will be too
long.
Thanks,
Chao Fan
> Also, if this were cause of this performance degradation, we might
> need to extend nr_cpus= kernel option to choose NUMA nodes we use;
> though, we might already be able to do that in combination with other
> kernel features.
>
> > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> > mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall mmxext fxsr_opt
> > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc
> > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic
> > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
> > hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall
> >
> >> But I still believe it's better not to use "-l -d 31" and "--num-threads"
> >> at the same time, though it's very strange that the performance
> >> degradation is so big.
> >>
> >> --
> >> Thanks
> >> Zhou
> >>
> >> > Thanks,
> >> > Chao Fan
> >> >
> >> >
> >> >>
> >> >> It seems that I can get almost the same result of Chao from "PRIMEQUEST
> >> >> 1800E".
> >> >>
> >> >> ###################################
> >> >> - System: PRIMERGY RX300 S6
> >> >> - CPU: Intel(R) Xeon(R) CPU x5660
> >> >> - memory: 16GB
> >> >> ###################################
> >> >> ************ makedumpfile -d 7 ******************
> >> >> core-data 0 256
> >> >> threads-num
> >> >> -l
> >> >> 0 10 144
> >> >> 4 5 110
> >> >> 8 5 111
> >> >> 12 6 111
> >> >>
> >> >> ************ makedumpfile -d 31 ******************
> >> >> core-data 0 256
> >> >> threads-num
> >> >> -l
> >> >> 0 0 0
> >> >> 4 2 2
> >> >> 8 2 3
> >> >> 12 2 3
> >> >>
> >> >> ###################################
> >> >> - System: PRIMEQUEST 1800E
> >> >> - CPU: Intel(R) Xeon(R) CPU E7540
> >> >> - memory: 32GB
> >> >> ###################################
> >> >> ************ makedumpfile -d 7 ******************
> >> >> core-data 0 256
> >> >> threads-num
> >> >> -l
> >> >> 0 34 270
> >> >> 4 63 154
> >> >> 8 64 131
> >> >> 12 65 159
> >> >>
> >> >> ************ makedumpfile -d 31 ******************
> >> >> core-data 0 256
> >> >> threads-num
> >> >> -l
> >> >> 0 2 1
> >> >> 4 48 48
> >> >> 8 48 49
> >> >> 12 49 50
> >> >>
> >> >>>> I'm not so sure if it is a problem that the performance degradation
> >> >>>> is
> >> >>>> so
> >> >>>> big.
> >> >>>> But I think if in other cases, it works as expected, this won't be a
> >> >>>> problem(
> >> >>>> or a problem needs to be fixed), for the performance degradation
> >> >>>> existing
> >> >>>> in theory.
> >> >>>>
> >> >>>> Or the current implementation should be replaced by a new arithmetic.
> >> >>>> For example:
> >> >>>> We can add an array to record whether the page is filtered or not.
> >> >>>> And only the unfiltered page will take the buffer.
> >> >>>
> >> >>> We should discuss how to implement new mechanism, I'll mention this
> >> >>> later.
> >> >>>
> >> >>>> But I'm not sure if it is worth.
> >> >>>> For "-l -d 31" is fast enough, the new arithmetic also can't do much
> >> >>>> help.
> >> >>>
> >> >>> Basically the faster, the better. There is no obvious target time.
> >> >>> If there is room for improvement, we should do it.
> >> >>>
> >> >>
> >> >> Maybe we can improve the performance of "-c -d 31" in some case.
> >> >>
> >> >> BTW, we can easily get the theoretical performance by using the
> >> >> "--split".
> >> >>
> >> >> --
> >> >> Thanks
> >> >> Zhou
> >> >>
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> kexec mailing list
> >> >> kexec@lists.infradead.org
> >> >> http://lists.infradead.org/mailman/listinfo/kexec
> >> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> kexec mailing list
> >> kexec@lists.infradead.org
> >> http://lists.infradead.org/mailman/listinfo/kexec
> >>
> >
> > _______________________________________________
> > kexec mailing list
> > kexec@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> --
> Thanks.
> HATAYAMA, Daisuke
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
next prev parent reply other threads:[~2015-12-24 2:21 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-06-05 7:56 [PATCH RFC 00/11] makedumpfile: parallel processing Zhou Wenjian
2015-06-05 7:56 ` [PATCH RFC 01/11] Add readpage_kdump_compressed_parallel Zhou Wenjian
2015-06-05 7:56 ` [PATCH RFC 02/11] Add mappage_elf_parallel Zhou Wenjian
2015-06-05 7:56 ` [PATCH RFC 03/11] Add readpage_elf_parallel Zhou Wenjian
2015-06-05 7:56 ` [PATCH RFC 04/11] Add read_pfn_parallel Zhou Wenjian
2015-06-05 7:56 ` [PATCH RFC 05/11] Add function to initial bitmap for parallel use Zhou Wenjian
2015-06-05 7:57 ` [PATCH RFC 06/11] Add filter_data_buffer_parallel Zhou Wenjian
2015-06-05 7:57 ` [PATCH RFC 07/11] Add write_kdump_pages_parallel to allow parallel process Zhou Wenjian
2015-06-05 7:57 ` [PATCH RFC 08/11] Add write_kdump_pages_parallel_cyclic to allow parallel process in cyclic_mode Zhou Wenjian
2015-06-05 7:57 ` [PATCH RFC 09/11] Initial and free data used for parallel process Zhou Wenjian
2015-06-05 7:57 ` [PATCH RFC 10/11] Make makedumpfile available to read and compress pages parallelly Zhou Wenjian
2015-06-05 7:57 ` [PATCH RFC 11/11] Add usage and manual about multiple threads process Zhou Wenjian
2015-06-08 3:55 ` [PATCH RFC 00/11] makedumpfile: parallel processing "Zhou, Wenjian/周文剑"
2015-12-01 8:39 ` Chao Fan
2015-12-02 5:29 ` "Zhou, Wenjian/周文剑"
2015-12-02 7:24 ` Dave Young
2015-12-02 7:38 ` "Zhou, Wenjian/周文剑"
2015-12-04 2:30 ` Atsushi Kumagai
2015-12-04 3:33 ` "Zhou, Wenjian/周文剑"
2015-12-04 8:56 ` Chao Fan
2015-12-07 1:09 ` "Zhou, Wenjian/周文剑"
2015-12-10 8:14 ` Atsushi Kumagai
2015-12-10 9:36 ` "Zhou, Wenjian/周文剑"
2015-12-10 9:58 ` Chao Fan
2015-12-10 10:32 ` "Zhou, Wenjian/周文剑"
2015-12-10 10:54 ` Chao Fan
2015-12-22 8:32 ` HATAYAMA Daisuke
2015-12-24 2:20 ` Chao Fan [this message]
2015-12-24 3:22 ` HATAYAMA Daisuke
2015-12-24 3:31 ` Chao Fan
2015-12-24 3:50 ` HATAYAMA Daisuke
2015-12-24 6:02 ` Chao Fan
2015-12-24 7:22 ` HATAYAMA Daisuke
2015-12-24 8:20 ` Atsushi Kumagai
2015-12-24 9:04 ` Chao Fan
2015-12-14 8:26 ` Atsushi Kumagai
2015-12-14 8:59 ` "Zhou, Wenjian/周文剑"
2015-06-10 6:06 ` Atsushi Kumagai
2015-06-11 3:47 ` "Zhou, Wenjian/周文剑"
2015-06-15 1:59 ` qiaonuohan
2015-06-15 5:57 ` Atsushi Kumagai
2015-06-15 6:06 ` qiaonuohan
2015-06-15 6:07 ` qiaonuohan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1655994487.2838319.1450923648068.JavaMail.zimbra@redhat.com \
--to=cfan@redhat.com \
--cc=ats-kumagai@wm.jp.nec.com \
--cc=d.hatayama@jp.fujitsu.com \
--cc=kexec@lists.infradead.org \
--cc=zhouwj-fnst@cn.fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.