Re: [PATCH RFC 00/11] makedumpfile: parallel processing

From: Chao Fan <cfan@redhat.com>
To: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: ats-kumagai@wm.jp.nec.com, zhouwj-fnst@cn.fujitsu.com,
	kexec@lists.infradead.org
Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
Date: Wed, 23 Dec 2015 21:20:48 -0500 (EST)	[thread overview]
Message-ID: <1655994487.2838319.1450923648068.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <20151222.173225.158968237.d.hatayama@jp.fujitsu.com>

----- Original Message -----
> From: "HATAYAMA Daisuke" <d.hatayama@jp.fujitsu.com>
> To: cfan@redhat.com
> Cc: ats-kumagai@wm.jp.nec.com, zhouwj-fnst@cn.fujitsu.com, kexec@lists.infradead.org
> Sent: Tuesday, December 22, 2015 4:32:25 PM
> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> 
> Chao,
> 
> From: Chao Fan <cfan@redhat.com>
> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> Date: Thu, 10 Dec 2015 05:54:28 -0500
> 
> > 
> > 
> > ----- Original Message -----
> >> From: "Wenjian Zhou/周文剑" <zhouwj-fnst@cn.fujitsu.com>
> >> To: "Chao Fan" <cfan@redhat.com>
> >> Cc: "Atsushi Kumagai" <ats-kumagai@wm.jp.nec.com>,
> >> kexec@lists.infradead.org
> >> Sent: Thursday, December 10, 2015 6:32:32 PM
> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> >> 
> >> On 12/10/2015 05:58 PM, Chao Fan wrote:
> >> >
> >> >
> >> > ----- Original Message -----
> >> >> From: "Wenjian Zhou/周文剑" <zhouwj-fnst@cn.fujitsu.com>
> >> >> To: "Atsushi Kumagai" <ats-kumagai@wm.jp.nec.com>
> >> >> Cc: kexec@lists.infradead.org
> >> >> Sent: Thursday, December 10, 2015 5:36:47 PM
> >> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
> >> >>
> >> >> On 12/10/2015 04:14 PM, Atsushi Kumagai wrote:
> >> >>>> Hello Kumagai,
> >> >>>>
> >> >>>> On 12/04/2015 10:30 AM, Atsushi Kumagai wrote:
> >> >>>>> Hello, Zhou
> >> >>>>>
> >> >>>>>> On 12/02/2015 03:24 PM, Dave Young wrote:
> >> >>>>>>> Hi,
> >> >>>>>>>
> >> >>>>>>> On 12/02/15 at 01:29pm, "Zhou, Wenjian/周文剑" wrote:
> >> >>>>>>>> I think there is no problem if other test results are as
> >> >>>>>>>> expected.
> >> >>>>>>>>
> >> >>>>>>>> --num-threads mainly reduces the time of compressing.
> >> >>>>>>>> So for lzo, it can't do much help at most of time.
> >> >>>>>>>
> >> >>>>>>> Seems the help of --num-threads does not say it exactly:
> >> >>>>>>>
> >> >>>>>>>       [--num-threads THREADNUM]:
> >> >>>>>>>           Using multiple threads to read and compress data of each
> >> >>>>>>>           page
> >> >>>>>>>           in parallel.
> >> >>>>>>>           And it will reduces time for saving DUMPFILE.
> >> >>>>>>>           This feature only supports creating DUMPFILE in
> >> >>>>>>>           kdump-comressed format from
> >> >>>>>>>           VMCORE in kdump-compressed format or elf format.
> >> >>>>>>>
> >> >>>>>>> Lzo is also a compress method, it should be mentioned that
> >> >>>>>>> --num-threads only
> >> >>>>>>> supports zlib compressed vmcore.
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> Sorry, it seems that something I said is not so clear.
> >> >>>>>> lzo is also supported. Since lzo compresses data at a high speed,
> >> >>>>>> the
> >> >>>>>> improving of the performance is not so obvious at most of time.
> >> >>>>>>
> >> >>>>>>> Also worth to mention about the recommended -d value for this
> >> >>>>>>> feature.
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> Yes, I think it's worth. I forgot it.
> >> >>>>>
> >> >>>>> I saw your patch, but I think I should confirm what is the problem
> >> >>>>> first.
> >> >>>>>
> >> >>>>>> However, when "-d 31" is specified, it will be worse.
> >> >>>>>> Less than 50 buffers are used to cache the compressed page.
> >> >>>>>> And even the page has been filtered, it will also take a buffer.
> >> >>>>>> So if "-d 31" is specified, the filtered page will use a lot
> >> >>>>>> of buffers. Then the page which needs to be compressed can't
> >> >>>>>> be compressed parallel.
> >> >>>>>
> >> >>>>> Could you explain why compression will not be parallel in more
> >> >>>>> detail ?
> >> >>>>> Actually the buffers are used also for filtered pages, it sounds
> >> >>>>> inefficient.
> >> >>>>> However, I don't understand why it prevents parallel compression.
> >> >>>>>
> >> >>>>
> >> >>>> Think about this, in a huge memory, most of the page will be
> >> >>>> filtered,
> >> >>>> and
> >> >>>> we have 5 buffers.
> >> >>>>
> >> >>>> page1       page2      page3     page4     page5      page6
> >> >>>> page7
> >> >>>> .....
> >> >>>> [buffer1]   [2]        [3]       [4]       [5]
> >> >>>> unfiltered  filtered   filtered  filtered  filtered   unfiltered
> >> >>>> filtered
> >> >>>>
> >> >>>> Since filtered page will take a buffer, when compressing page1,
> >> >>>> page6 can't be compressed at the same time.
> >> >>>> That why it will prevent parallel compression.
> >> >>>
> >> >>> Thanks for your explanation, I understand.
> >> >>> This is just an issue of the current implementation, there is no
> >> >>> reason to stand this restriction.
> >> >>>
> >> >>>>> Further, according to Chao's benchmark, there is a big performance
> >> >>>>> degradation even if the number of thread is 1. (58s vs 240s)
> >> >>>>> The current implementation seems to have some problems, we should
> >> >>>>> solve them.
> >> >>>>>
> >> >>>>
> >> >>>> If "-d 31" is specified, on the one hand we can't save time by
> >> >>>> compressing
> >> >>>> parallel, on the other hand we will introduce some extra work by
> >> >>>> adding
> >> >>>> "--num-threads". So it is obvious that it will have a performance
> >> >>>> degradation.
> >> >>>
> >> >>> Sure, there must be some overhead due to "some extra work"(e.g.
> >> >>> exclusive
> >> >>> lock),
> >> >>> but "--num-threads=1 is 4 times slower than --num-threads=0" still
> >> >>> sounds
> >> >>> too slow, the degradation is too big to be called "some extra work".
> >> >>>
> >> >>> Both --num-threads=0 and --num-threads=1 are serial processing,
> >> >>> the above "buffer fairness issue" will not be related to this
> >> >>> degradation.
> >> >>> What do you think what make this degradation ?
> >> >>>
> >> >>
> >> >> I can't get such result at this moment, so I can't do some further
> >> >> investigation
> >> >> right now. I guess it may be caused by the underlying implementation of
> >> >> pthread.
> >> >> I reviewed the test result of the patch v2 and found in different
> >> >> machines,
> >> >> the results are quite different.
> >> >
> >> > Hi Zhou Wenjian,
> >> >
> >> > I have done more tests in another machine with 128G memory, and get the
> >> > result:
> >> >
> >> > the size of vmcore is 300M in "-d 31"
> >> > makedumpfile -l --message-level 1 -d 31:
> >> > time: 8.6s      page-faults: 2272
> >> >
> >> > makedumpfile -l --num-threads 1 --message-level 1 -d 31:
> >> > time: 28.1s     page-faults: 2359
> >> >
> >> >
> >> > and the size of vmcore is 2.6G in "-d 0".
> >> > In this machine, I get the same result as yours:
> >> >
> >> >
> >> > makedumpfile -c --message-level 1 -d 0:
> >> > time: 597s      page-faults: 2287
> >> >
> >> > makedumpfile -c --num-threads 1 --message-level 1 -d 0:
> >> > time: 602s      page-faults: 2361
> >> >
> >> > makedumpfile -c --num-threads 2 --message-level 1 -d 0:
> >> > time: 337s      page-faults: 2397
> >> >
> >> > makedumpfile -c --num-threads 4 --message-level 1 -d 0:
> >> > time: 175s      page-faults: 2461
> >> >
> >> > makedumpfile -c --num-threads 8 --message-level 1 -d 0:
> >> > time: 103s      page-faults: 2611
> >> >
> >> >
> >> > But the machine of my first test is not under my control, should I wait
> >> > for
> >> > the first machine to do more tests?
> >> > If there are still some problems in my tests, please tell me.
> >> >
> >> 
> >> Thanks a lot for your test, it seems that there is nothing wrong.
> >> And I haven't got any idea about more tests...
> >> 
> >> Could you provide the information of your cpu ?
> >> I will do some further investigation later.
> >> 
> > 
> > OK, of course, here is the information of cpu:
> > 
> > # lscpu
> > Architecture:          x86_64
> > CPU op-mode(s):        32-bit, 64-bit
> > Byte Order:            Little Endian
> > CPU(s):                48
> > On-line CPU(s) list:   0-47
> > Thread(s) per core:    1
> > Core(s) per socket:    6
> > Socket(s):             8
> > NUMA node(s):          8
> > Vendor ID:             AuthenticAMD
> > CPU family:            16
> > Model:                 8
> > Model name:            Six-Core AMD Opteron(tm) Processor 8439 SE
> > Stepping:              0
> > CPU MHz:               2793.040
> > BogoMIPS:              5586.22
> > Virtualization:        AMD-V
> > L1d cache:             64K
> > L1i cache:             64K
> > L2 cache:              512K
> > L3 cache:              5118K
> > NUMA node0 CPU(s):     0,8,16,24,32,40
> > NUMA node1 CPU(s):     1,9,17,25,33,41
> > NUMA node2 CPU(s):     2,10,18,26,34,42
> > NUMA node3 CPU(s):     3,11,19,27,35,43
> > NUMA node4 CPU(s):     4,12,20,28,36,44
> > NUMA node5 CPU(s):     5,13,21,29,37,45
> > NUMA node6 CPU(s):     6,14,22,30,38,46
> > NUMA node7 CPU(s):     7,15,23,31,39,47
> 
> This CPU assignment on NUMA nodes looks interesting. Is it possible
> that this affects performance of makedumpfile? This is just a guess.
> 
> Could you check whether the performance gets imporoved if you run each
> thread on the same NUMA node? For example:
> 
>   # taskset -c 0,8,16,24 makedumpfile --num-threads 4 -c -d 0 vmcore
>   vmcore-cd0
> 
Hi HATAYAMA,

I think your guess is right, but maybe your command has a little problem.

From my test, the NUMA did affect the performance, but not too much.
The average time of cpus in the same NUMA node: 
# taskset -c 0,8,16,24,32 makedumpfile --num-threads 4 -c -d 0 vmcore vmcore-cd0
is 314s
The average time of cpus in different NUMA node:
# taskset -c 2,3,5,6,7 makedumpfile --num-threads 4 -c -d 0 vmcore vmcore-cd0
is 354s

But I think if you want to use "--num-threads 4", the --cpu-list numbers
following "taskset -c" should be 5 cpus at least, otherwise the time will be too
long.

Thanks,
Chao Fan

> Also, if this were cause of this performance degradation, we might
> need to extend nr_cpus= kernel option to choose NUMA nodes we use;
> though, we might already be able to do that in combination with other
> kernel features.
> 
> > Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> > mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall mmxext fxsr_opt
> > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc
> > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic
> > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
> > hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall
> > 
> >> But I still believe it's better not to use "-l -d 31" and "--num-threads"
> >> at the same time, though it's very strange that the performance
> >> degradation is so big.
> >> 
> >> --
> >> Thanks
> >> Zhou
> >> 
> >> > Thanks,
> >> > Chao Fan
> >> >
> >> >
> >> >>
> >> >> It seems that I can get almost the same result of Chao from "PRIMEQUEST
> >> >> 1800E".
> >> >>
> >> >> ###################################
> >> >> - System: PRIMERGY RX300 S6
> >> >> - CPU: Intel(R) Xeon(R) CPU x5660
> >> >> - memory: 16GB
> >> >> ###################################
> >> >> ************ makedumpfile -d 7 ******************
> >> >>                   core-data       0       256
> >> >>           threads-num
> >> >> -l
> >> >>           0                       10      144
> >> >>           4                       5       110
> >> >>           8                       5       111
> >> >>           12                      6       111
> >> >>
> >> >> ************ makedumpfile -d 31 ******************
> >> >>                   core-data       0       256
> >> >>           threads-num
> >> >> -l
> >> >>           0                       0       0
> >> >>           4                       2       2
> >> >>           8                       2       3
> >> >>           12                      2       3
> >> >>
> >> >> ###################################
> >> >> - System: PRIMEQUEST 1800E
> >> >> - CPU: Intel(R) Xeon(R) CPU E7540
> >> >> - memory: 32GB
> >> >> ###################################
> >> >> ************ makedumpfile -d 7 ******************
> >> >>                   core-data        0       256
> >> >>           threads-num
> >> >> -l
> >> >>           0                        34      270
> >> >>           4                        63      154
> >> >>           8                        64      131
> >> >>           12                       65      159
> >> >>
> >> >> ************ makedumpfile -d 31 ******************
> >> >>                   core-data        0       256
> >> >>           threads-num
> >> >> -l
> >> >>           0                        2       1
> >> >>           4                        48      48
> >> >>           8                        48      49
> >> >>           12                       49      50
> >> >>
> >> >>>> I'm not so sure if it is a problem that the performance degradation
> >> >>>> is
> >> >>>> so
> >> >>>> big.
> >> >>>> But I think if in other cases, it works as expected, this won't be a
> >> >>>> problem(
> >> >>>> or a problem needs to be fixed), for the performance degradation
> >> >>>> existing
> >> >>>> in theory.
> >> >>>>
> >> >>>> Or the current implementation should be replaced by a new arithmetic.
> >> >>>> For example:
> >> >>>> We can add an array to record whether the page is filtered or not.
> >> >>>> And only the unfiltered page will take the buffer.
> >> >>>
> >> >>> We should discuss how to implement new mechanism, I'll mention this
> >> >>> later.
> >> >>>
> >> >>>> But I'm not sure if it is worth.
> >> >>>> For "-l -d 31" is fast enough, the new arithmetic also can't do much
> >> >>>> help.
> >> >>>
> >> >>> Basically the faster, the better. There is no obvious target time.
> >> >>> If there is room for improvement, we should do it.
> >> >>>
> >> >>
> >> >> Maybe we can improve the performance of "-c -d 31" in some case.
> >> >>
> >> >> BTW, we can easily get the theoretical performance by using the
> >> >> "--split".
> >> >>
> >> >> --
> >> >> Thanks
> >> >> Zhou
> >> >>
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> kexec mailing list
> >> >> kexec@lists.infradead.org
> >> >> http://lists.infradead.org/mailman/listinfo/kexec
> >> >>
> >> 
> >> 
> >> 
> >> 
> >> _______________________________________________
> >> kexec mailing list
> >> kexec@lists.infradead.org
> >> http://lists.infradead.org/mailman/listinfo/kexec
> >> 
> > 
> > _______________________________________________
> > kexec mailing list
> > kexec@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> --
> Thanks.
> HATAYAMA, Daisuke
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec