Re: [PATCH RFC 00/11] makedumpfile: parallel processing

From: "\"Zhou, Wenjian/周文剑\"" <zhouwj-fnst@cn.fujitsu.com>
To: Atsushi Kumagai <ats-kumagai@wm.jp.nec.com>
Cc: "kexec@lists.infradead.org" <kexec@lists.infradead.org>
Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
Date: Mon, 14 Dec 2015 16:59:10 +0800	[thread overview]
Message-ID: <566E84DE.2020804@cn.fujitsu.com> (raw)
In-Reply-To: <0910DD04CBD6DE4193FCF86B9C00BE9701E0D0DA@BPXM01GP.gisp.nec.co.jp>

On 12/14/2015 04:26 PM, Atsushi Kumagai wrote:
>>>> Think about this, in a huge memory, most of the page will be filtered, and
>>>> we have 5 buffers.
>>>>
>>>> page1       page2      page3     page4     page5      page6       page7 .....
>>>> [buffer1]   [2]        [3]       [4]       [5]
>>>> unfiltered  filtered   filtered  filtered  filtered   unfiltered  filtered
>>>>
>>>> Since filtered page will take a buffer, when compressing page1,
>>>> page6 can't be compressed at the same time.
>>>> That why it will prevent parallel compression.
>>>
>>> Thanks for your explanation, I understand.
>>> This is just an issue of the current implementation, there is no
>>> reason to stand this restriction.
>>>
>>>>> Further, according to Chao's benchmark, there is a big performance
>>>>> degradation even if the number of thread is 1. (58s vs 240s)
>>>>> The current implementation seems to have some problems, we should
>>>>> solve them.
>>>>>
>>>>
>>>> If "-d 31" is specified, on the one hand we can't save time by compressing
>>>> parallel, on the other hand we will introduce some extra work by adding
>>>> "--num-threads". So it is obvious that it will have a performance degradation.
>>>
>>> Sure, there must be some overhead due to "some extra work"(e.g. exclusive lock),
>>> but "--num-threads=1 is 4 times slower than --num-threads=0" still sounds
>>> too slow, the degradation is too big to be called "some extra work".
>>>
>>> Both --num-threads=0 and --num-threads=1 are serial processing,
>>> the above "buffer fairness issue" will not be related to this degradation.
>>> What do you think what make this degradation ?
>>>
>>
>> I can't get such result at this moment, so I can't do some further investigation
>> right now. I guess it may be caused by the underlying implementation of pthread.
>> I reviewed the test result of the patch v2 and found in different machines,
>> the results are quite different.
>
> Unluckily, I also can't reproduce such big degradation.
> According to the Chao's verification, this issue seems different form
> the "too many page fault issue" that we solved.
> I have no ideas, but at least I want to confirm whether this issue
> is avoidable or not.
>
>> It seems that I can get almost the same result of Chao from "PRIMEQUEST 1800E".
>>
>> ###################################
>> - System: PRIMERGY RX300 S6
>> - CPU: Intel(R) Xeon(R) CPU x5660
>> - memory: 16GB
>> ###################################
>> ************ makedumpfile -d 7 ******************
>>                  core-data       0       256
>>          threads-num
>> -l
>>          0                       10      144
>>          4                       5       110
>>          8                       5       111
>>          12                      6       111
>>
>> ************ makedumpfile -d 31 ******************
>>                  core-data       0       256
>>          threads-num
>> -l
>>          0                       0       0
>>          4                       2       2
>>          8                       2       3
>>          12                      2       3
>>
>> ###################################
>> - System: PRIMEQUEST 1800E
>> - CPU: Intel(R) Xeon(R) CPU E7540
>> - memory: 32GB
>> ###################################
>> ************ makedumpfile -d 7 ******************
>>                  core-data        0       256
>>          threads-num
>> -l
>>          0                        34      270
>>          4                        63      154
>>          8                        64      131
>>          12                       65      159
>>
>> ************ makedumpfile -d 31 ******************
>>                  core-data        0       256
>>          threads-num
>> -l
>>          0                        2       1
>>          4                        48      48
>>          8                        48      49
>>          12                       49      50
>>
>>>> I'm not so sure if it is a problem that the performance degradation is so big.
>>>> But I think if in other cases, it works as expected, this won't be a problem(
>>>> or a problem needs to be fixed), for the performance degradation existing
>>>> in theory.
>>>>
>>>> Or the current implementation should be replaced by a new arithmetic.
>>>> For example:
>>>> We can add an array to record whether the page is filtered or not.
>>>> And only the unfiltered page will take the buffer.
>>>
>>> We should discuss how to implement new mechanism, I'll mention this later.
>>>
>>>> But I'm not sure if it is worth.
>>>> For "-l -d 31" is fast enough, the new arithmetic also can't do much help.
>>>
>>> Basically the faster, the better. There is no obvious target time.
>>> If there is room for improvement, we should do it.
>>>
>>
>> Maybe we can improve the performance of "-c -d 31" in some case.
>
> Yes, the buffer is used for -c, -l and -p, not only for -l.
> It would be useful to improve that.
>
>> BTW, we can easily get the theoretical performance by using the "--split".
>
> Are you sure ? You persuaded me in the thread below:
>
>    http://lists.infradead.org/pipermail/kexec/2015-June/013881.html
>
> --num-threads is orthogonal to --split, it's better to use the both
> option since they try to solve different bottlenecks.
> That's why I decided to merge your multi thread feature.
>
> However, what you said sounds --split is a superset of --num-threads.
> You don't need the multi thread feature ?
>

I just mean the performance.
There is no doubt that we will use multi-threads in --split in the future.

But as we all known, threads and processes have some common characters.
And in makedumpfile, if we use "--split core1 core2 core3 core4" and
"--num-threads 4" separately, the spent time should not be quite different.

Since the logic of "--split" is more simple, if we can't improve the performance
of "-l -d 31" by "--split", we also don't have much chance to do it by "--num-threads".

I just mean that.
It is of course that --split is not a super set of --num-threads.

-- 
Thanks
Zhou

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec