Re: [PATCH v3 00/10] makedumpfile: parallel processing

From: "\"Zhou, Wenjian/周文剑\"" <zhouwj-fnst@cn.fujitsu.com>
To: Atsushi Kumagai <ats-kumagai@wm.jp.nec.com>
Cc: "kexec@lists.infradead.org" <kexec@lists.infradead.org>
Subject: Re: [PATCH v3 00/10] makedumpfile: parallel processing
Date: Wed, 5 Aug 2015 10:46:10 +0800	[thread overview]
Message-ID: <55C178F2.80406@cn.fujitsu.com> (raw)
In-Reply-To: <55BB4178.3040602@cn.fujitsu.com>

ping...

-- 
Thanks
Zhou Wenjian

On 07/31/2015 05:35 PM, "Zhou, Wenjian/周文剑" wrote:
> On 07/31/2015 04:27 PM, Atsushi Kumagai wrote:
>>> On 07/23/2015 02:20 PM, Atsushi Kumagai wrote:
>>>>> Hello Kumagai,
>>>>>
>>>>> The PATCH v3 has improved the performance.
>>>>> The performance degradation in PATCH v2 mainly caused by the page_fault
>>>>> produced by the function compress2().
>>>>>
>>>>> I wrote some codes to test the performance of compress2. It almost costs
>>>>> the same time and produces the same amount of page_fault as executing compress2
>>>>> in thread.
>>>>>
>>>>> To reduce page_faults, I have to do the following in kdump_thread_function_cyclic().
>>>>>
>>>>> +    /*
>>>>> +     * lock memory to reduce page_faults by compress2()
>>>>> +     */
>>>>> +    void *temp = malloc(1);
>>>>> +    memset(temp, 0, 1);
>>>>> +    mlockall(MCL_CURRENT);
>>>>> +    free(temp);
>>>>> +
>>>>>
>>>>> With this, using a thread or not almost has the same performance.
>>>>
>>>> Hmm... I can't get good results with this patch, many page faults still
>>>> occur. I guess mlock will change when page faults occur, but will not
>>>> change the total number of page faults.
>>>> Could you explain why compress2() causes many page faults only in thread,
>>>> then I may understand why this patch is meaningful.
>>>>
>>>
>>> Actually, it will also cause so much page faults even not in thread, if
>>> info->bitmap2 is not freed in makedumpfile.
>>>
>>> I wrote some codes to test the performance of compress2().
>>>
>>> <cut>
>>> buf = malloc(PAGE_SIZE);
>>> bufout = malloc(SIZE_OUT);
>>> memset(buf, 1, PAGE_SIZE / 2);
>>> while (1)
>>>      compress2(bufout, &size_out, buf, PAGE_SIZE, Z_BEST_SPEED);
>>> <cut>
>>>
>>> The codes almost like this.
>>> It will cause much page faults.
>>>
>>> But if the codes turn to be the following, it will be much better.
>>>
>>> <cut>
>>> temp = malloc(TEMP_SIZE);
>>> memset(temp, 0, TEMP_SIZE);
>>> free(temp);
>>>
>>> buf = malloc(PAGE_SIZE);
>>> bufout = malloc(SIZE_OUT);
>>> memset(buf, 1, PAGE_SIZE / 2);
>>> while (1)
>>>      compress2(bufout, &size_out, buf, PAGE_SIZE, Z_BEST_SPEED);
>>> <cut>
>>>
>>> TEMP_SIZE must be large enough.
>>> (larger than 135097 will work,in my machine)
>>>
>>>
>>> If in thread, the following codes can reduce the page faults.
>>>
>>> <cut>
>>> temp = malloc(1);
>>> memset(temp, 0, 1);
>>> mlockall(MCL_CURRENT);
>>> free(temp);
>>>
>>> buf = malloc(PAGE_SIZE);
>>> bufout = malloc(SIZE_OUT);
>>> memset(buf, 1, PAGE_SIZE / 2);
>>> while (1)
>>>      compress2(bufout, &size_out, buf, PAGE_SIZE, Z_BEST_SPEED);
>>> <cut>
>>>
>>> I haven't known why.
>>
>> I assume that we are facing the known issue of glibc:
>>
>>    https://sourceware.org/ml/libc-alpha/2015-03/msg00270.html
>>
>> According to the thread above, per-thread arena is easy to be grown and
>> trimmed compared with main arena.
>> Actually compress2() calls malloc() and free() for compression each time
>> it is called, so every compression processing will cause page fault.
>> Moreover, I confirmed that many madvise(MADV_DONTNEED) are invoked only
>> when compress2() is called in thread.
>>
>> OTOH, in lzo case, a temp buffer for working is allocated on the caller
>> side, so it can reduce the number of malloc()/free() pair.
>> (but I'm not sure why snappy doesn't hit this issue. The buffer size
>> for compression may be smaller than the trim threshold.)
>>
>> Anyway, basically it's hard for zlib to avoid this issue on the application
>> side, it seems that we have to accept the performance degradation caused by it.
>> Unfortunately, the main target of this multi thread feature is zlib as you
>> measured, we should resolve this issue somehow.
>>
>> Nevertheless, even now we can get some benefit of parallel processing,
>> so lets' start to discuss the implementation of the parallel processing
>> feature to accept this patch. I have some comments:
>>
>>    - read_pfn_parallel() doesn't use the cache feature(cache.c), is it
>>      intentional with you ?
>>
>
> Yes, since the data are read once a page here, cache feature seems not
> needed.
>
>>    - Now --num-buffers is tunable but the man description and your benchmark
>>      didn't mention what is the benefit of this parameter.
>>
>
> The default value of num-buffers is 50. Originally the value has great influence
> on the performance. But since we changed the logic in the 2nd version of the
> patch set, more buffers have little improvement(1000 buffers may have 1% improvement).
> I'm considering if the option should be removed. what do you think about it?
>
> BTW, the code (mlockall) added in the 3rd version works well in several machines here.
> Should I keep it ?
> With the codes, madvise(MADV_DONTNEED) will be failed in compress2 and the performance
> is as expected in these machines.
>
>

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec