* [PATCH] makedumpfile: --split: assign fair I/O workloads for each process
@ 2014-03-18 3:09 HATAYAMA Daisuke
2014-03-25 1:14 ` Atsushi Kumagai
0 siblings, 1 reply; 5+ messages in thread
From: HATAYAMA Daisuke @ 2014-03-18 3:09 UTC (permalink / raw)
To: kumagai-atsushi; +Cc: kexec
From: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
When --split option is specified, fair I/O workloads should be
assigned for each process to maximize amount of performance
optimization by parallel processing.
However, the current implementation of setup_splitting() in cyclic
mode doesn't care about filtering at all; I/O workloads for each
process could be biased easily.
This patch deals with the issue by implementing the fair I/O workload
assignment as setup_splitting_cyclic().
Note: If --split is specified in cyclic mode, we do filtering three
times: in get_dumpable_pages_cyclic(), in setup_splitting_cyclic() and
in writeout_dumpfile(). Filtering takes about 10 minutes on system
with huge memory according to the benchmark on the past, so it might
be necessary to optimize filtering or setup_filtering_cyclic().
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
---
makedumpfile.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 41 insertions(+), 7 deletions(-)
diff --git a/makedumpfile.c b/makedumpfile.c
index 0bd8b55..d310891 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -7885,26 +7885,60 @@ out:
return ret;
}
+static int setup_splitting_cyclic(void)
+{
+ int i;
+ unsigned long long j, pfn_per_dumpfile;
+ unsigned long long start_pfn, end_pfn;
+
+ pfn_per_dumpfile = info->num_dumpable / info->num_dumpfile;
+ start_pfn = end_pfn = 0;
+
+ for (i = 0; i < info->num_dumpfile - 1; i++) {
+ struct cycle cycle;
+
+ start_pfn = end_pfn;
+ j = pfn_per_dumpfile;
+
+ for_each_cycle(start_pfn, info->max_mapnr, &cycle) {
+ if (!exclude_unnecessary_pages_cyclic(&cycle))
+ return FALSE;
+ while (j && end_pfn < cycle.end_pfn) {
+ if (is_dumpable_cyclic(info->partial_bitmap2,
+ end_pfn, &cycle))
+ j--;
+ end_pfn++;
+ }
+ if (!j)
+ break;
+ }
+
+ SPLITTING_START_PFN(i) = start_pfn;
+ SPLITTING_END_PFN(i) = end_pfn;
+ }
+
+ SPLITTING_START_PFN(info->num_dumpfile - 1) = end_pfn;
+ SPLITTING_END_PFN(info->num_dumpfile - 1) = info->max_mapnr;
+
+ return TRUE;
+}
+
int
setup_splitting(void)
{
int i;
unsigned long long j, pfn_per_dumpfile;
unsigned long long start_pfn, end_pfn;
- unsigned long long num_dumpable = get_num_dumpable();
struct dump_bitmap bitmap2;
if (info->num_dumpfile <= 1)
return FALSE;
if (info->flag_cyclic) {
- for (i = 0; i < info->num_dumpfile; i++) {
- SPLITTING_START_PFN(i) = divideup(info->max_mapnr, info->num_dumpfile) * i;
- SPLITTING_END_PFN(i) = divideup(info->max_mapnr, info->num_dumpfile) * (i + 1);
- }
- if (SPLITTING_END_PFN(i-1) > info->max_mapnr)
- SPLITTING_END_PFN(i-1) = info->max_mapnr;
+ return setup_splitting_cyclic();
} else {
+ unsigned long long num_dumpable = get_num_dumpable();
+
initialize_2nd_bitmap(&bitmap2);
pfn_per_dumpfile = num_dumpable / info->num_dumpfile;
--
1.8.5.3
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply related [flat|nested] 5+ messages in thread* RE: [PATCH] makedumpfile: --split: assign fair I/O workloads for each process
2014-03-18 3:09 [PATCH] makedumpfile: --split: assign fair I/O workloads for each process HATAYAMA Daisuke
@ 2014-03-25 1:14 ` Atsushi Kumagai
2014-03-25 5:52 ` "Hatayama, Daisuke/畑山 大輔"
0 siblings, 1 reply; 5+ messages in thread
From: Atsushi Kumagai @ 2014-03-25 1:14 UTC (permalink / raw)
To: d.hatayama@jp.fujitsu.com; +Cc: kexec@lists.infradead.org
>From: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
>
>When --split option is specified, fair I/O workloads should be
>assigned for each process to maximize amount of performance
>optimization by parallel processing.
>
>However, the current implementation of setup_splitting() in cyclic
>mode doesn't care about filtering at all; I/O workloads for each
>process could be biased easily.
>
>This patch deals with the issue by implementing the fair I/O workload
>assignment as setup_splitting_cyclic().
>
>Note: If --split is specified in cyclic mode, we do filtering three
>times: in get_dumpable_pages_cyclic(), in setup_splitting_cyclic() and
>in writeout_dumpfile(). Filtering takes about 10 minutes on system
>with huge memory according to the benchmark on the past, so it might
>be necessary to optimize filtering or setup_filtering_cyclic().
Sorry, I lost the result of that benchmark, could you give me the URL?
I'd like to confirm that the advantage of fair I/O will exceed the
10 minutes disadvantage.
Thanks
Atsushi Kumagai
>Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
>---
> makedumpfile.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 41 insertions(+), 7 deletions(-)
>
>diff --git a/makedumpfile.c b/makedumpfile.c
>index 0bd8b55..d310891 100644
>--- a/makedumpfile.c
>+++ b/makedumpfile.c
>@@ -7885,26 +7885,60 @@ out:
> return ret;
> }
>
>+static int setup_splitting_cyclic(void)
>+{
>+ int i;
>+ unsigned long long j, pfn_per_dumpfile;
>+ unsigned long long start_pfn, end_pfn;
>+
>+ pfn_per_dumpfile = info->num_dumpable / info->num_dumpfile;
>+ start_pfn = end_pfn = 0;
>+
>+ for (i = 0; i < info->num_dumpfile - 1; i++) {
>+ struct cycle cycle;
>+
>+ start_pfn = end_pfn;
>+ j = pfn_per_dumpfile;
>+
>+ for_each_cycle(start_pfn, info->max_mapnr, &cycle) {
>+ if (!exclude_unnecessary_pages_cyclic(&cycle))
>+ return FALSE;
>+ while (j && end_pfn < cycle.end_pfn) {
>+ if (is_dumpable_cyclic(info->partial_bitmap2,
>+ end_pfn, &cycle))
>+ j--;
>+ end_pfn++;
>+ }
>+ if (!j)
>+ break;
>+ }
>+
>+ SPLITTING_START_PFN(i) = start_pfn;
>+ SPLITTING_END_PFN(i) = end_pfn;
>+ }
>+
>+ SPLITTING_START_PFN(info->num_dumpfile - 1) = end_pfn;
>+ SPLITTING_END_PFN(info->num_dumpfile - 1) = info->max_mapnr;
>+
>+ return TRUE;
>+}
>+
> int
> setup_splitting(void)
> {
> int i;
> unsigned long long j, pfn_per_dumpfile;
> unsigned long long start_pfn, end_pfn;
>- unsigned long long num_dumpable = get_num_dumpable();
> struct dump_bitmap bitmap2;
>
> if (info->num_dumpfile <= 1)
> return FALSE;
>
> if (info->flag_cyclic) {
>- for (i = 0; i < info->num_dumpfile; i++) {
>- SPLITTING_START_PFN(i) = divideup(info->max_mapnr, info->num_dumpfile) * i;
>- SPLITTING_END_PFN(i) = divideup(info->max_mapnr, info->num_dumpfile) * (i + 1);
>- }
>- if (SPLITTING_END_PFN(i-1) > info->max_mapnr)
>- SPLITTING_END_PFN(i-1) = info->max_mapnr;
>+ return setup_splitting_cyclic();
> } else {
>+ unsigned long long num_dumpable = get_num_dumpable();
>+
> initialize_2nd_bitmap(&bitmap2);
>
> pfn_per_dumpfile = num_dumpable / info->num_dumpfile;
>--
>1.8.5.3
>
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH] makedumpfile: --split: assign fair I/O workloads for each process
2014-03-25 1:14 ` Atsushi Kumagai
@ 2014-03-25 5:52 ` "Hatayama, Daisuke/畑山 大輔"
2014-03-26 5:38 ` HATAYAMA Daisuke
0 siblings, 1 reply; 5+ messages in thread
From: "Hatayama, Daisuke/畑山 大輔" @ 2014-03-25 5:52 UTC (permalink / raw)
To: Atsushi Kumagai; +Cc: kexec@lists.infradead.org
(2014/03/25 10:14), Atsushi Kumagai wrote:
>> From: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
>>
>> When --split option is specified, fair I/O workloads should be
>> assigned for each process to maximize amount of performance
>> optimization by parallel processing.
>>
>> However, the current implementation of setup_splitting() in cyclic
>> mode doesn't care about filtering at all; I/O workloads for each
>> process could be biased easily.
>>
>> This patch deals with the issue by implementing the fair I/O workload
>> assignment as setup_splitting_cyclic().
>>
>> Note: If --split is specified in cyclic mode, we do filtering three
>> times: in get_dumpable_pages_cyclic(), in setup_splitting_cyclic() and
>> in writeout_dumpfile(). Filtering takes about 10 minutes on system
>> with huge memory according to the benchmark on the past, so it might
>> be necessary to optimize filtering or setup_filtering_cyclic().
>
> Sorry, I lost the result of that benchmark, could you give me the URL?
> I'd like to confirm that the advantage of fair I/O will exceed the
> 10 minutes disadvantage.
>
Here are two benchmarks by Jingbai Ma and myself.
http://lists.infradead.org/pipermail/kexec/2013-March/008515.html
http://lists.infradead.org/pipermail/kexec/2013-March/008517.html
Note that Jingbai Ma's results are sum of get_dumpable_cyclic() and writeout_dumpfile(), so apparently it looks twice larger than mine, but actually they show almost same performance.
In summary, each result shows about 40 seconds per 1TiB. So, most of systems is not affected very much. On 12TiB memory, which is the current maximum memory size of Fujitsu system, we needs 480 seconds == 8 minutes more. But this is stable in the sense that time never become long suddenly in some rare worst case, so it seems to me optimistic in this sense.
The other ideas to deal with the issue are:
- paralellize the counting up processes. But it might be difficult to paralellize the 2nd pass, which seems inherently serial processing.
- Insead of doing the 2nd pass, make the terminating proces join to still running process. But it might be combersome to implement this not using pthread.
Thanks.
HATAYAMA, Daisuke
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] makedumpfile: --split: assign fair I/O workloads for each process
2014-03-25 5:52 ` "Hatayama, Daisuke/畑山 大輔"
@ 2014-03-26 5:38 ` HATAYAMA Daisuke
2014-03-27 5:18 ` Atsushi Kumagai
0 siblings, 1 reply; 5+ messages in thread
From: HATAYAMA Daisuke @ 2014-03-26 5:38 UTC (permalink / raw)
To: kumagai-atsushi; +Cc: kexec
From: "Hatayama, Daisuke/畑山 大輔" <d.hatayama@jp.fujitsu.com>
Subject: Re: [PATCH] makedumpfile: --split: assign fair I/O workloads for each process
Date: Tue, 25 Mar 2014 14:52:36 +0900
>
>
> (2014/03/25 10:14), Atsushi Kumagai wrote:
>>> From: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
>>>
>>> When --split option is specified, fair I/O workloads should be
>>> assigned for each process to maximize amount of performance
>>> optimization by parallel processing.
>>>
>>> However, the current implementation of setup_splitting() in cyclic
>>> mode doesn't care about filtering at all; I/O workloads for each
>>> process could be biased easily.
>>>
>>> This patch deals with the issue by implementing the fair I/O workload
>>> assignment as setup_splitting_cyclic().
>>>
>>> Note: If --split is specified in cyclic mode, we do filtering three
>>> times: in get_dumpable_pages_cyclic(), in setup_splitting_cyclic() and
>>> in writeout_dumpfile(). Filtering takes about 10 minutes on system
>>> with huge memory according to the benchmark on the past, so it might
>>> be necessary to optimize filtering or setup_filtering_cyclic().
>>
>> Sorry, I lost the result of that benchmark, could you give me the URL?
>> I'd like to confirm that the advantage of fair I/O will exceed the
>> 10 minutes disadvantage.
>>
>
> Here are two benchmarks by Jingbai Ma and myself.
>
> http://lists.infradead.org/pipermail/kexec/2013-March/008515.html
> http://lists.infradead.org/pipermail/kexec/2013-March/008517.html
>
>
> Note that Jingbai Ma's results are sum of get_dumpable_cyclic() and writeout_dumpfile(), so apparently it looks twice larger than mine, but actually they show almost same performance.
>
> In summary, each result shows about 40 seconds per 1TiB. So, most of systems is not affected very much. On 12TiB memory, which is the current maximum memory size of Fujitsu system, we needs 480 seconds == 8 minutes more. But this is stable in the sense that time never become long suddenly in some rare worst case, so it seems to me optimistic in this sense.
>
> The other ideas to deal with the issue are:
>
> - paralellize the counting up processes. But it might be difficult to paralellize the 2nd pass, which seems inherently serial processing.
>
> - Insead of doing the 2nd pass, make the terminating proces join to still running process. But it might be combersome to implement this not using pthread.
>
I noticed that it's able to create a table of dumpable pages with a
relatively small amount of memory by manging a memory as blocks. This
is just kind of a page table management.
For example, define a block 1 GiB boundary region and assume a system
with 64 TiB physical memory (which is current maximum on x86_64).
Then,
64 TiB / 1 GiB = 64 Ki blocks
A table we consdier here have the number of dumpable pages for each 1
GiB boundary in each entry of 8 bytes. So, total size of the table is:
8 B * 64 Ki blocks = 512 KiB
Counting up dumpable pages in each GiB boundary can be done by 1 pass
only; get_dumpable_cyclic() does that too.
Then, it's assign amount of I/O to each process fairly enough. The
difference is at most 1 GiB. If disk speed is 100 MiB/sec, 1 GiB
corresponds to about 10 seconds only.
If you think 512 KiB not small enough, it would be able to increase
block size a little more. If choosing 8 GiB block, table size is 64
KiB only, and the 8 GiB data corresponds to about 80 seconds on
typical disks.
How do you think this?
Thanks.
HATAYAMA, Daisuke
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: [PATCH] makedumpfile: --split: assign fair I/O workloads for each process
2014-03-26 5:38 ` HATAYAMA Daisuke
@ 2014-03-27 5:18 ` Atsushi Kumagai
0 siblings, 0 replies; 5+ messages in thread
From: Atsushi Kumagai @ 2014-03-27 5:18 UTC (permalink / raw)
To: d.hatayama@jp.fujitsu.com; +Cc: kexec@lists.infradead.org
>> (2014/03/25 10:14), Atsushi Kumagai wrote:
>>>> From: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
>>>>
>>>> When --split option is specified, fair I/O workloads should be
>>>> assigned for each process to maximize amount of performance
>>>> optimization by parallel processing.
>>>>
>>>> However, the current implementation of setup_splitting() in cyclic
>>>> mode doesn't care about filtering at all; I/O workloads for each
>>>> process could be biased easily.
>>>>
>>>> This patch deals with the issue by implementing the fair I/O workload
>>>> assignment as setup_splitting_cyclic().
>>>>
>>>> Note: If --split is specified in cyclic mode, we do filtering three
>>>> times: in get_dumpable_pages_cyclic(), in setup_splitting_cyclic() and
>>>> in writeout_dumpfile(). Filtering takes about 10 minutes on system
>>>> with huge memory according to the benchmark on the past, so it might
>>>> be necessary to optimize filtering or setup_filtering_cyclic().
>>>
>>> Sorry, I lost the result of that benchmark, could you give me the URL?
>>> I'd like to confirm that the advantage of fair I/O will exceed the
>>> 10 minutes disadvantage.
>>>
>>
>> Here are two benchmarks by Jingbai Ma and myself.
>>
>> http://lists.infradead.org/pipermail/kexec/2013-March/008515.html
>> http://lists.infradead.org/pipermail/kexec/2013-March/008517.html
>>
>>
>> Note that Jingbai Ma's results are sum of get_dumpable_cyclic() and writeout_dumpfile(), so apparently it looks
>twice larger than mine, but actually they show almost same performance.
>>
>> In summary, each result shows about 40 seconds per 1TiB. So, most of systems is not affected very much. On 12TiB
>memory, which is the current maximum memory size of Fujitsu system, we needs 480 seconds == 8 minutes more. But this
>is stable in the sense that time never become long suddenly in some rare worst case, so it seems to me optimistic
>in this sense.
>>
>> The other ideas to deal with the issue are:
>>
>> - paralellize the counting up processes. But it might be difficult to paralellize the 2nd pass, which seems inherently
>serial processing.
>>
>> - Insead of doing the 2nd pass, make the terminating proces join to still running process. But it might be combersome
>to implement this not using pthread.
>>
>
>I noticed that it's able to create a table of dumpable pages with a
>relatively small amount of memory by manging a memory as blocks. This
>is just kind of a page table management.
>
>For example, define a block 1 GiB boundary region and assume a system
>with 64 TiB physical memory (which is current maximum on x86_64).
>
>Then,
>
> 64 TiB / 1 GiB = 64 Ki blocks
>
>A table we consdier here have the number of dumpable pages for each 1
>GiB boundary in each entry of 8 bytes. So, total size of the table is:
>
> 8 B * 64 Ki blocks = 512 KiB
>
>Counting up dumpable pages in each GiB boundary can be done by 1 pass
>only; get_dumpable_cyclic() does that too.
>
>Then, it's assign amount of I/O to each process fairly enough. The
>difference is at most 1 GiB. If disk speed is 100 MiB/sec, 1 GiB
>corresponds to about 10 seconds only.
>
>If you think 512 KiB not small enough, it would be able to increase
>block size a little more. If choosing 8 GiB block, table size is 64
>KiB only, and the 8 GiB data corresponds to about 80 seconds on
>typical disks.
>
>How do you think this?
Good, I prefer this to the first one.
Even the first one can't achieve complete fairness due to zero pages,
we should also accept the 1 GiB difference.
I think 512KiB will not cause a problem in practice,
you should go on with this idea.
Thanks
Atsushi Kumagai
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-03-27 5:22 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-18 3:09 [PATCH] makedumpfile: --split: assign fair I/O workloads for each process HATAYAMA Daisuke
2014-03-25 1:14 ` Atsushi Kumagai
2014-03-25 5:52 ` "Hatayama, Daisuke/畑山 大輔"
2014-03-26 5:38 ` HATAYAMA Daisuke
2014-03-27 5:18 ` Atsushi Kumagai
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.