* makedumpfile memory usage grows with system memory size @ 2012-03-28 21:22 Don Zickus 2012-03-29 8:09 ` Ken'ichi Ohmichi 0 siblings, 1 reply; 34+ messages in thread From: Don Zickus @ 2012-03-28 21:22 UTC (permalink / raw) To: oomichi; +Cc: kexec Hello Ken'ichi-san, I was talking to Vivek about kdump memory requirements and he mentioned that they vary based on how much system memory is used. I was interested in knowing why that was and again he mentioned that makedumpfile needed lots of memory if it was running on a large machine (for example 1TB of system memory). Looking through the makedumpfile README and using what Vivek remembered of makedumpfile, we gathered that as the number of pages grows, the more makedumpfile has to temporarily store the information in memory. The possible reason was to calculate the size of the file before it was copied to its final destination? I was curious if that was true and if it was, would it be possible to only process memory in chunks instead of all at once. The idea is that a machine with 4Gigs of memory should consume the same the amount of kdump runtime memory as a 1TB memory system. Just trying to research ways to keep the memory requirements consistent across all memory ranges. Thanks, Don _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-03-28 21:22 makedumpfile memory usage grows with system memory size Don Zickus @ 2012-03-29 8:09 ` Ken'ichi Ohmichi 2012-03-29 12:56 ` HATAYAMA Daisuke ` (2 more replies) 0 siblings, 3 replies; 34+ messages in thread From: Ken'ichi Ohmichi @ 2012-03-29 8:09 UTC (permalink / raw) To: Don Zickus; +Cc: kexec Hi Don-san, On Wed, 28 Mar 2012 17:22:04 -0400 Don Zickus <dzickus@redhat.com> wrote: > > I was talking to Vivek about kdump memory requirements and he mentioned > that they vary based on how much system memory is used. > > I was interested in knowing why that was and again he mentioned that > makedumpfile needed lots of memory if it was running on a large machine > (for example 1TB of system memory). > > Looking through the makedumpfile README and using what Vivek remembered of > makedumpfile, we gathered that as the number of pages grows, the more > makedumpfile has to temporarily store the information in memory. The > possible reason was to calculate the size of the file before it was copied > to its final destination? makedumpfile uses the system memory of 2nd-kernel for a bitmap if RHEL. The bitmap represents each page of 1st-kernel is excluded or not. So the bitmap size depends on 1st-kernel's system memory. makedumpfile creates a file /tmp/kdump_bitmapXXXXXX as the bitmap, and the file is created on 2nd-kernel's memory if RHEL, because RHEL does not mount a root filesystem when 2nd-kernel is running. > I was curious if that was true and if it was, would it be possible to only > process memory in chunks instead of all at once. > > The idea is that a machine with 4Gigs of memory should consume the same > the amount of kdump runtime memory as a 1TB memory system. > > Just trying to research ways to keep the memory requirements consistent > across all memory ranges. I think the above purpose is good, and I don't have any idea for reducing the bitmap size. And now I am out of makedumpfile development. Kumagai-san is the makedumpfile maintainer now, and he will help you. Thanks Ken'ichi Ohmichi _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-03-29 8:09 ` Ken'ichi Ohmichi @ 2012-03-29 12:56 ` HATAYAMA Daisuke 2012-03-29 13:25 ` Don Zickus 2012-03-29 13:05 ` Don Zickus 2012-04-02 17:15 ` Michael Holzheu 2 siblings, 1 reply; 34+ messages in thread From: HATAYAMA Daisuke @ 2012-03-29 12:56 UTC (permalink / raw) To: dzickus; +Cc: oomichi, kexec Hello Don, I'm missing your mail somehow so replying Oomichi-san's mail... From: "Ken'ichi Ohmichi" <oomichi@mxs.nes.nec.co.jp> Subject: Re: makedumpfile memory usage grows with system memory size Date: Thu, 29 Mar 2012 17:09:18 +0900 > > On Wed, 28 Mar 2012 17:22:04 -0400 > Don Zickus <dzickus@redhat.com> wrote: >> I was curious if that was true and if it was, would it be possible to only >> process memory in chunks instead of all at once. >> >> The idea is that a machine with 4Gigs of memory should consume the same >> the amount of kdump runtime memory as a 1TB memory system. >> >> Just trying to research ways to keep the memory requirements consistent >> across all memory ranges. I think this is possible in constant memory space by creating bitmaps and writing pages in a certain amount of memory. That is, if choosing 4GB, do [0, 4GB) space processing, [4GB, 8GB) space processing, [8GB, 12GB) ... in order. The key is to restrict the target memory range of filtering. Thanks. HATAYAMA, Daisuke _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-03-29 12:56 ` HATAYAMA Daisuke @ 2012-03-29 13:25 ` Don Zickus 2012-03-30 0:51 ` HATAYAMA Daisuke 0 siblings, 1 reply; 34+ messages in thread From: Don Zickus @ 2012-03-29 13:25 UTC (permalink / raw) To: HATAYAMA Daisuke; +Cc: oomichi, kexec Hello Daisuke, On Thu, Mar 29, 2012 at 09:56:46PM +0900, HATAYAMA Daisuke wrote: > Hello Don, > > I'm missing your mail somehow so replying Oomichi-san's mail... > > From: "Ken'ichi Ohmichi" <oomichi@mxs.nes.nec.co.jp> > Subject: Re: makedumpfile memory usage grows with system memory size > Date: Thu, 29 Mar 2012 17:09:18 +0900 > > > > > On Wed, 28 Mar 2012 17:22:04 -0400 > > Don Zickus <dzickus@redhat.com> wrote: > > >> I was curious if that was true and if it was, would it be possible to only > >> process memory in chunks instead of all at once. > >> > >> The idea is that a machine with 4Gigs of memory should consume the same > >> the amount of kdump runtime memory as a 1TB memory system. > >> > >> Just trying to research ways to keep the memory requirements consistent > >> across all memory ranges. > > I think this is possible in constant memory space by creating bitmaps > and writing pages in a certain amount of memory. That is, if choosing > 4GB, do [0, 4GB) space processing, [4GB, 8GB) space processing, [8GB, > 12GB) ... in order. The key is to restrict the target memory range of > filtering. Yes, that was what I was thinking. I am glad to hear that is possible. Is there some place in the code that I can help try out that idea? I would also be curious if there is a 'time' impact on how long it takes to process this (for example, would it add a couple of milliseconds overhead or seconds overhead). Thanks, Don _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-03-29 13:25 ` Don Zickus @ 2012-03-30 0:51 ` HATAYAMA Daisuke 2012-04-02 7:46 ` Atsushi Kumagai 0 siblings, 1 reply; 34+ messages in thread From: HATAYAMA Daisuke @ 2012-03-30 0:51 UTC (permalink / raw) To: dzickus; +Cc: oomichi, kexec From: Don Zickus <dzickus@redhat.com> Subject: Re: makedumpfile memory usage grows with system memory size Date: Thu, 29 Mar 2012 09:25:33 -0400 > On Thu, Mar 29, 2012 at 09:56:46PM +0900, HATAYAMA Daisuke wrote: >> From: "Ken'ichi Ohmichi" <oomichi@mxs.nes.nec.co.jp> >> Subject: Re: makedumpfile memory usage grows with system memory size >> Date: Thu, 29 Mar 2012 17:09:18 +0900 >> > On Wed, 28 Mar 2012 17:22:04 -0400 >> > Don Zickus <dzickus@redhat.com> wrote: >> >> I was curious if that was true and if it was, would it be possible to only >> >> process memory in chunks instead of all at once. >> >> >> >> The idea is that a machine with 4Gigs of memory should consume the same >> >> the amount of kdump runtime memory as a 1TB memory system. >> >> >> >> Just trying to research ways to keep the memory requirements consistent >> >> across all memory ranges. >> I think this is possible in constant memory space by creating bitmaps >> and writing pages in a certain amount of memory. That is, if choosing >> 4GB, do [0, 4GB) space processing, [4GB, 8GB) space processing, [8GB, >> 12GB) ... in order. The key is to restrict the target memory range of >> filtering. > Yes, that was what I was thinking. I am glad to hear that is possible. > Is there some place in the code that I can help try out that idea? I > would also be curious if there is a 'time' impact on how long it takes to > process this (for example, would it add a couple of milliseconds overhead > or seconds overhead). The part related is the path in create_dumpfile(): if (!create_dump_bitmap()) return FALSE; if (info->flag_split) { if ((status = writeout_multiple_dumpfiles()) == FALSE) return FALSE; } else { if ((status = writeout_dumpfile()) == FALSE) return FALSE; } Now this part tries to do the task for a whole memory. So first this needs to be generalized so that it does processing per range of memory. If doing the processing three cycles, it would be as follows pictorically in kdump-compressed format. +------------------------------------------+ | main header (struct disk_dump_header) | |------------------------------------------+ | sub header (struct kdump_sub_header) | |------------------------------------------+ | | <-- 1st cycle - - - - - - - - - - - - - - | 1st-bitmap | <-- 2nd cycle - - - - - - - - - - - - - - | | <-- 3rd cycle |------------------------------------------+ | | <-- 1st cycle - - - - - - - - - - - - - - | 2nd-bitmap | <-- 2nd cycle - - - - - - - - - - - - - - | | <-- 3rd cycle |------------------------------------------+ | | <-- 1st cycle - - - - - - - - - - - - - - | page header | <-- 2nd cycle - - - - - - - - - - - - - - | | <-- 3rd cycle |------------------------------------------| | | | page data | <-- 1st cycle | | - - - - - - - - - - - - - - | page data | <-- 2nd cycle - - - - - - - - - - - - - - | | | | | page data | <-- 3rd cycle | | | | +------------------------------------------+ where the portions except for page data are in fixed length; so I wrote only page data differently. For the processing of writing pages per range of memory, it's useful to reuse the code for --split's splitting features that split a single dumpfile into a multiple dumpfiles, which has prepared data strucutre to have start and end page frame numbers of the corresponding dumped memory. For example, see part below in write_kdump_pages(). if (info->flag_split) { start_pfn = info->split_start_pfn; end_pfn = info->split_end_pfn; } else { start_pfn = 0; end_pfn = info->max_mapnr; } for (pfn = start_pfn; pfn < end_pfn; pfn++) { For the processing of creating and referencing bitmaps per range of memory, there's no functions that do that. The ones for a whole memory only: create_bitmap() and is_dumpable(). Also, creating bitmap depends on source dumpfile format. Trying with ELF to kdump-compressed format case first seems most handy; or if usecase is on the 2nd kernel only, this case is enough?) For performance impact, I don't know that exactly. But I guess iterating filtering processing is most significant. I don't know exact data structure for each kind of memory, but if there's the ones needing linear order to look up the data for a given page frame number, there would be necessary to add some special handling not to reduce performance. Thanks. HATAYAMA, Daisuke _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-03-30 0:51 ` HATAYAMA Daisuke @ 2012-04-02 7:46 ` Atsushi Kumagai 2012-04-05 6:52 ` HATAYAMA Daisuke 0 siblings, 1 reply; 34+ messages in thread From: Atsushi Kumagai @ 2012-04-02 7:46 UTC (permalink / raw) To: d.hatayama; +Cc: dzickus, oomichi, kexec Hello Hatayama-san, On Fri, 30 Mar 2012 09:51:43 +0900 ( ) HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote: > For the processing of writing pages per range of memory, it's useful > to reuse the code for --split's splitting features that split a single > dumpfile into a multiple dumpfiles, which has prepared data strucutre > to have start and end page frame numbers of the corresponding dumped > memory. For example, see part below in write_kdump_pages(). > > if (info->flag_split) { > start_pfn = info->split_start_pfn; > end_pfn = info->split_end_pfn; > } > else { > start_pfn = 0; > end_pfn = info->max_mapnr; > } > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > For the processing of creating and referencing bitmaps per range of > memory, there's no functions that do that. The ones for a whole memory > only: create_bitmap() and is_dumpable(). Also, creating bitmap depends > on source dumpfile format. Trying with ELF to kdump-compressed format > case first seems most handy; or if usecase is on the 2nd kernel only, > this case is enough?) > > For performance impact, I don't know that exactly. But I guess > iterating filtering processing is most significant. I don't know exact > data structure for each kind of memory, but if there's the ones > needing linear order to look up the data for a given page frame > number, there would be necessary to add some special handling not to > reduce performance. Thank you for your idea. I think this is an important issue and I have no idea except iterating filtering processes for each memory range. But as you said, we should consider the issue related to performance. For example, makedumpfile must parse free_list repeatedly to distinguish whether each pfn is a free page or not, because each range may be inside the same zone. It will be overhead. Thanks Atsushi Kumagai _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-02 7:46 ` Atsushi Kumagai @ 2012-04-05 6:52 ` HATAYAMA Daisuke 2012-04-05 14:34 ` Vivek Goyal 0 siblings, 1 reply; 34+ messages in thread From: HATAYAMA Daisuke @ 2012-04-05 6:52 UTC (permalink / raw) To: kumagai-atsushi; +Cc: dzickus, oomichi, kexec From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Subject: Re: makedumpfile memory usage grows with system memory size Date: Mon, 2 Apr 2012 16:46:51 +0900 > On Fri, 30 Mar 2012 09:51:43 +0900 ( ) > HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote: >> For performance impact, I don't know that exactly. But I guess >> iterating filtering processing is most significant. I don't know exact >> data structure for each kind of memory, but if there's the ones >> needing linear order to look up the data for a given page frame >> number, there would be necessary to add some special handling not to >> reduce performance. > > Thank you for your idea. > > I think this is an important issue and I have no idea except iterating > filtering processes for each memory range. > > But as you said, we should consider the issue related to performance. > For example, makedumpfile must parse free_list repeatedly to distinguish > whether each pfn is a free page or not, because each range may be inside > the same zone. It will be overhead. > Hello Kumagai-san, I looked into contents of free_list and confirmed that even buddies with the same order are not ordered linearly. The below is the output of makedumpfile I customized so it outputs buddy data. # ./makedumpfile --message-level 32 -c -d 31 /media/127.0.0.1-2012-04-04-20:31:58/vmcore vmcore-cd31 NR_ZONE: 0 order: 10 migrate_type: 2 pfn: 3072 order: 10 migrate_type: 2 pfn: 2048 order: 10 migrate_type: 2 pfn: 1024 order: 9 migrate_type: 3 pfn: 512 order: 8 migrate_type: 0 pfn: 256 order: 6 migrate_type: 0 pfn: 64 order: 5 migrate_type: 0 pfn: 32 order: 4 migrate_type: 0 pfn: 128 order: 4 migrate_type: 0 pfn: 16 order: 2 migrate_type: 0 pfn: 144 order: 1 migrate_type: 0 pfn: 148 NR_ZONE: 1 order: 10 migrate_type: 2 pfn: 226304 order: 10 migrate_type: 2 pfn: 225280 order: 10 migrate_type: 2 pfn: 486400 order: 10 migrate_type: 2 pfn: 485376 order: 10 migrate_type: 2 pfn: 484352 order: 10 migrate_type: 2 pfn: 483328 order: 10 migrate_type: 2 pfn: 482304 order: 10 migrate_type: 2 pfn: 481280 <snip> We cannot choose the way of simply walking free_list in the increasing order w.r.t. pfn for a given range of memory, suspend the walking and save the data for the next walking... So, it's necessary to create a table for access in constant time. But for that, the table needs to be created on the memory. On the 2nd kernel, we cannot assume any backing store in general: consider scp for example. I think basic idea would be several efforts for small memory programming, like: * Create part of bitmap corresponding to range of memory currently being processed only, and table creation processing is repeated each time range of memory is started. => difficult to avoid looking up a whole part of free_list every time, but this is only idea I come up with that makes it always possible that consumed memory is stably constant. * Have table in memory mapping form rather than bitmap, switch back to bitmap if the size gets larger than the bitmap's => bad performance on very fragmented case, and constructing memory mapping requires O(n^2) so would cost high if doing it multiple times. * Compress part of bitmap except for the one currently being processed => bad performance when compression doesn't work well bad performance when compression is done too many times But before that, I want to also consider possibility of increasing reserved memory for the 2nd kernel. On the discussion of 512MB reservation regression last month, Vivek explained that 512MB is current maximam value and enough for at most 6TB system. https://lkml.org/lkml/2012/3/13/372 But on such machine, where makedumpfile perforamce is affected, there seems to be a room to reserve more 512MB memory. Also Yinghai said following Vivek, system memory size will still grow in next years. Note: * 1 bit in bitmap represents 1 page frame. On x86, 1 byte is for 32kB memory. 1TB memory requres 32MB. Dump includes two bitmaps so 64MB is needed in total. * Bad performance is free pages only. Cache, cache private, user and zero pages are processed per range of memory in good performance. Thanks. HATAYAMA, Daisuke _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-05 6:52 ` HATAYAMA Daisuke @ 2012-04-05 14:34 ` Vivek Goyal 2012-04-06 1:12 ` HATAYAMA Daisuke 0 siblings, 1 reply; 34+ messages in thread From: Vivek Goyal @ 2012-04-05 14:34 UTC (permalink / raw) To: HATAYAMA Daisuke; +Cc: dzickus, oomichi, kumagai-atsushi, kexec On Thu, Apr 05, 2012 at 03:52:11PM +0900, HATAYAMA Daisuke wrote: [..] > * Bad performance is free pages only. Cache, cache private, user and > zero pages are processed per range of memory in good performance. Hi Daisuke-san, I am wondering why can't we walk through the memmap array and look into struct page for figuring out if page is free or not. Looks like that in the past we used to have PG_buddy flag and same information possibly could be retrieved by looking at page->_count field. So I am just curious that why do we walk through free pages list to figure out free pages instead of looking at "struct page". Thanks Vivek _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-05 14:34 ` Vivek Goyal @ 2012-04-06 1:12 ` HATAYAMA Daisuke 2012-04-06 8:59 ` Atsushi Kumagai 2012-04-09 19:00 ` Vivek Goyal 0 siblings, 2 replies; 34+ messages in thread From: HATAYAMA Daisuke @ 2012-04-06 1:12 UTC (permalink / raw) To: vgoyal; +Cc: dzickus, oomichi, kumagai-atsushi, kexec From: Vivek Goyal <vgoyal@redhat.com> Subject: Re: makedumpfile memory usage grows with system memory size Date: Thu, 5 Apr 2012 10:34:39 -0400 > On Thu, Apr 05, 2012 at 03:52:11PM +0900, HATAYAMA Daisuke wrote: > > [..] >> * Bad performance is free pages only. Cache, cache private, user and >> zero pages are processed per range of memory in good performance. > > Hi Daisuke-san, > Hello Vivek, > I am wondering why can't we walk through the memmap array and look into > struct page for figuring out if page is free or not. Looks like that > in the past we used to have PG_buddy flag and same information possibly > could be retrieved by looking at page->_count field. > > So I am just curious that why do we walk through free pages list to figure > out free pages instead of looking at "struct page". Thanks. To be honest, I have just beginning with reading around here and known PG_buddy just now. I have small checked this fact on 2.6.18 with the patch in the bottom of this mail and free pages found from free_list and by PG_buddy check are coincide. As Vivek says, more recent kernel has change around PG_buddy and the patch says we should check _mapcount; I have yet to check this. Author: Andrea Arcangeli <aarcange@redhat.com> Date: Thu Jan 13 15:47:00 2011 -0800 thp: remove PG_buddy PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can be added to page->flags without overflowing (because of the sparse section bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also has to move the memory hotplug code from _mapcount to lru.next to avoid any risk of clashes. We can't use lru.next for PG_buddy removal, but memory hotplug can use lru.next even more easily than the mapcount instead. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> $ git describe 5f24ce5fd34c3ca1b3d10d30da754732da64d5c0 v2.6.37-7012-g5f24ce5 So now we can walk on the memmap array also for free pages like other kinds of memory. The question I have now is why the current implementation was chosen. Is there any difference between two ways? Subject: [PATCH] Add free pages message --- makedumpfile.c | 9 +++++++++ makedumpfile.h | 1 + print_info.h | 2 +- 3 files changed, 11 insertions(+), 1 deletions(-) diff --git a/makedumpfile.c b/makedumpfile.c index c843567..bd770b1 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -3198,6 +3198,9 @@ reset_bitmap_of_free_pages(unsigned long node_zones) retcd = ANALYSIS_FAILED; return FALSE; } + + FREEPAGE_MSG("order: %d migrate_type: %d pfn: %llu\n", order, migrate_type, start_pfn); + for (i = 0; i < (1<<order); i++) { pfn = start_pfn + i; clear_bit_on_2nd_bitmap_for_kernel(pfn); @@ -3399,6 +3402,7 @@ _exclude_free_page(void) } if (!spanned_pages) continue; + FREEPAGE_MSG("NR_ZONE: %d\n", i); if (!reset_bitmap_of_free_pages(zone)) return FALSE; } @@ -3688,6 +3692,11 @@ __exclude_unnecessary_pages(unsigned long mem_map, _count = UINT(pcache + OFFSET(page._count)); mapping = ULONG(pcache + OFFSET(page.mapping)); + if ((info->dump_level & DL_EXCLUDE_FREE) + && (flags & (1UL << PG_flag))) { + FREEPAGE_MSG("PG_flag: flags: %#016lx pfn %llu\n", flags, pfn); + } + /* * Exclude the cache page without the private page. */ diff --git a/makedumpfile.h b/makedumpfile.h index ed1e9de..1faef47 100644 --- a/makedumpfile.h +++ b/makedumpfile.h @@ -67,6 +67,7 @@ int get_mem_type(void); #define PG_lru_ORIGINAL (5) #define PG_private_ORIGINAL (11) /* Has something at ->private */ #define PG_swapcache_ORIGINAL (15) /* Swap page: swp_entry_t in private */ +#define PG_buddy (19) #define PAGE_MAPPING_ANON (1) diff --git a/print_info.h b/print_info.h index 94968ca..44415d3 100644 --- a/print_info.h +++ b/print_info.h @@ -42,7 +42,7 @@ void print_execution_time(char *step_name, struct timeval *tv_start); * Message Level */ #define MIN_MSG_LEVEL (0) -#define MAX_MSG_LEVEL (31) +#define MAX_MSG_LEVEL (31+0x20) #define DEFAULT_MSG_LEVEL (7) /* Print the progress indicator, the common message, the error message */ #define ML_PRINT_PROGRESS (0x001) /* Print the progress indicator */ -- 1.7.4.4 Thanks, HATAYAMA, Daisuke _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-06 1:12 ` HATAYAMA Daisuke @ 2012-04-06 8:59 ` Atsushi Kumagai 2012-04-06 9:29 ` HATAYAMA Daisuke 2012-04-09 19:00 ` Vivek Goyal 1 sibling, 1 reply; 34+ messages in thread From: Atsushi Kumagai @ 2012-04-06 8:59 UTC (permalink / raw) To: d.hatayama; +Cc: dzickus, oomichi, kexec, vgoyal Hello Hatayama-san, On Fri, 06 Apr 2012 10:12:12 +0900 ( ) HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote: > Thanks. To be honest, I have just beginning with reading around here > and known PG_buddy just now. I have small checked this fact on 2.6.18 > with the patch in the bottom of this mail and free pages found from > free_list and by PG_buddy check are coincide. > > As Vivek says, more recent kernel has change around PG_buddy and the > patch says we should check _mapcount; I have yet to check this. > > Author: Andrea Arcangeli <aarcange@redhat.com> > Date: Thu Jan 13 15:47:00 2011 -0800 > > thp: remove PG_buddy > > PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can > be added to page->flags without overflowing (because of the sparse section > bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also > has to move the memory hotplug code from _mapcount to lru.next to avoid > any risk of clashes. We can't use lru.next for PG_buddy removal, but > memory hotplug can use lru.next even more easily than the mapcount > instead. > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > > $ git describe 5f24ce5fd34c3ca1b3d10d30da754732da64d5c0 > v2.6.37-7012-g5f24ce5 > > So now we can walk on the memmap array also for free pages like other > kinds of memory. The question I have now is why the current > implementation was chosen. Is there any difference between two ways? We just referred to the implementation of disk_dump. Now, I'm checking the validity of using _count field to figure out free pages. I would like to use _count rather than PG_buddy because I would like to avoid changing behavior based on versions as long as possible. Thanks Atsushi Kumagai _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-06 8:59 ` Atsushi Kumagai @ 2012-04-06 9:29 ` HATAYAMA Daisuke 2012-04-09 18:57 ` Vivek Goyal 0 siblings, 1 reply; 34+ messages in thread From: HATAYAMA Daisuke @ 2012-04-06 9:29 UTC (permalink / raw) To: kumagai-atsushi; +Cc: dzickus, oomichi, kexec, vgoyal From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Subject: Re: makedumpfile memory usage grows with system memory size Date: Fri, 6 Apr 2012 17:59:24 +0900 > Hello Hatayama-san, > > On Fri, 06 Apr 2012 10:12:12 +0900 ( ) > HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote: > >> Thanks. To be honest, I have just beginning with reading around here >> and known PG_buddy just now. I have small checked this fact on 2.6.18 >> with the patch in the bottom of this mail and free pages found from >> free_list and by PG_buddy check are coincide. >> >> As Vivek says, more recent kernel has change around PG_buddy and the >> patch says we should check _mapcount; I have yet to check this. >> >> Author: Andrea Arcangeli <aarcange@redhat.com> >> Date: Thu Jan 13 15:47:00 2011 -0800 >> >> thp: remove PG_buddy >> >> PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can >> be added to page->flags without overflowing (because of the sparse section >> bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also >> has to move the memory hotplug code from _mapcount to lru.next to avoid >> any risk of clashes. We can't use lru.next for PG_buddy removal, but >> memory hotplug can use lru.next even more easily than the mapcount >> instead. >> >> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> >> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> >> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> >> >> $ git describe 5f24ce5fd34c3ca1b3d10d30da754732da64d5c0 >> v2.6.37-7012-g5f24ce5 >> >> So now we can walk on the memmap array also for free pages like other >> kinds of memory. The question I have now is why the current >> implementation was chosen. Is there any difference between two ways? > > We just referred to the implementation of disk_dump. > > Now, I'm checking the validity of using _count field to figure out free pages. > I would like to use _count rather than PG_buddy because I would like to avoid > changing behavior based on versions as long as possible. > I agree. On the other hand, there is one more thing to consider. The value of order is in private member of the page descripter. Now there's no information for private member in VMCOREINFO. If we choose this method and delete the current one, it's necessary to prepare vmlinux file for old kernels. Thanks. HATAYAMA, Daisuke _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-06 9:29 ` HATAYAMA Daisuke @ 2012-04-09 18:57 ` Vivek Goyal 2012-04-09 23:58 ` HATAYAMA Daisuke 0 siblings, 1 reply; 34+ messages in thread From: Vivek Goyal @ 2012-04-09 18:57 UTC (permalink / raw) To: HATAYAMA Daisuke; +Cc: dzickus, oomichi, kumagai-atsushi, kexec On Fri, Apr 06, 2012 at 06:29:40PM +0900, HATAYAMA Daisuke wrote: [..] > I agree. On the other hand, there is one more thing to consider. The > value of order is in private member of the page descripter. Now > there's no information for private member in VMCOREINFO. If we choose > this method and delete the current one, it's necessary to prepare > vmlinux file for old kernels. What information do you need to access "private" member of "struct page". offset? Can't we extend VMCOREINFO to export this info too? Thanks Vivek _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-09 18:57 ` Vivek Goyal @ 2012-04-09 23:58 ` HATAYAMA Daisuke 2012-04-10 12:52 ` Vivek Goyal 0 siblings, 1 reply; 34+ messages in thread From: HATAYAMA Daisuke @ 2012-04-09 23:58 UTC (permalink / raw) To: vgoyal; +Cc: dzickus, oomichi, kumagai-atsushi, kexec From: Vivek Goyal <vgoyal@redhat.com> Subject: Re: makedumpfile memory usage grows with system memory size Date: Mon, 9 Apr 2012 14:57:28 -0400 > On Fri, Apr 06, 2012 at 06:29:40PM +0900, HATAYAMA Daisuke wrote: > > [..] >> I agree. On the other hand, there is one more thing to consider. The >> value of order is in private member of the page descripter. Now >> there's no information for private member in VMCOREINFO. If we choose >> this method and delete the current one, it's necessary to prepare >> vmlinux file for old kernels. > > What information do you need to access "private" member of "struct page". > offset? Can't we extend VMCOREINFO to export this info too? > Yes, I mean offset of private member in page structure. The member contains order of the buddy. Extending VMCOREINFO is easy, but we cannot do that for old kernels, for which vmlinux is needed separately. This might be the same as what Kumagai-san says he doesn' want to change behaviour on kernel versions. Thanks. HATAYAMA, Daisuke _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-09 23:58 ` HATAYAMA Daisuke @ 2012-04-10 12:52 ` Vivek Goyal 2012-04-12 3:40 ` Atsushi Kumagai 0 siblings, 1 reply; 34+ messages in thread From: Vivek Goyal @ 2012-04-10 12:52 UTC (permalink / raw) To: HATAYAMA Daisuke; +Cc: dzickus, oomichi, kumagai-atsushi, kexec On Tue, Apr 10, 2012 at 08:58:24AM +0900, HATAYAMA Daisuke wrote: > From: Vivek Goyal <vgoyal@redhat.com> > Subject: Re: makedumpfile memory usage grows with system memory size > Date: Mon, 9 Apr 2012 14:57:28 -0400 > > > On Fri, Apr 06, 2012 at 06:29:40PM +0900, HATAYAMA Daisuke wrote: > > > > [..] > >> I agree. On the other hand, there is one more thing to consider. The > >> value of order is in private member of the page descripter. Now > >> there's no information for private member in VMCOREINFO. If we choose > >> this method and delete the current one, it's necessary to prepare > >> vmlinux file for old kernels. > > > > What information do you need to access "private" member of "struct page". > > offset? Can't we extend VMCOREINFO to export this info too? > > > > Yes, I mean offset of private member in page structure. The member > contains order of the buddy. Extending VMCOREINFO is easy, but we > cannot do that for old kernels, for which vmlinux is needed > separately. > > This might be the same as what Kumagai-san says he doesn' want to > change behaviour on kernel versions. We can retain both the mechanisms. For newer kernels which export page->private offset, we can walk through memmap array and prepare a chunk of bitmap and discard it. For older kernels we can continue to walk through free pages list and prepare big bitmap in userspace. It is desirable to keep mechanism same across kernel versions, but change is unavoidable as things evolve in newer kernels. So at max we can provide backward compatibility with old kernels. Thanks Vivek _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-10 12:52 ` Vivek Goyal @ 2012-04-12 3:40 ` Atsushi Kumagai 2012-04-12 7:47 ` HATAYAMA Daisuke 0 siblings, 1 reply; 34+ messages in thread From: Atsushi Kumagai @ 2012-04-12 3:40 UTC (permalink / raw) To: vgoyal; +Cc: dzickus, oomichi, d.hatayama, kexec Hello, On Tue, 10 Apr 2012 08:52:05 -0400 Vivek Goyal <vgoyal@redhat.com> wrote: > On Tue, Apr 10, 2012 at 08:58:24AM +0900, HATAYAMA Daisuke wrote: > > From: Vivek Goyal <vgoyal@redhat.com> > > Subject: Re: makedumpfile memory usage grows with system memory size > > Date: Mon, 9 Apr 2012 14:57:28 -0400 > > > > > On Fri, Apr 06, 2012 at 06:29:40PM +0900, HATAYAMA Daisuke wrote: > > > > > > [..] > > >> I agree. On the other hand, there is one more thing to consider. The > > >> value of order is in private member of the page descripter. Now > > >> there's no information for private member in VMCOREINFO. If we choose > > >> this method and delete the current one, it's necessary to prepare > > >> vmlinux file for old kernels. > > > > > > What information do you need to access "private" member of "struct page". > > > offset? Can't we extend VMCOREINFO to export this info too? > > > > > > > Yes, I mean offset of private member in page structure. The member > > contains order of the buddy. Extending VMCOREINFO is easy, but we > > cannot do that for old kernels, for which vmlinux is needed > > separately. > > > > This might be the same as what Kumagai-san says he doesn' want to > > change behaviour on kernel versions. > > We can retain both the mechanisms. For newer kernels which export > page->private offset, we can walk through memmap array and prepare a > chunk of bitmap and discard it. For older kernels we can continue to walk > through free pages list and prepare big bitmap in userspace. > > It is desirable to keep mechanism same across kernel versions, but > change is unavoidable as things evolve in newer kernels. So at max > we can provide backward compatibility with old kernels. I said I want to avoid changing behavior based on kernel versions, but it seems difficult as Vivek said. So, I will accept the changing if it is necessary. Now, I will make two prototypes to consider the method to figure out free pages. - a prototype based on _count - a prototype based on PG_buddy (or _mapcount) If prototypes work fine, then we can select the method. Thanks Atsushi Kumagai _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-12 3:40 ` Atsushi Kumagai @ 2012-04-12 7:47 ` HATAYAMA Daisuke [not found] ` <20120427164649.9932a33f.kumagai-atsushi@mxc.nes.nec.co.jp> 0 siblings, 1 reply; 34+ messages in thread From: HATAYAMA Daisuke @ 2012-04-12 7:47 UTC (permalink / raw) To: kumagai-atsushi; +Cc: dzickus, oomichi, kexec, vgoyal From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Subject: Re: makedumpfile memory usage grows with system memory size Date: Thu, 12 Apr 2012 12:40:40 +0900 > Hello, > > On Tue, 10 Apr 2012 08:52:05 -0400 > Vivek Goyal <vgoyal@redhat.com> wrote: > >> On Tue, Apr 10, 2012 at 08:58:24AM +0900, HATAYAMA Daisuke wrote: >> > From: Vivek Goyal <vgoyal@redhat.com> >> > Subject: Re: makedumpfile memory usage grows with system memory size >> > Date: Mon, 9 Apr 2012 14:57:28 -0400 >> > >> > > On Fri, Apr 06, 2012 at 06:29:40PM +0900, HATAYAMA Daisuke wrote: >> > > >> > > [..] >> > >> I agree. On the other hand, there is one more thing to consider. The >> > >> value of order is in private member of the page descripter. Now >> > >> there's no information for private member in VMCOREINFO. If we choose >> > >> this method and delete the current one, it's necessary to prepare >> > >> vmlinux file for old kernels. >> > > >> > > What information do you need to access "private" member of "struct page". >> > > offset? Can't we extend VMCOREINFO to export this info too? >> > > >> > >> > Yes, I mean offset of private member in page structure. The member >> > contains order of the buddy. Extending VMCOREINFO is easy, but we >> > cannot do that for old kernels, for which vmlinux is needed >> > separately. >> > >> > This might be the same as what Kumagai-san says he doesn' want to >> > change behaviour on kernel versions. >> >> We can retain both the mechanisms. For newer kernels which export >> page->private offset, we can walk through memmap array and prepare a >> chunk of bitmap and discard it. For older kernels we can continue to walk >> through free pages list and prepare big bitmap in userspace. >> >> It is desirable to keep mechanism same across kernel versions, but >> change is unavoidable as things evolve in newer kernels. So at max >> we can provide backward compatibility with old kernels. > > I said I want to avoid changing behavior based on kernel versions, > but it seems difficult as Vivek said. So, I will accept the changing > if it is necessary. > > Now, I will make two prototypes to consider the method to figure out > free pages. > > - a prototype based on _count > - a prototype based on PG_buddy (or _mapcount) > > If prototypes work fine, then we can select the method. I think the first one would work well and it's more accurate in meaning of free page. Although this might be not problematic in practice, new method that walks on page tables can lead to different result from the previous one that looks up free_list: looking at __free_pages(), it first decreases page->_count and then add the page to free_list, and looking at __alloc_pages(), it first retrieves a page from free_list and then set page->_count to 1. Thanks. HATAYAMA, Daisuke _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <20120427164649.9932a33f.kumagai-atsushi@mxc.nes.nec.co.jp>]
* Re: makedumpfile memory usage grows with system memory size [not found] ` <20120427164649.9932a33f.kumagai-atsushi@mxc.nes.nec.co.jp> @ 2012-04-27 12:52 ` Don Zickus 2012-05-11 1:19 ` Atsushi Kumagai 2012-04-27 13:33 ` Vivek Goyal 2012-05-14 5:44 ` HATAYAMA Daisuke 2 siblings, 1 reply; 34+ messages in thread From: Don Zickus @ 2012-04-27 12:52 UTC (permalink / raw) To: Atsushi Kumagai; +Cc: oomichi, d.hatayama, kexec, vgoyal On Fri, Apr 27, 2012 at 04:46:49PM +0900, Atsushi Kumagai wrote: > Hello, > > On Thu, 12 Apr 2012 16:47:14 +0900 (JST) > HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote: > > [..] > > > I said I want to avoid changing behavior based on kernel versions, > > > but it seems difficult as Vivek said. So, I will accept the changing > > > if it is necessary. > > > > > > Now, I will make two prototypes to consider the method to figure out > > > free pages. > > > > > > - a prototype based on _count > > > - a prototype based on PG_buddy (or _mapcount) > > > > > > If prototypes work fine, then we can select the method. > > > > I think the first one would work well and it's more accurate in > > meaning of free page. > > > > Although this might be not problematic in practice, new method that > > walks on page tables can lead to different result from the previous > > one that looks up free_list: looking at __free_pages(), it first > > decreases page->_count and then add the page to free_list, and looking > > at __alloc_pages(), it first retrieves a page from free_list and then > > set page->_count to 1. > > I tested the prototype based on _count and the other based on _mapcount. > So, the former didn't work as expected while the latter worked fine. > (The former excluded some used pages as free pages.) > > As a next step, I measured performance of the prototype based on _mapcount, > please see below. Thanks for this work. I assume this work just switches the free page referencing and does not attempt to try and cut down on the memory usage (I guess that would be the next step if using mapcount is acceptable)? > > > Performance Comparison: > > Explanation: > - The new method supports 2.6.39 and later, and it needs vmlinux. > > - Now, the prototype doesn't support PG_buddy because the value of PG_buddy > is different depending on kernel configuration and it isn't stored into > VMCOREINFO. However, I'll extend get_length_of_free_pages() for PG_buddy > when the value of PG_buddy is stored into VMCOREINFO. > > - The prototype has dump_level "32" to use new method, but I don't think > to extend dump_level for official version. > > How to measure: > I measured execution times with vmcore of 5GB in below cases with > attached patches. > > - dump_level 16: exclude only free pages with the current method > - dump_level 31: exclude all excludable pages with the current method > - dump_level 32: exclude only free pages with the new method > - dump_level 47: exclude all excludable pages with the new method > > Result: > ------------------------------------------------------------------------ > dump_level size [Bytes] total time d_all_time d_new_time > ------------------------------------------------------------------------ > 16 431864384 28.6s 4.19s 0s > 31 111808568 14.5s 0.9s 0s > 32 431864384 41.2s 16.8s 0.05s > 47 111808568 31.5s 16.6s 0.05s > ------------------------------------------------------------------------ > > Discussion: > I think the new method can be used instead of the current method in many cases. > (However, the result of dump_level 31 looks too fast, I'm researching why > the case can execute so fast.) > > I would like to get your opinion. I am curious. Looking through your patches, it seems d_all_time's increase in time should be from the new method because the if-statement is setup to only accept the new method. Therefore I was expecting d_new_time for the new method when added to d_all_time for the current method would come close to d_all_time for the new method. IOW I would have expected the extra 10-12 seconds from the new method to be found in d_new_time. However, I do not see that. d_new_time hardly increases at all. So what is accounting for the increase in d_all_time for the new method? Thanks, Don _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-27 12:52 ` Don Zickus @ 2012-05-11 1:19 ` Atsushi Kumagai 2012-05-11 13:26 ` Don Zickus 0 siblings, 1 reply; 34+ messages in thread From: Atsushi Kumagai @ 2012-05-11 1:19 UTC (permalink / raw) To: kexec; +Cc: dzickus, oomichi, d.hatayama, vgoyal Hello, On Fri, 27 Apr 2012 08:52:14 -0400 Don Zickus <dzickus@redhat.com> wrote: [..] > > I tested the prototype based on _count and the other based on _mapcount. > > So, the former didn't work as expected while the latter worked fine. > > (The former excluded some used pages as free pages.) > > > > As a next step, I measured performance of the prototype based on _mapcount, > > please see below. > > Thanks for this work. I assume this work just switches the free page > referencing and does not attempt to try and cut down on the memory usage > (I guess that would be the next step if using mapcount is acceptable)? Thank you for your reply, Don, Vivek. As Don said, I tried to change the method to exclude free pages and planed to resolve the memory consumption issue after it, because parsing free list repeatedly may cause a performance issue. However, I'm thinking that to fix the size of memory consumption is more important than to resolve a performance issue for large system. So I'm afraid that I would like to change the plan as: 1. Implement "iterating filtering processing" to fix the size of memory consumption. At this stage, makedumpfile will parse free list repeatedly even though it may cause a performance issue. 2. Take care of the performance issue after the 1st step. Thanks Atsushi Kumagai > > > > > > > Performance Comparison: > > > > Explanation: > > - The new method supports 2.6.39 and later, and it needs vmlinux. > > > > - Now, the prototype doesn't support PG_buddy because the value of PG_buddy > > is different depending on kernel configuration and it isn't stored into > > VMCOREINFO. However, I'll extend get_length_of_free_pages() for PG_buddy > > when the value of PG_buddy is stored into VMCOREINFO. > > > > - The prototype has dump_level "32" to use new method, but I don't think > > to extend dump_level for official version. > > > > How to measure: > > I measured execution times with vmcore of 5GB in below cases with > > attached patches. > > > > - dump_level 16: exclude only free pages with the current method > > - dump_level 31: exclude all excludable pages with the current method > > - dump_level 32: exclude only free pages with the new method > > - dump_level 47: exclude all excludable pages with the new method > > > > Result: > > ------------------------------------------------------------------------ > > dump_level size [Bytes] total time d_all_time d_new_time > > ------------------------------------------------------------------------ > > 16 431864384 28.6s 4.19s 0s > > 31 111808568 14.5s 0.9s 0s > > 32 431864384 41.2s 16.8s 0.05s > > 47 111808568 31.5s 16.6s 0.05s > > ------------------------------------------------------------------------ > > > > Discussion: > > I think the new method can be used instead of the current method in many cases. > > (However, the result of dump_level 31 looks too fast, I'm researching why > > the case can execute so fast.) > > > > I would like to get your opinion. > > I am curious. Looking through your patches, it seems d_all_time's > increase in time should be from the new method because the if-statement is > setup to only accept the new method. Therefore I was expecting d_new_time > for the new method when added to d_all_time for the current method would > come close to d_all_time for the new method. IOW I would have expected > the extra 10-12 seconds from the new method to be found in d_new_time. > > However, I do not see that. d_new_time hardly increases at all. So what > is accounting for the increase in d_all_time for the new method? > > Thanks, > Don > > _______________________________________________ > kexec mailing list > kexec@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-05-11 1:19 ` Atsushi Kumagai @ 2012-05-11 13:26 ` Don Zickus 2012-05-15 5:57 ` Atsushi Kumagai 0 siblings, 1 reply; 34+ messages in thread From: Don Zickus @ 2012-05-11 13:26 UTC (permalink / raw) To: Atsushi Kumagai; +Cc: oomichi, d.hatayama, kexec, vgoyal On Fri, May 11, 2012 at 10:19:52AM +0900, Atsushi Kumagai wrote: > Hello, > > On Fri, 27 Apr 2012 08:52:14 -0400 > Don Zickus <dzickus@redhat.com> wrote: > > [..] > > > I tested the prototype based on _count and the other based on _mapcount. > > > So, the former didn't work as expected while the latter worked fine. > > > (The former excluded some used pages as free pages.) > > > > > > As a next step, I measured performance of the prototype based on _mapcount, > > > please see below. > > > > Thanks for this work. I assume this work just switches the free page > > referencing and does not attempt to try and cut down on the memory usage > > (I guess that would be the next step if using mapcount is acceptable)? > > Thank you for your reply, Don, Vivek. > > As Don said, I tried to change the method to exclude free pages and > planed to resolve the memory consumption issue after it, because > parsing free list repeatedly may cause a performance issue. > > However, I'm thinking that to fix the size of memory consumption is more > important than to resolve a performance issue for large system. > > So I'm afraid that I would like to change the plan as: > > 1. Implement "iterating filtering processing" to fix the size of memory > consumption. At this stage, makedumpfile will parse free list repeatedly > even though it may cause a performance issue. > > 2. Take care of the performance issue after the 1st step. Hello Atsushi-san, Hmm. The problem with the free list is that the addresses are in random order, hence the reason to parse it repeatedly, correct? I figured, now that you have a solution to parse the addresses in a linear way (the changes you made a couple of weeks ago), you would just continue with that. With that complete, we can look at the performance issues and solve them then. But it is up to you. You are willing to do the work, so I will defer to your judgement on how best to proceed. :-) Cheers, Don > > > Thanks > Atsushi Kumagai > > > > > > > > > > > > Performance Comparison: > > > > > > Explanation: > > > - The new method supports 2.6.39 and later, and it needs vmlinux. > > > > > > - Now, the prototype doesn't support PG_buddy because the value of PG_buddy > > > is different depending on kernel configuration and it isn't stored into > > > VMCOREINFO. However, I'll extend get_length_of_free_pages() for PG_buddy > > > when the value of PG_buddy is stored into VMCOREINFO. > > > > > > - The prototype has dump_level "32" to use new method, but I don't think > > > to extend dump_level for official version. > > > > > > How to measure: > > > I measured execution times with vmcore of 5GB in below cases with > > > attached patches. > > > > > > - dump_level 16: exclude only free pages with the current method > > > - dump_level 31: exclude all excludable pages with the current method > > > - dump_level 32: exclude only free pages with the new method > > > - dump_level 47: exclude all excludable pages with the new method > > > > > > Result: > > > ------------------------------------------------------------------------ > > > dump_level size [Bytes] total time d_all_time d_new_time > > > ------------------------------------------------------------------------ > > > 16 431864384 28.6s 4.19s 0s > > > 31 111808568 14.5s 0.9s 0s > > > 32 431864384 41.2s 16.8s 0.05s > > > 47 111808568 31.5s 16.6s 0.05s > > > ------------------------------------------------------------------------ > > > > > > Discussion: > > > I think the new method can be used instead of the current method in many cases. > > > (However, the result of dump_level 31 looks too fast, I'm researching why > > > the case can execute so fast.) > > > > > > I would like to get your opinion. > > > > I am curious. Looking through your patches, it seems d_all_time's > > increase in time should be from the new method because the if-statement is > > setup to only accept the new method. Therefore I was expecting d_new_time > > for the new method when added to d_all_time for the current method would > > come close to d_all_time for the new method. IOW I would have expected > > the extra 10-12 seconds from the new method to be found in d_new_time. > > > > However, I do not see that. d_new_time hardly increases at all. So what > > is accounting for the increase in d_all_time for the new method? > > > > Thanks, > > Don > > > > _______________________________________________ > > kexec mailing list > > kexec@lists.infradead.org > > http://lists.infradead.org/mailman/listinfo/kexec _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-05-11 13:26 ` Don Zickus @ 2012-05-15 5:57 ` Atsushi Kumagai 2012-05-15 12:35 ` Don Zickus 0 siblings, 1 reply; 34+ messages in thread From: Atsushi Kumagai @ 2012-05-15 5:57 UTC (permalink / raw) To: dzickus; +Cc: oomichi, d.hatayama, kexec, vgoyal Hello Don, On Fri, 11 May 2012 09:26:01 -0400 Don Zickus <dzickus@redhat.com> wrote: > > Thank you for your reply, Don, Vivek. > > > > As Don said, I tried to change the method to exclude free pages and > > planed to resolve the memory consumption issue after it, because > > parsing free list repeatedly may cause a performance issue. > > > > However, I'm thinking that to fix the size of memory consumption is more > > important than to resolve a performance issue for large system. > > > > So I'm afraid that I would like to change the plan as: > > > > 1. Implement "iterating filtering processing" to fix the size of memory > > consumption. At this stage, makedumpfile will parse free list repeatedly > > even though it may cause a performance issue. > > > > 2. Take care of the performance issue after the 1st step. > > Hello Atsushi-san, > > Hmm. The problem with the free list is that the addresses are in random > order, hence the reason to parse it repeatedly, correct? Yes. > I figured, now that you have a solution to parse the addresses in a linear > way (the changes you made a couple of weeks ago), you would just continue > with that. With that complete, we can look at the performance issues and > solve them then. > > But it is up to you. You are willing to do the work, so I will defer to > your judgement on how best to proceed. :-) What I wanted to tell you was I want to resolve the memory consumption issue as soon as possible. In other words, I think the method to exclude free pages is not so important. I'll continue to work with the method which is easy to implement. Thanks Atsushi Kumagai _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-05-15 5:57 ` Atsushi Kumagai @ 2012-05-15 12:35 ` Don Zickus 0 siblings, 0 replies; 34+ messages in thread From: Don Zickus @ 2012-05-15 12:35 UTC (permalink / raw) To: Atsushi Kumagai; +Cc: oomichi, d.hatayama, kexec, vgoyal On Tue, May 15, 2012 at 02:57:05PM +0900, Atsushi Kumagai wrote: > Hello Don, > > On Fri, 11 May 2012 09:26:01 -0400 > Don Zickus <dzickus@redhat.com> wrote: > > > > Thank you for your reply, Don, Vivek. > > > > > > As Don said, I tried to change the method to exclude free pages and > > > planed to resolve the memory consumption issue after it, because > > > parsing free list repeatedly may cause a performance issue. > > > > > > However, I'm thinking that to fix the size of memory consumption is more > > > important than to resolve a performance issue for large system. > > > > > > So I'm afraid that I would like to change the plan as: > > > > > > 1. Implement "iterating filtering processing" to fix the size of memory > > > consumption. At this stage, makedumpfile will parse free list repeatedly > > > even though it may cause a performance issue. > > > > > > 2. Take care of the performance issue after the 1st step. > > > > Hello Atsushi-san, > > > > Hmm. The problem with the free list is that the addresses are in random > > order, hence the reason to parse it repeatedly, correct? > > Yes. > > > I figured, now that you have a solution to parse the addresses in a linear > > way (the changes you made a couple of weeks ago), you would just continue > > with that. With that complete, we can look at the performance issues and > > solve them then. > > > > But it is up to you. You are willing to do the work, so I will defer to > > your judgement on how best to proceed. :-) > > What I wanted to tell you was I want to resolve the memory consumption issue > as soon as possible. In other words, I think the method to exclude free pages > is not so important. > I'll continue to work with the method which is easy to implement. Ok. I look forward to your results. Thanks for your effort. Cheers, Don _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size [not found] ` <20120427164649.9932a33f.kumagai-atsushi@mxc.nes.nec.co.jp> 2012-04-27 12:52 ` Don Zickus @ 2012-04-27 13:33 ` Vivek Goyal 2012-05-14 5:44 ` HATAYAMA Daisuke 2 siblings, 0 replies; 34+ messages in thread From: Vivek Goyal @ 2012-04-27 13:33 UTC (permalink / raw) To: Atsushi Kumagai; +Cc: dzickus, oomichi, d.hatayama, kexec On Fri, Apr 27, 2012 at 04:46:49PM +0900, Atsushi Kumagai wrote: > Hello, > > On Thu, 12 Apr 2012 16:47:14 +0900 (JST) > HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote: > > [..] > > > I said I want to avoid changing behavior based on kernel versions, > > > but it seems difficult as Vivek said. So, I will accept the changing > > > if it is necessary. > > > > > > Now, I will make two prototypes to consider the method to figure out > > > free pages. > > > > > > - a prototype based on _count > > > - a prototype based on PG_buddy (or _mapcount) > > > > > > If prototypes work fine, then we can select the method. > > > > I think the first one would work well and it's more accurate in > > meaning of free page. > > > > Although this might be not problematic in practice, new method that > > walks on page tables can lead to different result from the previous > > one that looks up free_list: looking at __free_pages(), it first > > decreases page->_count and then add the page to free_list, and looking > > at __alloc_pages(), it first retrieves a page from free_list and then > > set page->_count to 1. > > I tested the prototype based on _count and the other based on _mapcount. > So, the former didn't work as expected while the latter worked fine. > (The former excluded some used pages as free pages.) > > As a next step, I measured performance of the prototype based on _mapcount, > please see below. > > > Performance Comparison: > > Explanation: > - The new method supports 2.6.39 and later, and it needs vmlinux. > > - Now, the prototype doesn't support PG_buddy because the value of PG_buddy > is different depending on kernel configuration and it isn't stored into > VMCOREINFO. However, I'll extend get_length_of_free_pages() for PG_buddy > when the value of PG_buddy is stored into VMCOREINFO. > > - The prototype has dump_level "32" to use new method, but I don't think > to extend dump_level for official version. Thanks for your work. Yes, introducing new dump_level for new filtering method will not be appropriate. If it is found that going through struct pages and parsing _mapcount is not too bad from performance point of view, then makedumpfile should just switch its default on newer kernels. Or, I am assuming that anyway we will intorduce a new option to makedumpfile to tell whether we want to a fixed memory usage filtering or not (assuming there is significant performance penalty on large machines, 1TB or more). So with that option we can do free page filtering using struct page otherwise we can continue to go through free pages list. Anyway, I think it is too early to discuss various user visible options. Thanks Vivek _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size [not found] ` <20120427164649.9932a33f.kumagai-atsushi@mxc.nes.nec.co.jp> 2012-04-27 12:52 ` Don Zickus 2012-04-27 13:33 ` Vivek Goyal @ 2012-05-14 5:44 ` HATAYAMA Daisuke 2012-05-16 8:02 ` Atsushi Kumagai 2 siblings, 1 reply; 34+ messages in thread From: HATAYAMA Daisuke @ 2012-05-14 5:44 UTC (permalink / raw) To: kumagai-atsushi; +Cc: dzickus, oomichi, kexec, vgoyal From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Subject: Re: makedumpfile memory usage grows with system memory size Date: Fri, 27 Apr 2012 16:46:49 +0900 > - Now, the prototype doesn't support PG_buddy because the value of PG_buddy > is different depending on kernel configuration and it isn't stored into > VMCOREINFO. However, I'll extend get_length_of_free_pages() for PG_buddy > when the value of PG_buddy is stored into VMCOREINFO. Hello Kumagai san, I'm now investigating how to perform filtering free pages without kernel debuginfo. For this, I've investigated which of PG_buddy and _mapcount to use in kernel versions. In the current conclusion, it's reasonable to do that as shown in the following table. | kernel version | Use PG_buddy? or _mapcount? | |------------------+----------------------------------------------------------| | 2.6.15 -- 2.6.16 | offsetof(page,_mapcount):=sizeof(ulong)+sizeof(atomic_t) | | 2.6.17 -- 2.6.26 | PG_buddy := 19 | | 2.6.27 -- 2.6.36 | PG_buddy := 18 | | 2.6.37 and later | offsetof(page,_mapcount):= under investigation | | In summary: PG_buddy was first introduced at 2.6.17 as 19 to fix some race bug leading to lru list corruptions, and from 2.6.17 to 2.6.26, it had been defined using macro preprocessor. At 2.6.27 enum pageflags was introduced for ease of page flags maintainance and its value changed to 18. At 2.6.37, it was removed, and it no longer exists in later kernel versions. My quick feeling is that solving dependency of PG_buddy is simler than that of _mapcount from 2.6.17 to 2.6.36. From 2.6.15 to 2.6.16, PG_buddy has not been introduced so we need to rely on _mapcount. It's very complex to solve _mapcount dependency in general on all supported kernel versions, but only on both kernel versions, definition of struct page begins with the following layout. I think it's not so much complex to hardcode offset of _mapcount for these two kernel versions only: that is, sizeof(unsigned long) + sizeof(atomic_t) which is in fact struct { volatile int counter } on all platforms. struct page { unsigned long flags; /* Atomic flags, some possibly * updated asynchronously */ atomic_t _count; /* Usage count, see below. */ atomic_t _mapcount; /* Count of ptes mapped in mms, ... In the period of PG_buddy is defined as enumeration value, PG_buddy value depends on CONFIG_PAGEFLAGS_EXTENDED. At commit e20b8cca760ed2a6abcfe37ef56f2306790db648, PG_head and PG_tail were introduced and they are positioned before PG_buddy if CONFIG_PAGEFLAGS_EXTENDED is set; then PG_buddy value becomes 19. However, its users are mips, um and xtensa only as: $ git grep "CONFIG_PAGEFLAGS_EXTENDED" arch/mips/configs/db1300_defconfig:CONFIG_PAGEFLAGS_EXTENDED=y arch/um/defconfig:CONFIG_PAGEFLAGS_EXTENDED=y arch/xtensa/configs/iss_defconfig:CONFIG_PAGEFLAGS_EXTENDED=y arch/xtensa/configs/s6105_defconfig:CONFIG_PAGEFLAGS_EXTENDED=y include/linux/page-flags.h:#ifdef CONFIG_PAGEFLAGS_EXTENDED include/linux/page-flags.h:#ifdef CONFIG_PAGEFLAGS_EXTENDED mm/memory-failure.c:#ifdef CONFIG_PAGEFLAGS_EXTENDED mm/page_alloc.c:#ifdef CONFIG_PAGEFLAGS_EXTENDED and makedumpfile doesn't support any of these platforms now. So we don't need to consider this case more. On 2.6.37 and the later kernels, we must use _mapcount. I'm now looking into how to get offset of _mapcount in each kernel version without kernel debug information. But page structure has changed considerably on recent kernels so I guess the way hardcoding them gets more complicated. Anyway, I think it better to add _mapcount information to VMCOREINFO on upstream as soon as possible. Thanks. HATAYAMA, Daisuke _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-05-14 5:44 ` HATAYAMA Daisuke @ 2012-05-16 8:02 ` Atsushi Kumagai 2012-05-17 0:21 ` HATAYAMA Daisuke 0 siblings, 1 reply; 34+ messages in thread From: Atsushi Kumagai @ 2012-05-16 8:02 UTC (permalink / raw) To: d.hatayama; +Cc: dzickus, oomichi, kexec, vgoyal Hello HATAYAMA-san, On Mon, 14 May 2012 14:44:28 +0900 (JST) HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote: > From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> > Subject: Re: makedumpfile memory usage grows with system memory size > Date: Fri, 27 Apr 2012 16:46:49 +0900 > > > - Now, the prototype doesn't support PG_buddy because the value of PG_buddy > > is different depending on kernel configuration and it isn't stored into > > VMCOREINFO. However, I'll extend get_length_of_free_pages() for PG_buddy > > when the value of PG_buddy is stored into VMCOREINFO. > > Hello Kumagai san, > > I'm now investigating how to perform filtering free pages without > kernel debuginfo. For this, I've investigated which of PG_buddy and > _mapcount to use in kernel versions. In the current conclusion, it's > reasonable to do that as shown in the following table. > > | kernel version | Use PG_buddy? or _mapcount? | > |------------------+----------------------------------------------------------| > | 2.6.15 -- 2.6.16 | offsetof(page,_mapcount):=sizeof(ulong)+sizeof(atomic_t) | > | 2.6.17 -- 2.6.26 | PG_buddy := 19 | > | 2.6.27 -- 2.6.36 | PG_buddy := 18 | > | 2.6.37 and later | offsetof(page,_mapcount):= under investigation | | Thank you for your investigation, it's very helpful ! > In summary: PG_buddy was first introduced at 2.6.17 as 19 to fix some > race bug leading to lru list corruptions, and from 2.6.17 to 2.6.26, > it had been defined using macro preprocessor. At 2.6.27 enum pageflags > was introduced for ease of page flags maintainance and its value > changed to 18. At 2.6.37, it was removed, and it no longer exists in > later kernel versions. > > My quick feeling is that solving dependency of PG_buddy is simler than > that of _mapcount from 2.6.17 to 2.6.36. > > From 2.6.15 to 2.6.16, PG_buddy has not been introduced so we need to > rely on _mapcount. It's very complex to solve _mapcount dependency in > general on all supported kernel versions, but only on both kernel > versions, definition of struct page begins with the following > layout. I think it's not so much complex to hardcode offset of > _mapcount for these two kernel versions only: that is, sizeof(unsigned > long) + sizeof(atomic_t) which is in fact struct { volatile int > counter } on all platforms. > > struct page { > unsigned long flags; /* Atomic flags, some possibly > * updated asynchronously */ > atomic_t _count; /* Usage count, see below. */ > atomic_t _mapcount; /* Count of ptes mapped in mms, > ... > > In the period of PG_buddy is defined as enumeration value, PG_buddy > value depends on CONFIG_PAGEFLAGS_EXTENDED. At commit > e20b8cca760ed2a6abcfe37ef56f2306790db648, PG_head and PG_tail were > introduced and they are positioned before PG_buddy if > CONFIG_PAGEFLAGS_EXTENDED is set; then PG_buddy value becomes > 19. However, its users are mips, um and xtensa only as: > > $ git grep "CONFIG_PAGEFLAGS_EXTENDED" > arch/mips/configs/db1300_defconfig:CONFIG_PAGEFLAGS_EXTENDED=y > arch/um/defconfig:CONFIG_PAGEFLAGS_EXTENDED=y > arch/xtensa/configs/iss_defconfig:CONFIG_PAGEFLAGS_EXTENDED=y > arch/xtensa/configs/s6105_defconfig:CONFIG_PAGEFLAGS_EXTENDED=y > include/linux/page-flags.h:#ifdef CONFIG_PAGEFLAGS_EXTENDED > include/linux/page-flags.h:#ifdef CONFIG_PAGEFLAGS_EXTENDED > mm/memory-failure.c:#ifdef CONFIG_PAGEFLAGS_EXTENDED > mm/page_alloc.c:#ifdef CONFIG_PAGEFLAGS_EXTENDED > > and makedumpfile doesn't support any of these platforms now. So we > don't need to consider this case more. > > On 2.6.37 and the later kernels, we must use _mapcount. I'm now > looking into how to get offset of _mapcount in each kernel version > without kernel debug information. But page structure has changed > considerably on recent kernels so I guess the way hardcoding them gets > more complicated. > > Anyway, I think it better to add _mapcount information to VMCOREINFO > on upstream as soon as possible. I think it's better way to use _mapcount. But we don't certainly decide to use _mapcount and even if we decide to use it, we still have problems to use it. For example, the upstream kernel(v3.4-rc7) has _mapcount in union, we need a information to judge whether the found data is _mapcount or not. So, more investigation is needed and I think it's too early to send the request to upstream kernel. I plan to finish working to reduce memory consumption by the end of June, and I will continue to discuss performance issues. Therefore, the request will be delayed until July or August. Thanks Atsushi Kumagai _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-05-16 8:02 ` Atsushi Kumagai @ 2012-05-17 0:21 ` HATAYAMA Daisuke 0 siblings, 0 replies; 34+ messages in thread From: HATAYAMA Daisuke @ 2012-05-17 0:21 UTC (permalink / raw) To: kumagai-atsushi; +Cc: dzickus, oomichi, kexec, vgoyal From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Subject: Re: makedumpfile memory usage grows with system memory size Date: Wed, 16 May 2012 17:02:30 +0900 > On Mon, 14 May 2012 14:44:28 +0900 (JST) > HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote: >> Anyway, I think it better to add _mapcount information to VMCOREINFO >> on upstream as soon as possible. > > I think it's better way to use _mapcount. > But we don't certainly decide to use _mapcount and even if we decide to use it, > we still have problems to use it. > For example, the upstream kernel(v3.4-rc7) has _mapcount in union, we need > a information to judge whether the found data is _mapcount or not. > So, more investigation is needed and I think it's too early to send the request > to upstream kernel. A quick look into the other part of the union---inuse, objects, froze---to which _mapcount belongs, they appears to be used in SLUB allocator. The page with PG_slab appears to use the other part to _mapcount. This means that to decide whether to use _mapcount, it's necessary to first investigate how SLUB allocator works. > > I plan to finish working to reduce memory consumption by the end of June, > and I will continue to discuss performance issues. > Therefore, the request will be delayed until July or August. > I'll wait for your patch on memory consumption and feedback to it. I'll also look into possibility of filtering free memory in constant space on recent kernels. Thanks. HATAYAMA, Daisuke _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-06 1:12 ` HATAYAMA Daisuke 2012-04-06 8:59 ` Atsushi Kumagai @ 2012-04-09 19:00 ` Vivek Goyal 1 sibling, 0 replies; 34+ messages in thread From: Vivek Goyal @ 2012-04-09 19:00 UTC (permalink / raw) To: HATAYAMA Daisuke; +Cc: dzickus, oomichi, kumagai-atsushi, kexec On Fri, Apr 06, 2012 at 10:12:12AM +0900, HATAYAMA Daisuke wrote: [..] > So now we can walk on the memmap array also for free pages like other > kinds of memory. The question I have now is why the current > implementation was chosen. Is there any difference between two ways? I don't know but I am guessing may be going through buddy allocator data structures is faster when there is tons of memory in the system and number of free pages is less. Thanks Vivek _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-03-29 8:09 ` Ken'ichi Ohmichi 2012-03-29 12:56 ` HATAYAMA Daisuke @ 2012-03-29 13:05 ` Don Zickus 2012-03-30 9:43 ` Atsushi Kumagai 2012-04-02 17:15 ` Michael Holzheu 2 siblings, 1 reply; 34+ messages in thread From: Don Zickus @ 2012-03-29 13:05 UTC (permalink / raw) To: Ken'ichi Ohmichi; +Cc: kexec Hi Ken'ichi-san, On Thu, Mar 29, 2012 at 05:09:18PM +0900, Ken'ichi Ohmichi wrote: > > Hi Don-san, > > On Wed, 28 Mar 2012 17:22:04 -0400 > Don Zickus <dzickus@redhat.com> wrote: > > > > I was talking to Vivek about kdump memory requirements and he mentioned > > that they vary based on how much system memory is used. > > > > I was interested in knowing why that was and again he mentioned that > > makedumpfile needed lots of memory if it was running on a large machine > > (for example 1TB of system memory). > > > > Looking through the makedumpfile README and using what Vivek remembered of > > makedumpfile, we gathered that as the number of pages grows, the more > > makedumpfile has to temporarily store the information in memory. The > > possible reason was to calculate the size of the file before it was copied > > to its final destination? > > makedumpfile uses the system memory of 2nd-kernel for a bitmap if RHEL. > The bitmap represents each page of 1st-kernel is excluded or not. > So the bitmap size depends on 1st-kernel's system memory. > > makedumpfile creates a file /tmp/kdump_bitmapXXXXXX as the bitmap, > and the file is created on 2nd-kernel's memory if RHEL, because > RHEL does not mount a root filesystem when 2nd-kernel is running. Ok. > > > > I was curious if that was true and if it was, would it be possible to only > > process memory in chunks instead of all at once. > > > > The idea is that a machine with 4Gigs of memory should consume the same > > the amount of kdump runtime memory as a 1TB memory system. > > > > Just trying to research ways to keep the memory requirements consistent > > across all memory ranges. > > I think the above purpose is good, and I don't have any idea for reducing > the bitmap size. And now I am out of makedumpfile development. > Kumagai-san is the makedumpfile maintainer now, and he will help you. Thanks for the feedback, I'll wait for Kumagai-san's response then. Cheers, Don _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-03-29 13:05 ` Don Zickus @ 2012-03-30 9:43 ` Atsushi Kumagai 2012-03-30 13:19 ` Don Zickus 0 siblings, 1 reply; 34+ messages in thread From: Atsushi Kumagai @ 2012-03-30 9:43 UTC (permalink / raw) To: dzickus; +Cc: oomichi, kexec Hello Don, > > On Thu, 29 Mar 2012 09:05:14 -0400 > > Don Zickus <dzickus@redhat.com> wrote: > > > > > Hi Ken'ichi-san, > > > > > > On Thu, Mar 29, 2012 at 05:09:18PM +0900, Ken'ichi Ohmichi wrote: > > > > > > > > Hi Don-san, > > > > > > > > On Wed, 28 Mar 2012 17:22:04 -0400 > > > > Don Zickus <dzickus@redhat.com> wrote: > > > > > > > > > > I was talking to Vivek about kdump memory requirements and he mentioned > > > > > that they vary based on how much system memory is used. > > > > > > > > > > I was interested in knowing why that was and again he mentioned that > > > > > makedumpfile needed lots of memory if it was running on a large machine > > > > > (for example 1TB of system memory). > > > > > > > > > > Looking through the makedumpfile README and using what Vivek remembered of > > > > > makedumpfile, we gathered that as the number of pages grows, the more > > > > > makedumpfile has to temporarily store the information in memory. The > > > > > possible reason was to calculate the size of the file before it was copied > > > > > to its final destination? > > > > > > > > makedumpfile uses the system memory of 2nd-kernel for a bitmap if RHEL. > > > > The bitmap represents each page of 1st-kernel is excluded or not. > > > > So the bitmap size depends on 1st-kernel's system memory. > > > > > > > > makedumpfile creates a file /tmp/kdump_bitmapXXXXXX as the bitmap, > > > > and the file is created on 2nd-kernel's memory if RHEL, because > > > > RHEL does not mount a root filesystem when 2nd-kernel is running. > > > > > > Ok. Does setting TMPDIR solve your problem ? Please refer to the man page. ENVIRONMENT VARIABLES TMPDIR This environment variable is for a temporary memory bitmap file. If your machine has a lots of memory and you use tmpfs on /tmp, makedumpfile can fail for a little memory in the 2nd kernel because makedumpfile makes a very large temporary memory bitmap file in this case. To avoid this failure, you can set a TMPDIR environment variable. If you do not set a TMPDIR envi- ronment variable, makedumpfile uses /tmp directory for a tempo- rary bitmap file as a default. On the other hand, I'm considering the enhancement suggested by Hatayama-san now. Thanks Atsushi Kumagai _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-03-30 9:43 ` Atsushi Kumagai @ 2012-03-30 13:19 ` Don Zickus 0 siblings, 0 replies; 34+ messages in thread From: Don Zickus @ 2012-03-30 13:19 UTC (permalink / raw) To: Atsushi Kumagai; +Cc: oomichi, kexec On Fri, Mar 30, 2012 at 06:43:34PM +0900, Atsushi Kumagai wrote: > Hello Don, > Does setting TMPDIR solve your problem ? Please refer to the man page. > > > ENVIRONMENT VARIABLES > TMPDIR This environment variable is for a temporary memory bitmap > file. If your machine has a lots of memory and you use tmpfs > on /tmp, makedumpfile can fail for a little memory in the 2nd > kernel because makedumpfile makes a very large temporary memory > bitmap file in this case. To avoid this failure, you can set a > TMPDIR environment variable. If you do not set a TMPDIR envi- > ronment variable, makedumpfile uses /tmp directory for a tempo- > rary bitmap file as a default. I do not think it will because we run the second kernel inside the initramfs and do not mount any extra disks. So the only location available for the temporary memory bitmap would be memory either tmpfs or something else. Regardless the file ends up in memory. > > > On the other hand, I'm considering the enhancement suggested by Hatayama-san now. His idea looks interesting if it works. Thanks. Cheers, Don _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-03-29 8:09 ` Ken'ichi Ohmichi 2012-03-29 12:56 ` HATAYAMA Daisuke 2012-03-29 13:05 ` Don Zickus @ 2012-04-02 17:15 ` Michael Holzheu 2012-04-06 8:09 ` Atsushi Kumagai 2 siblings, 1 reply; 34+ messages in thread From: Michael Holzheu @ 2012-04-02 17:15 UTC (permalink / raw) To: Ken'ichi Ohmichi; +Cc: Don Zickus, kexec Hello Ken'ichi, On Thu, 2012-03-29 at 17:09 +0900, Ken'ichi Ohmichi wrote: > On Wed, 28 Mar 2012 17:22:04 -0400 > makedumpfile uses the system memory of 2nd-kernel for a bitmap if RHEL. > The bitmap represents each page of 1st-kernel is excluded or not. > So the bitmap size depends on 1st-kernel's system memory. Does this mean that makedumpfile's memory demand linearly grows with 1 bit per page of 1-st kernel's memory? Is that the exact factor, if /tmp is in memory? Or is there any other memory allocation that is not constant regarding the 1-st kernel memory size? Michael _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-02 17:15 ` Michael Holzheu @ 2012-04-06 8:09 ` Atsushi Kumagai 2012-04-11 8:04 ` Michael Holzheu 0 siblings, 1 reply; 34+ messages in thread From: Atsushi Kumagai @ 2012-04-06 8:09 UTC (permalink / raw) To: holzheu; +Cc: dzickus, oomichi, kexec Hello Michael, On Mon, 02 Apr 2012 19:15:33 +0200 Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote: > Hello Ken'ichi, > > On Thu, 2012-03-29 at 17:09 +0900, Ken'ichi Ohmichi wrote: > > On Wed, 28 Mar 2012 17:22:04 -0400 > > makedumpfile uses the system memory of 2nd-kernel for a bitmap if RHEL. > > The bitmap represents each page of 1st-kernel is excluded or not. > > So the bitmap size depends on 1st-kernel's system memory. > > Does this mean that makedumpfile's memory demand linearly grows with 1 > bit per page of 1-st kernel's memory? Yes, you are right. (Precisely, 2 bit per page.) > Is that the exact factor, if /tmp is in memory? Or is there any other > memory allocation that is not constant regarding the 1-st kernel memory > size? bitmap file is main cause of memory consuming if 2nd kernel uses initramfs only. There are other parts where the size of allocated memory varies based on 1-st kernel memory size, but they don't have big influence. Thanks Atsushi Kumagai _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-06 8:09 ` Atsushi Kumagai @ 2012-04-11 8:04 ` Michael Holzheu 2012-04-12 8:49 ` Atsushi Kumagai 0 siblings, 1 reply; 34+ messages in thread From: Michael Holzheu @ 2012-04-11 8:04 UTC (permalink / raw) To: Atsushi Kumagai; +Cc: dzickus, oomichi, kexec Hello Kumagai, On Fri, 2012-04-06 at 17:09 +0900, Atsushi Kumagai wrote: > Hello Michael, > > On Mon, 02 Apr 2012 19:15:33 +0200 > Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote: > > > Hello Ken'ichi, > > > > On Thu, 2012-03-29 at 17:09 +0900, Ken'ichi Ohmichi wrote: > > > On Wed, 28 Mar 2012 17:22:04 -0400 > > > makedumpfile uses the system memory of 2nd-kernel for a bitmap if RHEL. > > > The bitmap represents each page of 1st-kernel is excluded or not. > > > So the bitmap size depends on 1st-kernel's system memory. > > > > Does this mean that makedumpfile's memory demand linearly grows with 1 > > bit per page of 1-st kernel's memory? > > Yes, you are right. (Precisely, 2 bit per page.) > > > Is that the exact factor, if /tmp is in memory? Or is there any other > > memory allocation that is not constant regarding the 1-st kernel memory > > size? > > bitmap file is main cause of memory consuming if 2nd kernel uses initramfs > only. There are other parts where the size of allocated memory varies based > on 1-st kernel memory size, but they don't have big influence. Thanks for the explanation. I ask because I want to exactly calculate the required size for the crashkernel parameter. On s390 the kdump kernel memory consumption is fix and not dependent on the 1st kernel memory size. So based on your explanation I, will use: crashkernel=<base size> + <variable size> where <variable size> = <pages of 1st kernel> * (2 + x) / 8 where "x" is the variable makedumpfile memory allocation that is on top of the bitmap allocation. What would be a good value for "x"? Michael _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size 2012-04-11 8:04 ` Michael Holzheu @ 2012-04-12 8:49 ` Atsushi Kumagai 0 siblings, 0 replies; 34+ messages in thread From: Atsushi Kumagai @ 2012-04-12 8:49 UTC (permalink / raw) To: holzheu; +Cc: dzickus, oomichi, kexec Hello Michael, On Wed, 11 Apr 2012 10:04:03 +0200 Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote: > > bitmap file is main cause of memory consuming if 2nd kernel uses initramfs > > only. There are other parts where the size of allocated memory varies based > > on 1-st kernel memory size, but they don't have big influence. > > Thanks for the explanation. > > I ask because I want to exactly calculate the required size for the > crashkernel parameter. On s390 the kdump kernel memory consumption is > fix and not dependent on the 1st kernel memory size. So based on your > explanation I, will use: > > crashkernel=<base size> + <variable size> > > where > > <variable size> = <pages of 1st kernel> * (2 + x) / 8 > > where "x" is the variable makedumpfile memory allocation that is on top > of the bitmap allocation. What would be a good value for "x"? I'm sorry that I don't have the exact number, but even the second biggest memory allocation would require under 1/100 of the bitmap size, so, I think 0.01 is usually enough for "x". Thanks Atsushi Kumagai _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: makedumpfile memory usage grows with system memory size @ 2012-04-02 6:53 tachibana 0 siblings, 0 replies; 34+ messages in thread From: tachibana @ 2012-04-02 6:53 UTC (permalink / raw) To: Don Zickus; +Cc: kexec Hi Don, On 2012/03/30 09:19:16 -0400, Don Zickus <dzickus@redhat.com> wrote: > On Fri, Mar 30, 2012 at 06:43:34PM +0900, Atsushi Kumagai wrote: > > Hello Don, > > Does setting TMPDIR solve your problem ? Please refer to the man page. > > > > > > ENVIRONMENT VARIABLES > > TMPDIR This environment variable is for a temporary memory bitmap > > file. If your machine has a lots of memory and you use tmpfs > > on /tmp, makedumpfile can fail for a little memory in the 2nd > > kernel because makedumpfile makes a very large temporary memory > > bitmap file in this case. To avoid this failure, you can set a > > TMPDIR environment variable. If you do not set a TMPDIR envi- > > ronment variable, makedumpfile uses /tmp directory for a tempo- > > rary bitmap file as a default. > > I do not think it will because we run the second kernel inside the > initramfs and do not mount any extra disks. So the only location available > for the temporary memory bitmap would be memory either tmpfs or something > else. Regardless the file ends up in memory. If a file system for a dump file is on the local system, it is effective that we specify a directory as TMPDIR in the same file system, isn't it? Thanks tachibana > > > > > > > On the other hand, I'm considering the enhancement suggested by Hatayama-san now. > > His idea looks interesting if it works. Thanks. > > Cheers, > Don > > _______________________________________________ > kexec mailing list > kexec@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2012-05-17 0:21 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-28 21:22 makedumpfile memory usage grows with system memory size Don Zickus
2012-03-29 8:09 ` Ken'ichi Ohmichi
2012-03-29 12:56 ` HATAYAMA Daisuke
2012-03-29 13:25 ` Don Zickus
2012-03-30 0:51 ` HATAYAMA Daisuke
2012-04-02 7:46 ` Atsushi Kumagai
2012-04-05 6:52 ` HATAYAMA Daisuke
2012-04-05 14:34 ` Vivek Goyal
2012-04-06 1:12 ` HATAYAMA Daisuke
2012-04-06 8:59 ` Atsushi Kumagai
2012-04-06 9:29 ` HATAYAMA Daisuke
2012-04-09 18:57 ` Vivek Goyal
2012-04-09 23:58 ` HATAYAMA Daisuke
2012-04-10 12:52 ` Vivek Goyal
2012-04-12 3:40 ` Atsushi Kumagai
2012-04-12 7:47 ` HATAYAMA Daisuke
[not found] ` <20120427164649.9932a33f.kumagai-atsushi@mxc.nes.nec.co.jp>
2012-04-27 12:52 ` Don Zickus
2012-05-11 1:19 ` Atsushi Kumagai
2012-05-11 13:26 ` Don Zickus
2012-05-15 5:57 ` Atsushi Kumagai
2012-05-15 12:35 ` Don Zickus
2012-04-27 13:33 ` Vivek Goyal
2012-05-14 5:44 ` HATAYAMA Daisuke
2012-05-16 8:02 ` Atsushi Kumagai
2012-05-17 0:21 ` HATAYAMA Daisuke
2012-04-09 19:00 ` Vivek Goyal
2012-03-29 13:05 ` Don Zickus
2012-03-30 9:43 ` Atsushi Kumagai
2012-03-30 13:19 ` Don Zickus
2012-04-02 17:15 ` Michael Holzheu
2012-04-06 8:09 ` Atsushi Kumagai
2012-04-11 8:04 ` Michael Holzheu
2012-04-12 8:49 ` Atsushi Kumagai
-- strict thread matches above, loose matches on Subject: below --
2012-04-02 6:53 tachibana
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox