public inbox for kexec@lists.infradead.org
 help / color / mirror / Atom feed
* 32TB kdump
@ 2013-06-21 14:17 Cliff Wickman
  2013-06-27 21:17 ` Vivek Goyal
  0 siblings, 1 reply; 12+ messages in thread
From: Cliff Wickman @ 2013-06-21 14:17 UTC (permalink / raw)
  To: kexec


I have been testing recent kernel and kexec-tools for doing kdump of large
memories, and found good results.

--------------------------------
UV2000  memory: 32TB  crashkernel=2G@4G
command line  /usr/bin/makedumpfile --non-cylic -c --message-level 23 -d 31 \
   --map-size 4096 -x /boot/vmlinux-3.10.0-rc5-linus-cpw+ /proc/vmcore \
   /tmp/cpw/dumpfile

page scanning  570 sec.
copying data  5795 sec. (72G)
(The data copy ran out of disk space at 23%, so the time and size above are
 extrapolated.)

--------------------------------
UV1000  memory: 8.85TB  crashkernel=1G@5G
command line  /usr/bin/makedumpfile --non-cylic -c --message-level 23 -d 31 \
   --map-size 4096 -x /boot/vmlinux-3.9.6-cpw-medusa /proc/vmcore \
   /tmp/cpw/dumpfile

page scanning  175 sec.
copying data  2085 sec. (15G)
(The data copy ran out of disk space at 60%, so the time and size above are
 extrapolated.)

Notes/observations:
- These systems were idle, so this is the capture of basically system
  memory only.
- Both stable 3.9.6 and 3.10.0-rc5 worked.
- Use of crashkernel=1G,high was usually problematic.  I assume some problem
  with a conflict with something else using high memory.  I always use
  the form like 1G@5G, finding memory by examining /proc/iomem.
- Time for copying data is dominated by data compression.  Writing 15G of
  compressed data to /dev/null takes about 35min.  Writing the same data
  but uncompressed (140G) to /dev/null takes about 6min.
  So a good workaround for a very large system might be to dump uncompressed
  to an SSD.
  The multi-threading of the crash kernel would produce a big gain.
- Use of mmap on /proc/vmcore increased page scanning speed from 4.4 minutes
  to 3 minutes.  It also increased data copying speed (unexpectedly) from
  38min. to 35min.
  So I think it is worthwhile to push Hatayama's 9-patch set into the kernel.
- I applied a 5-patch set from Takao Indoh to fix reset_devices handling of
  PCI devices.
  And I applied 3 kernel hacks of my own:
    - making a "Crash kernel low" section in /proc/iomem
    - make crashkernel avoid some things in pci_swiotlb_detect_override(),
      pci_swiotlb_detect_4gb() and register_mem_sect_under_node()
    - doing a crashkernel return from cpu_up()
  I don't understand why these should be necessary for my kernels but are
  not reported as problems elsewhere. I'm still investigating and will discuss
  those patches separately.
- my makedumpfile is an mmap-using version, with about 10 patches applied. I'll
  check which of those are not in the common version and discuss separately.
- my kexec is version 2.0.4 with 3 patches applied. I'll check which of those
  are not in the common version and discuss separately.

-Cliff

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 32TB kdump
  2013-06-21 14:17 32TB kdump Cliff Wickman
@ 2013-06-27 21:17 ` Vivek Goyal
  2013-06-28 21:56   ` Cliff Wickman
  2013-07-01  0:55   ` HATAYAMA Daisuke
  0 siblings, 2 replies; 12+ messages in thread
From: Vivek Goyal @ 2013-06-27 21:17 UTC (permalink / raw)
  To: Cliff Wickman; +Cc: kexec

On Fri, Jun 21, 2013 at 09:17:14AM -0500, Cliff Wickman wrote:
> 
> I have been testing recent kernel and kexec-tools for doing kdump of large
> memories, and found good results.
> 
> --------------------------------
> UV2000  memory: 32TB  crashkernel=2G@4G

> command line  /usr/bin/makedumpfile --non-cylic -c --message-level 23 -d 31 \
>    --map-size 4096 -x /boot/vmlinux-3.10.0-rc5-linus-cpw+ /proc/vmcore \
>    /tmp/cpw/dumpfile

Is --cyclic mode significantly slower for above configuration? Now cyclic
mode already uses 80% of available memory (I guess we are little
conservative and could bump it to 90 - 95% of available memory). That
should mean that by default cyclic mode should be as fast as non-cyclic
mode.

Added benefit is that even if one reserves less memory, cyclic mode
will atleast be able to save dump (at the cost of some time).

> 
> page scanning  570 sec.
> copying data  5795 sec. (72G)
> (The data copy ran out of disk space at 23%, so the time and size above are
>  extrapolated.)

That's almost 110 mins. Approximately 2 hrs to dump. I think it is still
a lot. How many people can afford to keep a machine dumping for 2hrs. They
would rather bring the servies back online.

So more work needed in scalability area. And page scanning seems to have
been not too bad. Copying data has taken majority of time. Is it because
of slow disk.

BTW, in non-cyclic mode, 32TB physical memory will require 2G just for
bitmap (2bits per 4K page).  And then you require some memory for
other stuff (around 128MB). I am not sure how did it work for you just
by reserving 2G of RAM.

> 
> --------------------------------
> UV1000  memory: 8.85TB  crashkernel=1G@5G
> command line  /usr/bin/makedumpfile --non-cylic -c --message-level 23 -d 31 \
>    --map-size 4096 -x /boot/vmlinux-3.9.6-cpw-medusa /proc/vmcore \
>    /tmp/cpw/dumpfile
> 
> page scanning  175 sec.
> copying data  2085 sec. (15G)
> (The data copy ran out of disk space at 60%, so the time and size above are
>  extrapolated.)
> 
> Notes/observations:
> - These systems were idle, so this is the capture of basically system
>   memory only.
> - Both stable 3.9.6 and 3.10.0-rc5 worked.
> - Use of crashkernel=1G,high was usually problematic.  I assume some problem
>   with a conflict with something else using high memory.  I always use
>   the form like 1G@5G, finding memory by examining /proc/iomem.

Hmm..., do you think you need to reserve some low mem too for swiotlb. (In
case you are not using iommu).

> - Time for copying data is dominated by data compression.  Writing 15G of
>   compressed data to /dev/null takes about 35min.  Writing the same data
>   but uncompressed (140G) to /dev/null takes about 6min.

Try using snappy or lzo for faster compression.

>   So a good workaround for a very large system might be to dump uncompressed
>   to an SSD.

Interesting.

>   The multi-threading of the crash kernel would produce a big gain.

Hatayama once was working on patches to bring up multiple cpus in second
kernel. Not sure what happened to those patches.

> - Use of mmap on /proc/vmcore increased page scanning speed from 4.4 minutes
>   to 3 minutes.  It also increased data copying speed (unexpectedly) from
>   38min. to 35min.

Hmm.., so on large memory systems, mmap() will not help a lot? In those
systems dump times are dominidated by disk speed and compression time.

So far I was thinking that ioremap() per page was big issue and you
also once had done the analysis that passing page list to kernel made
things significantly faster.

So on 32TB machines if it is taking 2hrs to save dump and mmap() shortens
it by only few minutes, it really is not significant win.

>   So I think it is worthwhile to push Hatayama's 9-patch set into the kernel.

I think his patches are in --mm tree and should show up in next kernel
realease. But it really does not sound much in overall scheme of things.

> - I applied a 5-patch set from Takao Indoh to fix reset_devices handling of
>   PCI devices.
>   And I applied 3 kernel hacks of my own:
>     - making a "Crash kernel low" section in /proc/iomem

And you did it because crashkernel=2G,high crashkernel=XM,low did not
work for you?

>     - make crashkernel avoid some things in pci_swiotlb_detect_override(),
>       pci_swiotlb_detect_4gb() and register_mem_sect_under_node()
>     - doing a crashkernel return from cpu_up()
>   I don't understand why these should be necessary for my kernels but are
>   not reported as problems elsewhere. I'm still investigating and will discuss
>   those patches separately.

Nobody might have tested it yet on such large machines and these problems
might be present for everyone.

So would be great if you could fix these in upstream kernel.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 32TB kdump
  2013-06-27 21:17 ` Vivek Goyal
@ 2013-06-28 21:56   ` Cliff Wickman
  2013-07-01  0:42     ` HATAYAMA Daisuke
                       ` (2 more replies)
  2013-07-01  0:55   ` HATAYAMA Daisuke
  1 sibling, 3 replies; 12+ messages in thread
From: Cliff Wickman @ 2013-06-28 21:56 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: kexec

On Thu, Jun 27, 2013 at 05:17:25PM -0400, Vivek Goyal wrote:
> On Fri, Jun 21, 2013 at 09:17:14AM -0500, Cliff Wickman wrote:
> > 
> > I have been testing recent kernel and kexec-tools for doing kdump of large
> > memories, and found good results.
> > 
> > --------------------------------
> > UV2000  memory: 32TB  crashkernel=2G@4G
> 
> > command line  /usr/bin/makedumpfile --non-cylic -c --message-level 23 -d 31 \
> >    --map-size 4096 -x /boot/vmlinux-3.10.0-rc5-linus-cpw+ /proc/vmcore \
> >    /tmp/cpw/dumpfile
> 
> Is --cyclic mode significantly slower for above configuration? Now cyclic
> mode already uses 80% of available memory (I guess we are little
> conservative and could bump it to 90 - 95% of available memory). That
> should mean that by default cyclic mode should be as fast as non-cyclic
> mode.
> 
> Added benefit is that even if one reserves less memory, cyclic mode
> will atleast be able to save dump (at the cost of some time).

Cyclic mode is not significantly slower.  On an idle 2TB machine it can
scan pages in 60 seconds then copy in 402.  Using non-cylic the scan is
35 seconds and the copy about 395 -- but with crashkernel=512M
getmakedumpfile then runs out of memory and the crash kernel panics, so
the 30-or-so seconds saved are definitely not worth it.
I am able to dump an idle 2TB system in about 500 seconds in cyclic mode
and crashkernel=384M.
> 
> > 
> > page scanning  570 sec.
> > copying data  5795 sec. (72G)
> > (The data copy ran out of disk space at 23%, so the time and size above are
> >  extrapolated.)
> 
> That's almost 110 mins. Approximately 2 hrs to dump. I think it is still
> a lot. How many people can afford to keep a machine dumping for 2hrs. They
> would rather bring the servies back online.

It is a long time, agreed.  But a vast improvement over the hours and
hours (maybe 12 or more) it would have taken just to scan pages before the
fix of ioremap() per page.
A 32T machine is probably a research engine rather than a server, and 2hrs
might be pretty acceptable to track down a system bug that's blocking some
important application.
 
> So more work needed in scalability area. And page scanning seems to have
> been not too bad. Copying data has taken majority of time. Is it because
> of slow disk.

I think compression is the bottleneck.

On an idle 2TB machine: (times in seconds)
                                copy time
uncompressed, to /dev/null      61
uncompressed, to file           336    (probably 37G, I extrapolate, disk full)
compressed, to /dev/null        387
compressed, to file             402    (file 3.7G)

uncompressed disk time  336-61  275
compressed disk time    402-387  15
compress time           387-61  326

> BTW, in non-cyclic mode, 32TB physical memory will require 2G just for
> bitmap (2bits per 4K page).  And then you require some memory for
> other stuff (around 128MB). I am not sure how did it work for you just
> by reserving 2G of RAM.

Could it be that this bitmap is being kept only partially in memory,
with the non-current parts in a file?
 
> > 
> > --------------------------------
> > UV1000  memory: 8.85TB  crashkernel=1G@5G
> > command line  /usr/bin/makedumpfile --non-cylic -c --message-level 23 -d 31 \
> >    --map-size 4096 -x /boot/vmlinux-3.9.6-cpw-medusa /proc/vmcore \
> >    /tmp/cpw/dumpfile
> > 
> > page scanning  175 sec.
> > copying data  2085 sec. (15G)
> > (The data copy ran out of disk space at 60%, so the time and size above are
> >  extrapolated.)
> > 
> > Notes/observations:
> > - These systems were idle, so this is the capture of basically system
> >   memory only.
> > - Both stable 3.9.6 and 3.10.0-rc5 worked.
> > - Use of crashkernel=1G,high was usually problematic.  I assume some problem
> >   with a conflict with something else using high memory.  I always use
> >   the form like 1G@5G, finding memory by examining /proc/iomem.
> 
> Hmm..., do you think you need to reserve some low mem too for swiotlb. (In
> case you are not using iommu).

It is reserving 72M in low mem for swiotlb + 8M.  But this seems not
enough.
I did not realize that I could specify crashkernel=xxx,high and
crashkernel=xxx,low together, until you mentioned it below.  This seems
to solve my crashkernel=1G,high problem.  I need to specify
crashkernel=128M,low on some systems or else my crash kernel panics on
not finding enough low memory.
 
> > - Time for copying data is dominated by data compression.  Writing 15G of
> >   compressed data to /dev/null takes about 35min.  Writing the same data
> >   but uncompressed (140G) to /dev/null takes about 6min.
> 
> Try using snappy or lzo for faster compression.

I don't have liblzo2 or snappy-c.h
I must need to install some packages on our build server.
Would you expect multiple times faster compression with those?
 
> >   So a good workaround for a very large system might be to dump uncompressed
> >   to an SSD.
> 
> Interesting.
> 
> >   The multi-threading of the crash kernel would produce a big gain.
> 
> Hatayama once was working on patches to bring up multiple cpus in second
> kernel. Not sure what happened to those patches.

I hope he pursues that. It is 'the' big performance issue remaining, I think.

 
> > - Use of mmap on /proc/vmcore increased page scanning speed from 4.4 minutes
> >   to 3 minutes.  It also increased data copying speed (unexpectedly) from
> >   38min. to 35min.
> 
> Hmm.., so on large memory systems, mmap() will not help a lot? In those
> systems dump times are dominidated by disk speed and compression time.
> 
> So far I was thinking that ioremap() per page was big issue and you
> also once had done the analysis that passing page list to kernel made
> things significantly faster.
> 
> So on 32TB machines if it is taking 2hrs to save dump and mmap() shortens
> it by only few minutes, it really is not significant win.
> 
> >   So I think it is worthwhile to push Hatayama's 9-patch set into the kernel.
> 
> I think his patches are in --mm tree and should show up in next kernel
> realease. But it really does not sound much in overall scheme of things.

Agreed.  Not a big speedup compared to multithreading the crash kernel.
 
> > - I applied a 5-patch set from Takao Indoh to fix reset_devices handling of
> >   PCI devices.
> >   And I applied 3 kernel hacks of my own:
> >     - making a "Crash kernel low" section in /proc/iomem
> 
> And you did it because crashkernel=2G,high crashkernel=XM,low did not
> work for you?
>
> >     - make crashkernel avoid some things in pci_swiotlb_detect_override(),
> >       pci_swiotlb_detect_4gb() and register_mem_sect_under_node()
> >     - doing a crashkernel return from cpu_up()
> >   I don't understand why these should be necessary for my kernels but are
> >   not reported as problems elsewhere. I'm still investigating and will discuss
> >   those patches separately.
> 
> Nobody might have tested it yet on such large machines and these problems
> might be present for everyone.
> 
> So would be great if you could fix these in upstream kernel.

In further testing I find that none of these kernel patches are needed if
I'm using the current kexec command and if I don't try to bring the crash
kernel up to multiuser mode.
So the current kexec command also works well for me, as well as the 3.10
kernel.

I have a small wish list for makedumpfile. Nothing major, but I'll post
those later.


Could you give me an estimate when the kexec
 (as in git://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git)
and makedumpfile 
 (as in git://git.code.sf.net/p/makedumpfile/code  mmap branch)
will be released?  We would like to advise the distro's about what level
of those things we require.

-Cliff
-- 
Cliff Wickman
SGI
cpw@sgi.com
(651) 683-3824

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 32TB kdump
  2013-06-28 21:56   ` Cliff Wickman
@ 2013-07-01  0:42     ` HATAYAMA Daisuke
  2013-07-01  2:57     ` Atsushi Kumagai
  2013-07-01 16:12     ` Vivek Goyal
  2 siblings, 0 replies; 12+ messages in thread
From: HATAYAMA Daisuke @ 2013-07-01  0:42 UTC (permalink / raw)
  To: Cliff Wickman; +Cc: kexec, Vivek Goyal

(2013/06/29 6:56), Cliff Wickman wrote:
> On Thu, Jun 27, 2013 at 05:17:25PM -0400, Vivek Goyal wrote:
>> On Fri, Jun 21, 2013 at 09:17:14AM -0500, Cliff Wickman wrote:
>>>
>>> I have been testing recent kernel and kexec-tools for doing kdump of large
>>> memories, and found good results.
>>>
>>> --------------------------------
>>> UV2000  memory: 32TB  crashkernel=2G@4G
>>
>>> command line  /usr/bin/makedumpfile --non-cylic -c --message-level 23 -d 31 \
>>>     --map-size 4096 -x /boot/vmlinux-3.10.0-rc5-linus-cpw+ /proc/vmcore \
>>>     /tmp/cpw/dumpfile
>>
>> Is --cyclic mode significantly slower for above configuration? Now cyclic
>> mode already uses 80% of available memory (I guess we are little
>> conservative and could bump it to 90 - 95% of available memory). That
>> should mean that by default cyclic mode should be as fast as non-cyclic
>> mode.
>>
>> Added benefit is that even if one reserves less memory, cyclic mode
>> will atleast be able to save dump (at the cost of some time).
>
> Cyclic mode is not significantly slower.  On an idle 2TB machine it can
> scan pages in 60 seconds then copy in 402.  Using non-cylic the scan is
> 35 seconds and the copy about 395 -- but with crashkernel=512M
> getmakedumpfile then runs out of memory and the crash kernel panics, so
> the 30-or-so seconds saved are definitely not worth it.
> I am able to dump an idle 2TB system in about 500 seconds in cyclic mode
> and crashkernel=384M.
>>
>>>
>>> page scanning  570 sec.
>>> copying data  5795 sec. (72G)
>>> (The data copy ran out of disk space at 23%, so the time and size above are
>>>   extrapolated.)
>>
>> That's almost 110 mins. Approximately 2 hrs to dump. I think it is still
>> a lot. How many people can afford to keep a machine dumping for 2hrs. They
>> would rather bring the servies back online.
>
> It is a long time, agreed.  But a vast improvement over the hours and
> hours (maybe 12 or more) it would have taken just to scan pages before the
> fix of ioremap() per page.

What does this mean?

> A 32T machine is probably a research engine rather than a server, and 2hrs
> might be pretty acceptable to track down a system bug that's blocking some
> important application.
>

Yes, this is true. It's of course impossible to stop the currently running
system, but it's possible to stop if it's still in development phase for
the purpose of bug fixing.

>> So more work needed in scalability area. And page scanning seems to have
>> been not too bad. Copying data has taken majority of time. Is it because
>> of slow disk.
>
> I think compression is the bottleneck.
>
> On an idle 2TB machine: (times in seconds)
>                                  copy time
> uncompressed, to /dev/null      61
> uncompressed, to file           336    (probably 37G, I extrapolate, disk full)
> compressed, to /dev/null        387
> compressed, to file             402    (file 3.7G)
>
> uncompressed disk time  336-61  275
> compressed disk time    402-387  15
> compress time           387-61  326
>
>> BTW, in non-cyclic mode, 32TB physical memory will require 2G just for
>> bitmap (2bits per 4K page).  And then you require some memory for
>> other stuff (around 128MB). I am not sure how did it work for you just
>> by reserving 2G of RAM.
>
> Could it be that this bitmap is being kept only partially in memory,
> with the non-current parts in a file?
>

makedumpfile runs on ramdisk in kdump 2nd kernel, so although makedumpfile
writes the non-current part in a temporary file (I assume the temporary file
is not using tmpfs), it's still on memory.

>>>
>>> --------------------------------
>>> UV1000  memory: 8.85TB  crashkernel=1G@5G
>>> command line  /usr/bin/makedumpfile --non-cylic -c --message-level 23 -d 31 \
>>>     --map-size 4096 -x /boot/vmlinux-3.9.6-cpw-medusa /proc/vmcore \
>>>     /tmp/cpw/dumpfile
>>>
>>> page scanning  175 sec.
>>> copying data  2085 sec. (15G)
>>> (The data copy ran out of disk space at 60%, so the time and size above are
>>>   extrapolated.)
>>>
>>> Notes/observations:
>>> - These systems were idle, so this is the capture of basically system
>>>    memory only.
>>> - Both stable 3.9.6 and 3.10.0-rc5 worked.
>>> - Use of crashkernel=1G,high was usually problematic.  I assume some problem
>>>    with a conflict with something else using high memory.  I always use
>>>    the form like 1G@5G, finding memory by examining /proc/iomem.
>>
>> Hmm..., do you think you need to reserve some low mem too for swiotlb. (In
>> case you are not using iommu).
>
> It is reserving 72M in low mem for swiotlb + 8M.  But this seems not
> enough.
> I did not realize that I could specify crashkernel=xxx,high and
> crashkernel=xxx,low together, until you mentioned it below.  This seems
> to solve my crashkernel=1G,high problem.  I need to specify
> crashkernel=128M,low on some systems or else my crash kernel panics on
> not finding enough low memory.
>
>>> - Time for copying data is dominated by data compression.  Writing 15G of
>>>    compressed data to /dev/null takes about 35min.  Writing the same data
>>>    but uncompressed (140G) to /dev/null takes about 6min.
>>
>> Try using snappy or lzo for faster compression.
>
> I don't have liblzo2 or snappy-c.h
> I must need to install some packages on our build server.
> Would you expect multiple times faster compression with those?
>

My benchmark showed 180 ~ 200 MiB/sec for snappy while 50 ~ 70 MiB/sec for zlib
on my i7-860. So did the other Xeon processors too.

>>>    So a good workaround for a very large system might be to dump uncompressed
>>>    to an SSD.
>>
>> Interesting.
>>

Just as above, compression time is slower than SSD. Current makedumpfile uses 4KiB
block size for compression (it uses page frmae size as compression block size on
each architecture) but it's not best for compression speed. If you want to use SSD,
it's better to optimize the compression block size.

In my small benchmark, I saw over 1 GiB/sec by simply increasing block size, but
it took longer compression time and I suspect it causes reducing average I/O requests.

>>>    The multi-threading of the crash kernel would produce a big gain.
>>
>> Hatayama once was working on patches to bring up multiple cpus in second
>> kernel. Not sure what happened to those patches.
>
> I hope he pursues that. It is 'the' big performance issue remaining, I think.
>
>

Yes, there's progress. I'll post the next version soon.

>>> - Use of mmap on /proc/vmcore increased page scanning speed from 4.4 minutes
>>>    to 3 minutes.  It also increased data copying speed (unexpectedly) from
>>>    38min. to 35min.
>>
>> Hmm.., so on large memory systems, mmap() will not help a lot? In those
>> systems dump times are dominidated by disk speed and compression time.
>>
>> So far I was thinking that ioremap() per page was big issue and you
>> also once had done the analysis that passing page list to kernel made
>> things significantly faster.
>>
>> So on 32TB machines if it is taking 2hrs to save dump and mmap() shortens
>> it by only few minutes, it really is not significant win.
>>
>>>    So I think it is worthwhile to push Hatayama's 9-patch set into the kernel.
>>
>> I think his patches are in --mm tree and should show up in next kernel
>> realease. But it really does not sound much in overall scheme of things.
>
> Agreed.  Not a big speedup compared to multithreading the crash kernel.
>
>>> - I applied a 5-patch set from Takao Indoh to fix reset_devices handling of
>>>    PCI devices.
>>>    And I applied 3 kernel hacks of my own:
>>>      - making a "Crash kernel low" section in /proc/iomem
>>
>> And you did it because crashkernel=2G,high crashkernel=XM,low did not
>> work for you?
>>
>>>      - make crashkernel avoid some things in pci_swiotlb_detect_override(),
>>>        pci_swiotlb_detect_4gb() and register_mem_sect_under_node()
>>>      - doing a crashkernel return from cpu_up()
>>>    I don't understand why these should be necessary for my kernels but are
>>>    not reported as problems elsewhere. I'm still investigating and will discuss
>>>    those patches separately.
>>
>> Nobody might have tested it yet on such large machines and these problems
>> might be present for everyone.
>>
>> So would be great if you could fix these in upstream kernel.
>
> In further testing I find that none of these kernel patches are needed if
> I'm using the current kexec command and if I don't try to bring the crash
> kernel up to multiuser mode.
> So the current kexec command also works well for me, as well as the 3.10
> kernel.
>
> I have a small wish list for makedumpfile. Nothing major, but I'll post
> those later.
>

It's best if they are given as patch set.

>
> Could you give me an estimate when the kexec
>   (as in git://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git)
> and makedumpfile
>   (as in git://git.code.sf.net/p/makedumpfile/code  mmap branch)
> will be released?  We would like to advise the distro's about what level
> of those things we require.
>
> -Cliff
>


-- 
Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 32TB kdump
  2013-06-27 21:17 ` Vivek Goyal
  2013-06-28 21:56   ` Cliff Wickman
@ 2013-07-01  0:55   ` HATAYAMA Daisuke
  2013-07-01 16:06     ` Vivek Goyal
  1 sibling, 1 reply; 12+ messages in thread
From: HATAYAMA Daisuke @ 2013-07-01  0:55 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: kexec, Cliff Wickman

(2013/06/28 6:17), Vivek Goyal wrote:
> On Fri, Jun 21, 2013 at 09:17:14AM -0500, Cliff Wickman wrote:

>
> Try using snappy or lzo for faster compression.
>
>>    So a good workaround for a very large system might be to dump uncompressed
>>    to an SSD.
>
> Interesting.
>
>>    The multi-threading of the crash kernel would produce a big gain.
>
> Hatayama once was working on patches to bring up multiple cpus in second
> kernel. Not sure what happened to those patches.
>
>> - Use of mmap on /proc/vmcore increased page scanning speed from 4.4 minutes
>>    to 3 minutes.  It also increased data copying speed (unexpectedly) from
>>    38min. to 35min.
>
> Hmm.., so on large memory systems, mmap() will not help a lot? In those
> systems dump times are dominidated by disk speed and compression time.
>
> So far I was thinking that ioremap() per page was big issue and you
> also once had done the analysis that passing page list to kernel made
> things significantly faster.
>
> So on 32TB machines if it is taking 2hrs to save dump and mmap() shortens
> it by only few minutes, it really is not significant win.
>

Sorry, I've explained this earlier in this ML.

Some patches have been applied on makedumpfile to improve the filtering speed.
Two changes that were useful for the improvement are the one implementing
a 8-slot cache for physical page for the purpose of reducing the number of
/proc/vmcore access for paging (just as TLB), and the one that cleanups
makedumpfile's filtering path.

Performance degradation by ioremap() is now being hided on a single cpu, but
it would again occur on multiple cpus. Sorry, but I have yet to do benchmark
showing the fact cleanly as numeral values.

-- 
Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 32TB kdump
  2013-06-28 21:56   ` Cliff Wickman
  2013-07-01  0:42     ` HATAYAMA Daisuke
@ 2013-07-01  2:57     ` Atsushi Kumagai
  2013-07-01 16:12     ` Vivek Goyal
  2 siblings, 0 replies; 12+ messages in thread
From: Atsushi Kumagai @ 2013-07-01  2:57 UTC (permalink / raw)
  To: cpw; +Cc: kexec, vgoyal

On Fri, 28 Jun 2013 16:56:31 -0500
Cliff Wickman <cpw@sgi.com> wrote:

> I have a small wish list for makedumpfile. Nothing major, but I'll post
> those later.
> 
> 
> Could you give me an estimate when the kexec
>  (as in git://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git)
> and makedumpfile 
>  (as in git://git.code.sf.net/p/makedumpfile/code  mmap branch)
> will be released?  We would like to advise the distro's about what level
> of those things we require.

The current *devel* branch will be released as makedumpfile-1.5.4
in this week. v1.5.4 will supports mmap.


Thanks
Atsushi Kumagai

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 32TB kdump
  2013-07-01  0:55   ` HATAYAMA Daisuke
@ 2013-07-01 16:06     ` Vivek Goyal
       [not found]       ` <51D3D15D.5090600@jp.fujitsu.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2013-07-01 16:06 UTC (permalink / raw)
  To: HATAYAMA Daisuke; +Cc: kexec, Cliff Wickman

On Mon, Jul 01, 2013 at 09:55:53AM +0900, HATAYAMA Daisuke wrote:
> (2013/06/28 6:17), Vivek Goyal wrote:
> >On Fri, Jun 21, 2013 at 09:17:14AM -0500, Cliff Wickman wrote:
> 
> >
> >Try using snappy or lzo for faster compression.
> >
> >>   So a good workaround for a very large system might be to dump uncompressed
> >>   to an SSD.
> >
> >Interesting.
> >
> >>   The multi-threading of the crash kernel would produce a big gain.
> >
> >Hatayama once was working on patches to bring up multiple cpus in second
> >kernel. Not sure what happened to those patches.
> >
> >>- Use of mmap on /proc/vmcore increased page scanning speed from 4.4 minutes
> >>   to 3 minutes.  It also increased data copying speed (unexpectedly) from
> >>   38min. to 35min.
> >
> >Hmm.., so on large memory systems, mmap() will not help a lot? In those
> >systems dump times are dominidated by disk speed and compression time.
> >
> >So far I was thinking that ioremap() per page was big issue and you
> >also once had done the analysis that passing page list to kernel made
> >things significantly faster.
> >
> >So on 32TB machines if it is taking 2hrs to save dump and mmap() shortens
> >it by only few minutes, it really is not significant win.
> >
> 
> Sorry, I've explained this earlier in this ML.
> 
> Some patches have been applied on makedumpfile to improve the filtering speed.
> Two changes that were useful for the improvement are the one implementing
> a 8-slot cache for physical page for the purpose of reducing the number of
> /proc/vmcore access for paging (just as TLB), and the one that cleanups
> makedumpfile's filtering path.

So biggest performance improvement came from implementing some kind of
TLB cache in makedumpfile?

> 
> Performance degradation by ioremap() is now being hided on a single cpu, but
> it would again occur on multiple cpus. Sorry, but I have yet to do benchmark
> showing the fact cleanly as numeral values.

IIUC, are you saying that now ioremap() overhead per page is not very
significant on single cpu system (after above makeudmpfile changes). And
that's the reason using mmap() does not show a very significant
improvement in overall scheme of things. And these overheads will become
more important when multiple cpus are brought up in kdump environment.

Please correct me if I am wrong, I just want to understand it better. So
most of our performance problems w.r.t to scanning got solved by
makeumpdfile changes and mmap() changes bring us only little bit of
improvements in overall scheme of things on large machines?

Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 32TB kdump
  2013-06-28 21:56   ` Cliff Wickman
  2013-07-01  0:42     ` HATAYAMA Daisuke
  2013-07-01  2:57     ` Atsushi Kumagai
@ 2013-07-01 16:12     ` Vivek Goyal
  2 siblings, 0 replies; 12+ messages in thread
From: Vivek Goyal @ 2013-07-01 16:12 UTC (permalink / raw)
  To: Cliff Wickman; +Cc: kexec

On Fri, Jun 28, 2013 at 04:56:31PM -0500, Cliff Wickman wrote:

[..]
> > > page scanning  570 sec.
> > > copying data  5795 sec. (72G)
> > > (The data copy ran out of disk space at 23%, so the time and size above are
> > >  extrapolated.)
> > 
> > That's almost 110 mins. Approximately 2 hrs to dump. I think it is still
> > a lot. How many people can afford to keep a machine dumping for 2hrs. They
> > would rather bring the servies back online.
> 
> It is a long time, agreed.  But a vast improvement over the hours and
> hours (maybe 12 or more) it would have taken just to scan pages before the
> fix of ioremap() per page.

Which ioremap() fix you are referring to. I thought using mmap() was the
fix for per page ioremap() issue and that's not showing significant
improvements. Looks like you are referring to some other makeudmpfile
changes which I am not aware of.

> A 32T machine is probably a research engine rather than a server, and 2hrs
> might be pretty acceptable to track down a system bug that's blocking some
> important application.
>  
> > So more work needed in scalability area. And page scanning seems to have
> > been not too bad. Copying data has taken majority of time. Is it because
> > of slow disk.
> 
> I think compression is the bottleneck.
> 
> On an idle 2TB machine: (times in seconds)
>                                 copy time
> uncompressed, to /dev/null      61
> uncompressed, to file           336    (probably 37G, I extrapolate, disk full)
> compressed, to /dev/null        387
> compressed, to file             402    (file 3.7G)
> 
> uncompressed disk time  336-61  275
> compressed disk time    402-387  15
> compress time           387-61  326
> 

Ok, so now compression is the biggest bottleneck on large machines.

[..]
> > > - Use of crashkernel=1G,high was usually problematic.  I assume some problem
> > >   with a conflict with something else using high memory.  I always use
> > >   the form like 1G@5G, finding memory by examining /proc/iomem.
> > 
> > Hmm..., do you think you need to reserve some low mem too for swiotlb. (In
> > case you are not using iommu).
> 
> It is reserving 72M in low mem for swiotlb + 8M.  But this seems not
> enough.
> I did not realize that I could specify crashkernel=xxx,high and
> crashkernel=xxx,low together, until you mentioned it below.  This seems
> to solve my crashkernel=1G,high problem.  I need to specify
> crashkernel=128M,low on some systems or else my crash kernel panics on
> not finding enough low memory.

Is it possible to dive deeper and figure out why do you need more low
memory. We might need some fixing in upstream kernel. Otherwise how a
user would know how much low memory to reserve.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 32TB kdump
       [not found]       ` <51D3D15D.5090600@jp.fujitsu.com>
@ 2013-07-03 13:03         ` Vivek Goyal
  2013-07-04  2:03           ` HATAYAMA Daisuke
  0 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2013-07-03 13:03 UTC (permalink / raw)
  To: HATAYAMA Daisuke; +Cc: kexec, Cliff Wickman

On Wed, Jul 03, 2013 at 04:23:09PM +0900, HATAYAMA Daisuke wrote:
> (2013/07/02 1:06), Vivek Goyal wrote:
> > On Mon, Jul 01, 2013 at 09:55:53AM +0900, HATAYAMA Daisuke wrote:
> >> (2013/06/28 6:17), Vivek Goyal wrote:
> >>> On Fri, Jun 21, 2013 at 09:17:14AM -0500, Cliff Wickman wrote:
> >>
> >>>
> >>> Try using snappy or lzo for faster compression.
> >>>
> >>>>    So a good workaround for a very large system might be to dump uncompressed
> >>>>    to an SSD.
> >>>
> >>> Interesting.
> >>>
> >>>>    The multi-threading of the crash kernel would produce a big gain.
> >>>
> >>> Hatayama once was working on patches to bring up multiple cpus in second
> >>> kernel. Not sure what happened to those patches.
> >>>
> >>>> - Use of mmap on /proc/vmcore increased page scanning speed from 4.4 minutes
> >>>>    to 3 minutes.  It also increased data copying speed (unexpectedly) from
> >>>>    38min. to 35min.
> >>>
> >>> Hmm.., so on large memory systems, mmap() will not help a lot? In those
> >>> systems dump times are dominidated by disk speed and compression time.
> >>>
> >>> So far I was thinking that ioremap() per page was big issue and you
> >>> also once had done the analysis that passing page list to kernel made
> >>> things significantly faster.
> >>>
> >>> So on 32TB machines if it is taking 2hrs to save dump and mmap() shortens
> >>> it by only few minutes, it really is not significant win.
> >>>
> >>
> >> Sorry, I've explained this earlier in this ML.
> >>
> >> Some patches have been applied on makedumpfile to improve the filtering speed.
> >> Two changes that were useful for the improvement are the one implementing
> >> a 8-slot cache for physical page for the purpose of reducing the number of
> >> /proc/vmcore access for paging (just as TLB), and the one that cleanups
> >> makedumpfile's filtering path.
> >
> > So biggest performance improvement came from implementing some kind of
> > TLB cache in makedumpfile?
> >
> 
> Yes, for filtering. We need to do paging on filtering since mem_map[] is
> mapped in VMEMMAP region (of course depending on kernel configuration).
> The TLB like cache works very well. OTOH, copying pages are done in pages.
> We don't need to do paging at all.
> 
> >>
> >> Performance degradation by ioremap() is now being hided on a single cpu, but
> >> it would again occur on multiple cpus. Sorry, but I have yet to do benchmark
> >> showing the fact cleanly as numeral values.
> >
> > IIUC, are you saying that now ioremap() overhead per page is not very
> > significant on single cpu system (after above makeudmpfile changes). And
> > that's the reason using mmap() does not show a very significant
> > improvement in overall scheme of things. And these overheads will become
> > more important when multiple cpus are brought up in kdump environment.
> >
> > Please correct me if I am wrong, I just want to understand it better. So
> > most of our performance problems w.r.t to scanning got solved by
> > makeumpdfile changes and mmap() changes bring us only little bit of
> > improvements in overall scheme of things on large machines?
> >
> > Vivek
> >
> 
> Filtering performance has been improved by other makedumpfile specific changes,
> and they are as small as we can ignore, but for huge crash dump, we need to
> be concerned about the performance difference between mmap() and read() without
> I/O to actual disks.
> 
> Please see the following simple benchmark where I tried to compare mmap() and
> read() with ioremap() in multiple cpu configuration, writing I/O into /dev/null.
> I also profiled them using perf record/report to understand the current ioremap()
> overheads accurately.
> 
> >From the result, mmap() is better than read() since:
> 
>  1. In case of read (ioremap), single thread takes about 180 seconds for filtering
>     and copying 32 GiB memory. In case of mmap, about 25 seconds.
> 
>     Therefore, I guess read (ioremap) would take about 96 minutes and mmap would
>     take about 14 minutes on 1 TiB memory.
> 
>     This is significant since there's situations where we cannot reduce crash dump
>     size by filtering and we have to collect huge crash dump. For example,
> 
>     - Debugging qemu/KVM system, i.e., get host machine's crash dump and analyze
>       guest machines' image included in the host machine's crash dump. To do so,
>       we cannot filter user-space memory from the host machine's crash dump.
> 
>     - Debugging application of High Availability system such as cluster, where
>       when crash happens, active node at the time triggers kdump when the main
>       application of the HA system crashes in order to switch into inactive node
>       as soon as possible by skipping the application's shutdown. We debug
>       application's debug after retrieving the application's image as user-space
>       process dump from generated crash dump. To do so, we cannot filter user-space
>       memory from the active node's crash dump.
> 
>     - In general, how much data size is reduced by filtering depends on memory
>       situation at the time of crash. We need to be concerned about the worst case;
>       at least bad case.
> 
>   2. For scalability of multiple cpus and disks. Improvement ratio shows 3.39 for
>     mmap() and 2.70 for read() on 4 cpus. I guess part of these degradation comes
>     from TLB purge and IPI interupts to call TLB flush call back function
>     on each CPU. The number of the interrupts depends on mapping size. For mmap(),
>     it's possible to support large 1GiB pages for remap_pfn_range() and then
>     we can map a whole range of vmcore by one call of mmap().
> 
> Benchmark:
> 
> * Environment
> | System           | RX600 S6                                                        |
> | CPU              | Intel(R) Xeon(R) CPU E7- 4820  @ 2.00GHz x 4                    |
> | Memory           | 32 GiB                                                          |
> | Kernel           | 3.10.0-rc2 + mmap() patch set                                   |
> | makedumpfile (*) | devel branch of git: http://git.code.sf.net/p/makedumpfile/code |
> 
> (*) I customized the makedumpfile so that for ease of benchmark, I can
> specify for makedumpfile to use read() explicitly via -W option.
> 
> The mmap size of the version of makedumpfile is 4 MiB.
> 
> * How to measure
> ** Configuration
> - Use 4 CPUS on the kdump 2nd kernel by specifying nr_cpus=4 to the
>   kdump kernel parameter.
> - Trigger kdump by taskset -c 0 sh -c "echo c > /proc/sysrq-trigger"
>   - NOTE: System hangs at the boot of 2nd kernel if crash happens on
>     the CPU except for the CPU0. To avoid this situation, you need to
>     specify CPU0 explicitly.
> 
> ** Benchmark Script
> #+BEGIN_QUOTE
> #! /bin/sh
> 
> VMCORE="$1"
> DUMPFILE="$2"
> 
> DUMPFILEDIR="$(dirname $DUMPFILE)"
> 
> MAKEDUMPFILE=/home/hat/repos/code/makedumpfile
> PERF=/var/crash/perf
> 
> drop_cache() {
>     sync
>     echo 3 > /proc/sys/vm/drop_caches
> }
> 
> upcpu() {
>     local CPU="cpu$1"
>     echo 1 >/sys/devices/system/cpu/$CPU/online
> }
> 
> downcpu() {
>     local CPU="cpu$1"
>     echo 0 >/sys/devices/system/cpu/$CPU/online
> }
> 
> downcpu 1; downcpu 2; downcpu 3
> 
> # number of CPUs: 1
> drop_cache
> $PERF record -g -o $DUMPFILEDIR/perf.data.mmap1 $MAKEDUMPFILE -f --message-level 31 $VMCORE /dev/null >>$DUMPFILEDIR/msg.txt 2>&1
> drop_cache
> $PERF record -g -o $DUMPFILEDIR/perf.data.read1 $MAKEDUMPFILE -W -f --message-level 31 $VMCORE /dev/null >>$DUMPFILEDIR/msg.txt 2>&1
> 
> # number of CPUs: 2
> upcpu 1
> drop_cache
> $PERF record -g -o $DUMPFILEDIR/perf.data.mmap2 $MAKEDUMPFILE -f --message-level 31 --split $VMCORE /dev/null /dev/null >>$DUMPFILEDIR/msg.txt 2>&1
> drop_cache
> $PERF record -g -o $DUMPFILEDIR/perf.data.read2 $MAKEDUMPFILE -W -f --message-level 31 --split $VMCORE /dev/null /dev/null >>$DUMPFILEDIR/msg.txt 2>&1
> 
> # number of CPUs: 4
> upcpu 2; upcpu 3
> drop_cache
> $PERF record -g -o $DUMPFILEDIR/perf.data.mmap4 $MAKEDUMPFILE -f --message-level 31 --split $VMCORE /dev/null /dev/null /dev/null /dev/null >>$DUMPFILEDIR/msg.txt 2>&1
> drop_cache
> $PERF record -g -o $DUMPFILEDIR/perf.data.read4 $MAKEDUMPFILE -W -f --message-level 31 --split $VMCORE /dev/null /dev/null /dev/null /dev/null >>$DUMPFILEDIR/msg.txt 2>&1
> 
> exit 0
> #+END_QUOTE
> 
> * benchmark result (Copy time)
> ** mmap
> | #threads | thr 1 [s] | thr 2 [s] | thr 3 [s] | thr 4 [s] | avg [s] | per [MB/s] | ratio |
> |----------+-----------+-----------+-----------+-----------+---------+------------+-------|
> |        1 |     25.10 |         - | -         | -         |   25.10 |    1305.50 |  1.00 |
> |        2 |     11.88 |     14.25 | -         | -         |  13.065 |    2508.08 |  1.92 |
> |        4 |      5.66 |      7.92 | 7.99      | 8.06      |  7.4075 |    4423.62 |  3.39 |
> 
> ** read (ioremap)
> | #threads | thr 1 [s] | thr 2 [s] | thr 3 [s] | thr 4 [s] | avg [s] | per [MB/s] | ratio |
> |----------+-----------+-----------+-----------+-----------+---------+------------+-------|
> |        1 |    149.39 |         - | -         | -         |  149.39 |     219.35 |  1.00 |
> |        2 |     89.24 |    104.33 | -         | -         |   96.79 |     338.55 |  1.54 |
> |        4 |     41.74 |     59.59 | 59.60     | 60.03     |   55.24 |     593.19 |  2.70 |

Hi Hatayama,

Thanks for testing and providing these results. Above table gives pretty
good idea about mmap() vs read() interface performance.

So mmap() does help significantly. I did not know that makedumpfile could
handle more than 1 cpu and divide work among threads to exploit
parallelism. Good to see that in case of 4 cpus, mmap() speeds up by 
a factor of 3.39.

So looks like now on large machine dump times will be dominated by
compression time and time taken to write to disk. Cliff Wickman's numbers
seems to suggest that compression time is much more than time it takes
to write dump to disk (idle system on a 2TB machine).

I have taken following snippet from his other mail.

*****************************************************
On an idle 2TB machine: (times in seconds)
                                copy time
uncompressed, to /dev/null	61
uncompressed, to file           336    (probably 37G, I extrapolate, disk
full)
compressed, to /dev/null        387
compressed, to file             402    (file 3.7G)

uncompressed disk time  336-61  275
compressed disk time    402-387  15
compress time           387-61  326
*****************************************************

It took around 400 seconds to capture compressed dump on disk and out of
that 326 seconds were consumed by compression only. Around 80% of total
dump time in this case was attributed to compression.

So this sounds like the next big fish to go after. Using lzo and
snappy might help a bit. But I think bringing up more cpus in second
kernel should help too.

What's the status of your patches to bring up multiple cpus in kdump
kernel. Are you planning to push these upstream?

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 32TB kdump
  2013-07-03 13:03         ` Vivek Goyal
@ 2013-07-04  2:03           ` HATAYAMA Daisuke
  2013-07-05 15:21             ` Not booting BSP in kdump kernel (Was: Re: 32TB kdump) Vivek Goyal
  0 siblings, 1 reply; 12+ messages in thread
From: HATAYAMA Daisuke @ 2013-07-04  2:03 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: kexec, Cliff Wickman

(2013/07/03 22:03), Vivek Goyal wrote:
> On Wed, Jul 03, 2013 at 04:23:09PM +0900, HATAYAMA Daisuke wrote:
>> (2013/07/02 1:06), Vivek Goyal wrote:
>>> On Mon, Jul 01, 2013 at 09:55:53AM +0900, HATAYAMA Daisuke wrote:
>>>> (2013/06/28 6:17), Vivek Goyal wrote:
>>>>> On Fri, Jun 21, 2013 at 09:17:14AM -0500, Cliff Wickman wrote:
>>>>
>>>>>
>>>>> Try using snappy or lzo for faster compression.
>>>>>
>>>>>>     So a good workaround for a very large system might be to dump uncompressed
>>>>>>     to an SSD.
>>>>>
>>>>> Interesting.
>>>>>
>>>>>>     The multi-threading of the crash kernel would produce a big gain.
>>>>>
>>>>> Hatayama once was working on patches to bring up multiple cpus in second
>>>>> kernel. Not sure what happened to those patches.
>>>>>
>>>>>> - Use of mmap on /proc/vmcore increased page scanning speed from 4.4 minutes
>>>>>>     to 3 minutes.  It also increased data copying speed (unexpectedly) from
>>>>>>     38min. to 35min.
>>>>>
>>>>> Hmm.., so on large memory systems, mmap() will not help a lot? In those
>>>>> systems dump times are dominidated by disk speed and compression time.
>>>>>
>>>>> So far I was thinking that ioremap() per page was big issue and you
>>>>> also once had done the analysis that passing page list to kernel made
>>>>> things significantly faster.
>>>>>
>>>>> So on 32TB machines if it is taking 2hrs to save dump and mmap() shortens
>>>>> it by only few minutes, it really is not significant win.
>>>>>
>>>>
>>>> Sorry, I've explained this earlier in this ML.
>>>>
>>>> Some patches have been applied on makedumpfile to improve the filtering speed.
>>>> Two changes that were useful for the improvement are the one implementing
>>>> a 8-slot cache for physical page for the purpose of reducing the number of
>>>> /proc/vmcore access for paging (just as TLB), and the one that cleanups
>>>> makedumpfile's filtering path.
>>>
>>> So biggest performance improvement came from implementing some kind of
>>> TLB cache in makedumpfile?
>>>
>>
>> Yes, for filtering. We need to do paging on filtering since mem_map[] is
>> mapped in VMEMMAP region (of course depending on kernel configuration).
>> The TLB like cache works very well. OTOH, copying pages are done in pages.
>> We don't need to do paging at all.
>>
>>>>
>>>> Performance degradation by ioremap() is now being hided on a single cpu, but
>>>> it would again occur on multiple cpus. Sorry, but I have yet to do benchmark
>>>> showing the fact cleanly as numeral values.
>>>
>>> IIUC, are you saying that now ioremap() overhead per page is not very
>>> significant on single cpu system (after above makeudmpfile changes). And
>>> that's the reason using mmap() does not show a very significant
>>> improvement in overall scheme of things. And these overheads will become
>>> more important when multiple cpus are brought up in kdump environment.
>>>
>>> Please correct me if I am wrong, I just want to understand it better. So
>>> most of our performance problems w.r.t to scanning got solved by
>>> makeumpdfile changes and mmap() changes bring us only little bit of
>>> improvements in overall scheme of things on large machines?
>>>
>>> Vivek
>>>
>>
>> Filtering performance has been improved by other makedumpfile specific changes,
>> and they are as small as we can ignore, but for huge crash dump, we need to
>> be concerned about the performance difference between mmap() and read() without
>> I/O to actual disks.
>>
>> Please see the following simple benchmark where I tried to compare mmap() and
>> read() with ioremap() in multiple cpu configuration, writing I/O into /dev/null.
>> I also profiled them using perf record/report to understand the current ioremap()
>> overheads accurately.
>>
>> >From the result, mmap() is better than read() since:
>>
>>   1. In case of read (ioremap), single thread takes about 180 seconds for filtering
>>      and copying 32 GiB memory. In case of mmap, about 25 seconds.
>>
>>      Therefore, I guess read (ioremap) would take about 96 minutes and mmap would
>>      take about 14 minutes on 1 TiB memory.
>>
>>      This is significant since there's situations where we cannot reduce crash dump
>>      size by filtering and we have to collect huge crash dump. For example,
>>
>>      - Debugging qemu/KVM system, i.e., get host machine's crash dump and analyze
>>        guest machines' image included in the host machine's crash dump. To do so,
>>        we cannot filter user-space memory from the host machine's crash dump.
>>
>>      - Debugging application of High Availability system such as cluster, where
>>        when crash happens, active node at the time triggers kdump when the main
>>        application of the HA system crashes in order to switch into inactive node
>>        as soon as possible by skipping the application's shutdown. We debug
>>        application's debug after retrieving the application's image as user-space
>>        process dump from generated crash dump. To do so, we cannot filter user-space
>>        memory from the active node's crash dump.
>>
>>      - In general, how much data size is reduced by filtering depends on memory
>>        situation at the time of crash. We need to be concerned about the worst case;
>>        at least bad case.
>>
>>    2. For scalability of multiple cpus and disks. Improvement ratio shows 3.39 for
>>      mmap() and 2.70 for read() on 4 cpus. I guess part of these degradation comes
>>      from TLB purge and IPI interupts to call TLB flush call back function
>>      on each CPU. The number of the interrupts depends on mapping size. For mmap(),
>>      it's possible to support large 1GiB pages for remap_pfn_range() and then
>>      we can map a whole range of vmcore by one call of mmap().
>>
>> Benchmark:
>>
>> * Environment
>> | System           | RX600 S6                                                        |
>> | CPU              | Intel(R) Xeon(R) CPU E7- 4820  @ 2.00GHz x 4                    |
>> | Memory           | 32 GiB                                                          |
>> | Kernel           | 3.10.0-rc2 + mmap() patch set                                   |
>> | makedumpfile (*) | devel branch of git: http://git.code.sf.net/p/makedumpfile/code |
>>
>> (*) I customized the makedumpfile so that for ease of benchmark, I can
>> specify for makedumpfile to use read() explicitly via -W option.
>>
>> The mmap size of the version of makedumpfile is 4 MiB.
>>
>> * How to measure
>> ** Configuration
>> - Use 4 CPUS on the kdump 2nd kernel by specifying nr_cpus=4 to the
>>    kdump kernel parameter.
>> - Trigger kdump by taskset -c 0 sh -c "echo c > /proc/sysrq-trigger"
>>    - NOTE: System hangs at the boot of 2nd kernel if crash happens on
>>      the CPU except for the CPU0. To avoid this situation, you need to
>>      specify CPU0 explicitly.
>>
>> ** Benchmark Script
>> #+BEGIN_QUOTE
>> #! /bin/sh
>>
>> VMCORE="$1"
>> DUMPFILE="$2"
>>
>> DUMPFILEDIR="$(dirname $DUMPFILE)"
>>
>> MAKEDUMPFILE=/home/hat/repos/code/makedumpfile
>> PERF=/var/crash/perf
>>
>> drop_cache() {
>>      sync
>>      echo 3 > /proc/sys/vm/drop_caches
>> }
>>
>> upcpu() {
>>      local CPU="cpu$1"
>>      echo 1 >/sys/devices/system/cpu/$CPU/online
>> }
>>
>> downcpu() {
>>      local CPU="cpu$1"
>>      echo 0 >/sys/devices/system/cpu/$CPU/online
>> }
>>
>> downcpu 1; downcpu 2; downcpu 3
>>
>> # number of CPUs: 1
>> drop_cache
>> $PERF record -g -o $DUMPFILEDIR/perf.data.mmap1 $MAKEDUMPFILE -f --message-level 31 $VMCORE /dev/null >>$DUMPFILEDIR/msg.txt 2>&1
>> drop_cache
>> $PERF record -g -o $DUMPFILEDIR/perf.data.read1 $MAKEDUMPFILE -W -f --message-level 31 $VMCORE /dev/null >>$DUMPFILEDIR/msg.txt 2>&1
>>
>> # number of CPUs: 2
>> upcpu 1
>> drop_cache
>> $PERF record -g -o $DUMPFILEDIR/perf.data.mmap2 $MAKEDUMPFILE -f --message-level 31 --split $VMCORE /dev/null /dev/null >>$DUMPFILEDIR/msg.txt 2>&1
>> drop_cache
>> $PERF record -g -o $DUMPFILEDIR/perf.data.read2 $MAKEDUMPFILE -W -f --message-level 31 --split $VMCORE /dev/null /dev/null >>$DUMPFILEDIR/msg.txt 2>&1
>>
>> # number of CPUs: 4
>> upcpu 2; upcpu 3
>> drop_cache
>> $PERF record -g -o $DUMPFILEDIR/perf.data.mmap4 $MAKEDUMPFILE -f --message-level 31 --split $VMCORE /dev/null /dev/null /dev/null /dev/null >>$DUMPFILEDIR/msg.txt 2>&1
>> drop_cache
>> $PERF record -g -o $DUMPFILEDIR/perf.data.read4 $MAKEDUMPFILE -W -f --message-level 31 --split $VMCORE /dev/null /dev/null /dev/null /dev/null >>$DUMPFILEDIR/msg.txt 2>&1
>>
>> exit 0
>> #+END_QUOTE
>>
>> * benchmark result (Copy time)
>> ** mmap
>> | #threads | thr 1 [s] | thr 2 [s] | thr 3 [s] | thr 4 [s] | avg [s] | per [MB/s] | ratio |
>> |----------+-----------+-----------+-----------+-----------+---------+------------+-------|
>> |        1 |     25.10 |         - | -         | -         |   25.10 |    1305.50 |  1.00 |
>> |        2 |     11.88 |     14.25 | -         | -         |  13.065 |    2508.08 |  1.92 |
>> |        4 |      5.66 |      7.92 | 7.99      | 8.06      |  7.4075 |    4423.62 |  3.39 |
>>
>> ** read (ioremap)
>> | #threads | thr 1 [s] | thr 2 [s] | thr 3 [s] | thr 4 [s] | avg [s] | per [MB/s] | ratio |
>> |----------+-----------+-----------+-----------+-----------+---------+------------+-------|
>> |        1 |    149.39 |         - | -         | -         |  149.39 |     219.35 |  1.00 |
>> |        2 |     89.24 |    104.33 | -         | -         |   96.79 |     338.55 |  1.54 |
>> |        4 |     41.74 |     59.59 | 59.60     | 60.03     |   55.24 |     593.19 |  2.70 |
>
> Hi Hatayama,
>
> Thanks for testing and providing these results. Above table gives pretty
> good idea about mmap() vs read() interface performance.
>
> So mmap() does help significantly. I did not know that makedumpfile could
> handle more than 1 cpu and divide work among threads to exploit
> parallelism. Good to see that in case of 4 cpus, mmap() speeds up by
> a factor of 3.39.
>
> So looks like now on large machine dump times will be dominated by
> compression time and time taken to write to disk. Cliff Wickman's numbers
> seems to suggest that compression time is much more than time it takes
> to write dump to disk (idle system on a 2TB machine).
>
> I have taken following snippet from his other mail.
>
> *****************************************************
> On an idle 2TB machine: (times in seconds)
>                                  copy time
> uncompressed, to /dev/null	61
> uncompressed, to file           336    (probably 37G, I extrapolate, disk
> full)
> compressed, to /dev/null        387
> compressed, to file             402    (file 3.7G)
>
> uncompressed disk time  336-61  275
> compressed disk time    402-387  15
> compress time           387-61  326
> *****************************************************
>
> It took around 400 seconds to capture compressed dump on disk and out of
> that 326 seconds were consumed by compression only. Around 80% of total
> dump time in this case was attributed to compression.
>
> So this sounds like the next big fish to go after. Using lzo and
> snappy might help a bit. But I think bringing up more cpus in second
> kernel should help too.
>
> What's the status of your patches to bring up multiple cpus in kdump
> kernel. Are you planning to push these upstream?
>

My next patch is the same as in diff:

http://lkml.indiana.edu/hypermail/linux/kernel/1210.2/00014.html

Now there's need to compare the idea with Eric's suggestion that unsets BSP flag from boot cpu
at the boot time of 1st kernel, which is very simpler than the idea of mine. However,
HPA pointed out that the Eric's idea could affect some sort of firmware.
I'm investigating that now.

Candidates I've found so far is ACPI firmware. ACPI specification describes that in FADT part,
SMI_CMD and other SMI releated commands need to be called on boot processor, i.e. cpu with BSP
flag; See Table 5-34 in ACPI spec 5.0. I associate this to restriction of cpu hotplug that boot
processor cannot be physically removed. Also, there's comment in __acpi_os_execute():

         /*
          * On some machines, a software-initiated SMI causes corruption unless
          * the SMI runs on CPU 0.  An SMI can be initiated by any AML, but
          * typically it's done in GPE-related methods that are run via
          * workqueues, so we can avoid the known corruption cases by always
          * queueing on CPU 0.
          */
         ret = queue_work_on(0, queue, &dpc->work);

But I don't know whether the SMI requires even BSP flag set or not. I need suggestion from
experts around this field.

I'll post the patch next week and send cc to some experts in order to fill the patch
description with concrete description about what kind of firmware is affected by unsetting
BSP flag.  

-- 
Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Not booting BSP in kdump kernel (Was: Re: 32TB kdump)
  2013-07-04  2:03           ` HATAYAMA Daisuke
@ 2013-07-05 15:21             ` Vivek Goyal
  2013-07-08  9:23               ` HATAYAMA Daisuke
  0 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2013-07-05 15:21 UTC (permalink / raw)
  To: HATAYAMA Daisuke; +Cc: kexec, Cliff Wickman, Eric W. Biederman

On Thu, Jul 04, 2013 at 11:03:46AM +0900, HATAYAMA Daisuke wrote:

[..]
> >It took around 400 seconds to capture compressed dump on disk and out of
> >that 326 seconds were consumed by compression only. Around 80% of total
> >dump time in this case was attributed to compression.
> >
> >So this sounds like the next big fish to go after. Using lzo and
> >snappy might help a bit. But I think bringing up more cpus in second
> >kernel should help too.
> >
> >What's the status of your patches to bring up multiple cpus in kdump
> >kernel. Are you planning to push these upstream?
> >
> 
> My next patch is the same as in diff:
> 
> http://lkml.indiana.edu/hypermail/linux/kernel/1210.2/00014.html
> 
> Now there's need to compare the idea with Eric's suggestion that unsets BSP flag from boot cpu
> at the boot time of 1st kernel, which is very simpler than the idea of mine. However,
> HPA pointed out that the Eric's idea could affect some sort of firmware.
> I'm investigating that now.
> 
> Candidates I've found so far is ACPI firmware. ACPI specification describes that in FADT part,
> SMI_CMD and other SMI releated commands need to be called on boot processor, i.e. cpu with BSP
> flag; See Table 5-34 in ACPI spec 5.0. I associate this to restriction of cpu hotplug that boot
> processor cannot be physically removed. Also, there's comment in __acpi_os_execute():
> 
>         /*
>          * On some machines, a software-initiated SMI causes corruption unless
>          * the SMI runs on CPU 0.  An SMI can be initiated by any AML, but
>          * typically it's done in GPE-related methods that are run via
>          * workqueues, so we can avoid the known corruption cases by always
>          * queueing on CPU 0.
>          */
>         ret = queue_work_on(0, queue, &dpc->work);
> 
> But I don't know whether the SMI requires even BSP flag set or not. I need suggestion from
> experts around this field.
> 
> I'll post the patch next week and send cc to some experts in order to fill the patch
> description with concrete description about what kind of firmware is affected by unsetting
> BSP flag.

Ok. BTW, did clearing BSP flag work for you experimentally? Hpa mentioned
that it did not for Fenghua.

To me looking into ACPI/MP tables to figure out which is BSP and not
bringing up that cpu sounds simple and one does not have to worry
about dealing with side affects of clearing BSP bit, if any.

CCing Eric to figure out if he is particular about clearing BSP bit
solution or is willing to accept the solution of booting N-1 cpus
in kdump kernel by not bringing up BSP.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Not booting BSP in kdump kernel (Was: Re: 32TB kdump)
  2013-07-05 15:21             ` Not booting BSP in kdump kernel (Was: Re: 32TB kdump) Vivek Goyal
@ 2013-07-08  9:23               ` HATAYAMA Daisuke
  0 siblings, 0 replies; 12+ messages in thread
From: HATAYAMA Daisuke @ 2013-07-08  9:23 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: kexec, Cliff Wickman, Eric W. Biederman

(2013/07/06 0:21), Vivek Goyal wrote:
> On Thu, Jul 04, 2013 at 11:03:46AM +0900, HATAYAMA Daisuke wrote:
>
> [..]
>>> It took around 400 seconds to capture compressed dump on disk and out of
>>> that 326 seconds were consumed by compression only. Around 80% of total
>>> dump time in this case was attributed to compression.
>>>
>>> So this sounds like the next big fish to go after. Using lzo and
>>> snappy might help a bit. But I think bringing up more cpus in second
>>> kernel should help too.
>>>
>>> What's the status of your patches to bring up multiple cpus in kdump
>>> kernel. Are you planning to push these upstream?
>>>
>>
>> My next patch is the same as in diff:
>>
>> http://lkml.indiana.edu/hypermail/linux/kernel/1210.2/00014.html
>>
>> Now there's need to compare the idea with Eric's suggestion that unsets BSP flag from boot cpu
>> at the boot time of 1st kernel, which is very simpler than the idea of mine. However,
>> HPA pointed out that the Eric's idea could affect some sort of firmware.
>> I'm investigating that now.
>>
>> Candidates I've found so far is ACPI firmware. ACPI specification describes that in FADT part,
>> SMI_CMD and other SMI releated commands need to be called on boot processor, i.e. cpu with BSP
>> flag; See Table 5-34 in ACPI spec 5.0. I associate this to restriction of cpu hotplug that boot
>> processor cannot be physically removed. Also, there's comment in __acpi_os_execute():
>>
>>          /*
>>           * On some machines, a software-initiated SMI causes corruption unless
>>           * the SMI runs on CPU 0.  An SMI can be initiated by any AML, but
>>           * typically it's done in GPE-related methods that are run via
>>           * workqueues, so we can avoid the known corruption cases by always
>>           * queueing on CPU 0.
>>           */
>>          ret = queue_work_on(0, queue, &dpc->work);
>>
>> But I don't know whether the SMI requires even BSP flag set or not. I need suggestion from
>> experts around this field.
>>
>> I'll post the patch next week and send cc to some experts in order to fill the patch
>> description with concrete description about what kind of firmware is affected by unsetting
>> BSP flag.
>
> Ok. BTW, did clearing BSP flag work for you experimentally? Hpa mentioned
> that it did not for Fenghua.
>

My experiment is still not enough. Unseting BSP flag only didn't affect system behavior. It's
necessary to trigger some firmware whose work depends on BSP flag set. Previously I tried
hibernation and suspend because I suspect they depend on BSP flag set but apparently they worked.

Luckily, I can use this week system with acpi hot plugging feature on which I can trigger
SMI_CMD port as described in ACPI specification. I plan to test logic of unsetting BSP flag
on the system.

> To me looking into ACPI/MP tables to figure out which is BSP and not
> bringing up that cpu sounds simple and one does not have to worry
> about dealing with side affects of clearing BSP bit, if any.
>
> CCing Eric to figure out if he is particular about clearing BSP bit
> solution or is willing to accept the solution of booting N-1 cpus
> in kdump kernel by not bringing up BSP.
>
> Thanks
> Vivek
>
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>


-- 
Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-07-08  9:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-21 14:17 32TB kdump Cliff Wickman
2013-06-27 21:17 ` Vivek Goyal
2013-06-28 21:56   ` Cliff Wickman
2013-07-01  0:42     ` HATAYAMA Daisuke
2013-07-01  2:57     ` Atsushi Kumagai
2013-07-01 16:12     ` Vivek Goyal
2013-07-01  0:55   ` HATAYAMA Daisuke
2013-07-01 16:06     ` Vivek Goyal
     [not found]       ` <51D3D15D.5090600@jp.fujitsu.com>
2013-07-03 13:03         ` Vivek Goyal
2013-07-04  2:03           ` HATAYAMA Daisuke
2013-07-05 15:21             ` Not booting BSP in kdump kernel (Was: Re: 32TB kdump) Vivek Goyal
2013-07-08  9:23               ` HATAYAMA Daisuke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox