Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA

Kexec Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Donald Dutile <ddutile@redhat.com>
To: David Hildenbrand <david@redhat.com>, Jiri Bohac <jbohac@suse.cz>,
	Baoquan He <bhe@redhat.com>, Vivek Goyal <vgoyal@redhat.com>,
	Dave Young <dyoung@redhat.com>,
	kexec@lists.infradead.org
Cc: Philipp Rudo <prudo@redhat.com>, Pingfan Liu <piliu@redhat.com>,
	Tao Liu <ltao@redhat.com>,
	linux-kernel@vger.kernel.org,
	David Hildenbrand <dhildenb@redhat.com>,
	Michal Hocko <mhocko@suse.cz>
Subject: Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
Date: Mon, 3 Mar 2025 09:17:51 -0500	[thread overview]
Message-ID: <427fec88-2a74-471e-aeb6-a108ca8c4336@redhat.com> (raw)
In-Reply-To: <04904e86-5b5f-4aa1-a120-428dac119189@redhat.com>



On 3/3/25 3:25 AM, David Hildenbrand wrote:
> On 20.02.25 17:48, Jiri Bohac wrote:
>> Hi,
>>
>> this series implements a way to reserve additional crash kernel
>> memory using CMA.
>>
>> Link to the v1 discussion:
>> https://lore.kernel.org/lkml/ZWD_fAPqEWkFlEkM@dwarf.suse.cz/
>> See below for the changes since v1 and how concerns from the
>> discussion have been addressed.
>>
>> Currently, all the memory for the crash kernel is not usable by
>> the 1st (production) kernel. It is also unmapped so that it can't
>> be corrupted by the fault that will eventually trigger the crash.
>> This makes sense for the memory actually used by the kexec-loaded
>> crash kernel image and initrd and the data prepared during the
>> load (vmcoreinfo, ...). However, the reserved space needs to be
>> much larger than that to provide enough run-time memory for the
>> crash kernel and the kdump userspace. Estimating the amount of
>> memory to reserve is difficult. Being too careful makes kdump
>> likely to end in OOM, being too generous takes even more memory
>> from the production system. Also, the reservation only allows
>> reserving a single contiguous block (or two with the "low"
>> suffix). I've seen systems where this fails because the physical
>> memory is fragmented.
>>
>> By reserving additional crashkernel memory from CMA, the main
>> crashkernel reservation can be just large enough to fit the
>> kernel and initrd image, minimizing the memory taken away from
>> the production system. Most of the run-time memory for the crash
>> kernel will be memory previously available to userspace in the
>> production system. As this memory is no longer wasted, the
>> reservation can be done with a generous margin, making kdump more
>> reliable. Kernel memory that we need to preserve for dumping is
>> never allocated from CMA. User data is typically not dumped by
>> makedumpfile. When dumping of user data is intended this new CMA
>> reservation cannot be used.
> 
> 
> Hi,
> 
> I'll note that your comment about "user space" is currently the case, but will likely not hold in the long run. The assumption you are making is that only user-space memory will be allocated from MIGRATE_CMA, which is not necessarily the case. Any movable allocation will end up in there.
> 
> Besides LRU folios (user space memory and the pagecache), we already support migration of some kernel allocations using the non-lru migration framework. Such allocations (which use __GFP_MOVABLE, see __SetPageMovable()) currently only include
> * memory balloon: pages we never want to dump either way
> * zsmalloc (->zpool): only used by zswap (-> compressed LRU pages)
> * z3fold (->zpool): only used by zswap (-> compressed LRU pages)
> 
> Just imagine if we support migration of other kernel allocations, such as user page tables. The dump would be missing important information.
> 
IOMMUFD is a near-term candidate for user page tables with multi-stage iommu support with going through upstream review atm.
Just saying, that David's case will be a norm in high-end VMs with performance-enhanced guest-driven iommu support (for GPUs).

> Once that happens, it will become a lot harder to judge whether CMA can be used or not. At least, the kernel could bail out/warn for these kernel configs.
> 
I don't think the aforementioned focus is to use CMA, but given its performance benefits, it won't take long to be the next perf improvement step taken.

>>
>> There are five patches in this series:
>>
>> The first adds a new ",cma" suffix to the recenly introduced generic
>> crashkernel parsing code. parse_crashkernel() takes one more
>> argument to store the cma reservation size.
>>
>> The second patch implements reserve_crashkernel_cma() which
>> performs the reservation. If the requested size is not available
>> in a single range, multiple smaller ranges will be reserved.
>>
>> The third patch updates Documentation/, explicitly mentioning the
>> potential DMA corruption of the CMA-reserved memory.
>>
>> The fourth patch adds a short delay before booting the kdump
>> kernel, allowing pending DMA transfers to finish.
> 
> 
> What does "short" mean? At least in theory, long-term pinning is forbidden for MIGRATE_CMA, so we should not have such pages mapped into an iommu where DMA can happily keep going on for quite a while.
> 
Hmmm, in the case I mentioned above, should there be a kexec hook in multi-stage IOMMU support for the hypervisor/VMM to invalidate/shut-off stage 2 mappings asap (a multi-microsecond process) so
DMA termination from VMs is stunted ?  is that already done today (due to 'simple', single-stage, device assignment in a VM)?

> But that assumes that our old kernel is not buggy, and doesn't end up mapping these pages into an IOMMU where DMA will just continue. I recall that DRM might currently be a problem, described here [1].
> 
> If kdump starts not working as expected in case our old kernel is buggy, doesn't that partially destroy the purpose of kdump (-> debug bugs in the old kernel)?
> 
> 
> [1] https://lore.kernel.org/all/Z6MV_Y9WRdlBYeRs@phenom.ffwll.local/T/#u
>

next prev parent reply	other threads:[~2025-03-03 14:32 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-20 16:48 [PATCH v2 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
2025-02-20 16:51 ` [PATCH v2 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
2025-03-03  1:51   ` Baoquan He
2025-02-20 16:52 ` [PATCH v2 2/5] kdump: implement reserve_crashkernel_cma Jiri Bohac
2025-02-20 16:54 ` [PATCH v2 3/5] kdump, documentation: describe craskernel CMA reservation Jiri Bohac
2025-03-03  1:54   ` Baoquan He
2025-02-20 16:55 ` [PATCH v2 4/5] kdump: wait for DMA to finish when using CMA Jiri Bohac
2025-03-03  2:02   ` Baoquan He
2025-03-11 12:00     ` Jiri Bohac
2025-02-20 16:57 ` [PATCH v2 5/5] x86: implement crashkernel cma reservation Jiri Bohac
2025-03-03  2:08 ` [PATCH v2 0/5] kdump: crashkernel reservation from CMA Baoquan He
2025-03-03  8:25 ` David Hildenbrand
2025-03-03 14:17   ` Donald Dutile [this message]
2025-03-04  4:20     ` Baoquan He
2025-05-28 21:01       ` David Hildenbrand
2025-05-29  7:46         ` Michal Hocko
2025-05-29  9:19           ` Michal Hocko
2025-05-30  8:06           ` David Hildenbrand
2025-05-30  8:28             ` Michal Hocko
2025-05-30  8:39               ` David Hildenbrand
2025-05-30  9:07                 ` Michal Hocko
2025-05-30  9:11                   ` David Hildenbrand
2025-05-30  9:26                     ` Michal Hocko
2025-05-30  9:28                       ` David Hildenbrand
2025-05-30  9:34                     ` Jiri Bohac
2025-05-30  9:47                       ` David Hildenbrand
2025-05-30  9:54                         ` Michal Hocko
2025-05-30 10:06                         ` Jiri Bohac
2025-05-29 16:22         ` Jiri Bohac
2025-03-12 15:36   ` Jiri Bohac

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=427fec88-2a74-471e-aeb6-a108ca8c4336@redhat.com \
    --to=ddutile@redhat.com \
    --cc=bhe@redhat.com \
    --cc=david@redhat.com \
    --cc=dhildenb@redhat.com \
    --cc=dyoung@redhat.com \
    --cc=jbohac@suse.cz \
    --cc=kexec@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ltao@redhat.com \
    --cc=mhocko@suse.cz \
    --cc=piliu@redhat.com \
    --cc=prudo@redhat.com \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox