From: Donald Dutile <ddutile@redhat.com>
To: David Hildenbrand <david@redhat.com>, Jiri Bohac <jbohac@suse.cz>,
Baoquan He <bhe@redhat.com>, Vivek Goyal <vgoyal@redhat.com>,
Dave Young <dyoung@redhat.com>,
kexec@lists.infradead.org
Cc: Philipp Rudo <prudo@redhat.com>, Pingfan Liu <piliu@redhat.com>,
Tao Liu <ltao@redhat.com>,
linux-kernel@vger.kernel.org,
David Hildenbrand <dhildenb@redhat.com>,
Michal Hocko <mhocko@suse.cz>
Subject: Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
Date: Mon, 3 Mar 2025 09:17:51 -0500 [thread overview]
Message-ID: <427fec88-2a74-471e-aeb6-a108ca8c4336@redhat.com> (raw)
In-Reply-To: <04904e86-5b5f-4aa1-a120-428dac119189@redhat.com>
On 3/3/25 3:25 AM, David Hildenbrand wrote:
> On 20.02.25 17:48, Jiri Bohac wrote:
>> Hi,
>>
>> this series implements a way to reserve additional crash kernel
>> memory using CMA.
>>
>> Link to the v1 discussion:
>> https://lore.kernel.org/lkml/ZWD_fAPqEWkFlEkM@dwarf.suse.cz/
>> See below for the changes since v1 and how concerns from the
>> discussion have been addressed.
>>
>> Currently, all the memory for the crash kernel is not usable by
>> the 1st (production) kernel. It is also unmapped so that it can't
>> be corrupted by the fault that will eventually trigger the crash.
>> This makes sense for the memory actually used by the kexec-loaded
>> crash kernel image and initrd and the data prepared during the
>> load (vmcoreinfo, ...). However, the reserved space needs to be
>> much larger than that to provide enough run-time memory for the
>> crash kernel and the kdump userspace. Estimating the amount of
>> memory to reserve is difficult. Being too careful makes kdump
>> likely to end in OOM, being too generous takes even more memory
>> from the production system. Also, the reservation only allows
>> reserving a single contiguous block (or two with the "low"
>> suffix). I've seen systems where this fails because the physical
>> memory is fragmented.
>>
>> By reserving additional crashkernel memory from CMA, the main
>> crashkernel reservation can be just large enough to fit the
>> kernel and initrd image, minimizing the memory taken away from
>> the production system. Most of the run-time memory for the crash
>> kernel will be memory previously available to userspace in the
>> production system. As this memory is no longer wasted, the
>> reservation can be done with a generous margin, making kdump more
>> reliable. Kernel memory that we need to preserve for dumping is
>> never allocated from CMA. User data is typically not dumped by
>> makedumpfile. When dumping of user data is intended this new CMA
>> reservation cannot be used.
>
>
> Hi,
>
> I'll note that your comment about "user space" is currently the case, but will likely not hold in the long run. The assumption you are making is that only user-space memory will be allocated from MIGRATE_CMA, which is not necessarily the case. Any movable allocation will end up in there.
>
> Besides LRU folios (user space memory and the pagecache), we already support migration of some kernel allocations using the non-lru migration framework. Such allocations (which use __GFP_MOVABLE, see __SetPageMovable()) currently only include
> * memory balloon: pages we never want to dump either way
> * zsmalloc (->zpool): only used by zswap (-> compressed LRU pages)
> * z3fold (->zpool): only used by zswap (-> compressed LRU pages)
>
> Just imagine if we support migration of other kernel allocations, such as user page tables. The dump would be missing important information.
>
IOMMUFD is a near-term candidate for user page tables with multi-stage iommu support with going through upstream review atm.
Just saying, that David's case will be a norm in high-end VMs with performance-enhanced guest-driven iommu support (for GPUs).
> Once that happens, it will become a lot harder to judge whether CMA can be used or not. At least, the kernel could bail out/warn for these kernel configs.
>
I don't think the aforementioned focus is to use CMA, but given its performance benefits, it won't take long to be the next perf improvement step taken.
>>
>> There are five patches in this series:
>>
>> The first adds a new ",cma" suffix to the recenly introduced generic
>> crashkernel parsing code. parse_crashkernel() takes one more
>> argument to store the cma reservation size.
>>
>> The second patch implements reserve_crashkernel_cma() which
>> performs the reservation. If the requested size is not available
>> in a single range, multiple smaller ranges will be reserved.
>>
>> The third patch updates Documentation/, explicitly mentioning the
>> potential DMA corruption of the CMA-reserved memory.
>>
>> The fourth patch adds a short delay before booting the kdump
>> kernel, allowing pending DMA transfers to finish.
>
>
> What does "short" mean? At least in theory, long-term pinning is forbidden for MIGRATE_CMA, so we should not have such pages mapped into an iommu where DMA can happily keep going on for quite a while.
>
Hmmm, in the case I mentioned above, should there be a kexec hook in multi-stage IOMMU support for the hypervisor/VMM to invalidate/shut-off stage 2 mappings asap (a multi-microsecond process) so
DMA termination from VMs is stunted ? is that already done today (due to 'simple', single-stage, device assignment in a VM)?
> But that assumes that our old kernel is not buggy, and doesn't end up mapping these pages into an IOMMU where DMA will just continue. I recall that DRM might currently be a problem, described here [1].
>
> If kdump starts not working as expected in case our old kernel is buggy, doesn't that partially destroy the purpose of kdump (-> debug bugs in the old kernel)?
>
>
> [1] https://lore.kernel.org/all/Z6MV_Y9WRdlBYeRs@phenom.ffwll.local/T/#u
>
next prev parent reply other threads:[~2025-03-03 14:32 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-20 16:48 [PATCH v2 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
2025-02-20 16:51 ` [PATCH v2 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
2025-03-03 1:51 ` Baoquan He
2025-02-20 16:52 ` [PATCH v2 2/5] kdump: implement reserve_crashkernel_cma Jiri Bohac
2025-02-20 16:54 ` [PATCH v2 3/5] kdump, documentation: describe craskernel CMA reservation Jiri Bohac
2025-03-03 1:54 ` Baoquan He
2025-02-20 16:55 ` [PATCH v2 4/5] kdump: wait for DMA to finish when using CMA Jiri Bohac
2025-03-03 2:02 ` Baoquan He
2025-03-11 12:00 ` Jiri Bohac
2025-02-20 16:57 ` [PATCH v2 5/5] x86: implement crashkernel cma reservation Jiri Bohac
2025-03-03 2:08 ` [PATCH v2 0/5] kdump: crashkernel reservation from CMA Baoquan He
2025-03-03 8:25 ` David Hildenbrand
2025-03-03 14:17 ` Donald Dutile [this message]
2025-03-04 4:20 ` Baoquan He
2025-05-28 21:01 ` David Hildenbrand
2025-05-29 7:46 ` Michal Hocko
2025-05-29 9:19 ` Michal Hocko
2025-05-30 8:06 ` David Hildenbrand
2025-05-30 8:28 ` Michal Hocko
2025-05-30 8:39 ` David Hildenbrand
2025-05-30 9:07 ` Michal Hocko
2025-05-30 9:11 ` David Hildenbrand
2025-05-30 9:26 ` Michal Hocko
2025-05-30 9:28 ` David Hildenbrand
2025-05-30 9:34 ` Jiri Bohac
2025-05-30 9:47 ` David Hildenbrand
2025-05-30 9:54 ` Michal Hocko
2025-05-30 10:06 ` Jiri Bohac
2025-05-29 16:22 ` Jiri Bohac
2025-03-12 15:36 ` Jiri Bohac
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=427fec88-2a74-471e-aeb6-a108ca8c4336@redhat.com \
--to=ddutile@redhat.com \
--cc=bhe@redhat.com \
--cc=david@redhat.com \
--cc=dhildenb@redhat.com \
--cc=dyoung@redhat.com \
--cc=jbohac@suse.cz \
--cc=kexec@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=ltao@redhat.com \
--cc=mhocko@suse.cz \
--cc=piliu@redhat.com \
--cc=prudo@redhat.com \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox