Re: [PATCH] kexec_core: Accept unaccepted kexec destination addresses

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Baoquan He <bhe@redhat.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
	linux-coco@lists.linux.dev, x86@kernel.org,
	rick.p.edgecombe@intel.com, kirill.shutemov@linux.intel.com
Subject: Re: [PATCH] kexec_core: Accept unaccepted kexec destination addresses
Date: Tue, 3 Dec 2024 18:30:36 +0800	[thread overview]
Message-ID: <Z07dzP6ZdW3sNahj@MiWiFi-R3L-srv> (raw)
In-Reply-To: <Z07YJlxK9/piXLhK@yzhao56-desk.sh.intel.com>

On 12/03/24 at 06:06pm, Yan Zhao wrote:
> On Mon, Dec 02, 2024 at 10:17:16PM +0800, Baoquan He wrote:
> > On 11/29/24 at 01:52pm, Yan Zhao wrote:
> > > On Thu, Nov 28, 2024 at 11:19:20PM +0800, Baoquan He wrote:
> > > > On 11/27/24 at 06:01pm, Yan Zhao wrote:
> > > > > On Tue, Nov 26, 2024 at 07:38:05PM +0800, Baoquan He wrote:
> > > > > > On 10/24/24 at 08:15am, Yan Zhao wrote:
> > > > > > > On Wed, Oct 23, 2024 at 10:44:11AM -0500, Eric W. Biederman wrote:
> > > > > > > > "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> > > > > > > > 
> > > > > > > > > Waiting minutes to get VM booted to shell is not feasible for most
> > > > > > > > > deployments. Lazy is sane default to me.
> > > > > > > > 
> > > > > > > > Huh?
> > > > > > > > 
> > > > > > > > Unless my guesses about what is happening are wrong lazy is hiding
> > > > > > > > a serious implementation deficiency.  From all hardware I have seen
> > > > > > > > taking minutes is absolutely ridiculous.
> > > > > > > > 
> > > > > > > > Does writing to all of memory at full speed take minutes?  How can such
> > > > > > > > a system be functional?
> > > > > > > > 
> > > > > > > > If you don't actually have to write to the pages and it is just some
> > > > > > > > accounting function it is even more ridiculous.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > I had previously thought that accept_memory was the firmware call.
> > > > > > > > Now that I see that it is just a wrapper for some hardware specific
> > > > > > > > calls I am even more perplexed.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Quite honestly what this looks like to me is that someone failed to
> > > > > > > > enable write-combining or write-back caching when writing to memory
> > > > > > > > when initializing the protected memory.  With the result that everything
> > > > > > > > is moving dog slow, and people are introducing complexity left and write
> > > > > > > > to avoid that bad implementation.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Can someone please explain to me why this accept_memory stuff has to be
> > > > > > > > slow, why it has to take minutes to do it's job.
> > > > > > > This kexec patch is a fix to a guest(TD)'s kexce failure.
> > > > > > > 
> > > > > > > For a linux guest, the accept_memory() happens before the guest accesses a page.
> > > > > > > It will (if the guest is a TD)
> > > > > > > (1) trigger the host to allocate the physical page on host to map the accessed
> > > > > > >     guest page, which might be slow with wait and sleep involved, depending on
> > > > > > >     the memory pressure on host.
> > > > > > > (2) initializing the protected page.
> > > > > > > 
> > > > > > > Actually most of guest memory are not accessed by guest during the guest life
> > > > > > > cycle. accept_memory() may cause the host to commit a never-to-be-used page,
> > > > > > > with the host physical page not even being able to get swapped out.
> > > > > > 
> > > > > > So this sounds to me more like a business requirement on cloud platform,
> > > > > > e.g if one customer books a guest instance with 60G memory, while the
> > > > > > customer actually always only cost 20G memory at most. Then the 40G memory
> > > > > > can be saved to reduce pressure for host.
> > > > > Yes.
> > > > 
> > > > That's very interesting, thanks for confirming.
> > > > 
> > > > > 
> > > > > > I could be shallow, just a wild guess.
> > > > > > If my guess is right, at least those cloud service providers must like this
> > > > > > accept_memory feature very much.
> > > > > > 
> > > > > > > 
> > > > > > > That's why we need a lazy accept, which does not accept_memory() until after a
> > > > > > > page is allocated by the kernel (in alloc_page(s)).
> > > > > > 
> > > > > > By the way, I have two questions, maybe very shallow.
> > > > > > 
> > > > > > 1) why can't we only find those already accepted memory to put kexec
> > > > > > kernel/initrd/bootparam/purgatory?
> > > > > 
> > > > > Currently, the first kernel only accepts memory during the memory allocation in
> > > > > a lazy accept mode. Besides reducing boot time, it's also good for memory
> > > > > over-commitment as you mentioned above.
> > > > > 
> > > > > My understanding of why the memory for the kernel/initrd/bootparam/purgatory is
> > > > > not allocated from the first kernel is that this memory usually needs to be
> > > > > physically contiguous. Since this memory will not be used by the first kernel,
> > > > > looking up from free RAM has a lower chance of failure compared to allocating it
> > > > 
> > > > Well, there could be misunderstanding here.The final loaded position of
> > > > kernel/initrd/bootparam/purgatory is not searched from free RAM, it's
> > > Oh, by free RAM, I mean system RAM that is marked as
> > > IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY, but not marked as
> > > IORESOURCE_SYSRAM_DRIVER_MANAGED.
> > > 
> > > 
> > > > just from RAM on x86. Means it possibly have been allocated and being
> > > > used by other component of 1st kernel. Not like kdump, the 2nd kernel of
> > > Yes, it's entirely possible that the destination address being searched out has
> > > already been allocated and is in use by the 1st kernel. e.g. for
> > > KEXEC_TYPE_DEFAULT, the source page for each segment is allocated from the 1st
> > > kernel, and it is allowed to have the same address as its corresponding
> > > destination address.
> > > 
> > > However, it's not guaranteed that the destination address must have been
> > > allocated by the 1st kernel.
> > > 
> > > > kexec reboot doesn't care about 1st kernel's memory usage. We will copy
> > > > them from intermediat position to the designated location when jumping.
> > > Right. If it's not guaranteed that the destination address has been accepted
> > > before this copying, the copying could trigger an error due to accessing an
> > > unaccepted page, which could be fatal for a linux TDX guest.
> > 
> > Oh, I just said the opposite. I meant we could search according to the
> > current unaccepted->bitmap to make sure the destination area definitely
> > have been accepted. This is the best if doable, while I know it's not
> > easy.
> Well, this sounds like introducing a new constraint in addition to the current
> checking of !kimage_is_destination_range() in locate_mem_hole_top_down() or
> locate_mem_hole_bottom_up(). (powerpc also has a different implementation).
> 
> This could make the success unpredictable, depending on how many pages have
> been accepted by the 1st kernel and the layout of the accepted pages(e.g.,
> whether they are physically contiguous). The 1st kernel would also have no
> reliable way to ensure success except by accepting all the guest pages.

Yeah, when I finished reading accept_memory code, this is the first idea
which come up into my mind. If it can be made, it's the most ideal. When
I tried to make a draft change, it does introduce a lot of code change and
add very much complication and I just gave up.

Maybe this can be added to cover-letter too to tell this possible path we
explored.

next prev parent reply	other threads:[~2024-12-03 10:30 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-21  3:45 [PATCH] kexec_core: Accept unaccepted kexec destination addresses Yan Zhao
2024-10-21 14:33 ` Eric W. Biederman
2024-10-22  3:12   ` Yan Zhao
2024-10-22 12:06   ` Kirill A. Shutemov
2024-10-23 15:44     ` Eric W. Biederman
2024-10-24  0:15       ` Yan Zhao
2024-10-24  0:26         ` Yan Zhao
2024-11-26 11:38         ` Baoquan He
2024-11-27 10:01           ` Yan Zhao
2024-11-28 15:19             ` Baoquan He
2024-11-29  5:52               ` Yan Zhao
2024-12-02 14:17                 ` Baoquan He
2024-12-03 10:06                   ` Yan Zhao
2024-12-03 10:30                     ` Baoquan He [this message]
2024-12-04  9:19                       ` Yan Zhao
2024-10-25 13:56       ` Kirill A. Shutemov
2024-11-04  8:35         ` Kirill A. Shutemov
2024-11-08 12:29           ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z07dzP6ZdW3sNahj@MiWiFi-R3L-srv \
    --to=bhe@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=kexec@lists.infradead.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=x86@kernel.org \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox