From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A36FC3601E for ; Fri, 4 Apr 2025 09:54:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 407F36B0007; Fri, 4 Apr 2025 05:54:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3B7AB6B0008; Fri, 4 Apr 2025 05:54:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 257EC6B000A; Fri, 4 Apr 2025 05:54:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 065FD6B0007 for ; Fri, 4 Apr 2025 05:54:43 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id DFCCBC2553 for ; Fri, 4 Apr 2025 09:54:45 +0000 (UTC) X-FDA: 83295902130.18.73DDBCC Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf08.hostedemail.com (Postfix) with ESMTP id 13099160005 for ; Fri, 4 Apr 2025 09:54:43 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=NqdZLC8P; spf=pass (imf08.hostedemail.com: domain of rppt@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=rppt@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743760484; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mrB1iO8WNCyw1FmbOcm5zEot0c6YpO468/MRGnmXPik=; b=OR2zRIx2fSyZYHn4czSt1TDjjEzkAlU5YNm2RFxNT2iAHjUbmnynR42KttL10RnDmeGFBA G3HGUVS6USOa1zOzCT+ak2VlpL2thyKBQbjvtoDHkifBLNe2tCB6WKtD0zn98qIqIHPZMC UWPcl5P76Q4KCMnb9TT3rl7cJmwoePs= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=NqdZLC8P; spf=pass (imf08.hostedemail.com: domain of rppt@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=rppt@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743760484; a=rsa-sha256; cv=none; b=S/KDfIs2zfVsydracf5S4bp9CJDeaDqrSHUw6Wr8pbcPqLlJR6XvNlPPaOxeABkKMv8jDT kmx8pHqOZ9ngQYithetFNm74zOBCo/SDMZAQ48PI86wEaFRWBxfV9Yl9vcDzzazjBuzEkl 2/ms8r0KD4jmImee7baWlhd7CBBCzDA= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 747C55C5EBA; Fri, 4 Apr 2025 09:52:25 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D486BC4CEDD; Fri, 4 Apr 2025 09:54:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1743760482; bh=cSrVbGxlpctlURfhSmALnKBVmteINKipAHsEvzkIzVk=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=NqdZLC8PCMbwQmaPJjUq1AB8uqnD6azcUrHQUG7wlkgsJ5KZyLC3gWR8+5IQ8gdkH BipTaslXkHHOiwz6doB2JCUH5WpxGhWYjF6ywbQfHN0TpQDALMbrZQRGxKRdi/Mna1 DrImE5oznfHcWyCtDYGehTiv4FdNZ3AgwLl+Pi4myqIj3yA8VUB8TtFzuY/hsgbk3V s+JisuKs89/jhocAXjeRatMHAtxEaG7IMWtsefPAlDHT4AGYaPA7lD9NRhHx33dXBn EjjpU4eQwu80NTHkXfft43c+DxySzS8KH3+L80qUYkfY8ownLn/BHE1NsKZaXz/ZwV WxjdoB7VzppCw== Date: Fri, 4 Apr 2025 12:54:25 +0300 From: Mike Rapoport To: Jason Gunthorpe Cc: Pratyush Yadav , Changyuan Lyu , linux-kernel@vger.kernel.org, graf@amazon.com, akpm@linux-foundation.org, luto@kernel.org, anthony.yznaga@oracle.com, arnd@arndb.de, ashish.kalra@amd.com, benh@kernel.crashing.org, bp@alien8.de, catalin.marinas@arm.com, dave.hansen@linux.intel.com, dwmw2@infradead.org, ebiederm@xmission.com, mingo@redhat.com, jgowans@amazon.com, corbet@lwn.net, krzk@kernel.org, mark.rutland@arm.com, pbonzini@redhat.com, pasha.tatashin@soleen.com, hpa@zytor.com, peterz@infradead.org, robh+dt@kernel.org, robh@kernel.org, saravanak@google.com, skinsburskii@linux.microsoft.com, rostedt@goodmis.org, tglx@linutronix.de, thomas.lendacky@amd.com, usama.arif@bytedance.com, will@kernel.org, devicetree@vger.kernel.org, kexec@lists.infradead.org, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org Subject: Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation Message-ID: References: <20250320015551.2157511-1-changyuanl@google.com> <20250320015551.2157511-10-changyuanl@google.com> <20250403114209.GE342109@nvidia.com> <20250403142438.GF342109@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250403142438.GF342109@nvidia.com> X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 13099160005 X-Stat-Signature: noa94ywkf9kxfxtqr3jfbbcb3rijr6i4 X-HE-Tag: 1743760483-750840 X-HE-Meta: U2FsdGVkX18Qwnfp8uXcftIUf98DDsaeXDFpQ7vJmwMHN1g5kt+QogaA7F01GbGPbk6yzEccMAdNAgeZ5fKgQQChyIVnSDziv1VVqVixfnwcJoY734+pyAza+AskKQ+ElGWBHWk9pyeJ0oy23/eA06bRrkMWkZBpQN3Hh/3oHQVNfYwQLTsnuhxRzkNK2lmozQQiCLxhhVzdBGJekCESIxC4/3JVso0Hl63ptwgcTWOorerHuUckxCKO9Og014YyeVGrx3GXLFICUm8HHHn3PMCSMmRVI5QlyeZ1O8bGeJ/ns8elJvqjoB8pF9N+8v1ZggbCZ3Dy6VOKFHCIWjiLqXGgxW8mhLrhMOt+vZx4K7Eb65Dan2AXYpRX2Ile7XQKD1RTTairP0/P4WfFcOqDxMWILBzEoa+0og3KnmDXbmhIaYt4iZEIXwbzyInowRso+bmI81g8AepepRF8NVdH+skHEiVHyJ5liKOLCr1sY9hd1qGO7xg3BZrbsGUtNHGfde247K0fMDbaS5CPo8MfwSf71Vu8z/7Fu6olxjdd9QpCI9u4AUsL8l63KZtIq+AeV2isoMcNQrpd9f4Zn+1dN7d5AOt2L3fQN5/gcgeLOVn408qGLlVAPBGGbQ6KCWBwtevrdLP0WqUKT64rCZu1s3utAKFxaNg4NUEwAneQPsGDk0RpERwRQRJIO4SCGM+g1poKh8MopqPtbk8lBQrSFOzUdMU+yCTmGIVO4dSph5wEq4Flg085d5UjBPv6/lf5BJbdzicrPbXdIYzhjl6lqFP4rKlYeaUHJ6u5qKDwOed5nRm+gp7gxpJCPi3KRU0IsBrKxZNPQw5P3v4tLcvgUkuokbMX7otykPUS9ozN4/ARnPqc5twIz2eqLFHJxItAufWy8lbjjkn8SKbUNaem7y5LVWn6E3jhFY1KSSfhQLrhzAzpdRceEIHGs3UtsoylVgiF2mThM0X+T33Hpw1 YgYzsuUc Nvyulwpc4zbRuWAGiJwLiBkLnPb7voaKp4NvB2zjSJUcC49GQC8+suXXpjHARzZBv/UVpoxWpnbo1mELp1D0MVFoYZXOu8RWYMOv7UCdGJ+3yJ6jMAh7wMNUwY8MFhy+CSQINGbF9mQmCsAP6l4Ol7EKsqdx5tadTwI5iCD3JYzKLFGNLeuU9OegP/d6Wr16Bx3fklkExRYfp/LG4JWfA/czlsmO57mbWmUsXEv4wNUFjlsLT5j2rmCYQ6B9uXPQlWB6aCSiW/PaCbp6hQMcaq29Zi7egJ6asRacXRpJeEy5pXL+dDeXjc5GU8oi5VzkAJI/b8RkgU4Eb7aw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Jason, On Thu, Apr 03, 2025 at 11:24:38AM -0300, Jason Gunthorpe wrote: > On Thu, Apr 03, 2025 at 04:58:27PM +0300, Mike Rapoport wrote: > > On Thu, Apr 03, 2025 at 08:42:09AM -0300, Jason Gunthorpe wrote: > > > On Wed, Apr 02, 2025 at 07:16:27PM +0000, Pratyush Yadav wrote: > > > > > +int kho_preserve_phys(phys_addr_t phys, size_t size) > > > > > +{ > > > > > + unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size); > > > > > + unsigned int order = ilog2(end_pfn - pfn); > > > > > > > > This caught my eye when playing around with the code. It does not put > > > > any limit on the order, so it can exceed NR_PAGE_ORDERS. Also, when > > > > initializing the page after KHO, we pass the order directly to > > > > prep_compound_page() without sanity checking it. The next kernel might > > > > not support all the orders the current one supports. Perhaps something > > > > to fix? > > > > > > IMHO we should delete the phys functions until we get a user of them > > > > The only user of memory tracker in this series uses kho_preserve_phys() > > But it really shouldn't. The reserved memory is a completely different > mechanism than buddy allocator preservation. It doesn't even call > kho_restore_phys() those pages, it just feeds the ranges directly to: > > + reserved_mem_add(*p_start, size, name); > > The bitmaps should be understood as preserving memory from the buddy > allocator only. > > IMHO it should not call kho_preserve_phys() at all. Do you mean that for preserving large physical ranges we need something entirely different? Then we don't need the bitmaps at this point, as we don't have any users for kho_preserve_folio() and we should not worry ourself with orders and restoration of high order folios until then ;-) Now more seriously, I considered the bitmaps and sparse xarrays as good initial implementation of memory preservation that can do both physical ranges now and folios later when we'll need them. It might not be the optimal solution in the long run but we don't have enough data right now to do any optimizations for real. Preserving huge amounts of order-0 pages does not seem to me a representative test case at all. The xarrays + bitmaps do have the limitation that we cannot store any information about the folio except its order and if we are anyway need something else to preserve physical ranges, I suggest starting with preserving ranges and then adding optimizations for the folio case. As I've mentioned earlier, maple tree is perfect for tracking ranges, it simpler than other alternatives, and at allows storing information about a range and easy and efficient coalescing for adjacent ranges with matching properties. The maple tree based memory tracker is less memory efficient than bitmap if we count how many data is required to preserve gigabytes of distinct order-0 pages, but I don't think this is the right thing to measure at least until we have some real data about how KHO is used. Here's something that implements preservation of ranges (compile tested only) and adding folios with their orders and maybe other information would be quite easy. /* * Keep track of memory that is to be preserved across KHO. * * For simplicity use a maple tree that conveniently stores ranges and * allows adding BITS_PER_XA_VALUE of metadata to each range */ struct kho_mem_track { struct maple_tree ranges; }; static struct kho_mem_track kho_mem_track; typedef unsigned long kho_range_desc_t; static int __kho_preserve(struct kho_mem_track *tracker, unsigned long addr, size_t size, kho_range_desc_t info) { struct maple_tree *ranges = &tracker->ranges; MA_STATE(mas, ranges, addr - 1, addr + size + 1); unsigned long lower, upper; void *area = NULL; lower = addr; upper = addr + size - 1; might_sleep(); area = mas_walk(&mas); if (area && mas.last == addr - 1) lower = mas.index; area = mas_next(&mas, ULONG_MAX); if (area && mas.index == addr + size) upper = mas.last; mas_set_range(&mas, lower, upper); return mas_store_gfp(&mas, xa_mk_value(info), GFP_KERNEL); } /** * kho_preserve_phys - preserve a physcally contiguous range accross KHO. * @phys: physical address of the range * @phys: size of the range * * Records that the entire range from @phys to @phys + @size is preserved * across KHO. * * Return: 0 on success, error code on failure */ int kho_preserve_phys(phys_addr_t phys, size_t size) { return __kho_preserve(&kho_mem_track, phys, size, 0); } EXPORT_SYMBOL_GPL(kho_preserve_phys); #define KHOSER_PTR(type) union {phys_addr_t phys; type ptr;} #define KHOSER_STORE_PTR(dest, val) \ ({ \ (dest).phys = virt_to_phys(val); \ typecheck(typeof((dest).ptr), val); \ }) #define KHOSER_LOAD_PTR(src) ((src).phys ? (typeof((src).ptr))(phys_to_virt((src).phys)): NULL) struct khoser_mem_range { phys_addr_t start; phys_addr_t size; unsigned long data; }; struct khoser_mem_chunk_hdr { KHOSER_PTR(struct khoser_mem_chunk *) next; unsigned long num_ranges; }; #define KHOSER_RANGES_SIZE \ ((PAGE_SIZE - sizeof(struct khoser_mem_chunk_hdr) / \ sizeof(struct khoser_mem_range))) struct khoser_mem_chunk { struct khoser_mem_chunk_hdr hdr; struct khoser_mem_range ranges[KHOSER_RANGES_SIZE]; }; static int new_chunk(struct khoser_mem_chunk **cur_chunk) { struct khoser_mem_chunk *chunk; chunk = kzalloc(sizeof(*chunk), GFP_KERNEL); if (!chunk) return -ENOMEM; if (*cur_chunk) KHOSER_STORE_PTR((*cur_chunk)->hdr.next, chunk); *cur_chunk = chunk; return 0; } /* * Record all the ranges in a linked list of pages for the next kernel to * process. Each chunk holds array of ragnes. The maple_tree is used to store * them in a tree while building up the data structure, but the KHO successor * kernel only needs to process them once in order. * * All of this memory is normal kmalloc() memory and is not marked for * preservation. The successor kernel will remain isolated to the scratch space * until it completes processing this list. Once processed all the memory * storing these ranges will be marked as free. */ static int kho_mem_serialize(phys_addr_t *fdt_value) { struct kho_mem_track *tracker = &kho_mem_track; struct maple_tree *ranges = &tracker->ranges; struct khoser_mem_chunk *first_chunk = NULL; struct khoser_mem_chunk *chunk = NULL; MA_STATE(mas, ranges, 0, ULONG_MAX); void *entry; int err; mas_for_each(&mas, entry, ULONG_MAX) { size_t size = mas.last - mas.index + 1; struct khoser_mem_range *range; err = new_chunk(&chunk); if (err) goto err_free; if (!first_chunk) first_chunk = chunk; if (chunk->hdr.num_ranges == ARRAY_SIZE(chunk->ranges)) { err = new_chunk(&chunk); if (err) goto err_free; } range = &chunk->ranges[chunk->hdr.num_ranges]; range->start = mas.index; range->size = size; range->data = xa_to_value(entry); chunk->hdr.num_ranges++; } *fdt_value = virt_to_phys(first_chunk); return 0; err_free: chunk = first_chunk; while (chunk) { struct khoser_mem_chunk *tmp = chunk; chunk = KHOSER_LOAD_PTR(chunk->hdr.next); kfree(tmp); } return err; } static void __init deserialize_range(struct khoser_mem_range *range) { memblock_reserved_mark_noinit(range->start, range->size); memblock_reserve(range->start, range->size); } static void __init kho_mem_deserialize(void) { const void *fdt = kho_get_fdt(); struct khoser_mem_chunk *chunk; const phys_addr_t *mem; int len, node; if (!fdt) return; node = fdt_path_offset(fdt, "/preserved-memory"); if (node < 0) { pr_err("no preserved-memory node: %d\n", node); return; } mem = fdt_getprop(fdt, node, "metadata", &len); if (!mem || len != sizeof(*mem)) { pr_err("failed to get preserved memory bitmaps\n"); return; } chunk = phys_to_virt(*mem); while (chunk) { unsigned int i; memblock_reserve(virt_to_phys(chunk), sizeof(*chunk)); for (i = 0; i != chunk->hdr.num_ranges; i++) deserialize_range(&chunk->ranges[i]); chunk = KHOSER_LOAD_PTR(chunk->hdr.next); } } -- Sincerely yours, Mike.