From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1F1D30CD8A; Wed, 27 Aug 2025 15:04:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756307047; cv=none; b=WPnhZI4mzufOfzaFrdZrSLv6Bcb85//TJL3EMKgtLk6jNPx8Dngahy5em2CvzgmJwdobSjT06//VGQ6/qcyy0A93/KpfTldMJeTo8moGpWmPyMZHEOshT3Aajd+9GmHwuf+ls6a7Ph9QN7F9xvupnonlpqpA+CcPpZZIcMN8fbI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756307047; c=relaxed/simple; bh=uo4Ob9kTPdoJmgAnEx6L4XzAms4E4Sy8+Nf0B+A98EM=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=Mg6vFhFo10IO/gB7J3DpGtimFfYnefUgTd5WO5dhsdsoCyQ5XqDECxEmCabRC+PneFWg00HqDJsSxzYor/olwVKCCe5vK50uqvyPwzdV7yeEsCRITy1XuD9ZRYhDIfhd9ZFQMUBdMxIHnNIH+4jYu6RM+qf4UUfEUw4YffGP+pI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=GFB7Oh1e; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="GFB7Oh1e" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0D109C4CEEB; Wed, 27 Aug 2025 15:03:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1756307047; bh=uo4Ob9kTPdoJmgAnEx6L4XzAms4E4Sy8+Nf0B+A98EM=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=GFB7Oh1eo+TReZoN3bwTBKHsq/yKs0Ec6wezEhYD6I52VzGWuajG/VUdqJlHo9GBa sP5uM0oIsz73s+GR3PHPfQM4FtpF3efam0U/5zWK0ivK0fj6LbZC5fFyPNQgTeW63w E7NL4+4i6tYIG42apLWejr2ppn7Fn5cZUKPnTLphMA4LKIcV/vG6hX/sMNARHSiFsV jQkO6X8RnOY4b/EE++uh5xQ5m0rgUa3Pqfc7MVaJ/KmZBiq0S0kt05UaAncWvIBfdx M34cS7vYly+aHvBfVvWPLH1rZ41QgDItwMsH34hBP87mp3hJpu4GzWzammLxdZZMf/ wwiiTIEjNHQnQ== From: Pratyush Yadav To: Jason Gunthorpe Cc: Pasha Tatashin , pratyush@kernel.org, jasonmiu@google.com, graf@amazon.com, changyuanl@google.com, rppt@kernel.org, dmatlack@google.com, rientjes@google.com, corbet@lwn.net, rdunlap@infradead.org, ilpo.jarvinen@linux.intel.com, kanie@linux.alibaba.com, ojeda@kernel.org, aliceryhl@google.com, masahiroy@kernel.org, akpm@linux-foundation.org, tj@kernel.org, yoann.congal@smile.fr, mmaurer@google.com, roman.gushchin@linux.dev, chenridong@huawei.com, axboe@kernel.dk, mark.rutland@arm.com, jannh@google.com, vincent.guittot@linaro.org, hannes@cmpxchg.org, dan.j.williams@intel.com, david@redhat.com, joel.granados@kernel.org, rostedt@goodmis.org, anna.schumaker@oracle.com, song@kernel.org, zhangguopeng@kylinos.cn, linux@weissschuh.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, gregkh@linuxfoundation.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, rafael@kernel.org, dakr@kernel.org, bartosz.golaszewski@linaro.org, cw00.choi@samsung.com, myungjoo.ham@samsung.com, yesanishhere@gmail.com, Jonathan.Cameron@huawei.com, quic_zijuhu@quicinc.com, aleksander.lobakin@intel.com, ira.weiny@intel.com, andriy.shevchenko@linux.intel.com, leon@kernel.org, lukas@wunner.de, bhelgaas@google.com, wagi@kernel.org, djeffery@redhat.com, stuart.w.hayes@gmail.com, lennart@poettering.net, brauner@kernel.org, linux-api@vger.kernel.org, linux-fsdevel@vger.kernel.org, saeedm@nvidia.com, ajayachandra@nvidia.com, parav@nvidia.com, leonro@nvidia.com, witu@nvidia.com Subject: Re: [PATCH v3 29/30] luo: allow preserving memfd In-Reply-To: <20250826162019.GD2130239@nvidia.com> References: <20250807014442.3829950-1-pasha.tatashin@soleen.com> <20250807014442.3829950-30-pasha.tatashin@soleen.com> <20250826162019.GD2130239@nvidia.com> Date: Wed, 27 Aug 2025 17:03:55 +0200 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Precedence: bulk X-Mailing-List: linux-doc@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain Hi Jason, Thanks for the review. On Tue, Aug 26 2025, Jason Gunthorpe wrote: > On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote: > >> + /* >> + * Most of the space should be taken by preserved folios. So take its >> + * size, plus a page for other properties. >> + */ >> + fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE); >> + if (!fdt) { >> + err = -ENOMEM; >> + goto err_unpin; >> + } > > This doesn't seem to have any versioning scheme, it really should.. It does. See the "compatible" property. static const char memfd_luo_compatible[] = "memfd-v1"; static struct liveupdate_file_handler memfd_luo_handler = { .ops = &memfd_luo_file_ops, .compatible = memfd_luo_compatible, }; This goes into the LUO FDT: static int luo_files_to_fdt(struct xarray *files_xa_out) [...] xa_for_each(files_xa_out, token, h) { [...] ret = fdt_property_string(luo_file_fdt_out, "compatible", h->fh->compatible); So this function only gets called for the version 1. > >> + err = fdt_property_placeholder(fdt, "folios", preserved_size, >> + (void **)&preserved_folios); >> + if (err) { >> + pr_err("Failed to reserve folios property in FDT: %s\n", >> + fdt_strerror(err)); >> + err = -ENOMEM; >> + goto err_free_fdt; >> + } > > Yuk. > > This really wants some luo helper > > 'luo alloc array' > 'luo restore array' > 'luo free array' > > Which would get a linearized list of pages in the vmap to hold the > array and then allocate some structure to record the page list and > return back the u64 of the phys_addr of the top of the structure to > store in whatever. > > Getting fdt to allocate the array inside the fds is just not going to > work for anything of size. Yep, I agree. This version already runs into size limits of around 1 GiB due to the FDT being limited to MAX_PAGE_ORDER, since that is the largest contiguous piece of memory folio_alloc() can give us. On top, FDT is only limited to 32 bits. While very large, it isn't unreasonable to expect metadata exceeding that for some use cases (4 GiB is only 0.4% of 1 TiB and there are systems a lot larger than that around). I think we need something a luo_xarray data structure that users like memfd (and later hugetlb and guest_memfd and maybe others) can build to make serialization easier. It will cover both contiguous arrays and arrays with some holes in them. I did it this way mainly to keep things simple and get things out. But Pasha already mentioned he is running into this limit for some tests, so I think I will experiment around with a serialized xarray design. > >> + for (; i < nr_pfolios; i++) { >> + const struct memfd_luo_preserved_folio *pfolio = &pfolios[i]; >> + phys_addr_t phys; >> + u64 index; >> + int flags; >> + >> + if (!pfolio->foliodesc) >> + continue; >> + >> + phys = PFN_PHYS(PRESERVED_FOLIO_PFN(pfolio->foliodesc)); >> + folio = kho_restore_folio(phys); >> + if (!folio) { >> + pr_err("Unable to restore folio at physical address: %llx\n", >> + phys); >> + goto put_file; >> + } >> + index = pfolio->index; >> + flags = PRESERVED_FOLIO_FLAGS(pfolio->foliodesc); >> + >> + /* Set up the folio for insertion. */ >> + /* >> + * TODO: Should find a way to unify this and >> + * shmem_alloc_and_add_folio(). >> + */ >> + __folio_set_locked(folio); >> + __folio_set_swapbacked(folio); >> >> + ret = mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping)); >> + if (ret) { >> + pr_err("shmem: failed to charge folio index %d: %d\n", >> + i, ret); >> + goto unlock_folio; >> + } > > [..] > >> + folio_add_lru(folio); >> + folio_unlock(folio); >> + folio_put(folio); >> + } > > Probably some consolidation will be needed to make this less > duplicated.. Maybe. I do have that as a TODO item, but I took a quick look today and I am not sure if it will make things simple enough. There are a few places that add a folio to the shmem page cache, and all of them have subtle differences and consolidating them all might be tricky. Let me give it a shot... > > But overall I think just using the memfd_luo_preserved_folio as the > serialization is entirely file, I don't think this needs anything more > complicated. > > What it does need is an alternative to the FDT with versioning. As I explained above, the versioning is already there. Beyond that, why do you think a raw C struct is better than FDT? It is just another way of expressing the same information. FDT is a bit more cumbersome to write and read, but comes at the benefit of more introspect-ability. > > Which seems to me to be entirely fine as: > > struct memfd_luo_v0 { > __aligned_u64 size; > __aligned_u64 pos; > __aligned_u64 folios; > }; > > struct memfd_luo_v0 memfd_luo_v0 = {.size = size, pos = file->f_pos, folios = folios}; > luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0); > > Which also shows the actual data needing to be serialized comes from > more than one struct and has to be marshaled in code, somehow, to a > single struct. > > Then I imagine a fairly simple forwards/backwards story. If something > new is needed that is non-optional, lets say you compress the folios > list to optimize holes: > > struct memfd_luo_v1 { > __aligned_u64 size; > __aligned_u64 pos; > __aligned_u64 folios_list_with_holes; > }; > > Obviously a v0 kernel cannot parse this, but in this case a v1 aware > kernel could optionally duplicate and write out the v0 format as well: > > luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0); > luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1); I think what you describe here is essentially how LUO works currently, just that the mechanisms are a bit different. For example, instead of the subsystem calling luo_store_object(), the LUO core calls back into the subsystem at the appropriate time to let it populate the object. See memfd_luo_prepare() and the data argument. The version is decided by the compatible string with which the handler was registered. Since LUO knows when to start serializing what, I think this flow of calling into the subsystem and letting it fill in an object that LUO tracks and hands over makes a lot of sense. > > Then the rule is fairly simple, when the sucessor kernel goes to > deserialize it asks luo for the versions it supports: > > if (luo_restore_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1)) > restore_v1(&memfd_luo_v1) > else if (luo_restore_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0)) > restore_v0(&memfd_luo_v0) > else > luo_failure("Do not understand this"); Similarly, on restore side, the new kernel can register handlers of all the versions it can deal with, and LUO core takes care of calling into the right callback. See memfd_luo_retrieve() for example. If we now have a v2, the new kernel can simply define a new handler for v2 and add a new memfd_luo_retrieve_v2(). > > luo core just manages this list of versioned data per serialized > object. There is only one version per object. This also holds true. -- Regards, Pratyush Yadav