From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C4BC7C7115C for ; Wed, 25 Jun 2025 16:59:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 53DCB6B00D0; Wed, 25 Jun 2025 12:59:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5150A6B00D1; Wed, 25 Jun 2025 12:59:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 42B286B00D2; Wed, 25 Jun 2025 12:59:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2F63C6B00D0 for ; Wed, 25 Jun 2025 12:59:31 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id E7D8D1A0629 for ; Wed, 25 Jun 2025 16:59:30 +0000 (UTC) X-FDA: 83594534100.07.2D17D23 Received: from mail-qt1-f173.google.com (mail-qt1-f173.google.com [209.85.160.173]) by imf08.hostedemail.com (Postfix) with ESMTP id 0F8BD160007 for ; Wed, 25 Jun 2025 16:59:28 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b="Fqq/l+U2"; spf=pass (imf08.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.160.173 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com; dmarc=pass (policy=none) header.from=soleen.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750870769; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=l8zxxJzF4mpV5WqO4/+d5XFWeNBNWQegpuXNxie2RGo=; b=bAaqhGXq6Jy0incNghFI2ZW1RsURnVRqCdIXmUm/HzPN+QeoXxL1PkfVGO+W3lOaUm0MeO cb3bp6knNMYCoZPXg8rwcEx0Oh9+KYj2SNPsSCDNLPaYjNYI7k3zeLbhr5SZxeaouQQT5N dwHL8wDmESaHXzojlD3ogBMeFDGMlrc= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b="Fqq/l+U2"; spf=pass (imf08.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.160.173 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com; dmarc=pass (policy=none) header.from=soleen.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750870769; a=rsa-sha256; cv=none; b=FNrhe2w4/JLL1tGgVbex6NMF450QNqMU2JaMgx6ApHup4/SeqM+ZsWKUpDotPrKQTbqXTm pZT7rVHyBy8a84WGgbkoTLwpK02o/8KcAn9SrPihORrD0QcPWwREdL/V921pg1xw0tI6Yp pWwyuYztN8QyI8Ftt8HrBUv26nEWoTc= Received: by mail-qt1-f173.google.com with SMTP id d75a77b69052e-4a4bb155edeso1833381cf.2 for ; Wed, 25 Jun 2025 09:59:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen-com.20230601.gappssmtp.com; s=20230601; t=1750870768; x=1751475568; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=l8zxxJzF4mpV5WqO4/+d5XFWeNBNWQegpuXNxie2RGo=; b=Fqq/l+U2rm3n/Mbac6jawIh3kXZGLoMkY6X9Coj/jIiBCbi6R51CIiTy9xNOECqpVh T+AcXL5d0vkgt/k9dDBF+lWaBuaOzOR91phJwBR8f4Jv2Fc4fENgaPrYinL2prfZl9oz rEJn2+YgcA/+NDerkxNxTCxhr4U6L5u+ifNhqsFumBMWIuc6glYiYeggcfJwgQHMsDu2 MvKJ8ND2G/VZmG32ssaPe/SG3+R3ywll6+wRRXdpMMu23PNap9jIvkBSFveDVbTQN2i2 SVKphc29HwXH2XizkPN/2r5uvRy02FXZpV1nNO4qynIQNelckVsB3r320Vyx082M+sP4 kS4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750870768; x=1751475568; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=l8zxxJzF4mpV5WqO4/+d5XFWeNBNWQegpuXNxie2RGo=; b=gHo6eo4RdHdAjyUY9rCdZz75PLGm1YdLHbR6Tx1Dt/UMsRn/pOj48ckzcEic/C2rpe YF8GMDuKC81AZT3pUtpUtpgEmUfns4OQIQBgv2IUFdOILe6i3BewNZjD5wSNbq6wK2J4 us1lJrxw/cfsXBlFJBM3DJG6CBogS9mkzQCBpXP9IoxtL5NWqOZR/3mTJfnqBaqVdiQO N9SIxtvMWUgPdZK24KFY8r2eVGp0kFOIl5cDmrXgneNBckAbYr6QjZa7Q/IS5DQOYP1o HK9ovB1cB7aT4rQ1SWPK6E4j/bsxszBupIPgf8shXndru1RLZfdI7CZO2wYeOkqRhre9 NtMA== X-Forwarded-Encrypted: i=1; AJvYcCWvLQfc78XdGj9HxnSGr1Yn4x6ikaIEcHf+PFVztwhnx7UqywSC88NV0TSXOQRbaShAHkLv6d4lcw==@kvack.org X-Gm-Message-State: AOJu0YzHUIaUnegLgkK0Ypa8v1aFOdzEizFnPaFd1kko2xv3gteUNN56 7GVKbPGqK0E/MI9N9hJk7nJ3Z4+ShDwq7xh3OtnSUALW42/JFEab0/IgfacrawjFxtZjq54nDoV oZSQxakEl6qMzCAHu7Z9NdktAvIN1J24Zs4g/o37w1g== X-Gm-Gg: ASbGncvsH1l7SUkfcJCHDKt7c8sbEbpp6vLBKbFDEMYWePVUKpDJkU/6u8PaC/dwQ/j FxQqdG5NZqWqXPAVwCuJT2WJ765TjmctQeBYAmbkx+bRNGC2O80pxUpQxEgFjqBReHL4ONk3rHf Jpik6qjY23rAcf6aOV+MIwhgQ6HR5RL+ndVHFgxjD8tQ== X-Google-Smtp-Source: AGHT+IHsVqa+gD29cjONjxJZbEgMlY0i24+ipstgNhfx5gKN7pnidM0AFKSkfQYHWJPURtGJ3aMJM/UHyOfnXlJo/Q4= X-Received: by 2002:a05:622a:1ccd:b0:494:a4bc:3b4d with SMTP id d75a77b69052e-4a7c068da06mr59991101cf.18.1750870767738; Wed, 25 Jun 2025 09:59:27 -0700 (PDT) MIME-Version: 1.0 References: <20250515182322.117840-1-pasha.tatashin@soleen.com> <20250515182322.117840-11-pasha.tatashin@soleen.com> <20250624-akzeptabel-angreifbar-9095f4717ca4@brauner> <20250625-akrobatisch-libellen-352997eb08ef@brauner> In-Reply-To: <20250625-akrobatisch-libellen-352997eb08ef@brauner> From: "pasha.tatashin" Date: Wed, 25 Jun 2025 12:58:49 -0400 X-Gm-Features: AX0GCFsLcQtAKsHQ7E9lthzKpTuYGc0t6y7agESjh9XPOaqb2LPofBMKpqHUGyg Message-ID: Subject: Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface To: Christian Brauner Cc: pratyush@kernel.org, jasonmiu@google.com, graf@amazon.com, changyuanl@google.com, rppt@kernel.org, dmatlack@google.com, rientjes@google.com, corbet@lwn.net, rdunlap@infradead.org, ilpo.jarvinen@linux.intel.com, kanie@linux.alibaba.com, ojeda@kernel.org, aliceryhl@google.com, masahiroy@kernel.org, akpm@linux-foundation.org, tj@kernel.org, yoann.congal@smile.fr, mmaurer@google.com, roman.gushchin@linux.dev, chenridong@huawei.com, axboe@kernel.dk, mark.rutland@arm.com, jannh@google.com, vincent.guittot@linaro.org, hannes@cmpxchg.org, dan.j.williams@intel.com, david@redhat.com, joel.granados@kernel.org, rostedt@goodmis.org, anna.schumaker@oracle.com, song@kernel.org, zhangguopeng@kylinos.cn, linux@weissschuh.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, gregkh@linuxfoundation.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, rafael@kernel.org, dakr@kernel.org, bartosz.golaszewski@linaro.org, cw00.choi@samsung.com, myungjoo.ham@samsung.com, yesanishhere@gmail.com, Jonathan.Cameron@huawei.com, quic_zijuhu@quicinc.com, aleksander.lobakin@intel.com, ira.weiny@intel.com, andriy.shevchenko@linux.intel.com, leon@kernel.org, lukas@wunner.de, bhelgaas@google.com, wagi@kernel.org, djeffery@redhat.com, stuart.w.hayes@gmail.com, ptyadav@amazon.de Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 0F8BD160007 X-Stat-Signature: 9zk6qghgrkjj5jhazhna4kuwdwdxfi9m X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1750870768-643716 X-HE-Meta: U2FsdGVkX1+NXa9RY3kISnuEpDXDmHJ9cDvpj49JVd+wveplrhVLObtLuare/VfumHMc50MMNCDjQ+jrsJ4NNFQON3zVr8XgqfPMV+xXnvfbq0noJSL1I3BJPC1ujYaHWlKGDpx3ppq/1gbBwePloS0kxLICByqlZoUvDQ3pZ16lDHXZtxhd3cRHR3UEFCc0ptFM7eDRb+a/xC64rENVV9hLSOc4rnqcPTkVJNscKjfizcpwTHLidNe6wb9Ow4uddgCygRkCvs4b/0f+NQK4af8rC8CrKMK9lRyV9T6F+tnJY0KAJ8H5715u9UJC9Ts9p7WV1tb7lRb1AlpARjhgIF/vnQR7xO5/pD/EzSi6GlDt+KJ+oQ5T7+X5D1RLJfHPlQWOZZ3eoz6TP7XPk353KFhBtvcXVa1OGaS4gTEdal9iEcWL2uJy/k+VYiEqJZ6kDtPJA4E0KxzhOKpf0mo4GwZXEPvAeAtDZMKEij+rb1oCLJtCKG83uRZB7SkrI6ZGWTQXTV0GNFKhFJfUTaqOM92m4eqDw0Dk5JciF0LYTxWX/DpmkSQPGlajzQGNpHdHWN75QZ29N5J/OFaqKDXjzi2T2Y97oJhMmd1+0HC897DzfAdT+G/9VNa55oIPa02mNYZIheuhOay44H41p+HgfJxQ2b+EE/kJ5hFxQ2I7xTBZ4Ge/wELS6DalCByT5g6Sf9EIY5gdtpeTtriZM0ZyWZDLNjTR+EJ2smQz8EOoaMQkeOEzlfUNSAYqQAbzgH1Iu+DfLhAa4A8khi59bPnN0fJWMani7PwxV2prpo62layCZQ6opPfnbicL/hfV52DM5DVIxsxSdtXTKAHGR8/sPisJujR0fIJdNfT1hbL1xUvYkd0eqsGuBnbI9wT4bRcl+rFyqCc0eTI5jGghkDbDTG6exXdk7Xaw6QyTeelyxbkcx9/u0yQexvoXw0dKqvax6bFWPRGfVbpWTvnVSVG jnx7+Qao IVak54AXFhAVS5yu5oCjjc2yopJdzOqy+8M0o5b7W6RFHLVWhM68n0C2Di5o1TEWaQjMQXx2h9JW7y+DXColqqNtIv3R6Jsu05eeGtlOrDUOhsm5c2Tyl66G9L4OZZMAgdAuOKiPbnTzgcvc+ZWv3JXWYqoTybPXLWBxYJWWvw8kODG4bJz7asL3GXQN7VIPgO/5WFwOSJ0iALGbRrl/W2CVOTNu+fNhOyB5L9R/wktQ39U0TBBcGf+CvqaT8HCIQZlk1qU0JuxO2/h2njPxYJK1dpO5jvPanACNTtXkNaqPj8balmAkqtT+j3rW9XDEvRysnLzRwHjXLnQZ8d51xq5uv+nXruPn723bnFMKWZWj8KkuZxhPlYWTh71cTwK29T84vjktqWA13tSkjwS25AOovuuptHaNW8070 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jun 25, 2025 at 5:36=E2=80=AFAM Christian Brauner wrote: > > > > I'm not sure why people are so in love with character device based ap= is. > > > It's terrible. It glues everything to devtmpfs which isn't namespacab= le > > > in any way. It's terrible to delegate and extremely restrictive in te= rms > > > of extensiblity if you need additional device entries (aka the loop > > > driver folly). > > > > > > One stupid question: I probably have asked this before and just swapp= ed > > > out that I a) asked this already and b) received an explanation. But = why > > > isn't this a singleton simple in-memory filesystem with a flat > > > hierarchy? > > > > Hi Christian, > > > > Thank you for the detailed feedback and for raising this important > > I don't know about detailed but no problem. > > > design question. I appreciate the points you've made about the > > benefits of a filesystem-based API. > > > > I have thought thoroughly about this and explored various alternatives > > before settling on the ioctl-based interface. This design isn't a > > sudden decision but is based on ongoing conversations that have been > > happening for over two years at LPC, as well as incorporating direct > > feedback I received on LUOv1 at LSF/MM. > > Well, Mike mentioned that ultimately you want to interface this with > systemd? And we certainly have never been privy to any of these > uapi design conversations. Which is usually not a good sign... > > > > > The choice for an ioctl-based character device was ultimately driven > > by the specific lifecycle and dependency management requirements of > > the live update process. While a filesystem API offers great > > advantages in visibility and hierarchy, filesystems are not typically > > designed to be state machines with the complex lifecycle, dependency, > > and ownership tracking that LUO needs to manage. > > > > Let me elaborate on the key aspects that led to the current design: > > > > 1. session based lifecycle management: The preservation of an FD is > > tied to the open instance of /dev/liveupdate. If a userspace agent > > opens /dev/liveupdate, registers several FDs for preservation, and > > then crashes or exits before the prepare phase is triggered, all FDs > > it registered are automatically unregistered. This "session-scoped" > > behavior is crucial to prevent leaking preserved resources into the > > next kernel if the controlling process fails. This is naturally > > handled by the open() and release() file operations on a character > > device. It's not immediately obvious how a similar automatic, > > session-based cleanup would be implemented with a singleton > > filesystem. > > fwiw > > fd_context =3D fsopen("kexecfs") > fd_context =3D fsconfig(FSCONFIG_CMD_CREATE, ...) > fd_mnt =3D fsmount(fd_context, ...) How is this kexecfs mount going to be restored into the container view? Will we need to preserve fd_context in some global(?) preservation way, i.e. in a root. Or is there a different way to recreate fd_context upon reboot? > This gets you a private kexecfs instances that's never visible anywhere > in the filesystem hierarchy. When the fd is closed everything gets auto > cleaned up by the kernel. No need to umount or anything. Yes, this is a very good property of using a file system. > > 2. state machine: LUO is fundamentally a state machine (NORMAL -> > > PREPARED -> FROZEN -> UPDATED -> NORMAL). As part of this, it provides > > a crucial guarantee: any resource that was successfully preserved but > > not explicitly reclaimed by userspace in the new kernel by the time > > the FINISH event is triggered will be automatically cleaned up and its > > memory released. This prevents leaks of unreclaimed resources and is > > managed by the orchestrator, which is a concept that doesn't map > > cleanly onto standard VFS semantics. > > I'm not following this. See above. And also any umount can trivially > just destroy whatever resource is still left in the filesystem. LUO provides more than just resource preservation; it orchestrates the serialization. While LUO can support various scenarios, let's use virtual machines as an example. The process involves distinct phases: Before suspending a VM, the Virtual Machine Monitor may take actions to quiesce the guest's activity. For example, it might temporarily prevent guest reboots to avoid new DMA mappings or PCI device resets. We refer to this preparatory, limited-functionality period as the "brownout." Following the brownout, LUO is transitioned into the PREPARED state. This allows device states and other resources that require significant time to serialize to be processed while the VMs are still running. For most guests, this preparation period is unnoticeable. Blackout: Once preparation is complete, the VMs are fully suspended in memory, and the "blackout" period begins. The goal is to perform the minimal required shutdown sequence and execute reboot(LINUX_REBOOT_CMD_KEXEC) as quickly as possible. During this shutdown, the VMM process itself might or might not be terminated. With FS approach it will have to stay alive in order to be preserved, with liveupdated it can be terminated and the session in liveupdated would carry the state into the kernel shutdown. Restoration and Finish: After the new kernel boots, a userspace agent like liveupdated would manage the preserved resources. It restores and returns these resources to their respective VMMs or containers upon request. Once all workloads have resumed, LUO is notified via the FINISH event. LUO then cleans up any post live update state and transitions the system back to the NORMAL state. > > > > 3. dependency tracking: Unlike normal files, preserved resources for > > live update have strong, often complex interdependencies. For example, > > a kvmfd might depend on a guestmemfd; an iommufd can depend on vfiofd, > > eventfd, memfd, and kvmfd. LUO's current design provides explicit > > callback points (prepare, freeze) where these dependencies can be > > validated and tracked by the participating subsystems. If a dependency > > is not met when we are about to freeze, we can fail the entire > > operation and return an error to userspace. The cancel callback > > further allows this complex dependency graph to be unwound safely. A > > filesystem interface based on linkat() or unlink() doesn't inherently > > provide these critical, ordered points for dependency verification and > > rollback. > > > > While I agree that a filesystem offers superior introspection and > > integration with standard tools, building this complex, stateful > > orchestration logic on top of VFS seemed to be forcing a square peg > > into a round hole. The ioctl interface, while more opaque, provides a > > direct and explicit way to command the state machine and manage these > > complex lifecycle and dependency rules. > > I'm not going to argue that you have to switch to this kexecfs idea > but... > > You're using a character device that's tied to devmptfs. In other words, > you're already using a filesystem interface. Literally the whole code > here is built on top of filesystem APIs. So this argument is just very > wrong imho. If you can built it on top of a character device using VFS > interfaces you can do it as a minimal filesystem. > > You're free to define the filesystem interface any way you like it. We > have a ton of examples there. All your ioctls would just be tied to the > fileystem instance instead of the /dev/somethingsomething character > device. The state machine could just be implemented the same way. > > One of my points is that with an fs interface you can have easy state > seralization on a per-service level. IOW, you have a bunch of virtual > machines running as services or some networking services or whatever. > You could just bind-mount an instance of kexecfs into the service and > the service can persist state into the instance and easily recover it > after kexec. > > But anyway, you seem to be set on the ioctl() interface, fine. I am not against your proposal, it should be discussed, perhaps at the hypervisor live update bi-weekly meeting. [1] https://lore.kernel.org/all/ee353d62-2e4c-b69c-39e6-1d273bfb01a0@google= .com/