From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CCEC5C83F1A for ; Thu, 17 Jul 2025 16:17:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 706C36B00B7; Thu, 17 Jul 2025 12:17:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6B5D26B00B8; Thu, 17 Jul 2025 12:17:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 57D556B00BA; Thu, 17 Jul 2025 12:17:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 43F226B00B7 for ; Thu, 17 Jul 2025 12:17:50 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 049D2B6650 for ; Thu, 17 Jul 2025 16:17:49 +0000 (UTC) X-FDA: 83674262700.17.02A007B Received: from mail-lf1-f41.google.com (mail-lf1-f41.google.com [209.85.167.41]) by imf27.hostedemail.com (Postfix) with ESMTP id EBBEA40012 for ; Thu, 17 Jul 2025 16:17:47 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=jjXbrF5t; spf=pass (imf27.hostedemail.com: domain of dmatlack@google.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=dmatlack@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752769068; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4Y/RjTkTJBG1lcOSM/R+PpjsJ58T7Ydi95mMf2enko8=; b=kFf2uNGUMfrSL9VlxfUC+JFgtIk7qc1wblK3fV/bnRkhvHA9I9piYuPihpV+QMxsVlpCkC 1FHbPy3RgeQkp/S0lSjlocvm0a7W86pXH3/vw+XT5OBitBVceYotNhHHDqxpJbPYID+XqD /vCh1JuC5Q0Xuu0AnQAEVtm64UHN3BY= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=jjXbrF5t; spf=pass (imf27.hostedemail.com: domain of dmatlack@google.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=dmatlack@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752769068; a=rsa-sha256; cv=none; b=OIx7NW8WQhx3u3JcV+kwHmdCkQlG7/UGHLRvbxsmm18y9AMDWR5Fkpps080dtNCByEOC+8 zCjEdc3self1Kw9IuZXZB/1+puvyeH1B1W3dPxX0lIS6XR4BUvZ60hLXOg4QXwzanCtgv9 O8lF10kCtu32KBzURGlVK8ZIMOMlH60= Received: by mail-lf1-f41.google.com with SMTP id 2adb3069b0e04-558fa0b2c99so956922e87.2 for ; Thu, 17 Jul 2025 09:17:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1752769066; x=1753373866; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=4Y/RjTkTJBG1lcOSM/R+PpjsJ58T7Ydi95mMf2enko8=; b=jjXbrF5tIpKISKjulRWg45HmTlkumUuTFi607DExY6mgPIsybds9oYN5/YNJtLSnQP 23F78c5q6AMUXipvri/INeWaYjaBwCH4rMv2aElRuG6tBSqttHIVtQafZtIEvgcfqNOK EoBYyPG6iYsY06v5h3GPCmtgleHDYwNFvyFansAjrRs3nqnK0C4bKtDy+NVm0IBJHCQw 6hKIW+ASj92RtdAOnEycaSvS/WRkzYJ7yV9fEOQUGTa9H2bv+aRVY83BrixFI5axooc8 2GUhjWK6c0Zgsjyo3VwIrs5oHbATMOXmY69wmaesZ1af2ynOP3CKXRxrxxaaENqdITzm yB8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752769066; x=1753373866; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4Y/RjTkTJBG1lcOSM/R+PpjsJ58T7Ydi95mMf2enko8=; b=TY04yvtE5Gl3yYFWrDJspuzTdFXK8GrICZlRosNPhaRdH24e++IJuOBxTbaqBbq9Cg G+6iwNTy1EyqdjR/1xA4jiI8FuBbJIV0PwU/qkLL/pIaM6JuOyv4HKWuu8YjvxalFkNz CtGWbA3RRVwRcQff+TXtCYI65ATKSkgqCStNo476u7tOX3IdX5Jp58PplBKC3NxVKl8V IPwAtPYMtmCLghqc6Y3mXbzc/5Z/sqWnlD4XhJMxZKd731sevOwUuHIzR9QVTMzkRPl+ vaLKl5/gGi+eT875eEZsMuZOr7bx0HSnHd5bbcD1MEH1IorD4jtPnGGLtNBAiRBwgyyC DL+g== X-Forwarded-Encrypted: i=1; AJvYcCVh6XCc5B/xh9JJv/DcUlUBtnRKS9e7sj2Uo0UyZNxWjytLEE2GF1JWEe777bdOxkA7CkrT2UWrGg==@kvack.org X-Gm-Message-State: AOJu0YzZjZ/uSVi5rZpFlKz7U91Bl7QU6LjOWnZtXIdXXCMZYzKA03ru odjmdZdfZpN+gmatqsvBVEhxOcNxKFmKUFOMvmWBApqH6fDL3GXMxBiaKsBUZwoK2DIISrqpLK5 54o3JKsThxT39awzLgPf97iD+WANONPx+Crxp/RVn X-Gm-Gg: ASbGncvzI07vVsnsYIW4Pqj2/nsryZtGT/Tdyn5AdJN/6uqNCQ/oPOPNjCXz9naM0eF Xn8FgUnWwfYJ4/QdsI/9aixnjoQAdO0eGZJMmAFOzpDkv7LgE3fh9seUWamJsreaTbvLmB2/nzA p3CLoIq+GzDp97t2g7zK/eGJ8S/Z1Q+JmcjHhy7Z+zA6ing4HdBymMPerqbISdRIFAERtzyUV6W rcuEBw= X-Google-Smtp-Source: AGHT+IE9DWaA8FRK4xUpNKqB8FmzMQNZFjlXe/JlqckiMUc8pQOkJKRNMXU1R+kPwSGJz+qYdQKKmPj7Fqf+yBV+I3I= X-Received: by 2002:a05:6512:31ca:b0:553:65bc:4250 with SMTP id 2adb3069b0e04-55a2338ab11mr2382217e87.27.1752769065572; Thu, 17 Jul 2025 09:17:45 -0700 (PDT) MIME-Version: 1.0 References: <20250515182322.117840-1-pasha.tatashin@soleen.com> <20250515182322.117840-11-pasha.tatashin@soleen.com> <20250624-akzeptabel-angreifbar-9095f4717ca4@brauner> <20250625-akrobatisch-libellen-352997eb08ef@brauner> In-Reply-To: From: David Matlack Date: Thu, 17 Jul 2025 09:17:17 -0700 X-Gm-Features: Ac12FXw3JFpwS5UWAzvXtdE2XcBaOGxi7uu6xW5eZbBjwaUzxU-r0d7vc1g83dU Message-ID: Subject: Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface To: Pratyush Yadav Cc: Christian Brauner , Pasha Tatashin , jasonmiu@google.com, graf@amazon.com, changyuanl@google.com, rppt@kernel.org, rientjes@google.com, corbet@lwn.net, rdunlap@infradead.org, ilpo.jarvinen@linux.intel.com, kanie@linux.alibaba.com, ojeda@kernel.org, aliceryhl@google.com, masahiroy@kernel.org, akpm@linux-foundation.org, tj@kernel.org, yoann.congal@smile.fr, mmaurer@google.com, roman.gushchin@linux.dev, chenridong@huawei.com, axboe@kernel.dk, mark.rutland@arm.com, jannh@google.com, vincent.guittot@linaro.org, hannes@cmpxchg.org, dan.j.williams@intel.com, david@redhat.com, joel.granados@kernel.org, rostedt@goodmis.org, anna.schumaker@oracle.com, song@kernel.org, zhangguopeng@kylinos.cn, linux@weissschuh.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, gregkh@linuxfoundation.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, rafael@kernel.org, dakr@kernel.org, bartosz.golaszewski@linaro.org, cw00.choi@samsung.com, myungjoo.ham@samsung.com, yesanishhere@gmail.com, Jonathan.Cameron@huawei.com, quic_zijuhu@quicinc.com, aleksander.lobakin@intel.com, ira.weiny@intel.com, andriy.shevchenko@linux.intel.com, leon@kernel.org, lukas@wunner.de, bhelgaas@google.com, wagi@kernel.org, djeffery@redhat.com, stuart.w.hayes@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: EBBEA40012 X-Rspamd-Server: rspam06 X-Stat-Signature: eef39g76r9h3uqmbg3z643yzrhoqmxtc X-HE-Tag: 1752769067-714242 X-HE-Meta: U2FsdGVkX1//M7SVwfXt+RYX/OUivvaNV0IhnXNxOCeUmy9QH9kuh+9KjZmbnCtez39/ojItI+N0xA16Db4y1vGpWmGFZznGDA07i3P2NXhvVgQ1g9I3Y1etmIDJbe419TvFpFoO1A6RhZ1g7xkUblKv/IJC3It/Xy0+jTQVBAqrSTaAXvYEY4VIAoFI0XYiP1W1diK2yqp74YxnAT5rLZTCVsO71T6SeiJwKwq3MbbinrcWbGMNCb0eqUmF+edwaqoM3w8fnLpM+6xybXFSxpA/lV13kW1Qxvp01QmMQSgHGhA489DqE44TGCQeUJMJYcl0bMhJqJGWiHav5bEYYEq7X54W3u3fK4IqfBfVko2j5xOD/Ue0V6SNGAHM7162us55P7Kc4go1nuKIdJJo3opiGRUM3mro19s2StBx6h5gzmTBGhwhxgmBeD+YGf7UVcsc4YtJILXlWDaHWezN1bgyeNWmwc/FQbTbJIHyWezWrvTEHdCUQ5UPFRRLxK0OPFBDHNGo7l5UwzsSQHvn0EcXzMfSVBTNYxrnYgQ5PzyQj0h9sGDEr4PGulGfeEnce3bQlHO/SulqRctfdl7I32+nKrdWL/RQWmXgH2ckfurE+s8925BMslQ6roCqYE1GtCJIi1bhybhF+xnJp/z1sbrNXz+DxEapW4AON3+2XWBmXFT+9VwbPszbWa9LO/CHNHQku/iU26Sjvix0LmEuNJJAUlEY3NG9mr3QbB7WDjzt9FpihXNx6AQfdWBbB6q9tnjK2TvAxiXWbFoJiQCafE7S0ceb7TxBoMdFteB7Q/o7pMvE1fC9DdUh5Zd1kNQ8S/vfWZaLS2gJX8+xnQCL2dDRt6lbQgffiumbIgD1R8dCF0AcrFE7UcqRDa+ggwdS8dpeGpPQ0PrW1sjxw+7JLxct2RUYKCtmtkiQ7DsWvmNjZFRg8qNw9+Ox9gA+7KTZHR+ayyNwKw2vTpcrpu2 PO/Yv7CK +NgWVT26UEOfwgZKu2G+pCF7/j+X4Rsm5BN20mPj5zbGlhXIqRAIBRyK3J6dZ5wFCYo006IoKA6WVqymueXXFLcLXlFbxkqdnujQoG+gdg1Eh4krvquv1l8RUjlkvG+3AErvixLYjuidng3CeDRBu3aLJpEbSE/7x8jz9qYjZc/xX9W5IplHgc6C9S/1yz+CzQhIP361cE+JNSvc+y0cNqmKBCEf9FVhUAaNuWxMEAytttWXBAIN08C2a3G8oZamPMNzf7GQsVRXiqWbAw1yxVWBSIUa1lz3fV8TNq6aw1YcZZBe8JIvyQz+L+odPh5ds2JE3X+jROfvg1U0ChIgHyyfhcg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jul 14, 2025 at 7:56=E2=80=AFAM Pratyush Yadav wrote: > On Thu, Jun 26 2025, David Matlack wrote: > > On Thu, Jun 26, 2025 at 8:42=E2=80=AFAM Pratyush Yadav wrote: > >> On Wed, Jun 25 2025, David Matlack wrote: > >> > On Wed, Jun 25, 2025 at 2:36=E2=80=AFAM Christian Brauner wrote: > >> >> > > >> >> > While I agree that a filesystem offers superior introspection and > >> >> > integration with standard tools, building this complex, stateful > >> >> > orchestration logic on top of VFS seemed to be forcing a square p= eg > >> >> > into a round hole. The ioctl interface, while more opaque, provid= es a > >> >> > direct and explicit way to command the state machine and manage t= hese > >> >> > complex lifecycle and dependency rules. > >> >> > >> >> I'm not going to argue that you have to switch to this kexecfs idea > >> >> but... > >> >> > >> >> You're using a character device that's tied to devmptfs. In other w= ords, > >> >> you're already using a filesystem interface. Literally the whole co= de > >> >> here is built on top of filesystem APIs. So this argument is just v= ery > >> >> wrong imho. If you can built it on top of a character device using = VFS > >> >> interfaces you can do it as a minimal filesystem. > >> >> > >> >> You're free to define the filesystem interface any way you like it.= We > >> >> have a ton of examples there. All your ioctls would just be tied to= the > >> >> fileystem instance instead of the /dev/somethingsomething character > >> >> device. The state machine could just be implemented the same way. > >> >> > >> >> One of my points is that with an fs interface you can have easy sta= te > >> >> seralization on a per-service level. IOW, you have a bunch of virtu= al > >> >> machines running as services or some networking services or whateve= r. > >> >> You could just bind-mount an instance of kexecfs into the service a= nd > >> >> the service can persist state into the instance and easily recover = it > >> >> after kexec. > >> > > >> > This approach sounds worth exploring more. It would avoid the need f= or > >> > a centralized daemon to mediate the preservation and restoration of > >> > all file descriptors. > >> > >> One of the jobs of the centralized daemon is to decide the _policy_ of > >> who gets to preserve things and more importantly, make sure the right > >> party unpreserves the right FDs after a kexec. I don't see how this > >> interface fixes this problem. You would still need a way to identify > >> which kexecfs instance belongs to who and enforce that. The kernel > >> probably shouldn't be the one doing this kind of policy so you still > >> need some userspace component to make those decisions. > > > > The main benefits I see of kexecfs is that it avoids needing to send > > all FDs over UDS to/from liveupdated and therefore the need for > > dynamic cross-process communication (e.g. RPCs). > > > > Instead, something just needs to set up a kexecfs for each VM when it > > is created, and give the same kexecfs back to each VM after kexec. > > Then VMs are free to save/restore any FDs in that kexecfs without > > cross-process communication or transferring file descriptors. > > Isn't giving back the right kexecfs instance to the right VMM the main > problem? After a kexec, you need a way to make that policy decision. You > would need a userspace agent to do that. > > I think what you are suggesting does make a lot of sense -- the agent > should be handing out sessions instead of FDs, which would make FD > save/restore simpler for applications. But that can be done using the > ioctl interface as well. Each time you open() the /dev/liveupdate, you > get a new session. Instead of file FDs like memfd or iommufs, we can > have the agent hand out these session FDs and anything that was saved > using this session would be ready for restoring. > > My main point is that this can be done with the current interface as > well as kexecfs. I think there is very much a reason for considering > kexecfs (like not being dependent on devtmpfs), but I don't think this > is necessarily the main one. The main problem I'd like solved is requiring all FDs to preserved and restored in the context of a central daemon, since I think this will inevitably cause problems for KVM. I agree with you that this problem can also be solved in other ways, such as session FDs (good idea!). > > > > > Policy can be enforced by controlling access to kexecfs mounts. This > > naturally fits into the standard architecture of running untrusted VMs > > (e.g. using chroots and containers to enforce security and isolation). > > How? After a kexec, how do you tell which process can get which kexecfs > mount/instance? If any of them can get any, then we lose all sort of > policy enforcement. I was imagining it's up to whatever process/daemon creates the kexecfs instances before kexec is also responsible for reassociating them with the right processes after kexec. If you are asking how that association would be done mechanically, I was imagining it would be through a combination of filesystem permissions, mounts, and chroots. For example, the kexecfs instance for VM A would be mounted in VM A's chroot. VM A would then only have access to its own kexecfs instance. > >> > I'm not sure that we can get rid of the machine-wide state machine > >> > though, as there is some kernel state that will necessarily cross > >> > these kexecfs domains (e.g. IOMMU driver state). So we still might > >> > need /dev/liveupdate for that. > >> > >> Generally speaking, I think both VFS-based and IOCTL-based interfaces > >> are more or less equally expressive/powerful. Most of the ioctl > >> operations can be translated to a VFS operation and vice versa. > >> > >> For example, the fsopen() call is similar to open("/dev/liveupdate") -= - > >> both would create a live update session which auto closes when the FD = is > >> closed or FS unmounted. Similarly, each ioctl can be replaced with a > >> file in the FS. For example, LIVEUPDATE_IOCTL_FD_PRESERVE can be > >> replaced with a fd_preserve file where you write() the FD number. > >> LIVEUPDATE_IOCTL_GET_STATE or LIVEUPDATE_IOCTL_PREPARE, etc. can be > >> replaced by a "state" file where you can read() or write() the state. > >> > >> I think the main benefit of the VFS-based interface is ease of use. > >> There already exist a bunch of utilites and libraries that we can use = to > >> interact with files. When we have ioctls, we would need to write > >> everything ourselves. For example, instead of > >> LIVEUPDATE_IOCTL_GET_STATE, you can do "cat state", which is a bit > >> easier to do. > >> > >> As for downsides, I think we might end up with a bit more boilerplate > >> code, but beyond that I am not sure. > > > > I agree we can more or less get to the same end state with either > > approach. And also, I don't think we have to do one or the other. I > > think kexecfs is something that we can build on top of this series. > > For example, kexecfs would be a new kernel subsystem that registers > > with LUO. > > Yeah, fair point. Though I'd rather we agree on one and go with that. > Having two interfaces for the same thing isn't the best. Agreed, tt would be better to have a single way to preserve FDs rather than 2 (LUO ioctl and kexecfs).