From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 9BD5FFD45F1 for ; Thu, 26 Feb 2026 00:28:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=yFgIHLNWG5nn2bKXAGosicvw/0Yu+qAVnLqqJ5fJf90=; b=xlzcOC2ZrIpUkACTDl3CgVZXay AmqE2PYaSOo2uqyAxKAdtsOsqZhE/k1b4HybrdgDHCzagd6splJuRmWgd43SrgUHgYJr7tT8xEK9F Gg4YcZy70n9U2xiopYfeRQl55hvIZKLplsVNelfEvy8l1mvfcf/C+L4WUKiJW2mxM1JTLoVT1zksM szQTCyfwwTziJzo6ANlyAJb9CNKEej3c7sDy0Ky2j8zyiv4NosJmox2CgpSysUE+YtidIrQq02EDC wrjhgF6OGVdJbp+mQHuXpEYF0gFq8tndb4KIRq2RJdvEfxPlAJkD/O79bITZ1X0Lh0Kka5xcUUgvs FJX2bYMA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux)) id 1vvPFA-0000000596s-1Osf; Thu, 26 Feb 2026 00:28:40 +0000 Received: from mail-pf1-x444.google.com ([2607:f8b0:4864:20::444]) by bombadil.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux)) id 1vvPF7-0000000596O-17yJ for kexec@lists.infradead.org; Thu, 26 Feb 2026 00:28:38 +0000 Received: by mail-pf1-x444.google.com with SMTP id d2e1a72fcca58-824ac5d28f9so378451b3a.0 for ; Wed, 25 Feb 2026 16:28:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1772065716; x=1772670516; darn=lists.infradead.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=yFgIHLNWG5nn2bKXAGosicvw/0Yu+qAVnLqqJ5fJf90=; b=YSoIfi4x7jUqnaRY2vaJXOOPJQ4MvPX6PJ/jeVug7O+Fv7M3vN7r8jIfVV8+JqceU9 +YOd1P37OOgNpEfzKBRBLg+WDoyEmUsZbPh6L5mM9an7G7eufyvNTXNxfBDMzxTbJJ8f wscWT5Wiky2xbd8BS5sqrVXPOe6vKg3y3ihZ9sCiAfUSQUwhHpJEZyQpzsrVaqGpgePx Tl6RMGldH3ir+y0xss0RC5FAbR4hTkmlYpqlKxCtqFRuffZlOrmDgm7+FcUPP+VilQVE gDLjGKNsUAuT+uf8ma/qvK/eFlljoR3B3aol6PPxFcAf7lEoKvOnvbg/8Vv31nUa3v08 N/KQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772065716; x=1772670516; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=yFgIHLNWG5nn2bKXAGosicvw/0Yu+qAVnLqqJ5fJf90=; b=SoSuIe2EVhixv/dQruYxBj1NT0eipb7Tmjjd+PgO/1Vd/8S1mlkstx7sgIURPqcnes bOGrfFDN61HTYmx3XTd5MC00lhYqhR+BltemuDpcujbsbupEmnDaqtpWrVXw9JziI0+1 W9dGQLoQNXxkgVVHp7AMzNtHZic43hfSA9bOOx08SAcWrp/2VSqwfE0Y95hpfykJ7qV3 qb3Pj9U4gTGYyiEBQIK30k2Vj94cXX1kSNPHwRDgYi8EeQpEmTi7dyH6cJISCGMQQEY+ jaSW6lOIPhcNnP3dlihW9TmEOWNrtwgvNi6dnegdwJ9Th75gVHd1zxgM+e4e1w/eCABq FcOA== X-Forwarded-Encrypted: i=1; AJvYcCUqRKDR6dZJCLHfZG7RgTkcL7H0bZMDGAz5SNXa3r9ypQMl7FqgPcsTQh88VBoXdzbo12FNWw==@lists.infradead.org X-Gm-Message-State: AOJu0YxniF3sW0wLSi8v3mdUyCjFfRChzVuZW+CKEeF4ephoSFr1Aoz2 XNcZgLd3gLirI99ekaRuqXHETEjRsAwd4SDBC148lHnsk2R2M63NXQpDo2GgrmfLnw== X-Gm-Gg: ATEYQzwv3tpHOaqHuoZBBHeyDhSjhWl27tg8dH4l57dBtnqAFAy4AIi5td5RtRzW+Fr BcMs0ZuEelMuyzqUscTFXB0fFvUHIXt2PanMLutxA+tVriu0jMSsnMxVvJDVe5FQIj5Y28hKXIw juh/GBw1uQpn8Cj/Ne76tL+5m0Ww+RIjusD6QCGrVVTYinrr8SpZ+prFuGkXBXD0dfq8IyKZFUW 9a+wZKgH890HCCyo9aZDiRppkY+9geKUL4X0F1UsS/8O1334Vgi5R3gXL4+w/GiZs6oZjXYDwyp rKBFVgXuEcXZu9MkJ7r81i0NcS7X/UpEw25/0gkysyalVaXG3WXYTSRm14pvBqqZ+VE7vFb75Gd ggi3YpWt/6S+SB5WhPl1edxqR/M8H9vLrY66dkPVaSozCt8qjj1kwtA3AGGxWF9wwpYmLAYZ84l FhgzVBOL/ZjwMlNPUrlek5obUT1oKOwbkXJmaGZP/Ta6z0TMvBPXv76WQp5nPo9g== X-Received: by 2002:a05:6a00:1d8e:b0:823:1cf4:bf38 with SMTP id d2e1a72fcca58-8273366fd35mr1707567b3a.12.1772065715499; Wed, 25 Feb 2026 16:28:35 -0800 (PST) Received: from google.com (239.23.105.34.bc.googleusercontent.com. [34.105.23.239]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-8273a0299d9sm467054b3a.51.2026.02.25.16.28.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Feb 2026 16:28:32 -0800 (PST) Date: Thu, 26 Feb 2026 00:28:28 +0000 From: David Matlack To: Bjorn Helgaas Cc: Alex Williamson , Adithya Jayachandran , Alexander Graf , Alex Mastro , Alistair Popple , Andrew Morton , Ankit Agrawal , Bjorn Helgaas , Chris Li , David Rientjes , Jacob Pan , Jason Gunthorpe , Jason Gunthorpe , Jonathan Corbet , Josh Hilke , Kevin Tian , kexec@lists.infradead.org, kvm@vger.kernel.org, Leon Romanovsky , Leon Romanovsky , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, linux-pci@vger.kernel.org, Lukas Wunner , =?utf-8?Q?Micha=C5=82?= Winiarski , Mike Rapoport , Parav Pandit , Pasha Tatashin , Pranjal Shrivastava , Pratyush Yadav , Raghavendra Rao Ananta , Rodrigo Vivi , Saeed Mahameed , Samiullah Khawaja , Shuah Khan , Thomas =?iso-8859-1?Q?Hellstr=F6m?= , Tomita Moeko , Vipin Sharma , Vivek Kasireddy , William Tu , Yi Liu , Zhu Yanjun Subject: Re: [PATCH v2 02/22] PCI: Add API to track PCI devices preserved across Live Update Message-ID: References: <20260129212510.967611-3-dmatlack@google.com> <20260225224651.GA3711085@bhelgaas> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260225224651.GA3711085@bhelgaas> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20260225_162837_370736_76FDD464 X-CRM114-Status: GOOD ( 70.86 ) X-BeenThere: kexec@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "kexec" Errors-To: kexec-bounces+kexec=archiver.kernel.org@lists.infradead.org On 2026-02-25 04:46 PM, Bjorn Helgaas wrote: > On Thu, Jan 29, 2026 at 09:24:49PM +0000, David Matlack wrote: > > Add an API to enable the PCI subsystem to track all devices that are > > preserved across a Live Update, including both incoming devices (passed > > from the previous kernel) and outgoing devices (passed to the next > > kernel). > > I'd probably describe this in the "outgoing ... incoming" order since > that's the order of operation. Will do. > > Use PCI segment number and BDF to keep track of devices across Live > > Update. This means the kernel must keep both identifiers constant across > > a Live Update for any preserved device. VFs are not supported for now, > > since that requires preserving SR-IOV state on the device to ensure the > > same number of VFs appear after kexec and with the same BDFs. > > > > Drivers that preserve devices across Live Update can now register their > > struct liveupdate_file_handler with the PCI subsystem so that the PCI > > subsystem can allocate and manage File-Lifecycle-Bound (FLB) global data > > to track the list of incoming and outgoing preserved devices. > > In what sense is this "global"? I assume it doesn't mean global > scope; does it mean something that persists across the kexec? Global in the sense that there is only one struct pci_ser in the system. There is no global variable for it though. It is created pci_flb_preserve() when the first device is preserved (and destroyed after the last device is unpreserved/cancelled). So it's lifecycle is bound to the number of devices. It is maintained by LUO (which does not know about devices, it just knows about files being preserved). For the PCI FLB, each file happens to correspond to a device (these are vfio-pci device files), but LUO doesn't need to know or care. > > > pci_liveupdate_register_fh(driver_fh) > > pci_liveupdate_unregister_fh(driver_fh) > > > > Drivers can notify the PCI subsystem whenever a device is preserved and > > unpreserved with the following APIs: > > > > pci_liveupdate_outgoing_preserve(pci_dev) > > pci_liveupdate_outgoing_unpreserve(pci_dev) > > IIUC this is basically asking the PCI core to preserve the generic PCI > parts of a device, i.e., the PCI core and the driver collaborate to > preserve it? > > I know what "preserve" means: "please maintain the original state." > The sort of made-up "unpreserve" sounds nicely symmetric, but I don't > think it means "destroy the original state." It feels like more of a > "cancel" of the original request to preserve something. Maybe not > worth any change, but I suspect this operation is going to be part of > the user interface for the administrator, and the description of > "unpreserve" will probably include something like "cancel." Yeah it is a cancel of the original request. I chose the name to match the callback op that this function implements (liveupdate_flb_ops.unpreserve). > > After a Live Update, the PCI subsystem fetches its FLB global data > > from the previous kernel from the Live Update Orchestrator (LUO) during > > device initialization to determine which devices were preserved. > > > > Drivers can check if a device was preserved before userspace retrieves > > the file for it via pci_dev->liveupdate_incoming. > > I see how drivers need to know whether their device has been preserved > and needs to be adopted, but what does userspace have to do with this? > > I don't know what the file is, but it doesn't sound related to the PCI > core, so maybe it should be mentioned elsewhere? > > I assume the PCI core preserves a device based only on the indication > from the previous kernel. At the Live Update Orchestrator level the PCI device preservation is driven by userspace preserving files via ioctls. So that framing leaked into this commit message. I will rephrase this to be more in the context of devices that the PCI code actually cares about. > > +/* > > + * Copyright (c) 2025, Google LLC. > > 2026 Will do. > > +static int pci_flb_preserve(struct liveupdate_flb_op_args *args) > > +{ > > + struct pci_dev *dev = NULL; > > + int max_nr_devices = 0; > > + struct pci_ser *ser; > > + unsigned long size; > > + > > + for_each_pci_dev(dev) > > + max_nr_devices++; > > How is this protected against hotplug? Pranjal raised this as well. Here was my reply: . Yes, it's possible to run out space to preserve devices if devices are . hot-plugged and then preserved. But I think it's better to defer . handling such a use-case exists (unless you see an obvious simple . solution). So far I am not seeing preserving hot-plugged devices . across Live Update as a high priority use-case to support. I am going to add a comment here in the next revision to clarify that. I will also add a comment clarifying why this code doesn't bother to account for VFs created after this call (preserving VFs are explicitly disallowed to be preserved in this patch since they require additional support). > > > + size = struct_size_t(struct pci_ser, devices, max_nr_devices); > > + > > + ser = kho_alloc_preserve(size); > > + if (IS_ERR(ser)) > > + return PTR_ERR(ser); > > + > > + ser->max_nr_devices = max_nr_devices; > > + > > + args->obj = ser; > > + args->data = virt_to_phys(ser); > > + return 0; > > +} > > + > > +static void pci_flb_unpreserve(struct liveupdate_flb_op_args *args) > > +{ > > + struct pci_ser *ser = args->obj; > > + > > + WARN_ON_ONCE(ser->nr_devices); > > What is this warning telling the user? Is there something wrong? > What would I do to fix things if I saw this, and how would I know what > to do? > > Is this telling us that we're undoing a "preserve" operation, which is > supposed to involve "unpreserve" on each device that has previously > been marked for preservation, but something (what?) forgot to > unpreserve one of the devices? > > Seems like maybe a debugging aid, but probably not something an > indication that the administrator did something wrong? This WARN_ON (and the others you flagged) are all "this should never happen" scenarios. For example if this triggers it means a bug somewhere in the kernel (likely in LUO), or some sort of corruption. Would you prefer I omit such WARNs or should I just a better accompanying log message or comment? > > > + kho_unpreserve_free(ser); > > +} > > + > > +static int pci_flb_retrieve(struct liveupdate_flb_op_args *args) > > +{ > > + args->obj = phys_to_virt(args->data); > > + return 0; > > +} > > + > > +static void pci_flb_finish(struct liveupdate_flb_op_args *args) > > +{ > > + kho_restore_free(args->obj); > > +} > > + > > +static struct liveupdate_flb_ops pci_liveupdate_flb_ops = { > > + .preserve = pci_flb_preserve, > > + .unpreserve = pci_flb_unpreserve, > > + .retrieve = pci_flb_retrieve, > > + .finish = pci_flb_finish, > > + .owner = THIS_MODULE, > > +}; > > + > > +static struct liveupdate_flb pci_liveupdate_flb = { > > + .ops = &pci_liveupdate_flb_ops, > > + .compatible = PCI_LUO_FLB_COMPATIBLE, > > I don't see anything in this series that checks this. Maybe omit it > until it's used? LUO checks this when it receives preserved data from the previous kernel and needs to hand it back to the users of that data. This prevents passing binary data between 2 kernels that don't speak the same ABI. > > +int pci_liveupdate_outgoing_preserve(struct pci_dev *dev) > > +{ > > + struct pci_dev_ser new = INIT_PCI_DEV_SER(dev); > > + struct pci_ser *ser; > > + int i, ret; > > + > > + /* Preserving VFs is not supported yet. */ > > + if (dev->is_virtfn) > > + return -EINVAL; > > + > > + guard(mutex)(&pci_flb_outgoing_lock); > > + > > + if (dev->liveupdate_outgoing) > > Is there a real need to keep "dev->liveupdate_outgoing", as opposed to > just searching ser->devices[]? It looks like the only use is to keep > from adding a device to ser->devices[] twice. Yes that's the only use-case in this series. I think there will be more usecases in the future for checking if a device is outgoing (for example in pci_device_shutdown() to avoid pci_clear_master()). I can defer adding this bit for now if you prefer. > > > + return -EBUSY; > > + > > + ret = liveupdate_flb_get_outgoing(&pci_liveupdate_flb, (void **)&ser); > > I guess this "get" must correspond with the kho_alloc_preserve() in > pci_flb_preserve()? It's sort of annoying that there's nothing in the > function names to connect them because it's hard to find the source. Yeah, the LUO FLB flow took me a little while to wrap my head around. liveupdate_flb_get_outgoing() actually just returns the data pci_flb_preserve() already allocated. It does not trigger pci_flb_preserve(). pci_flb_preserve() is triggered when the first file is preserved, for files that have registered their file handler with PCI (pci_liveupdate_register_fh()). So what happens in practice is: 0. VFIO calls pci_liveupdate_register_fh(). This is what connects VFIO files with pci_liveupdate_flb. 1. Userpace preserves a vfio-pci file using LIVEUPDATE_SESSION_PRESERVE_FD. a. If this is the first file to be preserved associated with pci_liveupdate_flb, then LUO will call pci_flb_preserve(). This allocates struct pci_ser. b. LUO calls VFIO preserve callback to preserve the VFIO device file. c. VFIO's preserve callback calls pci_liveupdate_outgoing_preserve() to notify PCI of which PCI device is getting preserved. c. pci_liveupdate_outgoing_preserve() calls liveupdate_flb_get_outgoing() to get the data allocated in (a). Note that this data may have been allocated a long time ago since (a) only happens when the first device is preserved. (And note I am saying "first" but I really mean when the count goes from 0 to 1.) It's kind of convoluted but it has some properties that I think make it worth it: - It works for any kind of file (not just devices). - It allows other drivers to participate in PCI preservation by registering thier own file handler via pci_liveupdate_register_fh() (so it's not VFIO-specific). - The lifecycle is managed by LUO (create on first file preserve, destroy on last file unpreserve/cancel). So it avoids PCI subsystem having to maintain any global variables for struct pci_ser. > > +void pci_liveupdate_outgoing_unpreserve(struct pci_dev *dev) > > +{ > > + struct pci_ser *ser; > > + int ret; > > + > > + guard(mutex)(&pci_flb_outgoing_lock); > > + > > + ret = liveupdate_flb_get_outgoing(&pci_liveupdate_flb, (void **)&ser); > > + if (WARN_ON_ONCE(ret)) > > I'm a little dubious about all the WARN_ON_ONCE() calls here. They > seem like debugging aids that we would want to remove eventually, so > maybe they should just be out-of-tree to begin with. This is another "should never happen" guard. > > + return; > > + > > + WARN_ON_ONCE(pci_ser_delete(ser, dev)); > > I don't like putting code with side effects inside WARN_ON() because > the important code gets hidden. Ack, I'll separate out the WARN. > > + dev->liveupdate_outgoing = false; > > +} > > +EXPORT_SYMBOL_GPL(pci_liveupdate_outgoing_unpreserve); > > + > > +u32 pci_liveupdate_incoming_nr_devices(void) > > +{ > > + struct pci_ser *ser; > > + int ret; > > + > > + ret = liveupdate_flb_get_incoming(&pci_liveupdate_flb, (void **)&ser); > > + if (ret) > > + return 0; > > + > > + return ser->nr_devices; > > +} > > +EXPORT_SYMBOL_GPL(pci_liveupdate_incoming_nr_devices); > > Currently only called from drivers/pci/probe.c; does this really need > to be declared in include/linux/pci.h and exported? We can do that > later if/when needed. Put in drivers/pci/pci.h if only needed in > drivers/pci/. Will do. > > +void pci_liveupdate_setup_device(struct pci_dev *dev) > > +{ > > + struct pci_ser *ser; > > + int ret; > > + > > + ret = liveupdate_flb_get_incoming(&pci_liveupdate_flb, (void **)&ser); > > + if (ret) > > + return; > > + > > + dev->liveupdate_incoming = !!pci_ser_find(ser, dev); > > I think this would be easier to read as: > > if (pci_ser_find(ser, dev)) > dev->liveupdate_incoming = true; Will do. > > > +} > > +EXPORT_SYMBOL_GPL(pci_liveupdate_setup_device); > > Currently only called from pci_setup_device(); does this really need > to be declared in include/linux/pci.h and exported? We can do that > later if/when needed. Ack, I'll take this out of include/. > > +void pci_liveupdate_incoming_finish(struct pci_dev *dev) > > +{ > > + dev->liveupdate_incoming = false; > > What is this useful for? Does the PCI core need to *do* anything > after a driver finishes its own adoption of a preserved device? I > assume everything the PCI core does, i.e., rebuilding its pci_dev and > related things based on KHO data, must be done before the driver sees > the device. VFIO needs to know this before userspace retrieves the file associated the preserved device from LUO. See patch 08: https://lore.kernel.org/kvm/20260129212510.967611-9-dmatlack@google.com/ > > diff --git a/include/linux/kho/abi/pci.h b/include/linux/kho/abi/pci.h > > new file mode 100644 > > index 000000000000..6577767f8da6 > > --- /dev/null > > +++ b/include/linux/kho/abi/pci.h > > Seems unusual to make an abi/ subdirectory when that's the only thing > in kho/, but I assume you have plans for more. > > There are several *-abi.h files in include/uapi/, but this abi/ > directory is the ABI-ish name in include/linux (except for the > include/soc/tegra/bpmp-abi.h oddity). This directory hosts all the ABI structures that are passed from kernel-to-kernel across a Live Update. $ ls include/linux/kho/abi kexec_handover.h luo.h memblock.h memfd.h pci.h vfio_pci.h I agree the directory structure doesn't need to be quite so deep though. > > +/** > > + * struct pci_dev_ser - Serialized state about a single PCI device. > > + * > > + * @domain: The device's PCI domain number (segment). > > + * @bdf: The device's PCI bus, device, and function number. > > + */ > > +struct pci_dev_ser { > > + u16 domain; > > ACPI _SEG is limited to 16 bits, but I think all the Linux interfaces > use "int" and VMD creates domains starting with 0x10000 to avoid > colliding with ACPI segment numbers. I think we should probably use > u32 here (and for the Linux interfaces, but that's another patch). Will do. Thank you for the review!