From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id F0489C3ABAA for ; Mon, 5 May 2025 14:39:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:Cc:To:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=1NHcDk4r4MC4EBq9o/mXd6iYMNAbw2nwb2cXbhASKss=; b=28LLtgeECwb9Xmmx+RUtt0PaK1 t+yKffQ5605yBJEPwuIjID3B7LDOVZO0VevLdMrhMRUbr5JrGid5qAYNuCqzLSDq47NicJmN4YrD7 FacaUtN7wjC7f8eoCwWGIevyf2/xxEh74bPp3vg6smidkDhDtXwKETQ3TzsIfRcJ1q8voDI2N2ujw N9WWX9UwRRZq5E81FMnwa2+ZID2tysz64FKJbwdB+feWdT+dLePzHiQiwML1vAOH43SHlUmAuyZjJ KcwHdRtuk/GV8yNdX4iouoz5RUTHnuxj4w44y3H0HnwJIohvnRUcNRaMWMe9Rs7s0KSLU+5GHJiCL m/aajtkQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux)) id 1uBwyM-00000007epg-26Tc; Mon, 05 May 2025 14:39:10 +0000 Received: from mail-lj1-x234.google.com ([2a00:1450:4864:20::234]) by bombadil.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux)) id 1uBwwT-00000007eZf-2bY1 for kexec@lists.infradead.org; Mon, 05 May 2025 14:37:14 +0000 Received: by mail-lj1-x234.google.com with SMTP id 38308e7fff4ca-30d8cb711e2so3421701fa.0 for ; Mon, 05 May 2025 07:37:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746455831; x=1747060631; darn=lists.infradead.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=1NHcDk4r4MC4EBq9o/mXd6iYMNAbw2nwb2cXbhASKss=; b=eF+zH76AxIiqZlCXuR3dSsXpafDgGTGxwWdJg3iU5McHGkFoSm6/Ac4P/erUrUoEjn xHEIHbM3+rt39JMfDS2Tp4TuqBpmlbOA2wztBtZtz+fyOGMR5F7t7EkaqSTDpQxboi/o ZRgxz0lxkfbQ2f421MyhKfOz/sOpWh3YcmqIJEmbP7ZfOZc929kGs4NMQS/g6URICqH6 MIVNRzfYWb+xyH7VJXh5sjM4A8IDpFMcK/zjcbfgGVvDB7OcVbC7eYe86/3jgzLSq/NZ N/zeVFrPTUmby+8IRCvaXmPhgFd6N8G9IJIYpVfrM42H6bxpH2pJ4xGor9yVfGk9Beot N10g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746455831; x=1747060631; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=1NHcDk4r4MC4EBq9o/mXd6iYMNAbw2nwb2cXbhASKss=; b=KLBY0ZgocTVnq2l82oCNKJg+oxZXbvJcjU8cWZC9VOUWNd9meDOQPtf0sRcOLSjGYi MXejuq+q7oTXRzLRaUxO5m7B4DhdFO1RsmhblZ4ksVR/SULcvE+gxqan01ZGYWK5kgMN htSjOXzBY70uus79TYG6nlV8E2jGuLDvW5V7omd96GcfXWLuyh7TbvA5iIbM1bRslgcQ gDwcZNtGp7aiOff5iljiDZLS2aeHuP7KWyn7hQUixN1zmo3kTclBKdgDMvUDz44G6H/6 DisZreqp/tywqNBhTjIMv/P1l5faC8nucG7BYFTKDMZbfHlzEL469DlJF/LIi/kaw7xn yNHA== X-Forwarded-Encrypted: i=1; AJvYcCWTjrR1uuag0OkCvpTJqYKRuadqAO63YMTPshFI7+m6crFS9AjoVtvwBo4+d4AeTjsF3stdNA==@lists.infradead.org X-Gm-Message-State: AOJu0Yy8Kj485c0ItADfbOfHOtNfU0q7xyD4dbN5AOnpRnDWS6Kozc+r TaaclsfqBq5v0p7PTCjMgKdkbao/D0U8TO+0nALq9k7xk9Ax9YWn X-Gm-Gg: ASbGncuBOGXoyDJsxebBVC9Ng+lH9+vSCGInfSbrQcjo0QCdU+gUKG/gDljfmhEThBF CUqL9B/aod+L2suPbBFY+NXZMqrFJF426DnnXWaqvDekAjpzcluJrv6HeDhfiOBBYLw/yLoLakl EScGiTMO9TiGK0M8WiPcnLY6lBZX7AJ0tYCplJ2K88MmEBzfNCyt62iRPPYSNtmLssC2YD45jQH fxVswH1ji2oN0M7WfaBT0UxkiOYZAzan8l93odItOyK7JEYGJH+3QLH2M9SA8iPnV468m82m8Ay /ILOGzB1GwdH6oDQ/Jrx9JcRlQEB8vn3kJwSmUURFJDsCVH0gfiE05qgfw== X-Google-Smtp-Source: AGHT+IGEtTrk5xxPxYLHFQLp2mHEJ3luWELdgPGmJBEEinUSHAR9t94mHQ/40kEYFTJUGu32Rm+2eQ== X-Received: by 2002:a2e:be8d:0:b0:30b:b852:2f5d with SMTP id 38308e7fff4ca-320c63b8d9amr12492721fa.11.1746455830940; Mon, 05 May 2025 07:37:10 -0700 (PDT) Received: from [10.214.35.248] ([80.93.240.68]) by smtp.gmail.com with ESMTPSA id 38308e7fff4ca-3202ad908b9sm16303971fa.108.2025.05.05.07.37.09 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 05 May 2025 07:37:10 -0700 (PDT) Message-ID: Date: Mon, 5 May 2025 16:35:57 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec To: Chris Li Cc: linux-kernel@vger.kernel.org, Alexander Graf , James Gowans , Mike Rapoport , Andrew Morton , linux-mm@kvack.org, Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H . Peter Anvin" , Eric Biederman , kexec@lists.infradead.org, Pratyush Yadav , Jason Gunthorpe , Pasha Tatashin , David Rientjes References: <20250310120318.2124-1-arbn@yandex-team.com> Content-Language: en-US From: Andrey Ryabinin In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20250505_073713_665470_2766A149 X-CRM114-Status: GOOD ( 69.82 ) X-BeenThere: kexec@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "kexec" Errors-To: kexec-bounces+kexec=archiver.kernel.org@lists.infradead.org On 4/29/25 1:01 AM, Chris Li wrote: > Hi Andrey, > > I am working on the PCI portion of the live update and looking at > using KSTATE as an alternative to the FDT. Here are some high level > feedbacks. > Hi, thanks a lot. > On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin wrote: >> >> Main changes from v1 [1]: >> - Get rid of abusing crashkernel and implent proper way to pass memory to new kernel >> - Lots of misc cleanups/refactorings. >> >> kstate (kernel state) is a mechanism to describe internal some part of the >> kernel state, save it into the memory and restore the state after kexec >> in the new kernel. >> >> The end goal here and the main use case for this is to be able to >> update host kernel under VMs with VFIO pass-through devices running >> on that host. Since we are pretty far from that end goal yet, this >> only establishes some basic infrastructure to describe and migrate complex >> in-kernel states. >> >> The idea behind KSTATE resembles QEMU's migration framework [1], which >> solves quite similar problem - migrate state of VM/emulated devices >> across different versions of QEMU. >> >> This is an altenative to Kexec Hand Over (KHO [3]). >> >> So, why not KHO? >> > > KHO does more than just serializing/unserializing. It also has scratch > areas etc to allow safely performing early allocation without stepping > on the preserved memory. I see KSTATE as an alternative to libFDT as > ways of serializing the preserved memory. Not a replacement for KHO. > > With that, it would be great to see a KSTATE build on top of the > current version of KHO. The V6 version of KHO uses a recursive FDT > object. I see recursive FDT can map to the C struct description > similar to the KSTATE field description nicely. However, that will > require KSTATE to make some considerable changes to embrace the KHO > v6. For example, the KSTATE uses one contiguous stream buffer and KHO > V6 uses many recursive physical address object pointers for different > objects. Maybe a KSTATE V3? > Yep, I'll take a look into combinig KSTATE with KHO. .... > >> a_kho_restore() >> { >> ... >> a.i = fdt_getprop(fdt, offset, "i", &len); >> if (!a.i || len != sizeof(a.i)) >> goto err >> *a.p_ulong = fdt_getprop.... >> } >> >> Each driver/subsystem has to solve this problem in their own way. >> Also if we use fdt properties for individual fields, that might be wastefull >> in terms of used memory, as these properties use strings as keys. > > Right, I need to write a lot of boilerplate code to do the per > property save/restore. I am not worried too much about memory usage. A > lot of string keys are not much longer than 8 bytes. The memory saving > convert to binary index is not huge. I actually would suggest adding > the string version of the field name to the description table, so that > we can dump the state in KSTATE just like the YAML FDT output for > debugging purposes. It is a very useful feature of FDT to dump the > current saving state into a human readable form. KSTATE can have the > same feature added. > kstate_field already have string with name of the field: #define KSTATE_BASE_TYPE(_f, _state, _type) { \ .name = (__stringify(_f)), \ Currently it's not used in code, but it's there for debug purposes >> While with KSTATE solves the same problem in more elegant way, with this: >> struct kstate_description a_state = { >> .name = "a_struct", >> .version_id = 1, >> .id = KSTATE_TEST_ID, >> .state_list = LIST_HEAD_INIT(test_state.state_list), >> .fields = (const struct kstate_field[]) { >> KSTATE_BASE_TYPE(i, struct a, int), >> KSTATE_BASE_TYPE(s, struct a, char [10]), >> KSTATE_POINTER(p_ulong, struct a), >> KSTATE_PAGE(page, struct a), >> KSTATE_END_OF_LIST() >> }, >> }; >> >> >> { >> static unsigned long ulong >> static struct a a_data = { .p_ulong = &ulong }; >> >> kstate_register(&test_state, &a_data); >> } >> >> The driver needs only to have a proper 'kstate_description' and call kstate_register() >> to save/restore a_data. >> Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'. >> And kstate_register() does all this save/restore stuff under the hood. > > It seems the KSTATE uses one contiguous stream and the object has to > be loaded in the order it was saved. For the PCI code, the PCI device > scanning and probing might cause the device load out of the order of > saving. (The PCI probing is actually the reverse order of saving). > This kstate_register() might pose restrictions on the restore order. > PCI will need to look up and find the device state based on the PCI > device ID. Other subsystems will likely have the requirement to look > up their own saved state as well. > I also see KSTATE can be extended to support that. > Absolutely agreed. I think we need to decouple restore and register, ie remove restore_misgrate_state() from kstate_register(). Add instance_id argument to kstate_register(), so the PCI code could do: kstate_register(&pci_state, pdev, PCI_DEVID(pdev->bus->number, pdev->devfn)); And on probing stage (probably in pci_device_add()) call kstate_restore(&pci_state, dev, PCI_DEVID(bus->number, dev->devfn)) which would locate state for the device if any and restore it. >> - Another bonus point - kstate can preserve migratable memory, which is required >> to preserve guest memory >> >> >> So now to the part how this works. >> >> State of kernel data (usually it's some struct) is described by the >> 'struct kstate_description' containing the array of individual >> fields descpriptions - 'struct kstate_field'. Each field >> has set of bits in ->flags which instructs how to save/restore >> a certain field of the struct. E.g.: >> - KS_BASE_TYPE flag tells that field can be just copied by value, >> >> - KS_POINTER means that the struct member is a pointer to the actual >> data, so it needs to be dereference before saving/restoring data >> to/from kstate data steam. >> >> - KS_STRUCT - contains another struct, field->ksd must point to >> another 'struct kstate_dscription' > > The field can't have both bits set for KS_BASE_TYPE and KS_STRUCT > type, right? Some of these flag combinations do not make sense. This > part might need more careful planning to keep it simple. Maybe some of > the flags bits should be enum. > Yes, this needs more thought. Mutually exclusive flags could be moved in separate enum field. Some may be not needed at all. e.g. instead of KS_STRUCT we could just check if (field->ksd != NULL) >> >> - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save() >> ->restore() callbacks to save/restore data. >> >> - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the >> field->count() callback >> - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or >> linear address. Store offset > > I think we want to describe different stream types. > For example the most simple stream container is just a contiguous > buffer with start address and size. > The more complex one might be and size then an array of page pointers, > all those pointers add to up the new buffer which describe an saved > KSTATE that is larger than 4K and spread into an array of pages. Those > pages don't need to be contiguous. Such a page array buffer stores the > KSTATE entry described by a separate description table. > Agreed, I had similar thoughts. But this complicates code, so I started with something simple. >> >> - KS_END - special flag indicating the end of migration stream data. >> >> kstate_register() call accepts kstate_description along with an instance >> of an object and registers it in the global 'states' list. >> >> During kexec reboot phase we go through the list of 'kstate_description's >> and each instance of kstate_description forms the 'struct kstate_entry' >> which save into the kstate's data stream. >> >> The 'kstate_entry' contains information like ID of kstate_description, version >> of it, size of migration data and the data itself. The ->data is formed in >> accordance to the kstate_field's of the corresponding kstate_description. > > The version for the kstate_description might not be enough. The > version works if there is a linear history. Here we are likely to have > different vendors add their own extension to the device state saving. I think vendors can just declare separate kstate_description with different ID. The only problem with this, is that kstate_description.id is integer, so it would be a problem to allocate those without conflicts. Perhaps change the id to string? So vendors can just add vendor prefix to ID. > I suggest instead we save the old kernel's kstate_description table > (once per description table as a recursive object) alongside the > object physical address as well. The new kernel has their new version > of the description table. It can compare between the old and new > description tables and find out what fields need to be upgraded or > downgraded. The new kernel will use the old kstate_description table > to decode the previous kernel's saved object. Hmm.. I'm not sure, there is a lot to think about. This might make changes in kstate_description painful, e.g. if I want to rearrange some ->flags for whatever reason. So how to deal with changes in kstate_description itself? How do we save links to methods in kstate_field (->restore/->save/->count), and what if we'll need to change function prototypes of these methods ? > I think that way it is> more flexible to support adding one or more features and not tight to > which version has what feature. It can also make sure the new kernel > can always dump the old KSTATE into YAML. > > That way we might be able to simplify the subsection and the > depreciation flags. The new kernel doesn't need to carry the history > of changes made to the old description table. > >> After the reboot, when the kstate_register() called it parses migration >> stream, finds the appropriate 'kstate_entry' and restores the contents of >> the object in accordance with kstate_description and ->fields. > > Again this restoring can happen in a different order when the PCI > device scanning and probing order. The restoration might not happen in > one single call chain. Material for V3? Agreed. > I am happy to work with you to get KSTATE working with the existing KHO effort. > Thanks for useful feedback, appreciated.