From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FDD0C636CC for ; Mon, 20 Feb 2023 11:59:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E7FF76B0071; Mon, 20 Feb 2023 06:59:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E09C46B0072; Mon, 20 Feb 2023 06:59:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C82706B0073; Mon, 20 Feb 2023 06:59:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id B113D6B0071 for ; Mon, 20 Feb 2023 06:59:33 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 7DC5DC0F76 for ; Mon, 20 Feb 2023 11:59:33 +0000 (UTC) X-FDA: 80487525426.17.4D53CC2 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf15.hostedemail.com (Postfix) with ESMTP id 64216A0013 for ; Mon, 20 Feb 2023 11:59:30 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf15.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676894370; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=oJ0+7wDQ/HzxGwzdo7en1B6+Lk2j7MfUZ5Sv+Wtw0xc=; b=C3kcJ/i1jguP3LqM99dFV1Zfkzq0q4NTNZ6t1SvvUxuV/hisXvZ0dzh5yQtaJMwR2Cn618 50hR6rcEhzpDcQh5w0xpHAsQLFs0i8T3UO+4WRNvXLYTpHCVmwf/rJDso9+ZTQfx8VHH8X VCMi3gcAIS8C5XDPgcm0AAXEhz5nidU= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf15.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676894370; a=rsa-sha256; cv=none; b=6MixEatl0TLSdcCnNxU0xWXOaKQYEqa919d00yIsqhDGvBMazkdKU0lp5r2xhVrQi65Cnr o3Oncp7Voz7SoUztjrzAD8TbsZtGFLQ0QUKK/AcmAyOK55in27FidIpeY6Be9dSog+MqWb qQLgUhUrGeIvjEWZGIarA529v2gmCD4= Received: from lhrpeml500005.china.huawei.com (unknown [172.18.147.226]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4PL1782lcTz6J7SM; Mon, 20 Feb 2023 19:54:44 +0800 (CST) Received: from localhost (10.202.227.76) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.17; Mon, 20 Feb 2023 11:59:27 +0000 Date: Mon, 20 Feb 2023 11:59:26 +0000 From: Jonathan Cameron To: Viacheslav A.Dubeyko CC: Adam Manzanares , "lsf-pc@lists.linux-foundation.org" , "linux-mm@kvack.org" , "linux-cxl@vger.kernel.org" , Dan Williams , "Cong Wang" , Viacheslav Dubeyko Subject: Re: [External] [LSF/MM/BPF TOPIC] CXL Fabric Manager (FM) architecture Message-ID: <20230220115926.000016c1@Huawei.com> In-Reply-To: <2EA73B59-7E5B-4FF6-9830-6C4C24FDDB6C@bytedance.com> References: <7F001EAF-C512-436A-A9DD-E08730C91214@bytedance.com> <20230131174115.00007493@Huawei.com> <5671D3B3-83B3-49FF-A662-509648E6D297@bytedance.com> <20230202095402.0000585d@Huawei.com> <20230208163844.GA407917@bgt-140510-bm01> <7E864E85-A36F-487B-8B70-C8C49FBECD73@bytedance.com> <20230209110502.00001a7a@Huawei.com> <89DC75A8-0507-4AA1-B121-4AC398F615BC@bytedance.com> <20230210123257.000029a9@Huawei.com> <2EA73B59-7E5B-4FF6-9830-6C4C24FDDB6C@bytedance.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.202.227.76] X-ClientProxiedBy: lhrpeml100001.china.huawei.com (7.191.160.183) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 64216A0013 X-Stat-Signature: sbi9qi4x7csgw6ar4o7r67eyamnipwdh X-HE-Tag: 1676894370-940221 X-HE-Meta: U2FsdGVkX18kUHEBj1psIKHyACsPK2RFM7d4KuLL4aJHXwQNllgBe8RnYBVicGH7ZklpddjWjiFoliB9sCGTLKIO66DRqi1YMWzxB4Xh9FL6HzKDhMUqvVDkxlmh9X/s9wAoJvYntfhvLG016lsrrsi3ksyvanz9GxmqKPFFwkZT4MLediPe32UbjdLttXPPRu9Q12TzeRVj9Ml7Wvy66GOWvcJr6DYAcvEAnm3djqbMs/LgQ9dHFV1R8clckD5a4N4md9xwMnU/HJXjx5AfRRFxlMXZt3lu+wd6qVRwX4qON6ThMqhnadjWeBHHfdmIuHZ6xnXwicItrrlmC/B/YofKko7l28LlQ/MqoFiPTb7oSkSUWNuE23yrQbZrcvH8iUwuIcAGwYrAKAMjHJzwH1gCAEqhJgnmdhbTqzawzLORKio7hAGAlQBBIbIgjMDHdHXyEbE6OUVFZrXFn88+YVAKp89QdizbkMY7cWvahUpxrwM/+a9aOzOng9881rF/VRnyrl2zBw9/3u2Tu3CTS1KMWx9kueM9vqbgH4wD0LMvqQtlnnkawzl5SLY8Yz7bqW19oRFavJr5+dKsAYsD8wOCjRc0GM+LikBZ0OOxXJCm5dnyjawKEkBrVIbjqpNzLwax8shJz1aI3AE1GKdRprfhcSHdLSCf6M/TfhckCRGZqO8dAOZEMUFnVVi6emXRI/jcTZX/++65Oy2SRCxB42oyIak41ou6J/0eRFIXVU+r2903PisyCOIIikcC2OvEuc6BZaz+Drwtig342FpKcSQonX1QyidRJt6vs40rvcF/6U+WGO/b8wGKVqW2rtUHm5G4JKKdXuOFcDBysbVhwgvGaS5wTWKr2M6T4lqokryRwDPGGCpvWzt6bRYfmMqsr4MfjF64Pr0+N19v4yixVY9337vvENMVwYVw24/QoQq38eQ5Hltrlo6N8wwBBfSbJYoz4MSYyLm8u7dnNvs Eo6jscFT 1Tmf/aP6WgAJ+acBDwkHp2uqONet4QY5R2t20sAnhK+9fruY0TOFyjz5mY6zxVQiQTrjCyLzQ/7Thcw0JJaAn/SQRzegGh7JDKsZ7L2V1ZCnw/WEblBTjmgahfsxTDzWisolr5/ObFj+S5JZ1ajoN0aaGSyMTiCJ+fb7gao0sw8n1i9BQVPLKYlLUtr1SIDWWdNqrUjw0iQltMTONgOMS6ryGGrqizvRi+BVLcCc7DubCTRdF7/oRJ30k2D5T0NQqXnU8QkzshGKp/30HnYLwGjzqeZiGnL468NyBmdDxOX8V92rrBV63Y5GwlHLBUqxc04RD7AfUpgEEI7n6rhDCm8QYOR6murjkgXhUkkv54SmriLxRAyKSAw5TxQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, 17 Feb 2023 10:31:15 -0800 "Viacheslav A.Dubeyko" wrote: > > On Feb 10, 2023, at 4:32 AM, Jonathan Cameron wrote: > >=20 > > On Thu, 9 Feb 2023 14:04:13 -0800 > > "Viacheslav A.Dubeyko" wrote: > > =20 > >>> On Feb 9, 2023, at 3:05 AM, Jonathan Cameron wrote: > >>>=20 > >>> On Wed, 8 Feb 2023 10:03:57 -0800 > >>> "Viacheslav A.Dubeyko" wrote: > >>> =20 > >>>>> On Feb 8, 2023, at 8:38 AM, Adam Manzanares wrote: > >>>>>=20 > >>>>> On Thu, Feb 02, 2023 at 09:54:02AM +0000, Jonathan Cameron wrote: = =20 > >>>>>> On Wed, 1 Feb 2023 12:04:56 -0800 > >>>>>> "Viacheslav A.Dubeyko" wrote: > >>>>>> =20 > >>>>>>>> =20 > >>>>=20 > >>>> > >>>> =20 > >>>>>>>=20 > >>>>>>> Most probably, we will have multiple FM implementations in firmwa= re. > >>>>>>> Yes, FM on host could be important for debug and to verify correc= tness > >>>>>>> firmware-based implementations. But FM daemon on host could be im= portant > >>>>>>> to receive notifications and react somehow on these events. Also,= journalling > >>>>>>> of events/messages/events could be important responsibility of FM= daemon > >>>>>>> on host. =20 > >>>>>>=20 > >>>>>> I agree with an FM daemon somewhere (potentially running on the BM= C type chip > >>>>>> that also has the lower level FM-API access). I think it is somew= hat > >>>>>> separate from the rest of this on basis it may well just be talkin= g redfish > >>>>>> to the FM and there are lots of tools for that sort of handling al= ready. > >>>>>> =20 > >>>>>=20 > >>>>> I would be interested in particpating in a BOF about this topic. I = wonder what > >>>>> happens when we have multiple switches with multiple FMs each on a = separate BMC. > >>>>> In this case, does it make more sense to have an owner of the globa= l FM state=20 > >>>>> be a user space application. Is this the job of the orchestrator? = =20 > >>>=20 > >>> This partly comes down to terminology. Ultimately there is an FM that= is > >>> responsible for the whole fabric (could be distributed software) and = that > >>> in turn will talk to a the various BMCs that then talk to the switche= s. > >>>=20 > >>> Depending on the setup it may not be necessary for any entity to see = the > >>> whole fabric. > >>>=20 > >>> Interesting point in general though. I think it boils down to getting > >>> layering in any software correct and that is easier done from outset. > >>>=20 > >>> I don't know whether the redfish stuff is flexible enough to cover th= is, but > >>> if it is, I'd envision, the actual FM talking redfish to a bunch of s= ub-FMs > >>> and in turn presenting redfish to the orchestrator. > >>>=20 > >>> Any of these components might run on separate machines, or in firmwar= e on > >>> some device, or indeed all run on one server that is acting as the FM= and > >>> a node in the orchestrator layer. > >>> =20 > >>>>>=20 > >>>>> The BMC based FM seems to have scalability issues, but will we hit = them in > >>>>> practice any time soon. =20 > >>>=20 > >>> Who knows ;) If anyone builds the large scale fabric stuff in CXL 3.= 0 then > >>> we definitely will in the medium term. > >>> =20 > >>>>=20 > >>>> I had discussion recently and it looks like there are interesting po= ints: > >>>> (1) If we have multiple CXL switches (especially with complex hierar= chy), then it is > >>>> very compute-intensive activity. So, potentially, FM on firmware sid= e could be not > >>>> capable to digest and executes all responsibilities without potentia= l performance > >>>> degradation. =20 > >>>=20 > >>> There is firmware and their is firmware ;) It's not uncommon for BMC= s to be > >>> significant devices in their own right and run Linux or other heavy w= eight OSes. > >>> =20 > >>>> (2) However, if we have FM on host side, then there is security conc= erns because > >>>> FM sees everything and all details of multiple hosts and subsystems.= =20 > >>>=20 > >>> Agreed. Other than testing I wouldn't expect the FM to run on a 'host= ', but in > >>> at lest some implementations it will be running on a capable Linux ma= chine. > >>> In large fabrics that may be very capable indeed (basically a server = dedicated to > >>> this role). > >>> =20 > >>>> (3) Technically speaking, there is one potential capability that use= r-space FM daemon > >>>> can run as on host side as on CXL switch side. I mean here that if w= e implement > >>>> user-space FM daemon, then it could be used to execute FM functional= ity on CXL > >>>> switch side (maybe????). :) =20 > >>>=20 > >>> Sure, anything could run anywhere. We should draw up some 'reference= ' architectures > >>> though to guide discussion down the line. Mind you I think there are= a lot of > >>> steps along the way and starting point should be a simple PoC where a= ll the FM > >>> stuff is in linux userspace (other than comms). That's easy enough t= o do. > >>> If I get a quiet week or so I'll hammer out what we need on emulation= side to > >>> start playing with this. > >>>=20 > >>> Jonathan > >>>=20 > >>>=20 > >>> =20 > >>>>=20 > >>>> > >>>> =20 > >>>>>>>>> - Manage surprise removal of devices =20 > >>>>>>>>=20 > >>>>>>>> Likewise, beyond reporting I wouldn't expect the FM daemon to ha= ve any idea > >>>>>>>> what to do in the way of managing this. Scream loudly? > >>>>>>>> =20 > >>>>>>>=20 > >>>>>>> Maybe, it could require application(s) notification. Let=E2=80=99= s imagine that application > >>>>>>> uses some resources from removed device. Maybe, FM can manage ker= nel-space > >>>>>>> metadata correction and helping to manage application requests to= not existing > >>>>>>> entities. =20 > >>>>>>=20 > >>>>>> Notifications for the host are likely to come via inband means - s= o type3 driver > >>>>>> handling rather than related to FM. As far as the host is concern= ed this is the > >>>>>> same as case where there is no FM and someone ripped a device out. > >>>>>>=20 > >>>>>> There might indeed be meta data to manage, but doubt it will have = anything to > >>>>>> do with kernel. > >>>>>> =20 > >>>>>=20 > >>>>> I've also had similar thoughts, I think the OS responds to notifica= tions that > >>>>> are generated in-band after changes to the state of the FM are made= through=20 > >>>>> OOB means. > >>>>>=20 > >>>>> I envision the host sends REDFISH requests to a switch BMC that has= an FM > >>>>> implementation. Once the changes are implemented by the FM it would= show up > >>>>> as changes to the PCIe hierarchy on a host, which is capable of res= ponding to > >>>>> such changes. > >>>>> =20 > >>>>=20 > >>>> I think I am not completely follow your point. :) First of all, I as= sume that if host > >>>> sends REDFISH request, then it will be expected the confirmation of = request execution. > >>>> It means for me that host needs to receive some packet that informs = that request > >>>> executed successfully or failed. It means that some subsystem or app= lication requested > >>>> this change and only after receiving the confirmation requested capa= bilities can be used. > >>>> And if FM is on CXL switch side, then how FM will show up the change= s? It sounds for me > >>>> that some FM subsystem should be on the host side to receive confirm= ation/notification > >>>> and to execute the real changes in PCIe hierarchy. Am missing someth= ing here? =20 > >>>=20 > >>> Another terminology issue I think. FM from CXL side of things is an = abstract thing > >>> (potentially highly layered / distributed) that acts on instructions = from an > >>> orchestrator (also potentially highly distributed, one implementation= is hosts > >>> can be the orchestrator) and configures the fabric. > >>> The downstream APIs to the switches and EPs are all in FM-API (CXL sp= ec) > >>> Upstream probably all Redfish. What happens in between is impdef (th= ough > >>> obviously mapping to Redfish or FM-API as applicable may make it more > >>> reuseable and flexible). > >>>=20 > >>> I think some diagrams of what is where will help. > >>> I think we need (note I've always kept the controller hosts as normal= hosts as well > >>> as that includes the case where it never uses the Fabric - so BMC typ= e cases as > >>> a subset without needing to double the number of diagrams). > >>>=20 > >>> 1) Diagram of single host with the FM as one 'thing' on that host - d= irect interfaces > >>> to a single switch - interfaces options include switch CCI MB, mctp = of PCI VDM, > >>> mctp over say i2c. > >>>=20 > >>> 2) Diagram of same as above, with a multiple head device all connecte= d to one host. > >>>=20 > >>> 3) Diagram of 1 (maybe with MHD below switches), but now with multipl= e hosts, > >>> one of which is responsible for fabric management. FM in that man= ager host > >>> and orchestrator) - agents on other hosts able to send requests for = services to that host. > >>>=20 > >>> 4) Diagram of 3, but now with multiple switches, each with separate c= ontrolling host. > >>> Some other hosts that don't have any fabric control. > >>> Distributed FM across the controlling hosts. > >>>=20 > >>> 5) Diagram of 4 but with layered FM and separate Orchestrator. Hosts= all talk to the > >>> orchestrator, that then talks to the FM. > >>>=20 > >>> 6) 4, but push some management entities down into switches (from arch= itecture point of > >>> view this is no different from layered case with a separate BMC per = switch - there > >>> is still either a distribute FM or a layered FM, which the orchestra= tor talks to.) > >>>=20 > >>> Can mess with exactly distribution of who does what across the variou= s layers. > >>>=20 > >>> I can sketch this lot up (and that will probably make some gaps in th= ese cases apparent) > >>> but will take a little while, hence text descriptions in the meantime. > >>>=20 > >>> I come back to my personal view though - which is don't worry too muc= h at this early > >>> stage, beyond making sure we have some layering in code so that we ca= n distribute > >>> it across a distributed or layered architecture later! =20 > >>> =20 > >>=20 > >> I had slightly more simplified image in my mind. :) We definitely need= to have diagrams > >> to clarify the vision. But which collaboration tool could we use to wo= rk publicly on diagrams? > >> Any suggestion? =20 > >=20 > > Ascii art :) To have a broad discussion it needs to be mailing list an= d that > > is effectively only option. > > =20 >=20 > I tried to prepare some diagram based on ascii art. :) It looks pretty te= rrible in email: >=20 > ---------------------------- ------------------ > | --------- ------ | | | > | | Agent | <---> | FM | | | | > | --------- ------ |<------->| CXL switch | > | Host | | | > | | | | > ---------------------------- =E2=80=94=E2=80=94=E2=80=94=E2=80=94= =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94 other than wrong line type on the right looks fine to me ;) >=20 > I think we need to use some online resource, anyway. We are discussing wi= th Adam what we > can do here. >=20 > You introduced Orchestrator entity. I realized that I am not completely f= ollow the responsibility > of this subsystem. Do you imply some central point of management of multi= ple FM instances? Absolutely - whether it's role is actually separate from the FM or not is a= n implementation detail, but assumption is someone is placing the VMs etc that are using the= CXL memory and only that entity will have the knowledge of what memory to assign to which = host to provide that memory to the VMs. > Something like a router that has knowledge base and can redirect the requ= est to proper FM > instance. Am I correct? More than that. The orchestrator would get a 'give me a VM with X normal D= RAM and X CXL DRAM' it would figure out where to put that VM across a set of systems and issue = the commands to the relevant FMs to 'make it so'. So that's the entity that would query= all the FMs to understand what resources it is managing and then tell them what to do (= possibly via multiple layers of abstraction and sub orchestators etc). > It sounds to me that orchestrator needs to implement some > sub-API of FM. Or, maybe, it needs to parse REDFISH packets, for example,= and only > redirects the packets. I'd expect individual hosts to most do what they are told to do, or maybe ask nicely for more resources for a particular VM or application. The host= s shouldn't be responsible for allocating those resources, but should just be told wher= e they are. That stuff might be in redfish or similar, but it's way above the lev= el of anything CXL specific. Jonathan >=20 > Thanks, > Slava. > =20