From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 57DF926D4C7 for ; Sat, 24 Jan 2026 18:09:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769278191; cv=none; b=uDQGaj1SsxTtjq1VX25Wz6Nd6l4XQd5SJobvO4/q9cilucuxozO1ufBGioTGZVm6+XfiL23b/2rypB0HIhH8sZqj8jEUYUpJJZTb6gosyiXl6+08iUPdZqBxCPK0Ys8GTdVhHOj0XhZ9KsXzGpap5vyuZtyNY0bQA841Fv/+ow8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769278191; c=relaxed/simple; bh=n08pa8WoQDqWyt5+p1S89ld8879c2w1+TZg9GqjepmM=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=WM1ePJ5+ZKWfLjpW0yIra4NmivyIB8YoAl+/K9CSdo2zbjm1TIbl4LqCPDU4za6TKeofPRotSQ2Ev0pHhRo4Q60fS8plFTF//h+glTCVrfVJBbO/57r54N8uliZc4g562bSCwOpUjh4/2J9bDqngjJ5fJq71mIRm75g0hDPpMmM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Dx2/Y4WO; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Dx2/Y4WO" Received: by smtp.kernel.org (Postfix) with ESMTPSA id DA49BC116D0; Sat, 24 Jan 2026 18:09:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1769278191; bh=n08pa8WoQDqWyt5+p1S89ld8879c2w1+TZg9GqjepmM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Dx2/Y4WOyEKmS7pTEeZYiH+An7dsEsu8UC6lwho/fwRhI0wUiqItzOHNSqM2Fy9H6 VMk8O9AXhDxjm2JcHcSMMthX98PzHjQ8+2bmncv7VVQhHOAQSmfKZOciujXqcizUYK p+ejrYwgZ4KZRT0GhtPepspkCRyxAzdseTx93a0+WQBez+6IWr9AO/qZVRjgM1mIxe JMme3mkzgy5un0J++qCiFdaMC4j58CEl1x/BCDrbeuN48JjljHjh/sR8N8z4uGwI8O UlOAIi8EpPSzfwJca1V/ofnKdROrJ4eh9JkNIZYPGetGE/Z8TywQ3qRD4KXeNHHGkJ MHN3/7nVEJs8g== Date: Sat, 24 Jan 2026 10:09:49 -0800 From: Drew Fustini To: Reinette Chatre Cc: Dave Martin , linux-kernel@vger.kernel.org, Babu Moger , Fenghua Yu , Tony Luck , James Morse , "Chen, Yu C" , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Jonathan Corbet , x86@kernel.org Subject: Re: [RFC] fs/resctrl: Generic schema description Message-ID: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Tue, Dec 16, 2025 at 02:26:23PM -0800, Reinette Chatre wrote: > Hi Babu and Fenghua, > > Could you please consider how the new AMD and MPAM features [2] may benefit > from the new interfaces proposed here? More below ... > > On 10/24/25 4:12 AM, Dave Martin wrote: > > Hi all, > > > > Going forward, a single resctrl resource (such as memory bandwidth) is > > likely to require multiple schemata, either because we want to add new > > schemata that provide finer control, or because the hardware has > > multiple controls, covering different aspects of resource allocation. > > > > The fit between MPAM's memory bandwidth controls and the resctrl MB > > schema is already awkward, and later Intel RDT features such as Region > > Aware Memory Bandwidth Allocation are already pushing past what the MB > > schema can describe. Both of these can involve multiple control > > values and finer resolution than the 100 steps offered by the current > > "MB" schema. > > > > The previous discussion went off in a few different directions [1], so > > I want to focus back onto defining an extended schema description that > > aims to cover the use cases that we know about or anticipate today, and > > allows for future extension as needed. > > > > (A separate discussion is needed on how new schemata interact with > > previously-defined schemata (such as the MB percentage schema). > > suggest we pause that discussion for now, in the interests of getting > > the schema description nailed down.) > > > > > > Following on from the previous mail thread, I've tried to refine and > > flesh out the proposal for schema descriptions a bit, as follows. > > > > Proposal: > > > > * Split resource names and schema names in resctrlfs. > > > > Resources will be named for the unique, existing schema for each > > resource. > > > > The existing schema will keep its name (the same as the resource > > name), and new schemata defined for a resource will include that > > name as a prefix (at least, by default). > > > > So, for example, we will have an MB resource with a schema called > > MB (the schema that we have already). But we may go on to define > > additional schemata for the MB resource, with names such MB_MAX, > > etc. > > > > * Stop adding new schema description information in the top-level > > info// directory in resctrlfs. > > > > For backwards compatibilty, we can keep the existing property > > files under the resource info directory to describe the previously > > defined resource, but we seem to need something richer going > > forward. > > > > * Add a hierarchy to list all the schemata for each resource, along > > with their properties. So far, the proposal looks like this, > > taking the MB resource as an example: > > > > info/ > > └─ MB/ > > └─ resource_schemata/ > > ├─ MB/ > > ├─ MB_MIN/ > > ├─ MB_MAX/ > > ┆ > > > > Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource. > > In this proposal, what these just dummy schema names for > > illustration purposes. The important thing is that they all > > control aspects of the "MB" resource, and that there can be more > > than one of them. > > > > It may be appropriate to have a nested hierarchy, where some > > schemata are presented as children of other schemata if they > > affect the same hardware controls. For now, let's put this issue > > on one side, and consider what properties should be advertsed for > > each schema. > > > > * Current properties that I think we might want are: > > > > info/ > > └─ SOME_RESOURCE/ > > └─ resource_schemata/ > > ├─ SOME_SCHEMA/ > > ┆ ├─ type > > ├─ min > > ├─ max > > ├─ tolerance > > ├─ resolution > > ├─ scale > > └─ unit > > > > (I've tweaked the properties a bit since previous postings. > > "type" replaces "map"; "scale" is now the unit multiplier; > > "resolution" is now a scaling divisor -- details below.) > > > > I assume that we expose the properties in individual files, but we > > could also combine them into a single description file per schema, > > per resource or (possibly) a single global file. > > (I don't have a strong view on the best option.) > > > > > > Either way, the following set of properties may be a reasonable > > place to start: > > > > > > type: the schema type, followed by optional flag specifiers: > > > > - "scalar": a single-valued numeric control > > > > A mandatory flag indicates how the control value written to > > the schemata file is converted to an amount of resource for > > hardware regulation. > > > > The flag "linear" indicates a linear mapping. > > > > In this case, the amount of resource E that is actually > > allocated is derived from the control value C written to the > > schemata file as follows: > > > > E = C * scale * unit / resolution > > > > Other flags values could be defined later, if we encounter > > hardware with non-linear controls. > > > > - "bitmap": a bitmap control > > > > The optional flag "sparse" is present if the control accepts > > sparse bitmaps. > > > > In this case, E = bitmap_weight(C) * scale * unit / resolution. > > > > As before, each bit controls access to a specific chunk of > > resource in the hardware, such as a group of cache lines. All > > chunks are equally sized. > > > > (Different CTRL_MON groups may still contend within the > > allocation E, when they have bits in common between their > > bitmaps.) > > > > min: > > > > - For a scalar schema, the minimum value that can be written to > > the control when writing the schemata file. > > > > - For a bitmap schema, a bitmap of the minimum weight that the > > schema accepts: if an empty bitmap is accepted, this can be 0. > > Otherwise, if bitmaps with a single bit set are acceptable, > > this can just have the lowest-order bit set. > > > > Most commonly, the value will probably be "1". > > > > For bitmap schemata, we might report this in hex. In the > > interest of generic parsing, we could include a "0x" prefix if > > so. > > > > max: > > > > - For a scalar schema, the maximum value that can be written to > > the control when writing the schemata file. > > > > - For a bitmap schema, the mask with all bits set. > > > > Possibly reported in hex for bitmap schemata (as for "min"). > > > > tolerance: > > > > (See below for discussion on this.) > > > > - "0": the control is exact > > > > - "1": the effective control value is within ±1 of the control > > value written to the schemata file. (Similary, positive "n" -> > > ±n.) > > > > A negative value could be used to indicate that the tolerance > > is unknown. (Possibly we could also just omit the property, > > though it seems better to warn userspace explicitly if we > > don't know.) > > > > Tests might make use of this parameter in order to determine > > how picky to be about exact measurement results. > > > > resolution: > > > > - For a proportional scalar schema: the number of divisions that > > the whole resource is divided into. (See below for > > "proportional scalar schema.) > > > > Typically, this will be the same as the "max" value. > > > > - For an absolute scalar schema: the divisor applied to the > > control value. > > > > - For a bitmap schema: the size of the bitmap in bits. > > > > scale: > > > > - For a scalar schema: the scale-up multiplier applied to > > "unit". > > > > - For a bitmap schema: probably "1". > > > > unit: > > > > - The base unit of the quantity measured by the control value. > > > > The special unit "all" denotes a proportional schema. In this > > case, the resource is a finite, physical thing such as a cache > > or maxed-out data throughput of a memory controller. The > > entire physical resource is available for allocation, and the > > control value indicates what proportion of it is allocated. > > > > Bitmap schemata will probably all be proportional and use the > > unit "all". (This applies to cache bitmaps, at least.) > > > > Absolute schemata will require specification of the base unit > > here, say, "MBps". The "scale" parameter can be used to avoid > > proliferation of unit strings: > > > > For example, {scale=1000, unit="MBps"} would be equivalent to > > {scale=1, unit="GBps"}. > > > > > > Note on the "tolerance" parameter: > > > > This is a new addition. On the MPAM side, the hardware has a choice > > about how to interpret the control value in some edge-case situations. > > We may not reasonably be able to probe for this, so it may be useful > > to warn software that there is an uncertainty margin. > > > > We might also be able to use the "tolerance" parameter to accommodate > > the rounding behaviour of the existing "MB" schema (otherwise, we > > might want a special "type" for this schema, if it doesn't comply > > closely enough). > > > > > > If we want to deploy resctrl under virtualisation, resctrl on the host > > could dynamically affect the actual amount of resource that is > > available for allocation inside a VM. > > > > Whether or not we ever want to do that, it might be useful to have a > > way to warn software that the effective control values hitting the > > hardware may not be entirely predictable. > > > > Thoughts? > > > > Cheers > > ---Dave > > > One thing I was pondering is that resctrl currently uses L3 interchangeably > as a scope and a resource but if instead that is separated then it should be > easier to support interactions with resource at a different scope. > > I am concerned that, for example, support for Global Memory Bandwidth Allocation > (GMBA) is planned to be done with a new resource. resctrl already has a > "memory bandwidth allocation" resource and introducing a new resource to essentially > manage the same resource, but at a different scope, sounds like a risk of fragmentation > and duplication to me. > > What if the "resource control" instead gains a new property, for example, "scope" that > essentially communicates to user space what a domain ID in the schemata file means. > > It is not clear to me what a "domain ID" of GMBA means so I will use the MPAM CPU-less > MBM as example that I expect will build on SMBA that supports CXL.mem. Consider, an interface > like below: > > info > └── SMBA > └── resource_schemata > ├── SMBA > │   ├── max > │   ├── min > │   ├── resolution > │   ├── scale > │   ├── scope <== contains "L3" > │   ├── tolerance > │   ├── type > │   └── unit > └── SMBA_NODE > ├── max > ├── min > ├── resolution > ├── scale > ├── scope <== contains "NODE" > ├── tolerance > ├── type > └── unit > > With an interface like above there is a single resource and allocating it at a different > scope is just another control. This correlates to how other parts of resctrl is managed. > For example, it can become explicit that the monitor groups' mon_data directory contains > sub-directories organized by scope. For example: > > mon_data > ├── mon_L3_00 <== monitoring data at scope L3 > │   ├── llc_occupancy > │   ├── mbm_local_bytes > │   └── mbm_total_bytes > ├── mon_L3_01 <== monitoring data at scope L3 > │ ├── llc_occupancy > │ ├── mbm_local_bytes > │ └── mbm_total_bytes > ├── mon_NODE_00 <== monitoring data at scope NODE > │ └── mbm_total_bytes > └── mon_NODE_01 <== monitoring data at scope NODE > └── mbm_total_bytes > > What do you think? I think that the ability to have different scopes for a resource would work well for QoS on RISC-V. The CBQRI spec [1] defines bandwidth controller operations which can be anywhere in the system. I've been having trouble trying to decide what to do about a CBQRI-enabled memory controller as all bandwidth monitoring is currently assumed to be L3. Therefore, my RFC series [2] that adds resctrl support for RISC-V does not support bandwidth monitoring, but I think scope concept could make it work. Thanks, Drew [1] https://github.com/riscv-non-isa/riscv-cbqri/releases/tag/v1.0 [2] https://lore.kernel.org/all/20260119-ssqosid-cbqri-v1-0-aa2a75153832@kernel.org/