From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9E3682F5479 for ; Fri, 12 Dec 2025 04:27:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765513647; cv=none; b=FLNvDIOumivMISM4qmgEk+KwrsSlvCLzuCxA6SsWMeOhsL/swCmDF4eDMvpAJsw3xUVF9grYwHNdEs5V7S4IeJvVjoKw2LAgPqfl7zRST5RHjUS7oLIYEqla18WnRu39OWCSKP2kUqGxvBijBeZQgxGyuijgIDQSZCGbJGJOKTY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765513647; c=relaxed/simple; bh=W1fKfMP/G8QNsy/YEK+99QYOEkv+wT9Zw13TWi2s1pg=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=TuarV+W1zBrGCY0QOCyU6thsIVcyalivKuoJJnFb+7kNgt7jyDlUEjpfOSgfj64lYgszSr9fPxLhyDeYW30FaTNluwfFhJsDHWeb5rznlXSxwVpYHLAzLhTT+lIN1TQnGefjR5CF91lGBizaj7LYtx3o4zJavOy1yDcFKj7NvhY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=fb6vAurC; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=o7d0RuVt; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="fb6vAurC"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="o7d0RuVt" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1765513643; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2mQWDX07Qsn2IYShcICQo+zasr4kHm8FuWvdXXrNAx8=; b=fb6vAurCQAtPDOiDjhEpKLC9N676mvEobw0I+b8a9gdjokGgPb66LxBPy7KZGRhHIeQnSU aikoW69g8G1oN22sfO5kdj6HyzkUTTk1xDZnaBLdlUfqhfDyskv0o4Ca9hdy1fvBqMPggr s8UQVSyEbEIqOKCLTU2WqiPBwXJNHF4= Received: from mail-pj1-f69.google.com (mail-pj1-f69.google.com [209.85.216.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-543-T0mTQgD3PJ2nqBXMfUTpuw-1; Thu, 11 Dec 2025 23:27:21 -0500 X-MC-Unique: T0mTQgD3PJ2nqBXMfUTpuw-1 X-Mimecast-MFC-AGG-ID: T0mTQgD3PJ2nqBXMfUTpuw_1765513641 Received: by mail-pj1-f69.google.com with SMTP id 98e67ed59e1d1-3438b1220bcso916209a91.2 for ; Thu, 11 Dec 2025 20:27:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1765513641; x=1766118441; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=2mQWDX07Qsn2IYShcICQo+zasr4kHm8FuWvdXXrNAx8=; b=o7d0RuVtqULa2+y/nxxG/5BEbwKWJ/S03xP5iQYaFWdKS1xFZk529edrswWRIlRVI3 Hqos0sE2ojXyFwDW2euMBmbwFQ/UHupwo+HgQK4dARITYrnfN/Rnvj+l3hFSgSy/m9+W 8tEEV7KY6CtE+Is/0AK58ZhXxQcrWxmMwpcG8RPvQuDQ23Mnd2fT2yc62MoY4PnPDG7o 0uPp+nh7bczVqxypxNrC0G/GPMq3zcnPF4WGQGyMyB7oQUF28/OYv6cv8BbNyugD1quZ qgFX08jfK/nUY9kyIS6pGI2/0RBkH9Vlsb3/klaAHrRycTPV62pZNadNY/WBXBjgMcNC ynfA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765513641; x=1766118441; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=2mQWDX07Qsn2IYShcICQo+zasr4kHm8FuWvdXXrNAx8=; b=oKSJNX+yy2Ui+9X66jBSYlzCDh6KzywP3M6n8IcQ2S2ltyZDXFPGs12dkCkrCX538T aakC6WjtA9NuTkzcxPmp0GQDNAzH2aOYqCpavdK+EBNmmDaW6Ij+qa401VgisDXuu8FM zIsIdYjewhFo8vw7IBYe5ZLV7xiedFKBe9zRYLXkHr/GjTUNq8nARe7ROyDXj8dEiB/T 8/KNokmBpopYvQs+KKrLKtPdRhv5bTbmvWC9XTRX19KdqH2NHaFT5s970aTx/pA7uu2c YctptYvubKvz9SOAoTCTbo9ZSzrbHib7b1PxkChNsVIX01ewH9Pws5futEnp/Mh1VQNg EWgw== X-Forwarded-Encrypted: i=1; AJvYcCV9Zt9uQ/8iS4KIWR3TC7pr/gCgfgGArZOBquI9MVOZ8emK0yElM89uZEtrkahPiiiv3XE6TBmWD1nJogo=@vger.kernel.org X-Gm-Message-State: AOJu0Yymw2x4AbTNRsx00HOp5hAs0vsFqBk1GBjropzZzjdh2Nw2Zm9J aIMbx9zZkPrrqwOPVAC7kOg78gFqcBFXx4oeuggf78cvNF1XUzo+5FFGvN5H8BfJKsVy1/JE4cs 8T7Sx2VSF14mqsKmEiWggnMS5rDbQao0twea6XQhLSpkwkCW5DMmwd8n9bp/hBYdhGw== X-Gm-Gg: AY/fxX4k3cUi6tesZjylZzWSCEHbN7HhpaHZnTg9z6YYB/vqofEKLgy0qukcNEdpA/y /FhTm9uToiLXch5LaUQfS/71MXHzG821HOhJJj58Gem4ghtnHFhbWf6BEsOZXqWFGcUtzKrVSxv MlhLcRbWu0Kdfb8NcV1zRM+9co9ckOYuenx1UqHSh9E+I6/gihIHjV0Khuaen3GOaLiDE3yVekh PxSCgJBcyR5+He10GPL7uOhP3x/2EmjQ4cyCdATGDptIc0lB+WHl2vHKnJ+8HvSrcH+JFfG0hHt 9lKNDMLIUhq8+hB8YxtP41Jl1IEV71Ss3q7bUVOmKWbAmhr8Sgd4elHKbCfZ7AAI25rrDKefGHd /k4eMV9BplpM6AFoupZZId4o9sujQHjZWr7SE7/Ljv+qHZYrZdw== X-Received: by 2002:a17:90b:5348:b0:33b:bf8d:6172 with SMTP id 98e67ed59e1d1-34abd88e216mr749928a91.34.1765513640646; Thu, 11 Dec 2025 20:27:20 -0800 (PST) X-Google-Smtp-Source: AGHT+IHz/omtq1S0a89fO0YjrcLQdrUotCHKBB5cLsMvCGu1fxU+fPJ41KZU3L1s90JRdyxd9zTYsw== X-Received: by 2002:a17:90b:5348:b0:33b:bf8d:6172 with SMTP id 98e67ed59e1d1-34abd88e216mr749876a91.34.1765513640120; Thu, 11 Dec 2025 20:27:20 -0800 (PST) Received: from [192.168.68.51] (n175-34-62-5.mrk21.qld.optusnet.com.au. [175.34.62.5]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-34abe305b0esm181148a91.5.2025.12.11.20.27.11 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 11 Dec 2025 20:27:19 -0800 (PST) Message-ID: <6d257a70-d27f-4741-8fa5-fa765fa10643@redhat.com> Date: Fri, 12 Dec 2025 14:27:08 +1000 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 16/38] arm_mpam: resctrl: Add support for 'MB' resource To: James Morse , linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Cc: D Scott Phillips OS , carl@os.amperecomputing.com, lcherian@marvell.com, bobo.shaobowang@huawei.com, tan.shaopeng@fujitsu.com, baolin.wang@linux.alibaba.com, Jamie Iles , Xin Hao , peternewman@google.com, dfustini@baylibre.com, amitsinght@marvell.com, David Hildenbrand , Dave Martin , Koba Ko , Shanker Donthineni , fenghuay@nvidia.com, baisheng.gao@unisoc.com, Jonathan Cameron , Ben Horgan , rohit.mathew@arm.com, reinette.chatre@intel.com, Punit Agrawal , Zeng Heng References: <20251205215901.17772-1-james.morse@arm.com> <20251205215901.17772-17-james.morse@arm.com> Content-Language: en-US From: Gavin Shan In-Reply-To: <20251205215901.17772-17-james.morse@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Hi James and Ben, On 12/6/25 7:58 AM, James Morse wrote: > resctrl supports 'MB', as a percentage throttling of traffic somewhere > after the L3. This is the control that mba_sc uses, so ideally the > class chosen should be as close as possible to the counters used for > mba_local. > > MB's percentage control should be backed either with the fixed point > fraction MBW_MAX. The bandwidth portion bitmaps is not used as its > tricky to pick which bits to use to avoid contention, and may be > possible to expose this as something other than a percentage in the > future. > > CC: Zeng Heng > Co-developed-by: Dave Martin > Signed-off-by: Dave Martin > Signed-off-by: James Morse > > --- > drivers/resctrl/mpam_resctrl.c | 212 ++++++++++++++++++++++++++++++++- > 1 file changed, 211 insertions(+), 1 deletion(-) > > diff --git a/drivers/resctrl/mpam_resctrl.c b/drivers/resctrl/mpam_resctrl.c > index 55576d0caf12..b9f3f00d8cad 100644 > --- a/drivers/resctrl/mpam_resctrl.c > +++ b/drivers/resctrl/mpam_resctrl.c > @@ -247,6 +247,33 @@ static bool cache_has_usable_cpor(struct mpam_class *class) > return (class->props.cpbm_wd <= 32); > } > > +static bool mba_class_use_mbw_max(struct mpam_props *cprops) > +{ > + return (mpam_has_feature(mpam_feat_mbw_max, cprops) && > + cprops->bwa_wd); > +} > + > +static bool class_has_usable_mba(struct mpam_props *cprops) > +{ > + return mba_class_use_mbw_max(cprops); > +} > + > +/* > + * Calculate the worst-case percentage change from each implemented step > + * in the control. > + */ > +static u32 get_mba_granularity(struct mpam_props *cprops) > +{ > + if (!mba_class_use_mbw_max(cprops)) > + return 0; > + > + /* > + * bwa_wd is the number of bits implemented in the 0.xxx > + * fixed point fraction. 1 bit is 50%, 2 is 25% etc. > + */ > + return DIV_ROUND_UP(MAX_MBA_BW, 1 << cprops->bwa_wd); > +} > + > /* > * Each fixed-point hardware value architecturally represents a range > * of values: the full range 0% - 100% is split contiguously into > @@ -287,6 +314,96 @@ static u16 percent_to_mbw_max(u8 pc, struct mpam_props *cprops) > return val; > } > > +static u32 get_mba_min(struct mpam_props *cprops) > +{ > + u32 val = 0; > + > + if (mba_class_use_mbw_max(cprops)) > + val = mbw_max_to_percent(val, cprops); > + else > + WARN_ON_ONCE(1); > + > + return val; > +} > + > +/* Find the L3 cache that has affinity with this CPU */ > +static int find_l3_equivalent_bitmask(int cpu, cpumask_var_t tmp_cpumask) > +{ > + u32 cache_id = get_cpu_cacheinfo_id(cpu, 3); > + > + lockdep_assert_cpus_held(); > + > + return mpam_get_cpumask_from_cache_id(cache_id, 3, tmp_cpumask); > +} > + > +/* > + * topology_matches_l3() - Is the provided class the same shape as L3 > + * @victim: The class we'd like to pretend is L3. > + * > + * resctrl expects all the world's a Xeon, and all counters are on the > + * L3. We play fast and loose with this, mapping counters on other > + * classes - provided the CPU->domain mapping is the same kind of shape. > + * > + * Using cacheinfo directly would make this work even if resctrl can't > + * use the L3 - but cacheinfo can't tell us anything about offline CPUs. > + * Using the L3 resctrl domain list also depends on CPUs being online. > + * Using the mpam_class we picked for L3 so we can use its domain list > + * assumes that there are MPAM controls on the L3. > + * Instead, this path eventually uses the mpam_get_cpumask_from_cache_id() > + * helper which can tell us about offline CPUs ... but getting the cache_id > + * to start with relies on at least one CPU per L3 cache being online at > + * boot. > + * > + * Walk the victim component list and compare the affinity mask with the > + * corresponding L3. The topology matches if each victim:component's affinity > + * mask is the same as the CPU's corresponding L3's. These lists/masks are > + * computed from firmware tables so don't change at runtime. > + */ > +static bool topology_matches_l3(struct mpam_class *victim) > +{ > + int cpu, err; > + struct mpam_component *victim_iter; > + cpumask_var_t __free(free_cpumask_var) tmp_cpumask; > + > + if (!alloc_cpumask_var(&tmp_cpumask, GFP_KERNEL)) > + return false; > + > + guard(srcu)(&mpam_srcu); > + list_for_each_entry_srcu(victim_iter, &victim->components, class_list, > + srcu_read_lock_held(&mpam_srcu)) { > + if (cpumask_empty(&victim_iter->affinity)) { > + pr_debug("class %u has CPU-less component %u - can't match L3!\n", > + victim->level, victim_iter->comp_id); > + return false; > + } > + > + cpu = cpumask_any(&victim_iter->affinity); > + if (WARN_ON_ONCE(cpu >= nr_cpu_ids)) > + return false; > + > + cpumask_clear(tmp_cpumask); > + err = find_l3_equivalent_bitmask(cpu, tmp_cpumask); > + if (err) { > + pr_debug("Failed to find L3's equivalent component to class %u component %u\n", > + victim->level, victim_iter->comp_id); > + return false; > + } > + > + /* Any differing bits in the affinity mask? */ > + if (!cpumask_equal(tmp_cpumask, &victim_iter->affinity)) { > + pr_debug("class %u component %u has Mismatched CPU mask with L3 equivalent\n" > + "L3:%*pbl != victim:%*pbl\n", > + victim->level, victim_iter->comp_id, > + cpumask_pr_args(tmp_cpumask), > + cpumask_pr_args(&victim_iter->affinity)); > + > + return false; > + } > + } > + > + return true; > +} > + > /* Test whether we can export MPAM_CLASS_CACHE:{2,3}? */ > static void mpam_resctrl_pick_caches(void) > { > @@ -330,10 +447,63 @@ static void mpam_resctrl_pick_caches(void) > } > } > > +static void mpam_resctrl_pick_mba(void) > +{ > + struct mpam_class *class, *candidate_class = NULL; > + struct mpam_resctrl_res *res; > + > + lockdep_assert_cpus_held(); > + > + guard(srcu)(&mpam_srcu); > + list_for_each_entry_srcu(class, &mpam_classes, classes_list, > + srcu_read_lock_held(&mpam_srcu)) { > + struct mpam_props *cprops = &class->props; > + > + if (class->level < 3) { > + pr_debug("class %u is before L3\n", class->level); > + continue; > + } > + > + if (!class_has_usable_mba(cprops)) { > + pr_debug("class %u has no bandwidth control\n", > + class->level); > + continue; > + } > + > + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) { > + pr_debug("class %u has missing CPUs\n", class->level); > + continue; > + } > + > + if (!topology_matches_l3(class)) { > + pr_debug("class %u topology doesn't match L3\n", > + class->level); > + continue; > + } > + > + /* > + * mba_sc reads the mbm_local counter, and waggles the MBA > + * controls. mbm_local is implicitly part of the L3, pick a > + * resource to be MBA that as close as possible to the L3. > + */ > + if (!candidate_class || class->level < candidate_class->level) > + candidate_class = class; > + } > + > + if (candidate_class) { > + pr_debug("selected class %u to back MBA\n", > + candidate_class->level); > + res = &mpam_resctrl_controls[RDT_RESOURCE_MBA]; > + res->class = candidate_class; > + exposed_alloc_capable = true; > + } > +} > + > static int mpam_resctrl_control_init(struct mpam_resctrl_res *res, > enum resctrl_res_level type) > { > struct mpam_class *class = res->class; > + struct mpam_props *cprops = &class->props; > struct rdt_resource *r = &res->resctrl_res; > > switch (res->resctrl_res.rid) { > @@ -362,6 +532,20 @@ static int mpam_resctrl_control_init(struct mpam_resctrl_res *res, > * 'all the bits' is the correct answer here. > */ > r->cache.shareable_bits = resctrl_get_default_ctrl(r); > + break; > + case RDT_RESOURCE_MBA: > + r->alloc_capable = true; > + r->schema_fmt = RESCTRL_SCHEMA_RANGE; > + r->ctrl_scope = RESCTRL_L3_CACHE; > + > + r->membw.delay_linear = true; > + r->membw.throttle_mode = THREAD_THROTTLE_UNDEFINED; > + r->membw.min_bw = get_mba_min(cprops); > + r->membw.max_bw = MAX_MBA_BW; > + r->membw.bw_gran = get_mba_granularity(cprops); > + > + r->name = "MB"; > + > break; > default: > break; > @@ -377,7 +561,17 @@ static int mpam_resctrl_pick_domain_id(int cpu, struct mpam_component *comp) > if (class->type == MPAM_CLASS_CACHE) > return comp->comp_id; > > - /* TODO: repaint domain ids to match the L3 domain ids */ > + if (topology_matches_l3(class)) { > + /* Use the corresponding L3 component ID as the domain ID */ > + int id = get_cpu_cacheinfo_id(cpu, 3); > + > + /* Implies topology_matches_l3() made a mistake */ > + if (WARN_ON_ONCE(id == -1)) > + return comp->comp_id; > + > + return id; > + } > + > /* > * Otherwise, expose the ID used by the firmware table code. > */ > @@ -419,6 +613,12 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d, > case RDT_RESOURCE_L3: > configured_by = mpam_feat_cpor_part; > break; > + case RDT_RESOURCE_MBA: > + if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { > + configured_by = mpam_feat_mbw_max; > + break; > + } > + fallthrough; > default: > return resctrl_get_default_ctrl(r); > } > @@ -430,6 +630,8 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d, > switch (configured_by) { > case mpam_feat_cpor_part: > return cfg->cpbm; > + case mpam_feat_mbw_max: > + return mbw_max_to_percent(cfg->mbw_max, cprops); > default: > return resctrl_get_default_ctrl(r); > } > @@ -474,6 +676,13 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d, > cfg.cpbm = cfg_val; > mpam_set_feature(mpam_feat_cpor_part, &cfg); > break; > + case RDT_RESOURCE_MBA: > + if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { > + cfg.mbw_max = percent_to_mbw_max(cfg_val, cprops); > + mpam_set_feature(mpam_feat_mbw_max, &cfg); > + break; > + } > + fallthrough; I think mpam_feat_mbw_min peroperly need to be cleared in '&cfg', whose content is copied from that of the component. mpam_feat_mbw_min may have been existing in '&cfg' and struct mpam_config::mbw_min won't be updated correctly in the subsequent call mpam_extend_config(). It means register MPAMCFG_MBW_MIN isn't updated correctly. On NVidia's grace-hopper machine, I got: host$ mount none -tresctrl /sys/fs/resctrl/ host$ mkdir -p /sys/fs/resctrl/all host$ mkdir -p /sys/fs/resctrl/test host$ cat /proc/dump_feat_regs MPAMF_IDR 0000008057010027 MAPMF_MBW_IDR 00000c07 host$ echo "MB:1=98" > /sys/fs/resctrl/test/schemata host$ cat /proc/dump_cfg_regs MPAMCFG_PART_SEL 00000002 MPAMCFG_MBW_MAX 0000f9ff MPAMCFG_MBW_MIN 0000f000 host$ echo "MB:1=2" > /sys/fs/resctrl/test/schemata host$ cat /proc/dump_cfg_regs MPAMCFG_PART_SEL 00000002 MPAMCFG_MBW_MAX 000005ff MPAMCFG_MBW_MIN 0000f000 With 'mpam_clear_feature(mpam_feat_mbw_min, &cfg);' applied here, the register can be updated correctly. It also makes my (soft) MBW limiting tests happy. host$ echo "MB:1=98" > /sys/fs/resctrl/test/schemata host$ cat /proc/dump_cfg_regs MPAMCFG_PART_SEL 00000002 MPAMCFG_MBW_MAX 0000f9ff MPAMCFG_MBW_MIN 0000ea00 host$ echo "MB:1=2" > /sys/fs/resctrl/test/schemata host$ cat /proc/dump_cfg_regs MPAMCFG_PART_SEL 00000002 MPAMCFG_MBW_MAX 000005ff MPAMCFG_MBW_MIN 00000200 Thanks, Gavin > default: > return -EINVAL; > } > @@ -743,6 +952,7 @@ int mpam_resctrl_setup(void) > > /* Find some classes to use for controls */ > mpam_resctrl_pick_caches(); > + mpam_resctrl_pick_mba(); > > /* Initialise the resctrl structures from the classes */ > for (i = 0; i < RDT_NUM_RESOURCES; i++) {