From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15FAEC2B9F4 for ; Mon, 14 Jun 2021 21:54:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CB88261209 for ; Mon, 14 Jun 2021 21:54:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229836AbhFNV4w (ORCPT ); Mon, 14 Jun 2021 17:56:52 -0400 Received: from mga11.intel.com ([192.55.52.93]:58470 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229734AbhFNV4v (ORCPT ); Mon, 14 Jun 2021 17:56:51 -0400 IronPort-SDR: fnabyfdd/ptNrdTpVJmJ/OdhFDppmTukpiU8qV484FRhFQaG1rGtGCe9/oqaDqzmqrvv56F0pJ rQ03ohnhbl4A== X-IronPort-AV: E=McAfee;i="6200,9189,10015"; a="202860332" X-IronPort-AV: E=Sophos;i="5.83,273,1616482800"; d="scan'208";a="202860332" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Jun 2021 14:54:05 -0700 IronPort-SDR: ybaGFAH0EOUaTEtqzddXqJMMYYi/rChdBO1Oljr/v7BXr/r5djph9a7yFg51cgm9uIubgO/qoG 43NNfXs5InyQ== X-IronPort-AV: E=Sophos;i="5.83,273,1616482800"; d="scan'208";a="636923886" Received: from smothe-mobl.amr.corp.intel.com (HELO intel.com) ([10.252.143.124]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Jun 2021 14:54:04 -0700 Date: Mon, 14 Jun 2021 14:54:02 -0700 From: Ben Widawsky To: Dan Williams Cc: linux-cxl@vger.kernel.org, Alison Schofield , Ira Weiny , Jonathan Cameron , Vishal Verma Subject: Re: [RFC PATCH 0/4] Region Creation Message-ID: <20210614215402.mxcwdv4wno6krm7w@intel.com> References: <20210610185725.897541-1-ben.widawsky@intel.com> <20210614161159.cqq64nbm5whzpud7@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-cxl@vger.kernel.org On 21-06-14 14:04:32, Dan Williams wrote: > On Mon, Jun 14, 2021 at 9:12 AM Ben Widawsky wrote: > > > > On 21-06-11 17:44:02, Dan Williams wrote: > > > On Thu, Jun 10, 2021 at 11:58 AM Ben Widawsky wrote: > > > > > > > > CXL interleave sets and non-interleave sets are described via regions. A region > > > > is specified in the CXL 2.0 specification and the purpose is to create a > > > > standardized way to preserve the region across reboots. > > > > > > > > Introduced here is the basic mechanism to create and configure and delete a CXL > > > > region. Configuring a region simply means giving it a size, offset within the > > > > CFMWS window, UUID, and a target list. Enabling/activating a region, which > > > > ultimately means programming the HDM decoders in the chain, is left for later > > > > work. > > > > > > > > The patches are only minimally tested so far in QEMU emulation and so x1 > > > > interleave is all that's supported. > > > > > > > > Here is a sample topology (also in patch #4) > > > > > > I'm just going to react to the attributes before looking at the > > > implementation to make sure we're level set. > > > > > > > > > > > decoder1.0 > > > > ├── create_region > > > > ├── delete_region > > > > ├── devtype > > > > ├── locked > > > > ├── region1.0:0 > > > > │ ├── offset > > > > > > Is this the region's offset relative to the next available free space > > > in the parent decoder range? If this is output only I think it's ok, > > > but I think the address space allocation decision belongs to the > > > region driver at activation time. I.e. userspace does not have much of > > > a chance at specifying this relative all the other dynamic operations > > > that can be happening in the decoder. > > > > > > > This was my mistake. Offset will be determined by the driver and I intend for > > this to be read-only. > > > > > > │ ├── size > > > > │ ├── subsystem -> ../../../../../../../bus/cxl > > > > │ ├── target0 > > > > │ ├── uevent > > > > │ ├── uuid > > > > │ └── verify > > > > > > I don't understand the role of a standalone @verify attribute, there > > > is verification that can happen per attribute write, and there is > > > final verification that can happen at region bind time. Either way > > > anything verify would check is duplicated somewhere else, and the > > > verification per attribute update is more precise. For example writes > > > to @size can check for free space in parent decoder and fail if > > > unavailable. Writes to targetX can fail if the memdev is not connected > > > to this decoder's port topology, or the memdev is out of decoder > > > resources. The final region bind will fail if mid-level switches are > > > lacking decoder resources, or would require changing a decoder > > > configuration that is pinned active. > > > > I strongly believe verification per attribute write will get too fragile. I'm > > afraid it's going to require writing attributes in a specific order so that we > > can do said verification in a sane way. We can skip that and just check it all > > on bind. Most of that logic is what would be contained in verify(), so why not > > expose it for userspace that may want to test out various configs without > > actually trying to bind? > > Because there's no harm in actually trying to bind. A verify attribute > is at best redundant, or I am otherwise not understanding the proposed > use case? > That's the use case. Though I don't consider it redundant. All bind() can return is errnos + what you mention below (and following LWN link). > > Also, I like having ABI that helps userspace get details on the configuration > > failure reason. You mention in the other reply, TRACE_EVENT. I suppose userspace > > could use tracepoints, or scrape dmesg for this same info. Maybe it's 6 one way, > > a half dozen the other. I'd be interested to know if there are other examples of > > tracepoints being used by userspace in a way like this and what the experience > > is like. > > > > To summarize, I think we need an atomic way to do verification (which obviously > > happens at bind()), and I think we need UAPI to get the configuration error. > > I expect higher order configuration error reporting and non-atomic > pre-verification to come from user tooling. But isn't that just duplicating code that we have to have in the kernel anyway? > As for what the kernel can do at runtime in the absence of user tooling, or in > the development of more aware tooling has been debated in the past [1]. In > this case the entire decoder resource topology is visible in userspace, an3d > while userspace can't atomically predict what will happen, it also does not > need to because the admin should not be racing resource querying and resource > consumption if they want to get a reliable answer. The reason I recommended > TRACE_EVENT() rather than dev_dbg() is due to being able to filter event > messages by cpu, pid, tid, uid... etc. Another approach I have seen upstream > is to emit extra variables with a KOBJ_CHANGE event, but that is more about > event reporting than extra information about provisioning failure. Interesting. Thanks for the link, it looks like it never landed. I think trace makes a good deal of sense considering all the options. I'm not convinced the interface is "at best redundant". I'll just drop verify(). I have no further arguments in favor and you don't sound convinced of the original ones. > > [1]: https://lwn.net/Articles/657341/