From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92CA8C02197 for ; Tue, 4 Feb 2025 09:32:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E7FB66B0085; Tue, 4 Feb 2025 04:32:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E08AF6B0088; Tue, 4 Feb 2025 04:32:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CAA406B0089; Tue, 4 Feb 2025 04:32:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id A8B026B0085 for ; Tue, 4 Feb 2025 04:32:43 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 112C980216 for ; Tue, 4 Feb 2025 09:32:43 +0000 (UTC) X-FDA: 83081747406.21.23CD1EB Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) by imf28.hostedemail.com (Postfix) with ESMTP id 4F6C5C000C for ; Tue, 4 Feb 2025 09:32:40 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=gEA6NivY; spf=none (imf28.hostedemail.com: domain of thomas.hellstrom@linux.intel.com has no SPF policy when checking 192.198.163.12) smtp.mailfrom=thomas.hellstrom@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738661561; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=afC03Ee5hYfBovzmOOaQQ63aQs98FWYrDWq0Cfbx4RI=; b=cVGbPSXJGaLxj5B0jIIVkZEMpP9SylVhBpIL0s1HBQzdnwPVENeuuAaKCj4OcN4V9pNHqg viIPQKzmtg8+12tace8rur/hAasSDHJA+pzQS1AgzrSLkPfexqz+97zcUrQiGnkqcihTtM hi4KC4jW35poWelG4ViNy+vIYtv5esM= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=gEA6NivY; spf=none (imf28.hostedemail.com: domain of thomas.hellstrom@linux.intel.com has no SPF policy when checking 192.198.163.12) smtp.mailfrom=thomas.hellstrom@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738661561; a=rsa-sha256; cv=none; b=u+7FJ4WQBLTxJS70DFG3pFK2zxUN7jXD7tE0nfQojByZTg29y5aORJYrBIoLauxQZZznne EnwPqIcoP+lV4E1gLzkJ8orlNkHFe4ez6DgytsySZ4wyFTX3nP2440wentwm6lJLBFV7yV qKLudl0YIyxHDVbnLiPbVFoO3AMGYQ8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1738661560; x=1770197560; h=message-id:subject:from:to:date:in-reply-to:references: content-transfer-encoding:mime-version; bh=afC03Ee5hYfBovzmOOaQQ63aQs98FWYrDWq0Cfbx4RI=; b=gEA6NivYYSqql33BfJi+IqI5kvZ1lQrpEpzCtl4K3MI/GrEKM0rKBY2J ycU1/Q8CrXasfmUzKFQrkpLW/WNT8CtZ/z9WWSNwuZxOOfeDOrC4a9ZIo 48PVzbzRVyk5Zb+ngVBz9Xu55zHzXi6gIUFNQMyuuuFsZXvffGri/kMif QJzajLu63iiYMhQFLQhb5+AYe7Pnr7O07RU4rokrLtMcMm7a2yPMbv7+j OO8e7A08sgFSLuxhg2szWFAinnAPRbG5VT9WX0azq1M2/mJoxsCr019jJ rnaVxtMM0RLfGf1fCpkfuk51SQkxGto1gIiPSofkUdwkSb7aJIyUPDmXo Q==; X-CSE-ConnectionGUID: LtNk3SGuS6G5Pevy6Xlelw== X-CSE-MsgGUID: Dii1aCgSRLOlcQnw+gDOfg== X-IronPort-AV: E=McAfee;i="6700,10204,11335"; a="43100958" X-IronPort-AV: E=Sophos;i="6.13,258,1732608000"; d="scan'208";a="43100958" Received: from orviesa007.jf.intel.com ([10.64.159.147]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Feb 2025 01:32:39 -0800 X-CSE-ConnectionGUID: zfYHGJSsSyC0q1C1d5OGAQ== X-CSE-MsgGUID: ByXkmf1TR56RDx03kwQdlA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.12,224,1728975600"; d="scan'208";a="110998417" Received: from lfiedoro-mobl.ger.corp.intel.com (HELO [10.245.246.144]) ([10.245.246.144]) by orviesa007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Feb 2025 01:32:35 -0800 Message-ID: <7b7a15fb1f59acc60393eb01cefddf4dc1f32c00.camel@linux.intel.com> Subject: Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages From: Thomas =?ISO-8859-1?Q?Hellstr=F6m?= To: Jason Gunthorpe , Yonatan Maman , kherbst@redhat.com, lyude@redhat.com, dakr@redhat.com, airlied@gmail.com, simona@ffwll.ch, leon@kernel.org, jglisse@redhat.com, akpm@linux-foundation.org, GalShalom@nvidia.com, dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, linux-mm@kvack.org, linux-tegra@vger.kernel.org Date: Tue, 04 Feb 2025 10:32:32 +0100 In-Reply-To: <20250203150805.GC2296753@ziepe.ca> References: <20250128151610.GC1524382@ziepe.ca> <20250128172123.GD1524382@ziepe.ca> <20250129134757.GA2120662@ziepe.ca> <20250130132317.GG2120662@ziepe.ca> <20250130174217.GA2296753@ziepe.ca> <20250203150805.GC2296753@ziepe.ca> Organization: Intel Sweden AB, Registration Number: 556189-6027 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.54.3 (3.54.3-1.fc41) MIME-Version: 1.0 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 4F6C5C000C X-Stat-Signature: jboydw5rwyz6cfjimme7cdf74chdzzyy X-Rspam-User: X-HE-Tag: 1738661560-891648 X-HE-Meta: U2FsdGVkX18405yDyM+q848B7kw6V/6J/vFTJ8O9Yk8FBzFPENtuNJAyZJFw/JCo8A5q2J0qaZyuK5n39UfYneGzT/PtvAPjGoIhcD1mOCXVy/5k8sJVOgWRfo3E10HbyP3hjCOCmIYXHoJtbSframvTOf6Esaml856EYnvP88TtfgLRAouKF/571UkgbcOecrFnN6KsbWN7g+veRzieklZIYmXx0g7YfOZWW4HAlqKvdjNzP4qjaoz7rwP0vBCHmK55veECW4aDZM0uLsln1eZUT+Zz0jpuc5U4nalhfRVb/T9SxnBP4P8ub5ok28GbiZN6mSKG81pnnYH8Huq550bcZ84F+EcCh3aA/tVKsgMm5C+WP7qnjOegCfKYjefsOHkRRWPRTZQO4e7n4ZvjdCIi9yQomddaR4u0RqwGe1GbzFm7gBg3GzUBGWqz6zoNXWE+ZZh2r/hL0ycUCyXhDp/uLSxxSvL5UMWjFU2CWfOy6x41BZF1f+TSjbQp2/ly1UKmD+ByplfDLF+Ozhf66K8+Y0IcdhoYjZO1QC5lLoupcro2OCRJL49BI8YLh/+XTWMw3w4fT/9Q9Aijv1s0eUOi+tdfkNJEWW6U17V8XoDAzFyD42zcAwQk11QeZVI8DNkTOgW30rkbGhmAL259/i3rk06YzqLQuj4/YiPF6d7SojdAgKgMTGHaAJfpTSqtinjVkEfyHAQqrSGklpgXZS2xziUUm2zVzFEq8O5iEIiIJIjxuBHv7z7p4TNpNaXyhD1QszMzYI1DI5MuQhhvl6nMFa6CIYSO3m9ohNy7KmiVTsF6cJFLIMsiG35U0+MUdM3lB401IXomz7S8c4HGC9Nc8XOATobN5XGZpde6hVUz7MdYKLY8fg4g18NeXQBrtlRS0tKAdomSyRo3H2nT09f1YBSMq4fTcQflkxKBZf6U/cpQYYSILCOFj4lmZw+tFo05iQ8CgZO3PUYD6Hg oa2sttWg vAewMnDASxKslLHtRudtzXpo01RxAcyZDxYVvrtbP3ntqc05m5lIfttBRzZdhGR+etASDS4daeEIdE0idNoG7GeJ+X/5WqdicHyEnnnTsjql/dQY87xnS8ElXUcyV7M710An0vMiXhBtWnLWmHElu5+LedSLIFxTbWyW9/sY324tna21usLsCFMvfBPXH2fuHTIYgNgtKWeRqp2sz0pSxXlCOMW0NjWf6EZwKey+ooq0ffFU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, 2025-02-03 at 11:08 -0400, Jason Gunthorpe wrote: > On Fri, Jan 31, 2025 at 05:59:26PM +0100, Simona Vetter wrote: >=20 > > So one aspect where I don't like the pgmap->owner approach much is > > that > > it's a big thing to get right, and it feels a bit to me that we > > don't yet > > know the right questions. >=20 > Well, I would say it isn't really complete yet. No driver has yet > attempted to use a private interconnect with these scheme. Probably > it > needs more work. >=20 > > A bit related is that we'll have to do some driver-specific > > migration > > after hmm_range_fault anyway for allocation policies. With coherent > > interconnect that'd be up to numactl, but for driver private it's > > all up > > to the driver. And once we have that, we can also migrate memory > > around > > that's misplaced for functional and not just performance reasons. >=20 > Are you sure? This doesn't seem to what any hmm_range_fault() user > should be doing. hmm_range_fault() is to help mirror the page table > to a secondary, that is all. Migration policy shouldn't be part of > it, > just mirroring doesn't necessarily mean any access was performed, for > instance. >=20 > And mirroring doesn't track any access done by non-faulting cases > either. >=20 > > The plan I discussed with Thomas a while back at least for gpus was > > to > > have that as a drm_devpagemap library,=20 >=20 > I would not be happy to see this. Please improve pagemap directly if > you think you need more things. These are mainly helpers to migrate and populate a range of cpu memory space (struct mm_struct) with GPU device_private memory, migrate to system on gpu memory shortage and implement the migrate_to_vram pagemap op, tied to gpu device memory allocations, so I don't think there is anything we should be exposing at the dev_pagemap level at this point? >=20 > > which would have a common owner (or > > maybe per driver or so as Thomas suggested).=20 >=20 > Neither really match the expected design here. The owner should be > entirely based on reachability. Devices that cannot reach each other > directly should have different owners. Actually what I'm putting together is a small helper to allocate and assign an "owner" based on devices that are previously registered to a "registry". The caller has to indicate using a callback function for each struct device pair whether there is a fast interconnect available, and this is expected to be done at pagemap creation time, so I think this aligns with the above. Initially a "registry" (which is a list of device-owner pairs) will be driver-local, but could easily have a wider scope. This means we handle access control, unplug checks and similar at migration time, typically before hmm_range_fault(), and the role of hmm_range_fault() will be to over pfns whose backing memory is directly accessible to the device, else migrate to system. Device unplug would then be handled by refusing migrations to the device (gpu drivers would probably use drm_dev_enter()), and then evict all device memory after a drm_dev_unplug(). This of course relies on that eviction is more or less failsafe. /Thomas >=20 > > But upfront speccing all this out doesn't seem like a good idea to, > > because I honestly don't know what we all need. >=20 > This is why it is currently just void *owner=C2=A0 :) Again, with the above I think we are good for now, but having experimented a lot with the callback, I'm still not convinced by the performance argument, for the following reasons. 1) Existing users would never use the callback. They can still rely on the owner check, only if that fails we check for callback existence. 2) By simply caching the result from the last checked dev_pagemap, most callback calls could typically be eliminated. 3) As mentioned before, a callback call would typically always be followed by either migration to ram or a page-table update. Compared to these, the callback overhead would IMO be unnoticeable. 4) pcie_p2p is already planning a dev_pagemap callback? =20 /Thomas >=20 > Jason