From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id DDE71C54E5D for ; Thu, 14 Mar 2024 08:52:29 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 8485210E6B1; Thu, 14 Mar 2024 08:52:29 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="M3Z9CL1y"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.10]) by gabe.freedesktop.org (Postfix) with ESMTPS id 212B110E6B1 for ; Thu, 14 Mar 2024 08:52:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1710406348; x=1741942348; h=message-id:subject:from:to:cc:date:in-reply-to: references:content-transfer-encoding:mime-version; bh=PUs2XT3LY5W3U4MvL+a0NAIDFsEP/IgKTT/Ja8W9yUA=; b=M3Z9CL1y8W+hK1x0MkiATIOUkQ5D8tXsVptLexdFJO/goErd23D+/6vk tpOGwxk9h7a1G0UfLH1ixXRaknHUlpSy41ucAc3UVUcF/Un8BWC90zDXS SfulBouhcaVpJwG1S4vaDukIicjN+Sz+NY/8GyahKH5ZPmtLR83V6C0q7 rzwu8LHvS+w3Wr3xE8frEW6de9cQIAgPdMoMQZDpXR2fKNrXJNvEe/xmD TukoG1ZANJBU1MweY2JtmOBUfepULbLqBSCFY52fRnu0f9ElnEpksG1RC YKXvNRFjNrEoEEJFVs0ZD0JzZXPJy2U/ZSS+CGlIVeExVu5FmCsfu27Lp w==; X-IronPort-AV: E=McAfee;i="6600,9927,11012"; a="16607917" X-IronPort-AV: E=Sophos;i="6.07,124,1708416000"; d="scan'208";a="16607917" Received: from fmviesa003.fm.intel.com ([10.60.135.143]) by fmvoesa104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Mar 2024 01:52:27 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,124,1708416000"; d="scan'208";a="16815658" Received: from mstribae-mobl.ger.corp.intel.com (HELO [10.249.254.78]) ([10.249.254.78]) by fmviesa003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Mar 2024 01:52:26 -0700 Message-ID: <5165406f368cc023a5d0fd9879e33b8ac01d8aa7.camel@linux.intel.com> Subject: Re: Separating xe_vma- and page-table state From: Thomas =?ISO-8859-1?Q?Hellstr=F6m?= To: "Zeng, Oak" , "Brost, Matthew" Cc: "intel-xe@lists.freedesktop.org" Date: Thu, 14 Mar 2024 09:52:23 +0100 In-Reply-To: References: <72ea6bc36260bcc2eaeb97d1abcb8bebf69f3f53.camel@linux.intel.com> Autocrypt: addr=thomas.hellstrom@linux.intel.com; prefer-encrypt=mutual; keydata=mDMEZaWU6xYJKwYBBAHaRw8BAQdAj/We1UBCIrAm9H5t5Z7+elYJowdlhiYE8zUXgxcFz360SFRob21hcyBIZWxsc3Ryw7ZtIChJbnRlbCBMaW51eCBlbWFpbCkgPHRob21hcy5oZWxsc3Ryb21AbGludXguaW50ZWwuY29tPoiTBBMWCgA7FiEEbJFDO8NaBua8diGTuBaTVQrGBr8FAmWllOsCGwMFCwkIBwICIgIGFQoJCAsCBBYCAwECHgcCF4AACgkQuBaTVQrGBr/yQAD/Z1B+Kzy2JTuIy9LsKfC9FJmt1K/4qgaVeZMIKCAxf2UBAJhmZ5jmkDIf6YghfINZlYq6ixyWnOkWMuSLmELwOsgPuDgEZaWU6xIKKwYBBAGXVQEFAQEHQF9v/LNGegctctMWGHvmV/6oKOWWf/vd4MeqoSYTxVBTAwEIB4h4BBgWCgAgFiEEbJFDO8NaBua8diGTuBaTVQrGBr8FAmWllOsCGwwACgkQuBaTVQrGBr/P2QD9Gts6Ee91w3SzOelNjsus/DcCTBb3fRugJoqcfxjKU0gBAKIFVMvVUGbhlEi6EFTZmBZ0QIZEIzOOVfkaIgWelFEH Organization: Intel Sweden AB, Registration Number: 556189-6027 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.50.3 (3.50.3-1.fc39) MIME-Version: 1.0 X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On Wed, 2024-03-13 at 17:06 +0000, Zeng, Oak wrote: > Hi Thomas, >=20 > For simplicity of the discussion, let's forget about BO vm_bind, > forget about memory attributes for a moment... Only consider system > allocator. So with the scheme below, we have a gigantic xe_vma in the > background holding some immutable state, never split. And we have > mutable page-table state which is created during GPU access and > destroyed during CPU munmap/invalidation, dynamically >=20 > For the mutable page-table state, you would maintain another RB-tree > so you can searching it, as I did=C2=A0 in POC, the tree is in xe_svm. Fo= r > BO driver, you don=E2=80=99t need this extra tree, you just need the xe_v= ma > tree as xe_vma has 1:1 mapping with page-table-state for BO driver... >=20 > I saw this scheme can be aligned with my POC.... >=20 > Mapping this scheme to the userptr "free without vm_unbind" thing, I > can see when user free, we can destroy page-table-state during mmu > notifier callback, while keep the xe_vma. Is this also how you look > at it?=20 >=20 > Need to say, the "free without vm_unbind" thing should only affect > our decision temporarily: once system allocator is ready, UMD > wouldn't need the userptr vm_bind anymore, so the problem will be > more perfectly solved with system allocator - umd just remove the > vm_bind, things would magically work with system allocator. I guess > what user need is really a system allocator but we don't have it at > that time, so userptr technology is used. For long term, system > allocator should eventually replace userptr. I mostly agree on the above, I think. >=20 > One thing I can't picture clearly is, how hard is it to change the > current xekmd to separate xe_vma into mutable and unmutable? It's not that hard at all, it's mostly changing the xe_pt.c interfaces. An obstacle, though, is that we don't want to do this before Matt's big vm_bind refactoring is reviewed and in place. > =20 >=20 > Is the split scheme with xe_vma maintaining both mutable and > unmutable simpler? It doesn't have xe_svm concept. No > xe_svm_range/page_table-state, single RB tree per gpuvm, no need to > re-construct xe_vma....depending on how we want to solve the multiple > device problem, the xe_svm concept can come back though... For the ordinary VMA types we have today, Userptr / Bo /NULL it's neither simpler nor more complex IMO, but it makes the code clearer and hopefully easier to maintain. For hmmptr/SVM system it's too early to answer. Here it depends really on whether 1) we do an 1:1 mapping between xe_vma and svm_range, or whether 2) we do an 1:N mapping of xe_vma and svm_range. Probably both approaches have their benefits so I'd tend to favour Matt's suggestion there that we start off with 1), make it work and then do a POC with 2) to see what it looks like. Comments, suggestions? /Thomas >=20 > Oak >=20 > > -----Original Message----- > > From: Thomas Hellstr=C3=B6m > > Sent: Wednesday, March 13, 2024 6:56 AM > > To: Brost, Matthew ; Zeng, Oak > > > > Cc: intel-xe@lists.freedesktop.org > > Subject: Re: Separating xe_vma- and page-table state > >=20 > > On Wed, 2024-03-13 at 01:27 +0000, Matthew Brost wrote: > > > On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote: > > > > Hi Thomas, > > >=20 > > >=20 > >=20 > > .... > >=20 > > > Thomas: > > >=20 > > > I like the idea of VMAs in the PT code function being marked as > > > const > > > and having the xe_pt_state as non const. It makes ownership very > > > clear. > > >=20 > > > Not sure how that will fit into [1] as that series passes around > > > a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It > > > does > > > this > > > to make "struct xe_vm_ops" a single atomic operation. The VMAs > > > are > > > extracted either the GPUVM base operation or "struct xe_vma_op". > > > Maybe > > > these can be const? I'll look into that but this might not work > > > out > > > in > > > practice. > > >=20 > > > Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship > > > fits in > > > hmmptrs. Could you explain your thinking here? > >=20 > > There is a need for hmmptrs to be sparse. When we fault we create a > > chunk of PTEs that we populate. This chunk could potentially be > > large > > and covering the whole CPU vma or it could be limited to, say 2MiB > > and > > aligned to allow for large page-table entries. In Oak's POC these > > chunks are called "svm ranges" > >=20 > > So the question arises, how do we map that to the current vma > > management and page-table code? There are basically two ways: > >=20 > > 1) Split VMAs so they are either fully populated or unpopulated, > > each > > svm_range becomes an xe_vma. > > 2) Create xe_pt_range / xe_pt_state whatever with an 1:1 mapping > > with > > the svm_mange and a 1:N mapping with xe_vmas. > >=20 > > Initially my thinking was that 1) Would be the simplest approach > > with > > the code we have today. I lifted that briefly with Sima and he > > answered > > "And why would we want to do that?", and the answer at hand was ofc > > that the page-table code worked with vmas. Or rather that we mix > > vma > > state (the hmmptr range / attributes) and page-table state (the > > regions > > of the hmmptr that are actually populated), so it would be a > > consequence of our current implementation (limitations). > >=20 > > With the suggestion to separate vma state and pt state, the xe_svm > > ranges map to pt state and are managed per hmmptr vma. The vmas > > would > > then be split mainly as a result of UMD mapping something else (bo) > > on > > top, or UMD giving new memory attributes for a range (madvise type > > of > > operations). > >=20 > > /Thomas > >=20 > >=20 > >=20 > >=20 > >=20 > >=20 > >=20 > >=20 >=20