From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 46AADC5475B for ; Thu, 14 Mar 2024 08:57:29 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id EF9FF10EE30; Thu, 14 Mar 2024 08:57:28 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="Z4+LizSV"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.10]) by gabe.freedesktop.org (Postfix) with ESMTPS id B2DB810E82D for ; Thu, 14 Mar 2024 08:57:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1710406648; x=1741942648; h=message-id:subject:from:to:cc:date:in-reply-to: references:content-transfer-encoding:mime-version; bh=AMEsc5b7Y5o6d6HeejGqi0yZGWqtiMggUOb8MW945vY=; b=Z4+LizSVWqp4WbN/sMVVi3Z8KjWA08lBVqm5SCvU3iq2hph6siiilMZZ S/SxTvPwuaUMVG4lXKyP7T2kGU0/mC4SUxwA/WbKwPGZs5Xy6VumrCMSM dFXYef6sggN0K43BXKoEerOL1RTyVFpI4UsNDQ2/eQIOOlIPbJ6BcX4cC pPAWR3sKLlio1WRC/dgNUzZdeOSftsayeyk7VPcWuCaqCTaOMtZVSJSeh TjMTFngKVBOVz2ZVdzxZqAPPeoco4FdHZwvN99wxM2ohdBF10UcXnKzQh CaJpJ/2o9SatPaiSYeDLqwNMyeJMb+ONCpnslqul7NK72LVq+T/jH/+WC Q==; X-IronPort-AV: E=McAfee;i="6600,9927,11012"; a="16608561" X-IronPort-AV: E=Sophos;i="6.07,124,1708416000"; d="scan'208";a="16608561" Received: from fmviesa003.fm.intel.com ([10.60.135.143]) by fmvoesa104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Mar 2024 01:57:16 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,124,1708416000"; d="scan'208";a="16816459" Received: from mstribae-mobl.ger.corp.intel.com (HELO [10.249.254.78]) ([10.249.254.78]) by fmviesa003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Mar 2024 01:57:15 -0700 Message-ID: Subject: Re: Separating xe_vma- and page-table state From: Thomas =?ISO-8859-1?Q?Hellstr=F6m?= To: Matthew Brost Cc: "Zeng, Oak" , "intel-xe@lists.freedesktop.org" Date: Thu, 14 Mar 2024 09:57:12 +0100 In-Reply-To: References: <72ea6bc36260bcc2eaeb97d1abcb8bebf69f3f53.camel@linux.intel.com> Autocrypt: addr=thomas.hellstrom@linux.intel.com; prefer-encrypt=mutual; keydata=mDMEZaWU6xYJKwYBBAHaRw8BAQdAj/We1UBCIrAm9H5t5Z7+elYJowdlhiYE8zUXgxcFz360SFRob21hcyBIZWxsc3Ryw7ZtIChJbnRlbCBMaW51eCBlbWFpbCkgPHRob21hcy5oZWxsc3Ryb21AbGludXguaW50ZWwuY29tPoiTBBMWCgA7FiEEbJFDO8NaBua8diGTuBaTVQrGBr8FAmWllOsCGwMFCwkIBwICIgIGFQoJCAsCBBYCAwECHgcCF4AACgkQuBaTVQrGBr/yQAD/Z1B+Kzy2JTuIy9LsKfC9FJmt1K/4qgaVeZMIKCAxf2UBAJhmZ5jmkDIf6YghfINZlYq6ixyWnOkWMuSLmELwOsgPuDgEZaWU6xIKKwYBBAGXVQEFAQEHQF9v/LNGegctctMWGHvmV/6oKOWWf/vd4MeqoSYTxVBTAwEIB4h4BBgWCgAgFiEEbJFDO8NaBua8diGTuBaTVQrGBr8FAmWllOsCGwwACgkQuBaTVQrGBr/P2QD9Gts6Ee91w3SzOelNjsus/DcCTBb3fRugJoqcfxjKU0gBAKIFVMvVUGbhlEi6EFTZmBZ0QIZEIzOOVfkaIgWelFEH Organization: Intel Sweden AB, Registration Number: 556189-6027 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.50.3 (3.50.3-1.fc39) MIME-Version: 1.0 X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" Hi, Matt, Wed, 2024-03-13 at 19:43 +0000, Matthew Brost wrote: > On Wed, Mar 13, 2024 at 11:56:12AM +0100, Thomas Hellstr=C3=B6m wrote: > > On Wed, 2024-03-13 at 01:27 +0000, Matthew Brost wrote: > > > On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote: > > > > Hi Thomas, > > >=20 > > >=20 > >=20 > > .... > >=20 > > > Thomas: > > >=20 > > > I like the idea of VMAs in the PT code function being marked as > > > const > > > and having the xe_pt_state as non const. It makes ownership very > > > clear. > > >=20 > > > Not sure how that will fit into [1] as that series passes around > > > a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It > > > does > > > this > > > to make "struct xe_vm_ops" a single atomic operation. The VMAs > > > are > > > extracted either the GPUVM base operation or "struct xe_vma_op". > > > Maybe > > > these can be const? I'll look into that but this might not work > > > out > > > in > > > practice. > > >=20 > > > Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship > > > fits in > > > hmmptrs. Could you explain your thinking here? > >=20 > > There is a need for hmmptrs to be sparse. When we fault we create a > > chunk of PTEs that we populate. This chunk could potentially be > > large > > and covering the whole CPU vma or it could be limited to, say 2MiB > > and > > aligned to allow for large page-table entries. In Oak's POC these > > chunks are called "svm ranges" > >=20 > > So the question arises, how do we map that to the current vma > > management and page-table code? There are basically two ways: > >=20 > > 1) Split VMAs so they are either fully populated or unpopulated, > > each > > svm_range becomes an xe_vma. > > 2) Create xe_pt_range / xe_pt_state whatever with an 1:1 mapping > > with > > the svm_mange and a 1:N mapping with xe_vmas. > >=20 > > Initially my thinking was that 1) Would be the simplest approach > > with > > the code we have today. I lifted that briefly with Sima and he > > answered > > "And why would we want to do that?", and the answer at hand was ofc > > that the page-table code worked with vmas. Or rather that we mix > > vma > > state (the hmmptr range / attributes) and page-table state (the > > regions > > of the hmmptr that are actually populated), so it would be a > > consequence of our current implementation (limitations). > >=20 > > With the suggestion to separate vma state and pt state, the xe_svm > > ranges map to pt state and are managed per hmmptr vma. The vmas > > would > > then be split mainly as a result of UMD mapping something else (bo) > > on > > top, or UMD giving new memory attributes for a range (madvise type > > of > > operations). > >=20 >=20 > Thanks for the explaination. >=20 > My 2 cents. >=20 > Do an initial implementation is based on 1) - Split VMAs so they are > either fully populated or unpopulated, each svm_range becomes an > xe_vma. >=20 > This is what I did in a quick PoC [A] and got basic system allocator > working in ~500 loc. Is it fully baked? No, but also it not all that > far > off. >=20 > We get the uAPI in place, agree on all the semantics, get IGTs in > place, > and basic UMD support. We walk before we run... >=20 > Then transform to code to 2). The real benefit of 2) is going to be > the > memory footprint of all the state and the fragmentation of GPUVM RB > tree > that comes with spliting VMAs. I'll state one obvious benefit, > xe_pt_state < xe_vma in size. >=20 > I'll give a quick example of the fragmentation issue too. >=20 > User binds 0-2^46 - 1 xe_vma > Fault - 3 xe_vma (split into 2 unpopulated xe_vma, 1 populated > xe_vma) > Free - 3 xe_vma (replace 1 populated xe_vma with 1 unpopulated > xe_vma) >=20 > Do this a couple of million times and we will tons of fragmentation > and > unnecessary xe_vma floating around. So pursuing 2) is probably worth > while. >=20 > Is 2) going to quite a bit more complicated implementation. Let me > give > fairly simple example. >=20 > User binds 0-2^46 - 1 xe_vma > Tons of occurs - 1 xe_vma, tons of xe_pt_states > User binds a BO - 3 xe_vma, need to split xe_pt_states between 2 new > xe_vma, possibly unmap some xe_pt_states, yikes... >=20 > While 2) is more complicated I don't think it is complete rewrite > from > 1) either. [A] is built on top of [B] which introduces the concept of > PT > OPs (in addition to existing concepts of GPUVM op / xe_vma_op). With > a > bit more of seperation I think getting 2) to work won't be all the > difficult and the uppers layer (IOCTLs, migrate code, fault handler, > etc..) should largely be unchanged from 1) -> 2). >=20 > With a staging plan of 1) -> 2) I think this doable. Thoughts? >=20 > [A] > https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-= /commits/system_allocator_poc/?ref_type=3Dheads > [B] https://patchwork.freedesktop.org/series/125608/=C2=A0 >=20 > Matt Without looking at the code ablve, yes I think this sounds like a good plan. Oak, it would be good to hear your comments on this as well. I think, given 1) Implementing 2) as a POC wouldn't be too hard. Then we also need to figure how both approaches tie into the possibility to make GPUVM hepers out of this... /Thomas >=20 > > /Thomas > >=20 > >=20 > >=20 > >=20 > >=20 > >=20 > >=20 > >=20 > >=20