From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A8584BA36 for ; Thu, 24 Jul 2025 06:47:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753339657; cv=none; b=hXum434B2aIvE73zDTKPS9fDa+9Dw2BW1uVyhWWC5Zn8/vM5M4RvNqO2H+tdx/Fj1xW2AjcwrfqKWzhT/rin5APCduPDvUw6cIdGGuZH1ksx649q+RVXXqaR6218WdFy6pPzIVbp5KrzzKxMMVa5HmoRfYutSRVNs1V1u2RkaSI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753339657; c=relaxed/simple; bh=csTsbkp93NJRtM4YbopZiMpO/7B2FNB1bYVQvq4Gu8o=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=erzvq3cMeOgGPAgHfPiMFH8qiXHMURxUDE4ZkrN5zOuxSS+lxAc9QVJwpdNg4AwRUC9+8e1lMnacUOUSYl9CSCHVhWsc9Yi4k3z0Az9varNhb4yC82NndelgzNuuLNYGy7fAaCSPGdNaY/P6S6zfkxfCWz7HNp8XUKkIsjj4IWQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=EiyeNwW9; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="EiyeNwW9" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3C39FC4CEF1; Thu, 24 Jul 2025 06:47:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1753339657; bh=csTsbkp93NJRtM4YbopZiMpO/7B2FNB1bYVQvq4Gu8o=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=EiyeNwW9FSXVQdf4EQIVZt5/CPNXF5UMFZvCWr4eXLRAEYT/QVUbSIL6Yn8BxFDfB APMYAPa9KV5RKE4rOUMR60FhZQuZWKMNRfHF2LsW4nfglewY4O4/bnyU4XcQR/DYMU FP2UNyPz58ah+A2siP1KKh6qJPrUYZovswwCSXMTz9OeGGDIfF7tnkEzR6AqnL48l2 kJNL5ORY4Bg7SuXvcsYsC1+kER29kfT4O0lsyCRc73L58wNWEq0w9J0GdxIfZQLdnb GidBmwLI0QEl4x+mR2E29CADPKupMoF/T2EOd5XcGXlCZcdv+ISbHe9lDlrLUkdbzA S52eZLseracTQ== Date: Thu, 24 Jul 2025 09:47:32 +0300 From: Leon Romanovsky To: Matthew Brost , Mika =?iso-8859-1?Q?Penttil=E4?= Cc: Francois Dugast , airlied@gmail.com, akpm@linux-foundation.org, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, dakr@kernel.org, david@redhat.com, donettom@linux.ibm.com, jane.chu@oracle.com, jglisse@redhat.com, kherbst@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, lyude@redhat.com, peterx@redhat.com, ryan.roberts@arm.com, shuah@kernel.org, simona@ffwll.ch, wangkefeng.wang@huawei.com, willy@infradead.org, ziy@nvidia.com, Balbir Singh , jgg@nvidia.com Subject: Re: [PATCH] mm/hmm: Do not fault in device private pages owned by the caller Message-ID: <20250724064732.GQ402218@unreal> References: <9ae3e014-c7d0-4d58-af0e-925bcd9e4cfd@nvidia.com> <20250722193445.1588348-1-francois.dugast@intel.com> <023ab16d-f3af-487e-a7ce-929bf7b2fe3e@nvidia.com> <368fa1c1-fccc-445d-bd22-0053fd2db29c@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <368fa1c1-fccc-445d-bd22-0053fd2db29c@redhat.com> On Thu, Jul 24, 2025 at 09:04:36AM +0300, Mika Penttilä wrote: > > On 7/24/25 08:57, Matthew Brost wrote: > > On Thu, Jul 24, 2025 at 08:46:11AM +0300, Mika Penttilä wrote: > >> On 7/24/25 08:02, Matthew Brost wrote: > >>> On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote: > >>>> On 7/23/25 05:34, Francois Dugast wrote: > >>>>> When the PMD swap entry is device private and owned by the caller, > >>>>> skip the range faulting and instead just set the correct HMM PFNs. > >>>>> This is similar to the logic for PTEs in hmm_vma_handle_pte(). > >>>>> > >>>>> For now, each hmm_pfns[i] entry is populated as it is currently done > >>>>> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up > >>>>> optimization could be to make use of the order and skip populating > >>>>> subsequent PFNs. > >>>> I think we should test and remove these now > >>>> > >>> +Jason, Leon – perhaps either of you can provide insight into why > >>> hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page > >>> is found. > >>> > >>> If we can be assured that changing this won’t break other parts of the > >>> kernel, I agree it should be removed. A snippet of documentation should > >>> also be added indicating that when higher-order PFNs are found, > >>> subsequent PFNs within the range will remain unpopulated. I can verify > >>> that GPU SVM works just fine without these PFNs being populated. > >> afaics the device can consume the range as smaller pages also, and some > >> hmm users depend on that. > >> > > Sure, but I think that should be fixed in the device code. If a > > large-order PFN is found, the subsequent PFNs can clearly be inferred. > > It's a micro-optimization here, but devices or callers capable of > > handling this properly shouldn't force a hacky, less optimal behavior on > > core code. If anything relies on the current behavior, we should fix it > > and ensure correctness. > > Yes sure device code can be changed but meant to say we can't just > delete those lines without breaking existing users. Mika is right. RDMA subsystem and HMM users there need to be updated. We have special flag (IB_ACCESS_HUGETLB) that prepare whole RDMA stack to handle large order PFNs. If this flag is not provided, we need to fallback to basic device page size (4k) and for that we expect fully populated PFN list. Thanks