From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 89F8BE67482 for ; Sun, 21 Dec 2025 12:19:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:In-Reply-To: Content-Transfer-Encoding:Content-Type:MIME-Version:References:Message-ID: Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=ltBr3XzB1GrPvRZcKpV7voX0AAOWySj0uwjCYD+lBmY=; b=N/V/Hz8cN16PRW3s36GkdcZHGt ZLfp3NhyR36rj74yM87ic9PCfz288tmF970LpJOxQNqVN9afUASjN7KDS+e9QXBUrGO2NZ/t6s/+b LZGF+Kwmoy4wO323CKZV2jyK+RPF7+IgFvbkMI+trhUQe1d31ftbfSYGjQ6BsWv35kXj1FG/Gd9FY UxrbZAr2tGKGDOj5oYGdo6oSNn/5mpv9eNDV1wbekBdpxvyOrq4mRGMzKmGwE41cSY6jRtDqOH8AZ FqxPeY36OvDrTsdnGmJcaugSOGBwIElhY+7ulImnaa85wazCc1cembqBgZXGJzq07l1PsOX5tji40 55LgDlnQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux)) id 1vXIPJ-0000000CQbm-1URH; Sun, 21 Dec 2025 12:19:29 +0000 Received: from tor.source.kernel.org ([172.105.4.254]) by bombadil.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux)) id 1vXIPD-0000000CQbg-2YAF for linux-nvme@lists.infradead.org; Sun, 21 Dec 2025 12:19:23 +0000 Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by tor.source.kernel.org (Postfix) with ESMTP id A9CA160007; Sun, 21 Dec 2025 12:19:20 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6261BC4CEFB; Sun, 21 Dec 2025 12:19:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1766319560; bh=ZeOPwShHRhwYUB1pry9WKHZ3vg3L8IaUwEYKG8XWhQo=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=XkR4deFJRzYHolHakfbtdIDJtIK9Yk8DEnj/EqcYYdZgFW2I1C4gWIMRhYEV5SKlb zNNLh8pdxlqTYU8eLvtzQuykfloKYp1KfpB9SWgR6ATsaOlCXkSzolIV4RqxuYbkNu Qezx9UnTHfcahw/4ucIhdrMlFMdPbBrRUTgqw4qP7l6X2MRvlprvOu6D1F35hO3i7I gvWhoZ0/htR6bJx7t+Iwr4AoPRZlXmIRhv7uSUibBOmkzDZB8FmpHXChxqlNfMx3DS 8SoqCN11kEmL2lhLaJZPKGoubgNsGbJu+CBNrzIOwiRaqNY0G3olfpPtH8xPiYE17/ SD8VhXCcxIn8Q== Date: Sun, 21 Dec 2025 14:19:15 +0200 From: Leon Romanovsky To: Hou Tao Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-nvme@lists.infradead.org, Bjorn Helgaas , Logan Gunthorpe , Alistair Popple , Greg Kroah-Hartman , Tejun Heo , "Rafael J . Wysocki" , Danilo Krummrich , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , Keith Busch , Jens Axboe , Christoph Hellwig , Sagi Grimberg , houtao1@huawei.com Subject: Re: [PATCH 00/13] Enable compound page for p2pdma memory Message-ID: <20251221121915.GJ13030@unreal> References: <20251220040446.274991-1-houtao@huaweicloud.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20251220040446.274991-1-houtao@huaweicloud.com> X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org On Sat, Dec 20, 2025 at 12:04:33PM +0800, Hou Tao wrote: > From: Hou Tao > > Hi, > > device-dax has already supported compound page. It not only reduces the > cost of struct page significantly, it also improve the performance of > get_user_pages when 2MB or 1GB page size is used. We are experimenting > to use p2p dma to directly transfer the content of NVMe SSD into NPU. I’ll admit my understanding here is limited, and lately everything tends to look like a DMABUF problem to me. Could you explain why DMABUF support is not being used for this use case? Thanks > The size of NPU HBM is 32GB or larger and there are at most 8 NPUs in > the host. When using the base page, the memory overhead is about 4GB for > 128GB HBM, and the mapping of 32GB HBM into userspace takes about 0.8 > second. Considering ZONE_DEVICE memory type has already supported the > compound page, enabling the compound page support for p2pdma memory as > well. After applying the patch set, when using the 1GB page, the memory > overhead is about 2MB and the mmap costs about 0.04 ms. > > The main difference between the compound page support of device-dax and > p2pdma is that p2pdma inserts the page into user vma during mmap instead > of page fault. The main reason is simplicity. The patch set is > structured as shown below: > > Patch #1~#2: tiny bug fixes for p2pdma > Patch #3~#5: add callbacks support in kernfs and sysfs, include > pagesize, may_split and get_unmapped_area. These callbacks are necessary > for the support of compound page when mmaping sysfs binary file. > Patch #6~#7: create compound page for p2pdma memory in the kernel. > Patch #8~#10: support the mapping of compound page in userspace. > Patch #11~#12: support the compound page for NVMe CMB. > Patch #13: enable the support for compound page for p2pdma memory. > > Please see individual patches for more details. Comments and > suggestions are always welcome. > > Hou Tao (13): > PCI/P2PDMA: Release the per-cpu ref of pgmap when vm_insert_page() > fails > PCI/P2PDMA: Fix the warning condition in p2pmem_alloc_mmap() > kernfs: add support for get_unmapped_area callback > kernfs: add support for may_split and pagesize callbacks > sysfs: support get_unmapped_area callback for binary file > PCI/P2PDMA: add align parameter for pci_p2pdma_add_resource() > PCI/P2PDMA: create compound page for aligned p2pdma memory > mm/huge_memory: add helpers to insert huge page during mmap > PCI/P2PDMA: support get_unmapped_area to return aligned vaddr > PCI/P2PDMA: support compound page in p2pmem_alloc_mmap() > PCI/P2PDMA: add helper pci_p2pdma_max_pagemap_align() > nvme-pci: introduce cmb_devmap_align module parameter > PCI/P2PDMA: enable compound page support for p2pdma memory > > drivers/accel/habanalabs/common/hldio.c | 3 +- > drivers/nvme/host/pci.c | 10 +- > drivers/pci/p2pdma.c | 140 ++++++++++++++++++++++-- > fs/kernfs/file.c | 79 +++++++++++++ > fs/sysfs/file.c | 15 +++ > include/linux/huge_mm.h | 4 + > include/linux/kernfs.h | 3 + > include/linux/pci-p2pdma.h | 30 ++++- > include/linux/sysfs.h | 4 + > mm/huge_memory.c | 66 +++++++++++ > 10 files changed, 339 insertions(+), 15 deletions(-) > > -- > 2.29.2 > >