From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 198EB345CB2 for ; Fri, 12 Jun 2026 14:50:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.169 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781275828; cv=none; b=IgJ/+Am6VjYI6Y85KfC5bRt4saJeJ7Eodc2wGbpUweNbY9OG6T0hf7B1Hn7+DV3rt/4zAa5sATRYkF5gOETkT7RUnm1/X+O9eRkQPiL6kATnkrTeVWNmOTmKYZ5ylp1ncYSJ5/L/6gmOK1DPT7yL7NS2M6d+wY9p8v78+Gp06G8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781275828; c=relaxed/simple; bh=ktJEoejF4IYc2x2sx5hNzOusqn1PFWfmraLH5p9R6zY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=lC4hd4ZJYFL7kHUOsk5IrSkUJmPhkaSd4PkZcnrabUdd39y5EPPJ5C6XL6TBbp4hGhTIZwGEu7nBrp6+NFAex/sKZYJCz+d3vh51Pe+F5S8BRo6UTC3FYMIuSw2zqt9kEjQc4sDmzBgYi7gqr1xYGTTCzsSLtzYamIWkD9UPkj4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=ZcErODWZ; arc=none smtp.client-ip=209.85.214.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ZcErODWZ" Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-2c0b1a48855so117875ad.0 for ; Fri, 12 Jun 2026 07:50:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1781275826; x=1781880626; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=TKPEKDhY7MEkgse5vNpBg59/8pjz1cpSSV/zhg7oF9E=; b=ZcErODWZ/U+nUPxMmEv908bsS0S53JqNsdPVrwuiUXQR6xzrpMlzDbRfWs4pDrRNBs XdFvi6Vfcgexc7xuhQ+AiCox+kc2p/4jATO1IZ5jzQ/OCsNMacY1Ru2+wjPS6/908v2a wu352v6ehffF9GmvaIpBuaUHhCADthkerJYsfoyRTPjT3fPBPUY8asxPuAllC895AvkM xnYS7MdHnCoFKPMlNN7M7T2cyIgMFlj+4IRISsh/7OPXKKAUF8iyMsE4gU2vq+MDKb7R yPNYS+4AwDVCaPrucN/dzPIBYUJz1xw4lb8BNJrE2Y/9Jl2dM63nz/l5fJVxwU2+nlsf B57g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781275826; x=1781880626; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TKPEKDhY7MEkgse5vNpBg59/8pjz1cpSSV/zhg7oF9E=; b=XWK5DF+H9Xp68SfbnJVMwe/iFrH5cH3DFUYhCnrLsYmhKs/dwBeQ5t/twSPqsg690e KBxNguDFbTSPCs+zZbeqXuZsn/FUSZjGvSBHV/D+aehXkH7EWJkRr9jIQK1xKlmEqDM/ nfj+REh7xNChEwliwCBge/vteLoz7r0CGPZJaT24ZtNS/lv3pwsF1B7nUvkhtl6oFLHF hks9m7gmhOXNz1Vb5drEQINPLfio57vBNa+YckjP5tOVIUxcushCpjdTX6CyKjICb1iL t0bsc+GvPL3yZmfbrWYOie6RMR/JQfL6L0pXjUsxa4p0bYGA5ykP/ZzD1uP0yq43hTLB h8wQ== X-Forwarded-Encrypted: i=1; AFNElJ8t8L7YRrIwbXz1LIsgVutbj/ZVkL+GlzIW8Umyw7JZ++tMfIiPo9qUXoCFAWEeO1Eog4uP1Cyb9W51MGo=@vger.kernel.org X-Gm-Message-State: AOJu0Yx5G/ZPiGNF5l/K3hJ6aia326p0Gv7QYIqR5FD6adWCZxA3Gpg8 pWtSRG3AFtiV3Dhmc5Qo95wp1/d4rmE9MEhQOOm/Peg/odQ55lJtR+ZskLhMMs8uyg== X-Gm-Gg: Acq92OGZ+tPoB3D0Z1Y8OenX9g8osrsfNqwU0W6RB+mrZHPkQhb3yG4fehVfvRj2yNl tm3xKeX2PqJlYWu9vQ2AubVGEKxpGpOQfxMCCjkCyV1HQm4QpnzXdcxPcP9pAIHqIFTgOmSWgRV 5yEI64AGzwxq+98tbS7pXq70vujcmvHzYslrcnNE7Bds1L9KF+v7uc5hkAckC/m4lCE5QBKBmln 9a5HwByx1md0vca2P5vjn7j7KHPgC+bOJ5uER7urk/FMHuEo4VsUvsaB+Zrh2f0Rh4Zz1ShoDlv CdBr2A8UhPEg/J0bdbRQc8QL1nL63QzNszixEjdrmKIV/rUd1PKgZCUMv5pmmWSc1gAzd+lfARZ 7oWgG9iHgeKyYAG8SSNQZzLBkVoqTHcJHnU+/1GwxWLIc2DKICjSrvoI4j8i5cRSSW/NvBZwZsM eXGS2heFexooafHBAih9vx/HGdJedAo/x4zz9jy3G32TVOjlcY4H2NGK7HwaC8 X-Received: by 2002:a17:902:f78d:b0:2a7:87c2:fcde with SMTP id d9443c01a7336-2c405c8818bmr2171605ad.15.1781275825932; Fri, 12 Jun 2026 07:50:25 -0700 (PDT) Received: from google.com (199.255.142.34.bc.googleusercontent.com. [34.142.255.199]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2c432d8a039sm23848235ad.62.2026.06.12.07.50.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 12 Jun 2026 07:50:25 -0700 (PDT) Date: Fri, 12 Jun 2026 14:50:18 +0000 From: Pranjal Shrivastava To: Jason Gunthorpe Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Bjorn Helgaas , Logan Gunthorpe , Alex Williamson , Kevin Tian , Ankit Agrawal , Matt Evans , Vivek Kasireddy , Leon Romanovsky , Shivaji Kant , Samiullah Khawaja Subject: Re: [RFC PATCH 0/5] vfio/pci: Support ZONE_DEVICE-backed P2P Registration Message-ID: References: <20260610151853.3608948-1-praan@google.com> <20260610162848.GO2764304@ziepe.ca> <20260611221447.GH1066031@ziepe.ca> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260611221447.GH1066031@ziepe.ca> On Thu, Jun 11, 2026 at 07:14:47PM -0300, Jason Gunthorpe wrote: > On Thu, Jun 11, 2026 at 02:40:17PM +0000, Pranjal Shrivastava wrote: > > On Wed, Jun 10, 2026 at 01:28:48PM -0300, Jason Gunthorpe wrote: > > > On Wed, Jun 10, 2026 at 03:18:48PM +0000, Pranjal Shrivastava wrote: > > > > > > > Users utilize the standard sysfs p2pmem/allocate interface for managing > > > > memory slices once a BAR is registered. > > > > > > I'm shocked someone wants to use API, what are you expecting to do > > > with it?? > > > > Our primary use-case is PCIe BAR (DDR / HBM) -> NFS via P2PDMA while the > > PCIe device is managed by a user-space driver based on vfio-pci. While > > kernel drivers (e.g.drm) can register BARs with ZONE_DEVICE natively to > > enable this, VFIO currently lacks an equivalent mechanism. > > I mean the weird sysfs mmap API. It is only useful if the device is > basically pure memory with no functionality. You can't even learn what > MMIO offset the returned allocation gives so it is almost completely > useless. > > nvme could use it because CMB is pure memory and you reference it by > its MMIO address, but that doesn't apply to VFIO.. > Ack, I agree, sysfs allocation doesn't provide the offset-level control. I'll pivot entirely to the DMABUF approach. > > > > An alternative implementation has been explored which integrates with the > > > > ongoing VFIO DMABUF-mmap refactor [1]. In that approach, rather than > > > > registering a BAR as a system-wide P2P provider, VFIO optionally > > > > allocates ZONE_DEVICE pages only for specifically exported DMABUFs via a > > > > new VFIO_DMA_BUF_FLAG_ALLOC_STRUCT_PAGES flag. > > > > > > That's probably more sensible but you can't have a DMABUF mmap > > > actually install non-special memory. The native vfio mmap still can, > > > but not mmap on the dmabuf fd. That's still workable, just keep in > > > mind. > > > > Ack. I guess, we could have a separate mmap path in case of BARs that are > > struct page backed which doesn't go through the dmabuf exporter. > > The dmabuf export is perfectly fine, you just have to think very > carefully about the mmap path. > > I suppose if you build the proper revocation fence for zone device > pages as part of the vfio implementation it would be OK for dmabuf > mmap to expose them as well since it would have the right lifecycle > model. > Ack, I'll move forward with adding a flag to request a ZONE_DEVICE-backed DMABUF export (the 'Alternative Approach' mentioned in the cover letter). And yes, I agree we need to ensure the mmap path is handled carefully with the correct lifecycle in mind. > That's the tricky thing with zone_device, you have to be careful to > wait for all the page references to be put back at all the right > times. Yea, that's going to be tricky.. I'm thinking if we can have a zap model there somehow? If the device is gone / going through a reset, we can handle the refcounts accordingly? > > Come to think of it, since the sysfs API cannot do that in the way > VFIO wants I actually think you can't use it.. Ack. Baking this into the VFIO DMABUF allows us to enforce the right lifecycle. My plan for RFC v2 is to add a flag like VFIO_DMA_BUF_FLAG_ZONE_DEVICE to struct vfio_device_feature_dma_buf which allows the caller to opt-in to ZONE_DEVICE backing specifically for that export. Does this opt-in flag sound like a reasonable uAPI or do you see any concerns with this direction? Otherwise, as you noted, the lifecycle and the mmap path remain the main problems to solve. Thanks, Praan