From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5D893285060 for ; Mon, 7 Jul 2025 21:55:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.179 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751925327; cv=none; b=QdWKN7KvovxhTxx4O923qiTWDpvFY2FCKF0h1mIAZQPURMsHJtWAxyjr3icYrC4Fh0ZhAr4btNvvRy5/CPcDmO5O1JKo37LnwsqrodXUw3bdaMhRfGp52R4p4ZlZeMPQZDaenlxfiGKiY54tOTa+p1kTv8XzCMaKpwcWaVPwSCk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751925327; c=relaxed/simple; bh=9AGMi6aOUv8lZr+RXfz2XCf31Nuvt0TeuQmn1A5hoCM=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=EHYfNurD2LxW8KdjyHU9S/nN02zm1pcxzRtjIPLeNcBTxBuJ1whblExJ04aS0uL+9BD14Kgzkb2RYhG/zO3joGlRu2Tx4SnEIsEOdVhpHiokTVcm4KU/sOpxc4o0fNYDxJn/aJ3mwht3LYOFhJtPRCN+Ys3RODPWvXap7UQ+/Ng= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=1s7+89zi; arc=none smtp.client-ip=209.85.214.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="1s7+89zi" Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-2357c61cda7so23835ad.1 for ; Mon, 07 Jul 2025 14:55:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1751925325; x=1752530125; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XjODZ7b3pUgkuHWTOlB6q0MkE7zknx2t+C23plqUGVI=; b=1s7+89ziOyRjmhJfH/FQ+bDNeXbN5cONQvBuKNGoUepXlaG2u9sJM79RqrNVQGCb90 uBtqf/0oy5lfNYzHeYjJOsNwlvYUjPAvBVk9H7evj0w4jFAL/D+uJXvqWbqr9J0l3BRj Iz6vmgHZPRugHTl7shBlWzM2mt/7HNyZ9z95kBdY4MuLSFlePtue91cVeuh9DqBzEV9l 6VRPueZGShp4niLyBW93uv3bvQ+Kcer4u/vsfAddhBPUQUmy87QDHBF1l24dakmJ5ZJJ RkwjWuCjtVuSxiKK9FT7nPHwE0+jYydjHB3Ib7JfpqduMOGddTbti3KIQyo33svAq47h 9LPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751925325; x=1752530125; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XjODZ7b3pUgkuHWTOlB6q0MkE7zknx2t+C23plqUGVI=; b=HfedPxLJYXdIn9co7fHeKjLOlkjLOyKqYcO3DHmCg2b5kvYFyzckBxDkZB+NisuSCX a/GjQZqrB7jAA0xUx5kmPYHr8tcLTh7kWXtPDjwDUZuqGa4rJVoZTHRFp9B70ujp0XNH 5aUzcEsHEZp7EQ/iVgnl0iWPuXP0YxVudifZ4cvGMAJ/JUlVhQA+xBbAYK0olqb3N+5d 2aHeR74QcGEmQo7b2UXcQ26veN20Og3iiYY6LX/BEOwQona17mHqJ+UmsNlvUVXwfxHS bmH6hi08X/t0dkEe7jhKpd/4d31WRvvl7Zs7MWBifMSLSuw8KzbSULnFzOsrG5a/D53r luoA== X-Forwarded-Encrypted: i=1; AJvYcCW1tTJZ/1onPaOlDxhqd4TC/UbnnUOjkRgTriYpCZhaODHnSoZ/HOm2vPGX00TulUT6M5vs+C3YK7O/KXU=@vger.kernel.org X-Gm-Message-State: AOJu0YznGTjFS7QwWV+gQ+j4EhiV4zWjfcsLYjD6PRiafWIkI3BGXH8Q RV2O3T9/pj/issLIA7eeiveBn85G216kk7wD+LP+SaXZRgo9HSVHYTnl0MjIhrI0RSgh5Gz/IQw TlBGx2NRnRjHDKYxYSVq7QMdwgmwx6yqNebz/4G2q X-Gm-Gg: ASbGncsfxtV1O948gap8tNszP7q4ydK8cuDJnpqDfp+ZsxSKOi6OZj1K79ngGVPDGfj fQz42oxqI3vJeGhxK4CjAEC1aUlxoF3GrkSvq9ocFL9oI0MFGUX2YvbwMYLEKIhDIzxzBW2BjfM VXxYgBbqlEVTh2SmgHvDTfnBu4G1DGyF7Jevl6iPLMe4DDbVsqkwPN19v4oImWsWLFh0RY/IrS/ Q== X-Google-Smtp-Source: AGHT+IEUyNA21PNLfpesvG5K4xIaoCzv68+pnS1VVkTUe7gUYeQrdXt9UG8p4qMQV0+grM07E7RHnYwdtIk1mXmGjiA= X-Received: by 2002:a17:903:32c8:b0:235:f18f:291f with SMTP id d9443c01a7336-23dd1c67ccbmr300755ad.23.1751925324336; Mon, 07 Jul 2025 14:55:24 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20250702172433.1738947-1-dtatulea@nvidia.com> <20250702172433.1738947-2-dtatulea@nvidia.com> <20250702113208.5adafe79@kernel.org> <20250702135329.76dbd878@kernel.org> <22kf5wtxym5x3zllar7ek3onkav6nfzclf7w2lzifhebjme4jb@h4qycdqmwern> In-Reply-To: From: Mina Almasry Date: Mon, 7 Jul 2025 14:55:11 -0700 X-Gm-Features: Ac12FXy6PtsGridoKCc-iCQycbBbvTllmo3prjwvZXhIWiUYJxClF9O2WMIZiK0 Message-ID: Subject: Re: [RFC net-next 1/4] net: Allow non parent devices to be used for ZC DMA To: Dragos Tatulea Cc: Parav Pandit , Jakub Kicinski , "asml.silence@gmail.com" , Andrew Lunn , "David S. Miller" , Eric Dumazet , Paolo Abeni , Simon Horman , Saeed Mahameed , Tariq Toukan , Cosmin Ratiu , "netdev@vger.kernel.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Jul 7, 2025 at 2:35=E2=80=AFPM Dragos Tatulea = wrote: > > On Mon, Jul 07, 2025 at 11:44:19AM -0700, Mina Almasry wrote: > > On Fri, Jul 4, 2025 at 6:11=E2=80=AFAM Dragos Tatulea wrote: > > > > > > On Thu, Jul 03, 2025 at 01:58:50PM +0200, Parav Pandit wrote: > > > > > > > > > From: Jakub Kicinski > > > > > Sent: 03 July 2025 02:23 AM > > > > > > > > [...] > > > > > Maybe someone with closer understanding can chime in. If the kind= of > > > > > subfunctions you describe are expected, and there's a generic way= of > > > > > recognizing them -- automatically going to parent of parent would= indeed be > > > > > cleaner and less error prone, as you suggest. > > > > > > > > I am not sure when the parent of parent assumption would fail, but = can be > > > > a good start. > > > > > > > > If netdev 8 bytes extension to store dma_dev is concern, > > > > probably a netdev IFF_DMA_DEV_PARENT can be elegant to refer parent= ->parent? > > > > So that there is no guess work in devmem layer. > > > > > > > > That said, my understanding of devmem is limited, so I could be mis= taken here. > > > > > > > > In the long term, the devmem infrastructure likely needs to be > > > > modernized to support queue-level DMA mapping. > > > > This is useful because drivers like mlx5 already support > > > > socket-direct netdev that span across two PCI devices. > > > > > > > > Currently, devmem is limited to a single PCI device per netdev. > > > > While the buffer pool could be per device, the actual DMA > > > > mapping might need to be deferred until buffer posting > > > > time to support such multi-device scenarios. > > > > > > > > In an offline discussion, Dragos mentioned that io_uring already > > > > operates at the queue level, may be some ideas can be picked up > > > > from io_uring? > > > The problem for devmem is that the device based API is already set in > > > stone so not sure how we can change this. Maybe Mina can chime in. > > > > > > > I think what's being discussed here is pretty straight forward and > > doesn't need UAPI changes, right? Or were you referring to another > > API? > > > I was referring to the fact that devmem takes one big buffer, maps it > for a single device (in net_devmem_bind_dmabuf()) and then assigns it to > queues in net_devmem_bind_dmabuf_to_queue(). As the single buffer is > part of the API, I don't see how the mapping could be done in a per > queue way. > Oh, I see. devmem does support mapping a single buffer to multiple queues in a single netlink API call, but there is nothing stopping the user from mapping N buffers to N queues in N netlink API calls. > > > To sum the conversation up, there are 2 imperfect and overlapping > > > solutions: > > > > > > 1) For the common case of having a single PCI device per netdev, goin= g one > > > parent up if the parent device is not DMA capable would be a good > > > starting point. > > > > > > 2) For multi-PF netdev [0], a per-queue get_dma_dev() op would be ide= al > > > as it provides the right PF device for the given queue. > > > > Agreed these are the 2 options. > > > > > io_uring > > > could use this but devmem can't. Devmem could use 1. but the > > > driver has to detect and block the multi PF case. > > > > > > > Why? AFAICT both io_uring and devmem are in the exact same boat right > > now, and your patchset seems to show that? Both use dev->dev.parent as > > the mapping device, and AFAIU you want to use dev->dev.parent.parent > > or something like that? > > > Right. My patches show that. But the issue raised by Parav is different: > different queues can belong to different DMA devices from different > PFs in the case of Multi PF netdev. > > io_uring can do it because it maps individual buffers to individual > queues. So it would be trivial to get the DMA device of each queue throug= h > a new queue op. > Right, devmem doesn't stop you from mapping individual buffers to individual queues. It just also supports mapping the same buffer to multiple queues. AFAIR, io_uring also supports mapping a single buffer to multiple queues, but I could easily be very wrong about that. It's just a vague recollection from reviewing the iozcrx.c implementation a while back. In your case, I think, if the user is trying to map a single buffer to multiple queues, and those queues have different dma-devices, then you have to error out. I don't see how to sanely handle that without adding a lot of code. The user would have to fall back onto mapping a single buffer to a single queue (or multiple queues that share the same dma-device). > > Also AFAIU the driver won't need to block the multi PF case, it's > > actually core that would need to handle that. For example, if devmem > > wants to bind a dmabuf to 4 queues, but queues 0 & 1 use 1 dma device, > > but queues 2 & 3 use another dma-device, then core doesn't know what > > to do, because it can't map the dmabuf to both devices at once. The > > restriction would be at bind time that all the queues being bound to > > have the same dma device. Core would need to check that and return an > > error if the devices diverge. I imagine all of this is the same for > > io_uring, unless I'm missing something. > > > Agreed. Currently I didn't see an API for Multi PF netdev to expose > this information so my thinking defaulted to "let's block it from the > driver side". > Agreed. > > > I think we need both. Either that or a netdev op with an optional que= ue > > > parameter. Any thoughts? > > > > > > > At the moment, from your description of the problem, I would lean to > > going with Jakub's approach and handling the common case via #1. If > > more use cases that require a very custom dma device to be passed we > > can always move to #2 later, but FWIW I don't see a reason to come up > > with a super future proof complicated solution right now, but I'm > > happy to hear disagreements. > But we also don't want to start off on the left foot when we know of > both issues right now. And I think we can wrap it up nicely in a single > function similary to how the current patch does it. > FWIW I don't have a strong preference. I'm fine with the simple solution for now and I'm fine with the slightly more complicated future proof solution. --=20 Thanks, Mina