From: Yunsheng Lin <linyunsheng@huawei.com>
To: Alexander Lobakin <aleksander.lobakin@intel.com>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>,
Jesper Dangaard Brouer <hawk@kernel.org>,
Larysa Zaremba <larysa.zaremba@intel.com>,
netdev@vger.kernel.org, Alexander Duyck <alexanderduyck@fb.com>,
Ilias Apalodimas <ilias.apalodimas@linaro.org>,
linux-kernel@vger.kernel.org, Eric Dumazet <edumazet@google.com>,
Michal Kubiak <michal.kubiak@intel.com>,
intel-wired-lan@lists.osuosl.org,
Yunsheng Lin <yunshenglin0825@gmail.com>,
David Christensen <drc@linux.vnet.ibm.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
"David S. Miller" <davem@davemloft.net>
Subject: Re: [Intel-wired-lan] [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)
Date: Wed, 12 Jul 2023 19:13:01 +0800 [thread overview]
Message-ID: <cf7dafbe-decc-623c-6322-d14dd8daee95@huawei.com> (raw)
In-Reply-To: <89dc48ab-0800-b12f-7124-cecc709364d7@intel.com>
On 2023/7/12 0:37, Alexander Lobakin wrote:
> From: Yunsheng Lin <linyunsheng@huawei.com>
> Date: Tue, 11 Jul 2023 19:39:28 +0800
>
>> On 2023/7/10 21:25, Alexander Lobakin wrote:
>>> From: Yunsheng Lin <yunshenglin0825@gmail.com>
>>> Date: Sun, 9 Jul 2023 13:16:33 +0800
>>>
>>>> On 2023/7/7 0:28, Alexander Lobakin wrote:
>>>>> From: Yunsheng Lin <linyunsheng@huawei.com>
>>>>> Date: Thu, 6 Jul 2023 20:47:28 +0800
>>>>>
>>>>>> On 2023/7/5 23:55, Alexander Lobakin wrote:
>>>>>>
>>>>>>> +/**
>>>>>>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>>>>>>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>>>>>>> + * @size: size of the PP, usually simply Rx queue len
>>>>>>> + *
>>>>>>> + * Returns &page_pool on success, casted -errno on failure.
>>>>>>> + */
>>>>>>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>>>>>>> + u32 size)
>>>>>>> +{
>>>>>>> + struct page_pool_params pp = {
>>>>>>> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>>>>>> + .order = LIBIE_RX_PAGE_ORDER,
>>>>>>> + .pool_size = size,
>>>>>>> + .nid = NUMA_NO_NODE,
>>>>>>> + .dev = napi->dev->dev.parent,
>>>>>>> + .napi = napi,
>>>>>>> + .dma_dir = DMA_FROM_DEVICE,
>>>>>>> + .offset = LIBIE_SKB_HEADROOM,
>>>>>>
>>>>>> I think it worth mentioning that the '.offset' is not really accurate
>>>>>> when the page is split, as we do not really know what is the offset of
>>>>>> the frag of a page except for the first frag.
>>>>>
>>>>> Yeah, this is read as "offset from the start of the page or frag to the
>>>>> actual frame start, i.e. its Ethernet header" or "this is just
>>>>> xdp->data - xdp->data_hard_start".
>>>>
>>>> So the problem seems to be if most of drivers have a similar reading as
>>>> libie does here, as .offset seems to have a clear semantics which is used
>>>> to skip dma sync operation for buffer range that is not touched by the
>>>> dma operation. Even if it happens to have the same value of "offset from
>>>> the start of the page or frag to the actual frame start", I am not sure
>>>> it is future-proofing to reuse it.
>>>
>>> Not sure I'm following :s
>>
>> It would be better to avoid accessing the internal data of the page pool
>> directly as much as possible, as that may be changed to different meaning
>> or removed if the implememtation is changed.
>>
>> If it is common enough that most drivers are using it the same way, adding
>> a helper for that would be great.
>
> How comes page_pool_params is internal if it's defined purely by the
> driver and then exists read-only :D I even got warned in the adjacent
> thread that the Page Pool core code shouldn't change it anyhow.
Personally I am not one hundred percent convinced that page_pool_params
will not be changed considering the discussion about improving/replacing
the page pool to support P2P case.
>
>>
>>>
>>>>
>>>> When page frag is added, I didn't really give much thought about that as
>>>> we use it in a cache coherent system.
>>>> It seems we might need to extend or update that semantics if we really want
>>>> to skip dma sync operation for all the buffer ranges that are not touched
>>>> by the dma operation for page split case.
>>>> Or Skipping dma sync operation for all untouched ranges might not be worth
>>>> the effort, because it might need a per frag dma sync operation, which is
>>>> more costly than a batched per page dma sync operation. If it is true, page
>>>> pool already support that currently as my understanding, because the dma
>>>> sync operation is only done when the last frag is released/freed.
>>>>
>>>>>
>>>>>>
>>>>>>> + };
>>>>>>> + size_t truesize;
>>>>>>> +
>>>>>>> + pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
>>>>
>>>> As mentioned above, if we depend on the last released/freed frag to do the
>>>> dma sync, the pp.max_len might need to cover all the frag.
>>>
>>> ^^^^^^^^^^^^
>>>
>>> You mean the whole page or...?
>>
>> If we don't care about the accurate dma syncing, "cover all the frag" means
>> the whole page here, as page pool doesn't have enough info to do accurate
>> dma sync for now.
>>
>>> I think it's not the driver's duty to track all this. We always set
>>> .offset to `data - data_hard_start` and .max_len to the maximum
>>> HW-writeable length for one frame. We don't know whether PP will give us
>>> a whole page or just a piece. DMA sync for device is performed in the PP
>>> core code as well. Driver just creates a PP and don't care about the
>>> internals.
>>
>> There problem is that when page_pool_put_page() is called with a split
>> page, the page pool does not know which frag is freeing too.
>>
>> setting 'maximum HW-writeable length for one frame' only sync the first
>> frag of a page as below:
>
> Maybe Page Pool should synchronize DMA even when !last_frag then?
> Setting .max_len to anything bigger than the maximum frame size you're
> planning to receive is counter-intuitive.
This is simplest way to support dma sync for page frag case, the question
is if the batching of the dma sync for all frag of a page can even out the
additional cost of dma sync for dma untouched range of a page.
> All three xdp_buff, xdp_frame and skb always have all info needed to
> determine which piece of the page we're recycling, it should be possible
> to do with no complications. Hypothetical forcing drivers to do DMA
> syncs on their own when they use frags is counter-intuitive as well,
> Page Pool should be able to handle this itself.
>
> Alternatively, Page Pool may do as follows:
>
> 1. !last_frag -- do nothing, same as today.
> 2. last_frag -- sync, but not [offset, offset + max_len), but
> [offset, PAGE_SIZE).
When a frag is free, we don't know if it is the last frag or not for
now yet.
>
> This would also cover non-HW-writeable pieces like 2th-nth frame's
> headroom and each frame's skb_shared_info, but it's the only alternative
> to syncing each frag separately.
> Yes, it's almost the same as to set .max_len to %PAGE_SIZE, but as I
> said, it feels weird to set .max_len to 4k when you allocate 2k frags.
> You don't know anyway how much of a page will be used.
In that that case, we may need to make it more generic for the case when
a page is spilt into more than two frags, especially for system with 64K
page size.
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
next prev parent reply other threads:[~2023-07-12 15:20 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-05 15:55 [Intel-wired-lan] [PATCH RFC net-next v4 0/9] net: intel: start The Great Code Dedup + Page Pool for iavf Alexander Lobakin
2023-07-05 15:55 ` [Intel-wired-lan] [PATCH RFC net-next v4 1/9] net: intel: introduce Intel Ethernet common library Alexander Lobakin
2023-07-14 14:17 ` Przemek Kitszel
2023-07-05 15:55 ` [Intel-wired-lan] [PATCH RFC net-next v4 2/9] iavf: kill "legacy-rx" for good Alexander Lobakin
2023-07-14 14:17 ` Przemek Kitszel
2023-07-05 15:55 ` [Intel-wired-lan] [PATCH RFC net-next v4 3/9] iavf: drop page splitting and recycling Alexander Lobakin
2023-07-06 14:47 ` Alexander Duyck
2023-07-06 16:45 ` Alexander Lobakin
2023-07-06 17:06 ` Alexander Duyck
2023-07-10 13:13 ` Alexander Lobakin
2023-07-05 15:55 ` [Intel-wired-lan] [PATCH RFC net-next v4 4/9] net: page_pool: add DMA-sync-for-CPU inline helpers Alexander Lobakin
2023-07-05 15:55 ` [Intel-wired-lan] [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool) Alexander Lobakin
2023-07-06 12:47 ` Yunsheng Lin
2023-07-06 16:28 ` Alexander Lobakin
2023-07-09 5:16 ` Yunsheng Lin
2023-07-10 13:25 ` Alexander Lobakin
2023-07-11 11:39 ` Yunsheng Lin
2023-07-11 16:37 ` Alexander Lobakin
2023-07-12 11:13 ` Yunsheng Lin [this message]
2023-07-05 15:55 ` [Intel-wired-lan] [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool Alexander Lobakin
2023-07-06 12:47 ` Yunsheng Lin
2023-07-06 16:38 ` Alexander Lobakin
2023-07-09 5:16 ` Yunsheng Lin
2023-07-10 13:34 ` Alexander Lobakin
2023-07-11 11:47 ` Yunsheng Lin
2023-07-18 13:56 ` Alexander Lobakin
2023-07-06 15:26 ` Alexander Duyck
2023-07-06 16:56 ` Alexander Lobakin
2023-07-06 17:28 ` Alexander Duyck
2023-07-10 13:18 ` Alexander Lobakin
2023-07-05 15:55 ` [Intel-wired-lan] [PATCH RFC net-next v4 7/9] libie: add common queue stats Alexander Lobakin
2023-07-05 15:55 ` [Intel-wired-lan] [PATCH RFC net-next v4 8/9] libie: add per-queue Page Pool stats Alexander Lobakin
2023-07-05 15:55 ` [Intel-wired-lan] [PATCH RFC net-next v4 9/9] iavf: switch queue stats to libie Alexander Lobakin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cf7dafbe-decc-623c-6322-d14dd8daee95@huawei.com \
--to=linyunsheng@huawei.com \
--cc=aleksander.lobakin@intel.com \
--cc=alexanderduyck@fb.com \
--cc=davem@davemloft.net \
--cc=drc@linux.vnet.ibm.com \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=ilias.apalodimas@linaro.org \
--cc=intel-wired-lan@lists.osuosl.org \
--cc=kuba@kernel.org \
--cc=larysa.zaremba@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=michal.kubiak@intel.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pmenzel@molgen.mpg.de \
--cc=yunshenglin0825@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox