netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yunsheng Lin <yunshenglin0825@gmail.com>
To: Alexander H Duyck <alexander.duyck@gmail.com>,
	Yunsheng Lin <linyunsheng@huawei.com>,
	davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Lorenzo Bianconi <lorenzo@kernel.org>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	Eric Dumazet <edumazet@google.com>
Subject: Re: [PATCH net-next v2 1/3] page_pool: unify frag page and non-frag page handling
Date: Sat, 3 Jun 2023 20:59:13 +0800	[thread overview]
Message-ID: <21f2fe04-8674-cc73-7a6c-cb0765c84dba@gmail.com> (raw)
In-Reply-To: <160fea176559526b38a4080cc0f52666634a7a4f.camel@gmail.com>

On 2023/6/3 0:37, Alexander H Duyck wrote:
...

>>
>> Please let me know if the above makes sense, or if misunderstood your
>> concern here.
> 
> So my main concern is that what this is doing is masking things so that
> the veth and virtio_net drivers can essentially lie about the truesize
> of the memory they are using in order to improve their performance by
> misleading the socket layer about how much memory it is actually
> holding onto.
> 
> We have historically had an issue with reporting the truesize of
> fragments, but generally the underestimation was kept to something less
> than 100% because pages were generally split by at least half. Where it
> would get messy is if a misbehaviing socket held onto packets for an
> exceedingly long time.
> 
> What this patch set is doing is enabling explicit lying about the
> truesize, and it compounds that by allowing for mixing small
> allocations w/ large ones.
> 
>>>
>>> The problem is there are some architectures where we just cannot
>>> support having pp_frag_count due to the DMA size. So it makes sense to
>>> leave those with just basic page pool instead of trying to fake that it
>>> is a fragmented page.
>>
>> It kind of depend on how you veiw it, this patch view it as only supporting
>> one frag when we can't support having pp_frag_count, so I would not call it
>> faking.
> 
> So the big thing that make it "faking" is the truesize underestimation
> that will occur with these frames.

Let's discuss truesize issue in patch 2 instead of here.
Personally, I still believe that if the driver can compute the
truesize correctly by manipulating the page->pp_frag_count and
frag offset directly, the page pool can do that too.

> 
>>
>>>
>>>> ---

...

>>>
>>> What is the point of this line? It doesn't make much sense to me. Are
>>> you just trying to force an optiimization? You would be better off just
>>> taking the BUILD_BUG_ON contents and feeding them into an if statement
>>> below since the statement will compile out anyway.
>>
>> if the "if statement" you said refers to the below, then yes.
>>
>>>> +		if (!__builtin_constant_p(nr))
>>>> +			atomic_long_set(&page->pp_frag_count, 1);
>>
>> But it is a *BUILD*_BUG_ON(), isn't it compiled out anywhere we put it?
>>
>> Will move it down anyway to avoid confusion.
> 
> Actually now that I look at this more it is even more confusing. The
> whole point of this function was that we were supposed to be getting
> pp_frag_count to 0. However you are setting it to 1.
> 
> This is seriously flawed. If we are going to treat non-fragmented pages
> as mono-frags then that is what we should do. We should be pulling this
> acounting into all of the page pool freeing paths, not trying to force
> the value up to 1 for the non-fragmented case.

I am not sure I understand what do you mean by 'non-fragmented ',
'mono-frags', 'page pool freeing paths' and 'non-fragmented case'
here. maybe describe it more detailed with something like the
pseudocode?

> 
>>>
>>> It seems like what you would want here is:
>>> 	BUG_ON(!PAGE_POOL_DMA_USE_PP_FRAG_COUNT);
>>>
>>> Otherwise you are potentially writing to a variable that shouldn't
>>> exist.
>>
>> Not if the driver use the page_pool_alloc_frag() API instead of manipulating
>> the page->pp_frag_count directly using the page_pool_defrag_page() like mlx5.
>> The mlx5 call the page_pool_create() with with PP_FLAG_PAGE_FRAG set, and
>> it does not seems to have a failback for PAGE_POOL_DMA_USE_PP_FRAG_COUNT
>> case, and we may need to keep PP_FLAG_PAGE_FRAG for it. That's why we need
>> to keep the driver from implementation detail(pp_frag_count handling specifically)
>> of the frag support unless we have a very good reason.
>>
> 
> Getting the truesize is that "very good reason". The fact is the
> drivers were doing this long before page pool came around. Trying to
> pull that away from them is the wrong way to go in my opinion.

If the truesize is really the concern here, I think it make more
sense to enforce it in the page pool instead of each driver doing
their trick, so I also think we can do better here to handle
pp_frag_count in the page pool instead of driver handling it, so
let's continue the truesize disscussion in patch 2 to see if we
can come up with something better there.

> 
>>>>  	/* If nr == pp_frag_count then we have cleared all remaining
>>>>  	 * references to the page. No need to actually overwrite it, instead
>>>>  	 * we can leave this to be overwritten by the calling function.
>>>> @@ -311,19 +321,36 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>>>>  	 * especially when dealing with a page that may be partitioned
>>>>  	 * into only 2 or 3 pieces.
>>>>  	 */
>>>> -	if (atomic_long_read(&page->pp_frag_count) == nr)
>>>> +	if (atomic_long_read(&page->pp_frag_count) == nr) {
>>>> +		/* As we have ensured nr is always one for constant case
>>>> +		 * using the BUILD_BUG_ON() as above, only need to handle
>>>> +		 * the non-constant case here for frag count draining.
>>>> +		 */
>>>> +		if (!__builtin_constant_p(nr))
>>>> +			atomic_long_set(&page->pp_frag_count, 1);
>>>> +
>>>>  		return 0;
>>>> +	}
>>>>  
> 
> The optimization here was already the comparison since we didn't have
> to do anything if pp_frag_count == nr. The whole point of pp_frag_count
> going to 0 is that is considered non-fragmented in that case and ready
> to be freed. By resetting it to 1 you are implying that there is still
> one *other* user that is holding a fragment so the page cannot be
> freed.
> 
> We weren't bothering with writing the value since the page is in the
> free path and this value is going to be unused until the page is
> reallocated anyway.

I am not sure what you meant above.
But I will describe what is this patch trying to do again:
When PP_FLAG_PAGE_FRAG is set and that flag is per page pool, not per
page, so page_pool_alloc_pages() is not allowed to be called as the
page->pp_frag_count is not setup correctly for the case.

So in order to allow calling page_pool_alloc_pages(), as best as I
can think of, either we need a per page flag/bit to decide whether
to do something like dec_and_test for page->pp_frag_count in
page_pool_is_last_frag(), or we unify the page->pp_frag_count handling
in page_pool_is_last_frag() so that we don't need a per page flag/bit.

This patch utilizes the optimization you mentioned above to unify the
page->pp_frag_count handling.

> 
>>>>  	ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>>>>  	WARN_ON(ret < 0);
>>>> +
>>>> +	/* Reset frag count back to 1, this should be the rare case when
>>>> +	 * two users call page_pool_defrag_page() currently.
>>>> +	 */
>>>> +	if (!ret)
>>>> +		atomic_long_set(&page->pp_frag_count, 1);
>>>> +
>>>>  	return ret;
>>>>  }
>>>>

...

>> As above, it is about unifying handling for frag and non-frag page in
>> page_pool_is_last_frag(). please let me know if there is any better way
>> to do it without adding statements here.
> 
> I get what you are trying to get at but I feel like the implementation
> is going to cause more problems than it helps. The problem is it is
> going to hurt base page pool performance and it just makes the
> fragmented pages that much more confusing to deal with.

For base page pool performance, as I mentioned before:
It remove PP_FLAG_PAGE_FRAG checking and only add the cost of
page_pool_fragment_page() in page_pool_set_pp_info(), which I
think it is negligible as we are already dirtying the same cache
line in page_pool_set_pp_info().

For the confusing, sometimes it is about personal taste, so I am
not going to argue with it:) But it would be good to provide a
non-confusing way to do that with minimal overhead. I feel like
you have provided it in the begin, but I am not able to understand
it yet.

> 
> My advice as a first step would be to look at first solving how to
> enable the PP_FLAG_PAGE_FRAG mode when you have
> PAGE_POOL_DMA_USE_PP_FRAG_COUNT as true. That should be creating mono-
> frags as we are calling them, and we should have a way to get the
> truesize for those so we know when we are consuming significant amount
> of memory.

Does the way to get the truesize in the below RFC make sense to you?
https://patchwork.kernel.org/project/netdevbpf/patch/20230516124801.2465-4-linyunsheng@huawei.com/

> 
> Once that is solved then we can look at what it would take to apply
> mono-frags to the standard page pool case. Ideally we would need to
> find a way to do it with minimal overhead.
> 
> 

  reply	other threads:[~2023-06-03 12:59 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-29  9:28 [PATCH net-next v2 0/3] support non-frag page for page_pool_alloc_frag() Yunsheng Lin
2023-05-29  9:28 ` [PATCH net-next v2 1/3] page_pool: unify frag page and non-frag page handling Yunsheng Lin
2023-05-30 15:07   ` Alexander H Duyck
2023-05-31 11:55     ` Yunsheng Lin
2023-06-02 16:37       ` Alexander H Duyck
2023-06-03 12:59         ` Yunsheng Lin [this message]
2023-06-05 14:58           ` Alexander Duyck
2023-06-06 12:41             ` Yunsheng Lin
2023-06-06 15:33               ` Alexander Duyck
2023-06-07 12:46                 ` Yunsheng Lin
2023-06-07 15:05                   ` Alexander Duyck
2023-05-29  9:28 ` [PATCH net-next v2 2/3] page_pool: support non-frag page for page_pool_alloc_frag() Yunsheng Lin
2023-05-30 15:07   ` Alexander H Duyck
2023-05-31 12:19     ` Yunsheng Lin
2023-06-01 12:21       ` Yunsheng Lin
2023-06-01 18:14       ` Alexander Duyck
2023-06-02 12:23         ` Yunsheng Lin
2023-06-02 15:57           ` Alexander Duyck
2023-06-03  4:20             ` Yunsheng Lin
2023-06-05 15:08               ` Alexander Duyck
2023-05-29  9:28 ` [PATCH net-next v2 3/3] page_pool: remove PP_FLAG_PAGE_FRAG flag Yunsheng Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=21f2fe04-8674-cc73-7a6c-cb0765c84dba@gmail.com \
    --to=yunshenglin0825@gmail.com \
    --cc=alexander.duyck@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linyunsheng@huawei.com \
    --cc=lorenzo@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).