From: NeilBrown <neilb@suse.de>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: releasing result pages in svc_xprt_release()
Date: Tue, 02 Feb 2021 10:27:25 +1100 [thread overview]
Message-ID: <87wnvre5cy.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <700333BB-0928-4772-ADFB-77D92CCE169C@oracle.com>
[-- Attachment #1: Type: text/plain, Size: 3628 bytes --]
On Mon, Feb 01 2021, Chuck Lever wrote:
>> On Jan 31, 2021, at 6:45 PM, NeilBrown <neilb@suse.de> wrote:
>>
>> On Fri, Jan 29 2021, Chuck Lever wrote:
>>>
>>> What's your opinion?
>>
>> To form a coherent opinion, I would need to know what that problem is.
>> I certainly accept that there could be performance problems in releasing
>> and re-allocating pages which might be resolved by batching, or by copying,
>> or by better tracking. But without knowing what hot-spot you want to
>> cool down, I cannot think about how that fits into the big picture.
>> So: what exactly is the problem that you see?
>
> The problem is that each 1MB NFS READ, for example, hands 257 pages
> back to the page allocator, then allocates another 257 pages. One page
> is not terrible, but 510+ allocator calls for every RPC begins to get
> costly.
>
> Also, remember that both allocating and freeing a page means an irqsave
> spin lock -- that will serialize all other page allocations, including
> allocation done by other nfsd threads.
>
> So I'd like to lower page allocator contention and the rate at which
> IRQs are disabled and enabled when the NFS server becomes busy, as it
> might with several 25 GbE NICs, for instance.
>
> Holding onto the same pages means we can make better use of TLB
> entries -- fewer TLB flushes is always a good thing.
>
> I know that the network folks at Red Hat have been staring hard at
> reducing memory allocation in the stack for several years. I recall
> that Matthew Wilcox recently made similar improvements to the block
> layer.
>
> With the advent of 100GbE and Optane-like durable storage, the balance
> of memory allocation cost to device latency has shifted so that
> superfluous memory allocation is noticeable.
>
>
> At first I thought of creating a page allocator API that could grab
> or free an array of pages while taking the allocator locks once. But
> now I wonder if there are opportunities to reduce the amount of page
> allocator traffic.
Thanks. This helps me a lot.
I wonder if there is some low-hanging fruit here.
If I read the code correctly (which is not certain, but what I see does
seem to agree with vague memories of how it all works), we currently do
a lot of wasted alloc/frees for zero-copy reads.
We allocate lots of pages and store the pointers in ->rq_respages
(i.e. ->rq_pages)
Then nfsd_splice_actor frees many of those pages and
replaces the pointers with pointers to page-cache pages. Then we release
those page-cache pages.
We need to have allocated them, but we don't need to free them.
We can add some new array for storing them, have nfsd_splice_actor move
them to that array, and have svc_alloc_arg() move pages back from the
store rather than re-allocating them.
Or maybe something even more sophisticated where we only move them out
of the store when we actually need them.
Having the RDMA layer return pages when they are finished with might
help. You might even be able to use atomics (cmpxchg) to handle the
contention. But I'm not convinced it would be worth it.
I *really* like your idea of a batch-API for page-alloc and page-free.
This would likely be useful for other users, and it would be worth
writing more code to get peak performance - things such as per-cpu
queues of returned pages and so-forth (which presumably already exist).
I cannot be sure that the batch-API would be better than a focused API
just for RDMA -> NFSD. But my guess is that it would be at least nearly
as good, and would likely get a lot more eyes on the code.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]
next prev parent reply other threads:[~2021-02-01 23:28 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-29 16:19 releasing result pages in svc_xprt_release() Chuck Lever
2021-01-29 22:43 ` NeilBrown
2021-01-29 23:06 ` Chuck Lever
2021-01-31 23:45 ` NeilBrown
2021-02-01 0:19 ` Chuck Lever
2021-02-01 23:27 ` NeilBrown [this message]
2021-02-05 20:20 ` Chuck Lever
2021-02-05 21:13 ` J. Bruce Fields
2021-02-06 17:59 ` Chuck Lever
2021-02-07 23:59 ` NeilBrown
2021-02-07 23:42 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wnvre5cy.fsf@notabene.neil.brown.name \
--to=neilb@suse.de \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox