From: Devesh Sharma <devesh.sharma@broadcom.com>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-rdma@vger.kernel.org,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH v1 02/10] svcrdma: Make svc_rdma_get_frmr() not sleep
Date: Sat, 6 Feb 2016 21:26:32 +0530 [thread overview]
Message-ID: <CANjDDBj_BK3H3Gc+Vzai9v0Uhk5HT+yHjvCLYssLZunvmGOcFQ@mail.gmail.com> (raw)
In-Reply-To: <5F13C308-89C7-4F98-ADB1-D43AA0A1EB50@oracle.com>
On Fri, Feb 5, 2016 at 11:43 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
>> On Feb 5, 2016, at 12:49 PM, Devesh Sharma <devesh.sharma@broadcom.com> wrote:
>>
>> On Fri, Feb 5, 2016 at 9:59 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>
>>>> On Feb 5, 2016, at 5:15 AM, Devesh Sharma <devesh.sharma@broadcom.com> wrote:
>>>>
>>>> As I understand, you are pre-allocating because alloc_mr() can sleep
>>>> so separate it out while map-frmr-wqe would be non-blocking context by
>>>> definition, we are on the same page?
>>>
>>> alloc_mr() can sleep, and so can kmalloc(GFP_KERNEL).
>>> Currently svc_rdma_get_frmr() is called in a worker thread,
>>> so it's in a process context and can sleep, but if we were
>>> to move the RDMA Read logic into a softIRQ context (say,
>>> invoked during Receive processing), it would be a problem
>>> to sleep.
>>
>> Got it.
>>
>>>
>>> Another goal is to get rid of memory allocation operations
>>> that can fail in low memory situations in areas of the code
>>> where an error is hard to handle.
>>
>>>
>>> So the patch description is misleading and incorrect.
>>
>> Got it
>>
>>>
>>> However, the problem with pre-allocating frmrs is that it
>>> is not possible for the server to determine how many will
>>> be needed at accept time. The maximum number depends on the
>>> largest possible ULP payload size, the page list depth of
>>> the server's device, and the maximum chunk size the client
>>> is using (it could be as small as 4KB for some registration
>>> modes). That would mean allocating thousands per connection
>>> up front to cover every contingency.
>>
>> Can't it decide on the fly? start with 'n' smaller size of frmrs
>> where 'n' is derived from some static calculation. Then if the
>> connection demands bigger size, increase the depth of all 'n' frmrs and
>> reuse them till the life of the connection. if it demands more frmr,
>> don't do anything
>> and try to manage with 'n'. Does it makes sense I am sure I am missing
>> few points here.
>
> Currently, all svc_rdma_frmrs are the same. They are allocated
> on-demand then cached after an RDMA Read operation is complete.
Yes, it does the same.
>
>
>>> Perhaps a better solution would be to understand how to
>>> deal with an frmr allocation failure in rdma_read_chunk_frmr().
>>> I think just dropping the connection would be reasonable.
>>> If ib_post_recv() fails, for example, this is exactly what
>>> the server does.
>>
>> Dropping the connection seems logical, however, I think it would be
>> a harsh decision for any hypothetical device which has less IB/RoCE
>> resources e.g. any given VM in SRIOV enabled environment where lots
>> of VFs are enabled.
>>
>> As suggested earlier in this mail, just continue the connection with whatever
>> resources it has would let low-resource-device still to use it. if
>> the system is recovered from memory-pressure, the allocations would
>> succeed eventually.
>> drawback is performance penalty.
>
> When an frmr allocation fails, the RDMA Read to pull the
> RPC data over can't be done, so there's no recourse but for
> the server to discard that RPC request. NFSv4.x requires a
> connection drop when an RPC is lost. The client will retransmit
> that RPC.
>
> Dropping the connection frees all resources for that connection.
> Clients will typically reconnect pretty quickly when they have
> more work to do. The server will drop the connection repeatedly
> until enough resources are available to handle the client's
> workload.
Ok, effectively user will see "failed to reconnect" messages in syslog
until it succeeds. I think if the syslog messages displays a valid reason
then we are covered here.
>
> Eventually we could add a retry mechanism to the RDMA Read
> logic. Wait a bit and try the allocation again. That's probably
> not worth the effort, this really should be very very rare.
Makes sense, I agree with you here.
>
> If a device is ultimately resource-starved, that's a different
> problem. The server may never be able to allocate enough MRs
> even if the client tries again later.
Yes
>
> That kind of issue should be obvious at accept time. The server
> should adjust the connection's credit limit downward to fit the
> device's available resource budget.
>
> Alternately, if the device is not an iWARP device, RDMA Reads
> can use physical addressing, and no frwrs are needed.
>
>
Ok, lets wait for Christoph's patch in NFS server. Till then I agree
with your decision on this patch.
>>> Christoph says he is about to modify or replace the NFS
>>> server's RDMA Read code paths, however.
>>>
>>> I'll drop this one.
>>>
>>>
>>>> On Wed, Feb 3, 2016 at 9:21 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>> Want to call it in a context that cannot sleep. So pre-alloc
>>>>> the memory and the MRs.
>>>>>
>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>> ---
>>>>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 44 ++++++++++++++++++++++++++----
>>>>> 1 file changed, 38 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>>> index 5763825..02eee12 100644
>>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>>> @@ -949,7 +949,7 @@ static struct svc_rdma_fastreg_mr *rdma_alloc_frmr(struct svcxprt_rdma *xprt)
>>>>> return ERR_PTR(-ENOMEM);
>>>>> }
>>>>>
>>>>> -static void rdma_dealloc_frmr_q(struct svcxprt_rdma *xprt)
>>>>> +static void svc_rdma_destroy_frmrs(struct svcxprt_rdma *xprt)
>>>>> {
>>>>> struct svc_rdma_fastreg_mr *frmr;
>>>>>
>>>>> @@ -963,6 +963,37 @@ static void rdma_dealloc_frmr_q(struct svcxprt_rdma *xprt)
>>>>> }
>>>>> }
>>>>>
>>>>> +static bool svc_rdma_prealloc_frmrs(struct svcxprt_rdma *xprt)
>>>>> +{
>>>>> + struct ib_device *dev = xprt->sc_cm_id->device;
>>>>> + unsigned int i;
>>>>> +
>>>>> + /* Pre-allocate based on the maximum amount of payload
>>>>> + * the server's HCA can handle per RDMA Read, to keep
>>>>> + * the number of MRs per connection in check.
>>>>> + *
>>>>> + * If a client sends small Read chunks (eg. it may be
>>>>> + * using physical registration), more RDMA Reads per
>>>>> + * NFS WRITE will be needed. svc_rdma_get_frmr() dips
>>>>> + * into its reserve in that case. Better would be for
>>>>> + * the server to reduce the connection's credit limit.
>>>>> + */
>>>>> + i = 1 + RPCSVC_MAXPAGES / dev->attrs.max_fast_reg_page_list_len;
>>>>> + i *= xprt->sc_max_requests;
>>>>> +
>>>>> + while (i--) {
>>>>> + struct svc_rdma_fastreg_mr *frmr;
>>>>> +
>>>>> + frmr = rdma_alloc_frmr(xprt);
>>>>> + if (!frmr) {
>>>>> + dprintk("svcrdma: No memory for request map\n");
>>>>> + return false;
>>>>> + }
>>>>> + list_add(&frmr->frmr_list, &xprt->sc_frmr_q);
>>>>> + }
>>>>> + return true;
>>>>> +}
>>>>> +
>>>>> struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *rdma)
>>>>> {
>>>>> struct svc_rdma_fastreg_mr *frmr = NULL;
>>>>> @@ -975,10 +1006,9 @@ struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *rdma)
>>>>> frmr->sg_nents = 0;
>>>>> }
>>>>> spin_unlock_bh(&rdma->sc_frmr_q_lock);
>>>>> - if (frmr)
>>>>> - return frmr;
>>>>> -
>>>>> - return rdma_alloc_frmr(rdma);
>>>>> + if (!frmr)
>>>>> + return ERR_PTR(-ENOMEM);
>>>>> + return frmr;
>>>>> }
>>>>>
>>>>> void svc_rdma_put_frmr(struct svcxprt_rdma *rdma,
>>>>> @@ -1149,6 +1179,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>>>>> dev->attrs.max_fast_reg_page_list_len;
>>>>> newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_FAST_REG;
>>>>> newxprt->sc_reader = rdma_read_chunk_frmr;
>>>>> + if (!svc_rdma_prealloc_frmrs(newxprt))
>>>>> + goto errout;
>>>>> }
>>>>>
>>>>> /*
>>>>> @@ -1310,7 +1342,7 @@ static void __svc_rdma_free(struct work_struct *work)
>>>>> xprt->xpt_bc_xprt = NULL;
>>>>> }
>>>>>
>>>>> - rdma_dealloc_frmr_q(rdma);
>>>>> + svc_rdma_destroy_frmrs(rdma);
>>>>> svc_rdma_destroy_ctxts(rdma);
>>>>> svc_rdma_destroy_maps(rdma);
>>>>>
>>>
>>> --
>>> Chuck Lever
>>>
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>
>
next prev parent reply other threads:[~2016-02-06 15:57 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-03 15:51 [PATCH v1 00/10] NFS/RDMA server patches for v4.6 Chuck Lever
2016-02-03 15:51 ` [PATCH v1 01/10] svcrdma: Do not send XDR roundup bytes for a write chunk Chuck Lever
2016-02-05 10:11 ` Devesh Sharma
2016-02-03 15:51 ` [PATCH v1 02/10] svcrdma: Make svc_rdma_get_frmr() not sleep Chuck Lever
2016-02-05 10:15 ` Devesh Sharma
2016-02-05 16:29 ` Chuck Lever
2016-02-05 17:49 ` Devesh Sharma
2016-02-05 18:13 ` Chuck Lever
2016-02-06 15:56 ` Devesh Sharma [this message]
2016-02-03 15:51 ` [PATCH v1 03/10] svcrdma: svc_rdma_post_recv() should close connection on error Chuck Lever
2016-02-05 10:17 ` Devesh Sharma
2016-02-03 15:51 ` [PATCH v1 04/10] rpcrdma: Add missing XDR union fields for RDMA errors Chuck Lever
2016-02-05 10:23 ` Devesh Sharma
2016-02-05 15:53 ` Chuck Lever
2016-02-03 15:52 ` [PATCH v1 05/10] svcrdma: Make RDMA_ERROR messages work Chuck Lever
2016-02-05 10:35 ` Devesh Sharma
2016-02-03 15:52 ` [PATCH v1 06/10] svcrdma: Use correct XID in error replies Chuck Lever
2016-02-05 10:41 ` Devesh Sharma
2016-02-03 15:52 ` [PATCH v1 07/10] svcrdma: Hook up the logic to return ERR_CHUNK Chuck Lever
2016-02-05 10:45 ` Devesh Sharma
2016-02-03 15:52 ` [PATCH v1 08/10] svcrdma: Remove close_out exit path Chuck Lever
2016-02-05 10:41 ` Devesh Sharma
2016-02-03 15:52 ` [PATCH v1 09/10] svcrdma: Use new CQ API for RPC-over-RDMA server receive CQs Chuck Lever
2016-02-05 10:51 ` Devesh Sharma
2016-02-03 15:52 ` [PATCH v1 10/10] svcrdma: Use new CQ API for RPC-over-RDMA server send CQs Chuck Lever
2016-02-05 10:56 ` Devesh Sharma
2016-02-05 10:59 ` [PATCH v1 00/10] NFS/RDMA server patches for v4.6 Devesh Sharma
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CANjDDBj_BK3H3Gc+Vzai9v0Uhk5HT+yHjvCLYssLZunvmGOcFQ@mail.gmail.com \
--to=devesh.sharma@broadcom.com \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).