Re: NFS over RDMA benchmark

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Tom Tucker <tom@opengridcomputing.com>
To: Tom Talpey <tom@talpey.com>
Cc: Wendy Cheng <s.wendy.cheng@gmail.com>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	Yan Burman <yanb@mellanox.com>,
	"Atchley, Scott" <atchleyes@ornl.gov>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	Or Gerlitz <ogerlitz@mellanox.com>
Subject: Re: NFS over RDMA benchmark
Date: Thu, 25 Apr 2013 16:17:06 -0500	[thread overview]
Message-ID: <51799D52.1040903@opengridcomputing.com> (raw)
In-Reply-To: <51798C51.50209@talpey.com>

On 4/25/13 3:04 PM, Tom Talpey wrote:
> On 4/25/2013 1:18 PM, Wendy Cheng wrote:
>> On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey <tom@talpey.com> wrote:
>>>> On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng <s.wendy.cheng@gmail.com>
>>>> wrote:
>>>>>
>>>> So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
>>>> tar ball) ... Here is a random thought (not related to the rb tree
>>>> comment).....
>>>>
>>>> The inflight packet count seems to be controlled by
>>>> xprt_rdma_slot_table_entries that is currently hard-coded as
>>>> RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
>>>> with the bandwidth number if we pump it up, say 64 instead ? Not sure
>>>> whether FMR pool size needs to get adjusted accordingly though.
>>>
>>> 1)
>>>
>>> The client slot count is not hard-coded, it can easily be changed by
>>> writing a value to /proc and initiating a new mount. But I doubt that
>>> increasing the slot table will improve performance much, unless this is
>>> a small-random-read, and spindle-limited workload.
>>
>> Hi Tom !
>>
>> It was a shot in the dark :)  .. as our test bed has not been setup
>> yet .However, since I'll be working on (very) slow clients, increasing
>> this buffer is still interesting (to me). I don't see where it is
>> controlled by a /proc value (?) - but that is not a concern at this
>
> The entries show up in /proc/sys/sunrpc (IIRC). The one you're looking
> for is called rdma_slot_table_entries.
>
>> moment as /proc entry is easy to add. More questions on the server
>> though (see below) ...
>>
>>>
>>> 2)
>>>
>>> The observation appears to be that the bandwidth is server CPU limited.
>>> Increasing the load offered by the client probably won't move the needle,
>>> until that's addressed.
>>>
>>
>> Could you give more hints on which part of the path is CPU limited ?
>
> Sorry, I don't. The profile showing 25% of the 16-core, 2-socket server
> spinning on locks is a smoking, flaming gun though. Maybe Tom Tucker
> has some ideas on the srv rdma code, but it could also be in the sunrpc
> or infiniband driver layers, can't really tell without the call stacks.

The Mellanox driver uses red-black trees extensively for resource 
management, e.g. QP ID, CQ ID, etc... When completions come in from the 
HW, these are used to find the associated software data structures I 
believe. It is certainly possible that these trees get hot on lookup when 
we're pushing a lot of data. I'm surprised, however, to see 
rb_insert_color there because I'm not aware of any where that resources 
are being inserted into and/or removed from a red-black tree in the data path.

They are also used by IPoIB and the IB CM, however, connections should not 
be coming and going unless we've got other problems. IPoIB is only used by 
the IB transport for connection set up and my impression is that this 
trace is for the IB transport.

I don't believe that red-black trees are used by either the client or 
server transports directly. Note that the rb_lock in the client is for 
buffers; not, as the name might imply, a red-black tree.

I think the key here is to discover what lock is being waited on. Are we 
certain that it's a lock on a red-black tree and if so, which one?

Tom
>
>> Is there a known Linux-based filesystem that is reasonbly tuned for
>> NFS-RDMA ? Any specific filesystem features would work well with
>> NFS-RDMA ? I'm wondering when disk+FS are added into the
>> configuration, how much advantages would NFS-RDMA get when compared
>> with a plain TCP/IP, say IPOIB on CM , transport ?
>
> NFS-RDMA is not really filesystem dependent, but certainly there are
> considerations for filesystems to support NFS, and of course the goal in
> general is performance. NFS-RDMA is a network transport, applicable to
> both client and server. Filesystem choice is a server consideration.
>
> I don't have a simple answer to your question about how much better
> NFS-RDMA is over other transports. Architecturally, a lot. In practice,
> there are many, many variables. Have you seen RFC5532, that I cowrote
> with the late Chet Juszczak? You may find it's still quite relevant.
> http://tools.ietf.org/html/rfc5532
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)

From: Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
To: Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
Cc: Wendy Cheng
	<s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	"J. Bruce Fields"
	<bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>,
	Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	"Atchley, Scott" <atchleyes-1Heg1YXhbW8@public.gmane.org>,
	"linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Subject: Re: NFS over RDMA benchmark
Date: Thu, 25 Apr 2013 16:17:06 -0500	[thread overview]
Message-ID: <51799D52.1040903@opengridcomputing.com> (raw)
In-Reply-To: <51798C51.50209-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>

On 4/25/13 3:04 PM, Tom Talpey wrote:
> On 4/25/2013 1:18 PM, Wendy Cheng wrote:
>> On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>>>> wrote:
>>>>>
>>>> So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
>>>> tar ball) ... Here is a random thought (not related to the rb tree
>>>> comment).....
>>>>
>>>> The inflight packet count seems to be controlled by
>>>> xprt_rdma_slot_table_entries that is currently hard-coded as
>>>> RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
>>>> with the bandwidth number if we pump it up, say 64 instead ? Not sure
>>>> whether FMR pool size needs to get adjusted accordingly though.
>>>
>>> 1)
>>>
>>> The client slot count is not hard-coded, it can easily be changed by
>>> writing a value to /proc and initiating a new mount. But I doubt that
>>> increasing the slot table will improve performance much, unless this is
>>> a small-random-read, and spindle-limited workload.
>>
>> Hi Tom !
>>
>> It was a shot in the dark :)  .. as our test bed has not been setup
>> yet .However, since I'll be working on (very) slow clients, increasing
>> this buffer is still interesting (to me). I don't see where it is
>> controlled by a /proc value (?) - but that is not a concern at this
>
> The entries show up in /proc/sys/sunrpc (IIRC). The one you're looking
> for is called rdma_slot_table_entries.
>
>> moment as /proc entry is easy to add. More questions on the server
>> though (see below) ...
>>
>>>
>>> 2)
>>>
>>> The observation appears to be that the bandwidth is server CPU limited.
>>> Increasing the load offered by the client probably won't move the needle,
>>> until that's addressed.
>>>
>>
>> Could you give more hints on which part of the path is CPU limited ?
>
> Sorry, I don't. The profile showing 25% of the 16-core, 2-socket server
> spinning on locks is a smoking, flaming gun though. Maybe Tom Tucker
> has some ideas on the srv rdma code, but it could also be in the sunrpc
> or infiniband driver layers, can't really tell without the call stacks.

The Mellanox driver uses red-black trees extensively for resource 
management, e.g. QP ID, CQ ID, etc... When completions come in from the 
HW, these are used to find the associated software data structures I 
believe. It is certainly possible that these trees get hot on lookup when 
we're pushing a lot of data. I'm surprised, however, to see 
rb_insert_color there because I'm not aware of any where that resources 
are being inserted into and/or removed from a red-black tree in the data path.

They are also used by IPoIB and the IB CM, however, connections should not 
be coming and going unless we've got other problems. IPoIB is only used by 
the IB transport for connection set up and my impression is that this 
trace is for the IB transport.

I don't believe that red-black trees are used by either the client or 
server transports directly. Note that the rb_lock in the client is for 
buffers; not, as the name might imply, a red-black tree.

I think the key here is to discover what lock is being waited on. Are we 
certain that it's a lock on a red-black tree and if so, which one?

Tom
>
>> Is there a known Linux-based filesystem that is reasonbly tuned for
>> NFS-RDMA ? Any specific filesystem features would work well with
>> NFS-RDMA ? I'm wondering when disk+FS are added into the
>> configuration, how much advantages would NFS-RDMA get when compared
>> with a plain TCP/IP, say IPOIB on CM , transport ?
>
> NFS-RDMA is not really filesystem dependent, but certainly there are
> considerations for filesystems to support NFS, and of course the goal in
> general is performance. NFS-RDMA is a network transport, applicable to
> both client and server. Filesystem choice is a server consideration.
>
> I don't have a simple answer to your question about how much better
> NFS-RDMA is over other transports. Architecturally, a lot. In practice,
> there are many, many variables. Have you seen RFC5532, that I cowrote
> with the late Chet Juszczak? You may find it's still quite relevant.
> http://tools.ietf.org/html/rfc5532
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2013-04-25 21:17 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-17 14:36 NFS over RDMA benchmark Yan Burman
2013-04-17 14:36 ` Yan Burman
2013-04-17 17:15 ` Wendy Cheng
2013-04-17 17:15   ` Wendy Cheng
2013-04-17 17:32   ` Atchley, Scott
2013-04-17 17:32     ` Atchley, Scott
2013-04-17 18:06     ` Wendy Cheng
2013-04-17 18:06       ` Wendy Cheng
2013-04-18 12:47       ` Yan Burman
2013-04-18 12:47         ` Yan Burman
2013-04-18 16:16         ` Wendy Cheng
2013-04-18 16:16           ` Wendy Cheng
2013-04-23 21:06         ` J. Bruce Fields
2013-04-23 21:06           ` J. Bruce Fields
2013-04-24 12:35           ` Yan Burman
2013-04-24 12:35             ` Yan Burman
2013-04-24 15:05             ` J. Bruce Fields
2013-04-24 15:05               ` J. Bruce Fields
2013-04-24 15:26               ` J. Bruce Fields
2013-04-24 15:26                 ` J. Bruce Fields
2013-04-24 16:27                 ` Wendy Cheng
2013-04-24 16:27                   ` Wendy Cheng
2013-04-24 18:04                   ` Wendy Cheng
2013-04-24 18:04                     ` Wendy Cheng
2013-04-24 18:26                     ` Tom Talpey
2013-04-24 18:26                       ` Tom Talpey
2013-04-25 17:18                       ` Wendy Cheng
2013-04-25 17:18                         ` Wendy Cheng
2013-04-25 19:01                         ` Phil Pishioneri
2013-04-25 19:01                           ` Phil Pishioneri
2013-04-25 20:14                           ` Tom Talpey
2013-04-25 20:14                             ` Tom Talpey
2013-04-25 20:04                         ` Tom Talpey
2013-04-25 20:04                           ` Tom Talpey
2013-04-25 21:17                           ` Tom Tucker [this message]
2013-04-25 21:17                             ` Tom Tucker
2013-04-25 21:58                             ` Wendy Cheng
2013-04-25 21:58                               ` Wendy Cheng
2013-04-25 22:26                               ` Wendy Cheng
2013-04-25 22:26                                 ` Wendy Cheng
2013-04-28  6:28                 ` Yan Burman
2013-04-28  6:28                   ` Yan Burman
2013-04-28 14:42                   ` J. Bruce Fields
2013-04-28 14:42                     ` J. Bruce Fields
2013-04-29  5:34                     ` Wendy Cheng
2013-04-29  5:34                       ` Wendy Cheng
2013-04-29 12:16                       ` Yan Burman
2013-04-29 12:16                         ` Yan Burman
2013-04-29 13:05                         ` Tom Tucker
2013-04-29 13:05                           ` Tom Tucker
2013-04-29 13:07                           ` Tom Tucker
2013-04-29 13:07                             ` Tom Tucker
2013-04-30  5:09                     ` Yan Burman
2013-04-30  5:09                       ` Yan Burman
2013-04-30 13:05                       ` Tom Talpey
2013-04-30 13:05                         ` Tom Talpey
2013-04-30 14:23                         ` Yan Burman
2013-04-30 14:23                           ` Yan Burman
2013-04-30 14:44                           ` Tom Talpey
2013-04-30 14:44                             ` Tom Talpey
2013-04-30 14:20                       ` Tom Talpey
2013-04-30 14:20                         ` Tom Talpey
2013-04-30 14:38                         ` Yan Burman
2013-04-30 14:38                           ` Yan Burman
2013-04-30 18:58                           ` Tom Tucker
2013-04-30 18:58                             ` Tom Tucker
     [not found]                             ` <CALsNU1MsjH5=p4Wtj2aJ5+odC7y7-5oTGhrzOL-=15pXaYYUZw@mail.gmail.com>
     [not found]                               ` <CABgxfbFhZTBO81WC5BcRRfQB_YBjE4N=sfS+G9eAzaFHYC_dWw@mail.gmail.com>
2013-06-20 14:56                                 ` Or Gerlitz
2013-06-20 14:56                                   ` Or Gerlitz
2013-04-30 16:24                       ` Wendy Cheng
2013-04-30 16:24                         ` Wendy Cheng
2013-04-30 13:38                     ` J. Bruce Fields
2013-04-30 13:38                       ` J. Bruce Fields
2013-04-19  2:27 ` Peng Tao
2013-04-19  2:27   ` Peng Tao
2013-04-22 11:07   ` Yan Burman
2013-04-22 11:07     ` Yan Burman
     [not found] <51703280.03e9440a.06a6.3f9f@mx.google.com>
2013-04-18 19:15 ` Wendy Cheng
2013-04-18 19:15   ` Wendy Cheng
2013-04-19  1:03   ` Atchley, Scott
2013-04-19  1:03     ` Atchley, Scott
2013-04-19  3:35     ` Spencer
2013-04-19  3:35       ` Spencer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51799D52.1040903@opengridcomputing.com \
    --to=tom@opengridcomputing.com \
    --cc=atchleyes@ornl.gov \
    --cc=bfields@fieldses.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=ogerlitz@mellanox.com \
    --cc=s.wendy.cheng@gmail.com \
    --cc=tom@talpey.com \
    --cc=yanb@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.