NFS over RDMA benchmark

All of lore.kernel.org
 help / color / mirror / Atom feed

* NFS over RDMA benchmark
@ 2013-04-17 14:36 ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-17 14:36 UTC (permalink / raw)
  To: J. Bruce Fields, Tom Tucker
  Cc: linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org

Hi.

I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me.
My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3.
These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime.
I am running kernel 3.5.7.

When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
I got to these results after the following optimizations:
1. Setting IRQ affinity to the CPUs that are part of the NUMA node the card is on
2. Increasing /proc/sys/sunrpc/svc_rdma/max_outbound_read_requests and /proc/sys/sunrpc/svc_rdma/max_requests to 256 on server
3. Increasing RPCNFSDCOUNT to 32 on server
4. FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255 --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --group_reporting --exitall --buffered=0

Please advise what can be done to improve bandwidth.

BTW, I also tried latest net-next tree (3.9-rc5), and when both server and client are 3.9, the client gets IO error when trying to access a file on nfs mount.
When server is 3.9 and client is 3.5.7, I managed to get through all randread tests with fio, but when I got to randwrite, the server crashes.

Thanks in advance
Yan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* NFS over RDMA benchmark
@ 2013-04-17 14:36 ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-17 14:36 UTC (permalink / raw)
  To: J. Bruce Fields, Tom Tucker
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Hi.

I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me.
My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3.
These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime.
I am running kernel 3.5.7.

When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
I got to these results after the following optimizations:
1. Setting IRQ affinity to the CPUs that are part of the NUMA node the card is on
2. Increasing /proc/sys/sunrpc/svc_rdma/max_outbound_read_requests and /proc/sys/sunrpc/svc_rdma/max_requests to 256 on server
3. Increasing RPCNFSDCOUNT to 32 on server
4. FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255 --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --group_reporting --exitall --buffered=0

Please advise what can be done to improve bandwidth.

BTW, I also tried latest net-next tree (3.9-rc5), and when both server and client are 3.9, the client gets IO error when trying to access a file on nfs mount.
When server is 3.9 and client is 3.5.7, I managed to get through all randread tests with fio, but when I got to randwrite, the server crashes.

Thanks in advance
Yan

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-17 17:15   ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-17 17:15 UTC (permalink / raw)
  To: Yan Burman
  Cc: J. Bruce Fields, Tom Tucker, linux-rdma@vger.kernel.org,
	linux-nfs@vger.kernel.org

On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb@mellanox.com> wrote:
> Hi.
>
> I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me.
> My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3.
> These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime.
> I am running kernel 3.5.7.
>
> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.

Remember there are always gaps between wire speed (that ib_send_bw
measures) and real world applications.

That being said, does your server use default export (sync) option ?
Export the share with "async" option can bring you closer to wire
speed. However, the practice (async) is generally not recommended in a
real production system - as it can cause data integrity issues, e.g.
you have more chances to lose data when the boxes crash.

-- Wendy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-17 17:15   ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-17 17:15 UTC (permalink / raw)
  To: Yan Burman
  Cc: J. Bruce Fields, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> Hi.
>
> I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me.
> My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3.
> These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime.
> I am running kernel 3.5.7.
>
> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.

Remember there are always gaps between wire speed (that ib_send_bw
measures) and real world applications.

That being said, does your server use default export (sync) option ?
Export the share with "async" option can bring you closer to wire
speed. However, the practice (async) is generally not recommended in a
real production system - as it can cause data integrity issues, e.g.
you have more chances to lose data when the boxes crash.

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-17 17:32     ` Atchley, Scott
  0 siblings, 0 replies; 82+ messages in thread
From: Atchley, Scott @ 2013-04-17 17:32 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: Yan Burman, J. Bruce Fields, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org

On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng@gmail.com> wrote:

> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb@mellanox.com> wrote:
>> Hi.
>> 
>> I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me.
>> My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3.
>> These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime.
>> I am running kernel 3.5.7.
>> 
>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.

Yan,

Are you trying to optimize single client performance or server performance with multiple clients?


> Remember there are always gaps between wire speed (that ib_send_bw
> measures) and real world applications.
> 
> That being said, does your server use default export (sync) option ?
> Export the share with "async" option can bring you closer to wire
> speed. However, the practice (async) is generally not recommended in a
> real production system - as it can cause data integrity issues, e.g.
> you have more chances to lose data when the boxes crash.
> 
> -- Wendy


Wendy,

It has a been a few years since I looked at RPCRDMA, but I seem to remember that RPCs were limited to 32KB which means that you have to pipeline them to get linerate. In addition to requiring pipelining, the argument from the authors was that the goal was to maximize server performance and not single client performance.

Scott


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-17 17:32     ` Atchley, Scott
  0 siblings, 0 replies; 82+ messages in thread
From: Atchley, Scott @ 2013-04-17 17:32 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: Yan Burman, J. Bruce Fields, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> Hi.
>> 
>> I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me.
>> My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3.
>> These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime.
>> I am running kernel 3.5.7.
>> 
>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.

Yan,

Are you trying to optimize single client performance or server performance with multiple clients?


> Remember there are always gaps between wire speed (that ib_send_bw
> measures) and real world applications.
> 
> That being said, does your server use default export (sync) option ?
> Export the share with "async" option can bring you closer to wire
> speed. However, the practice (async) is generally not recommended in a
> real production system - as it can cause data integrity issues, e.g.
> you have more chances to lose data when the boxes crash.
> 
> -- Wendy


Wendy,

It has a been a few years since I looked at RPCRDMA, but I seem to remember that RPCs were limited to 32KB which means that you have to pipeline them to get linerate. In addition to requiring pipelining, the argument from the authors was that the goal was to maximize server performance and not single client performance.

Scott

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-17 18:06       ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-17 18:06 UTC (permalink / raw)
  To: Atchley, Scott
  Cc: Yan Burman, J. Bruce Fields, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org

On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott <atchleyes@ornl.gov> wrote:
> On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng@gmail.com> wrote:
>
>> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb@mellanox.com> wrote:
>>> Hi.
>>>
>>> I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me.
>>> My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3.
>>> These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime.
>>> I am running kernel 3.5.7.
>>>
>>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
>>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
>
> Yan,
>
> Are you trying to optimize single client performance or server performance with multiple clients?
>
>
>> Remember there are always gaps between wire speed (that ib_send_bw
>> measures) and real world applications.
>>
>> That being said, does your server use default export (sync) option ?
>> Export the share with "async" option can bring you closer to wire
>> speed. However, the practice (async) is generally not recommended in a
>> real production system - as it can cause data integrity issues, e.g.
>> you have more chances to lose data when the boxes crash.
>>
>> -- Wendy
>
>
> Wendy,
>
> It has a been a few years since I looked at RPCRDMA, but I seem to remember that RPCs were limited to 32KB which means that you have to pipeline them to get linerate. In addition to requiring pipelining, the argument from the authors was that the goal was to maximize server performance and not single client performance.
>
> Scott
>

That (client count) brings up a good point ...

FIO is really not a good benchmark for NFS. Does anyone have SPECsfs
numbers on NFS over RDMA to share ?

-- Wendy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-17 18:06       ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-17 18:06 UTC (permalink / raw)
  To: Atchley, Scott
  Cc: Yan Burman, J. Bruce Fields, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott <atchleyes-1Heg1YXhbW8@public.gmane.org> wrote:
> On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>>> Hi.
>>>
>>> I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me.
>>> My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3.
>>> These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime.
>>> I am running kernel 3.5.7.
>>>
>>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
>>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
>
> Yan,
>
> Are you trying to optimize single client performance or server performance with multiple clients?
>
>
>> Remember there are always gaps between wire speed (that ib_send_bw
>> measures) and real world applications.
>>
>> That being said, does your server use default export (sync) option ?
>> Export the share with "async" option can bring you closer to wire
>> speed. However, the practice (async) is generally not recommended in a
>> real production system - as it can cause data integrity issues, e.g.
>> you have more chances to lose data when the boxes crash.
>>
>> -- Wendy
>
>
> Wendy,
>
> It has a been a few years since I looked at RPCRDMA, but I seem to remember that RPCs were limited to 32KB which means that you have to pipeline them to get linerate. In addition to requiring pipelining, the argument from the authors was that the goal was to maximize server performance and not single client performance.
>
> Scott
>

That (client count) brings up a good point ...

FIO is really not a good benchmark for NFS. Does anyone have SPECsfs
numbers on NFS over RDMA to share ?

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-18 12:47         ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-18 12:47 UTC (permalink / raw)
  To: Wendy Cheng, Atchley, Scott
  Cc: J. Bruce Fields, Tom Tucker, linux-rdma@vger.kernel.org,
	linux-nfs@vger.kernel.org, Or Gerlitz



> -----Original Message-----
> From: Wendy Cheng [mailto:s.wendy.cheng@gmail.com]
> Sent: Wednesday, April 17, 2013 21:06
> To: Atchley, Scott
> Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org;
> linux-nfs@vger.kernel.org
> Subject: Re: NFS over RDMA benchmark
> 
> On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott <atchleyes@ornl.gov>
> wrote:
> > On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng@gmail.com>
> wrote:
> >
> >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb@mellanox.com>
> wrote:
> >>> Hi.
> >>>
> >>> I've been trying to do some benchmarks for NFS over RDMA and I seem to
> only get about half of the bandwidth that the HW can give me.
> >>> My setup consists of 2 servers each with 16 cores, 32Gb of memory, and
> Mellanox ConnectX3 QDR card over PCI-e gen3.
> >>> These servers are connected to a QDR IB switch. The backing storage on
> the server is tmpfs mounted with noatime.
> >>> I am running kernel 3.5.7.
> >>>
> >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
> same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> >
> > Yan,
> >
> > Are you trying to optimize single client performance or server performance
> with multiple clients?
> >

I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost.
I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC.

What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet:
cat /proc/softirqs
                    CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15
          HI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
       TIMER:     418767      46596      43515      44547      50099      34815      40634      40337      39551      93442      73733      42631      42509      41592      40351      61793
      NET_TX:      28719        309       1421       1294       1730       1243        832        937         11         44         41         20         26         19         15         29
      NET_RX:     612070         19         22         21          6        235          3          2          9          6         17         16         20         13         16         10
       BLOCK:       5941          0          0          0          0          0          0          0        519        259       1238        272        253        174        215       2618
BLOCK_IOPOLL:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
     TASKLET:         28          1          1          1          1    1540653          1          1         29          1          1          1          1          1          1          2
       SCHED:     364965      26547      16807      18403      22919       8678      14358      14091      16981      64903      47141      18517      19179      18036      17037      38261
     HRTIMER:         13          0          1          1          0          0          0          0          0          0          0          0          1          1          0          1
         RCU:     945823     841546     715281     892762     823564      42663     863063     841622     333577     389013     393501     239103     221524     258159     313426     234030
> >
> >> Remember there are always gaps between wire speed (that ib_send_bw
> >> measures) and real world applications.

I realize that, but I don't expect the difference to be more than twice.

> >>
> >> That being said, does your server use default export (sync) option ?
> >> Export the share with "async" option can bring you closer to wire
> >> speed. However, the practice (async) is generally not recommended in
> >> a real production system - as it can cause data integrity issues, e.g.
> >> you have more chances to lose data when the boxes crash.

I am running with async export option, but that should not matter too much, since my backing storage is tmpfs mounted with noatime.

> >>
> >> -- Wendy
> >
> >
> > Wendy,
> >
> > It has a been a few years since I looked at RPCRDMA, but I seem to
> remember that RPCs were limited to 32KB which means that you have to
> pipeline them to get linerate. In addition to requiring pipelining, the
> argument from the authors was that the goal was to maximize server
> performance and not single client performance.
> >

What I see is that performance increases almost linearly up to block size 256K and falls a little at block size 512K

> > Scott
> >
> 
> That (client count) brings up a good point ...
> 
> FIO is really not a good benchmark for NFS. Does anyone have SPECsfs
> numbers on NFS over RDMA to share ?
> 
> -- Wendy

What do you suggest for benchmarking NFS?

Yan



^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-18 12:47         ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-18 12:47 UTC (permalink / raw)
  To: Wendy Cheng, Atchley, Scott
  Cc: J. Bruce Fields, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz



> -----Original Message-----
> From: Wendy Cheng [mailto:s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> Sent: Wednesday, April 17, 2013 21:06
> To: Atchley, Scott
> Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: NFS over RDMA benchmark
> 
> On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott <atchleyes-1Heg1YXhbW8@public.gmane.org>
> wrote:
> > On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> wrote:
> >
> >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> wrote:
> >>> Hi.
> >>>
> >>> I've been trying to do some benchmarks for NFS over RDMA and I seem to
> only get about half of the bandwidth that the HW can give me.
> >>> My setup consists of 2 servers each with 16 cores, 32Gb of memory, and
> Mellanox ConnectX3 QDR card over PCI-e gen3.
> >>> These servers are connected to a QDR IB switch. The backing storage on
> the server is tmpfs mounted with noatime.
> >>> I am running kernel 3.5.7.
> >>>
> >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
> same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> >
> > Yan,
> >
> > Are you trying to optimize single client performance or server performance
> with multiple clients?
> >

I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost.
I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC.

What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet:
cat /proc/softirqs
                    CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15
          HI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
       TIMER:     418767      46596      43515      44547      50099      34815      40634      40337      39551      93442      73733      42631      42509      41592      40351      61793
      NET_TX:      28719        309       1421       1294       1730       1243        832        937         11         44         41         20         26         19         15         29
      NET_RX:     612070         19         22         21          6        235          3          2          9          6         17         16         20         13         16         10
       BLOCK:       5941          0          0          0          0          0          0          0        519        259       1238        272        253        174        215       2618
BLOCK_IOPOLL:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
     TASKLET:         28          1          1          1          1    1540653          1          1         29          1          1          1          1          1          1          2
       SCHED:     364965      26547      16807      18403      22919       8678      14358      14091      16981      64903      47141      18517      19179      18036      17037      38261
     HRTIMER:         13          0          1          1          0          0          0          0          0          0          0          0          1          1          0          1
         RCU:     945823     841546     715281     892762     823564      42663     863063     841622     333577     389013     393501     239103     221524     258159     313426     234030
> >
> >> Remember there are always gaps between wire speed (that ib_send_bw
> >> measures) and real world applications.

I realize that, but I don't expect the difference to be more than twice.

> >>
> >> That being said, does your server use default export (sync) option ?
> >> Export the share with "async" option can bring you closer to wire
> >> speed. However, the practice (async) is generally not recommended in
> >> a real production system - as it can cause data integrity issues, e.g.
> >> you have more chances to lose data when the boxes crash.

I am running with async export option, but that should not matter too much, since my backing storage is tmpfs mounted with noatime.

> >>
> >> -- Wendy
> >
> >
> > Wendy,
> >
> > It has a been a few years since I looked at RPCRDMA, but I seem to
> remember that RPCs were limited to 32KB which means that you have to
> pipeline them to get linerate. In addition to requiring pipelining, the
> argument from the authors was that the goal was to maximize server
> performance and not single client performance.
> >

What I see is that performance increases almost linearly up to block size 256K and falls a little at block size 512K

> > Scott
> >
> 
> That (client count) brings up a good point ...
> 
> FIO is really not a good benchmark for NFS. Does anyone have SPECsfs
> numbers on NFS over RDMA to share ?
> 
> -- Wendy

What do you suggest for benchmarking NFS?

Yan


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-18 16:16           ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-18 16:16 UTC (permalink / raw)
  To: Yan Burman
  Cc: Atchley, Scott, J. Bruce Fields, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman <yanb@mellanox.com> wrote:
>
>
> What do you suggest for benchmarking NFS?
>

I believe SPECsfs has been widely used by NFS (server) vendors to
position their product lines. Its workload was based on a real life
NFS deployment. I think it is more torward office type of workload
(large client/user count with smaller file sizes e.g. software
development with build, compile, etc).

BTW, we're experimenting a similar project and would be interested to
know your findings.

-- Wendy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-18 16:16           ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-18 16:16 UTC (permalink / raw)
  To: Yan Burman
  Cc: Atchley, Scott, J. Bruce Fields, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>
>
> What do you suggest for benchmarking NFS?
>

I believe SPECsfs has been widely used by NFS (server) vendors to
position their product lines. Its workload was based on a real life
NFS deployment. I think it is more torward office type of workload
(large client/user count with smaller file sizes e.g. software
development with build, compile, etc).

BTW, we're experimenting a similar project and would be interested to
know your findings.

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-23 21:06           ` J. Bruce Fields
  0 siblings, 0 replies; 82+ messages in thread
From: J. Bruce Fields @ 2013-04-23 21:06 UTC (permalink / raw)
  To: Yan Burman
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Thu, Apr 18, 2013 at 12:47:09PM +0000, Yan Burman wrote:
> 
> 
> > -----Original Message-----
> > From: Wendy Cheng [mailto:s.wendy.cheng@gmail.com]
> > Sent: Wednesday, April 17, 2013 21:06
> > To: Atchley, Scott
> > Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org;
> > linux-nfs@vger.kernel.org
> > Subject: Re: NFS over RDMA benchmark
> > 
> > On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott <atchleyes@ornl.gov>
> > wrote:
> > > On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng@gmail.com>
> > wrote:
> > >
> > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb@mellanox.com>
> > wrote:
> > >>> Hi.
> > >>>
> > >>> I've been trying to do some benchmarks for NFS over RDMA and I seem to
> > only get about half of the bandwidth that the HW can give me.
> > >>> My setup consists of 2 servers each with 16 cores, 32Gb of memory, and
> > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > >>> These servers are connected to a QDR IB switch. The backing storage on
> > the server is tmpfs mounted with noatime.
> > >>> I am running kernel 3.5.7.
> > >>>
> > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
> > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> > >
> > > Yan,
> > >
> > > Are you trying to optimize single client performance or server performance
> > with multiple clients?
> > >
> 
> I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost.
> I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC.
> 
> What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> cat /proc/softirqs

Would any profiling help figure out which code it's spending time in?
(E.g. something simple as "perf top" might have useful output.)

--b.

>                     CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15
>           HI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
>        TIMER:     418767      46596      43515      44547      50099      34815      40634      40337      39551      93442      73733      42631      42509      41592      40351      61793
>       NET_TX:      28719        309       1421       1294       1730       1243        832        937         11         44         41         20         26         19         15         29
>       NET_RX:     612070         19         22         21          6        235          3          2          9          6         17         16         20         13         16         10
>        BLOCK:       5941          0          0          0          0          0          0          0        519        259       1238        272        253        174        215       2618
> BLOCK_IOPOLL:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
>      TASKLET:         28          1          1          1          1    1540653          1          1         29          1          1          1          1          1          1          2
>        SCHED:     364965      26547      16807      18403      22919       8678      14358      14091      16981      64903      47141      18517      19179      18036      17037      38261
>      HRTIMER:         13          0          1          1          0          0          0          0          0          0          0          0          1          1          0          1
>          RCU:     945823     841546     715281     892762     823564      42663     863063     841622     333577     389013     393501     239103     221524     258159     313426     234030
> > >
> > >> Remember there are always gaps between wire speed (that ib_send_bw
> > >> measures) and real world applications.
> 
> I realize that, but I don't expect the difference to be more than twice.
> 
> > >>
> > >> That being said, does your server use default export (sync) option ?
> > >> Export the share with "async" option can bring you closer to wire
> > >> speed. However, the practice (async) is generally not recommended in
> > >> a real production system - as it can cause data integrity issues, e.g.
> > >> you have more chances to lose data when the boxes crash.
> 
> I am running with async export option, but that should not matter too much, since my backing storage is tmpfs mounted with noatime.
> 
> > >>
> > >> -- Wendy
> > >
> > >
> > > Wendy,
> > >
> > > It has a been a few years since I looked at RPCRDMA, but I seem to
> > remember that RPCs were limited to 32KB which means that you have to
> > pipeline them to get linerate. In addition to requiring pipelining, the
> > argument from the authors was that the goal was to maximize server
> > performance and not single client performance.
> > >
> 
> What I see is that performance increases almost linearly up to block size 256K and falls a little at block size 512K
> 
> > > Scott
> > >
> > 
> > That (client count) brings up a good point ...
> > 
> > FIO is really not a good benchmark for NFS. Does anyone have SPECsfs
> > numbers on NFS over RDMA to share ?
> > 
> > -- Wendy
> 
> What do you suggest for benchmarking NFS?
> 
> Yan
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-23 21:06           ` J. Bruce Fields
  0 siblings, 0 replies; 82+ messages in thread
From: J. Bruce Fields @ 2013-04-23 21:06 UTC (permalink / raw)
  To: Yan Burman
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Thu, Apr 18, 2013 at 12:47:09PM +0000, Yan Burman wrote:
> 
> 
> > -----Original Message-----
> > From: Wendy Cheng [mailto:s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> > Sent: Wednesday, April 17, 2013 21:06
> > To: Atchley, Scott
> > Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> > linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Subject: Re: NFS over RDMA benchmark
> > 
> > On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott <atchleyes-1Heg1YXhbW8@public.gmane.org>
> > wrote:
> > > On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > wrote:
> > >
> > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > wrote:
> > >>> Hi.
> > >>>
> > >>> I've been trying to do some benchmarks for NFS over RDMA and I seem to
> > only get about half of the bandwidth that the HW can give me.
> > >>> My setup consists of 2 servers each with 16 cores, 32Gb of memory, and
> > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > >>> These servers are connected to a QDR IB switch. The backing storage on
> > the server is tmpfs mounted with noatime.
> > >>> I am running kernel 3.5.7.
> > >>>
> > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
> > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> > >
> > > Yan,
> > >
> > > Are you trying to optimize single client performance or server performance
> > with multiple clients?
> > >
> 
> I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost.
> I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC.
> 
> What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> cat /proc/softirqs

Would any profiling help figure out which code it's spending time in?
(E.g. something simple as "perf top" might have useful output.)

--b.

>                     CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15
>           HI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
>        TIMER:     418767      46596      43515      44547      50099      34815      40634      40337      39551      93442      73733      42631      42509      41592      40351      61793
>       NET_TX:      28719        309       1421       1294       1730       1243        832        937         11         44         41         20         26         19         15         29
>       NET_RX:     612070         19         22         21          6        235          3          2          9          6         17         16         20         13         16         10
>        BLOCK:       5941          0          0          0          0          0          0          0        519        259       1238        272        253        174        215       2618
> BLOCK_IOPOLL:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
>      TASKLET:         28          1          1          1          1    1540653          1          1         29          1          1          1          1          1          1          2
>        SCHED:     364965      26547      16807      18403      22919       8678      14358      14091      16981      64903      47141      18517      19179      18036      17037      38261
>      HRTIMER:         13          0          1          1          0          0          0          0          0          0          0          0          1          1          0          1
>          RCU:     945823     841546     715281     892762     823564      42663     863063     841622     333577     389013     393501     239103     221524     258159     313426     234030
> > >
> > >> Remember there are always gaps between wire speed (that ib_send_bw
> > >> measures) and real world applications.
> 
> I realize that, but I don't expect the difference to be more than twice.
> 
> > >>
> > >> That being said, does your server use default export (sync) option ?
> > >> Export the share with "async" option can bring you closer to wire
> > >> speed. However, the practice (async) is generally not recommended in
> > >> a real production system - as it can cause data integrity issues, e.g.
> > >> you have more chances to lose data when the boxes crash.
> 
> I am running with async export option, but that should not matter too much, since my backing storage is tmpfs mounted with noatime.
> 
> > >>
> > >> -- Wendy
> > >
> > >
> > > Wendy,
> > >
> > > It has a been a few years since I looked at RPCRDMA, but I seem to
> > remember that RPCs were limited to 32KB which means that you have to
> > pipeline them to get linerate. In addition to requiring pipelining, the
> > argument from the authors was that the goal was to maximize server
> > performance and not single client performance.
> > >
> 
> What I see is that performance increases almost linearly up to block size 256K and falls a little at block size 512K
> 
> > > Scott
> > >
> > 
> > That (client count) brings up a good point ...
> > 
> > FIO is really not a good benchmark for NFS. Does anyone have SPECsfs
> > numbers on NFS over RDMA to share ?
> > 
> > -- Wendy
> 
> What do you suggest for benchmarking NFS?
> 
> Yan
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-24 12:35             ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-24 12:35 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz



> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org]
> Sent: Wednesday, April 24, 2013 00:06
> To: Yan Burman
> Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
> linux-nfs@vger.kernel.org; Or Gerlitz
> Subject: Re: NFS over RDMA benchmark
> 
> On Thu, Apr 18, 2013 at 12:47:09PM +0000, Yan Burman wrote:
> >
> >
> > > -----Original Message-----
> > > From: Wendy Cheng [mailto:s.wendy.cheng@gmail.com]
> > > Sent: Wednesday, April 17, 2013 21:06
> > > To: Atchley, Scott
> > > Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
> > > linux-rdma@vger.kernel.org; linux-nfs@vger.kernel.org
> > > Subject: Re: NFS over RDMA benchmark
> > >
> > > On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
> > > <atchleyes@ornl.gov>
> > > wrote:
> > > > On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng@gmail.com>
> > > wrote:
> > > >
> > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb@mellanox.com>
> > > wrote:
> > > >>> Hi.
> > > >>>
> > > >>> I've been trying to do some benchmarks for NFS over RDMA and I
> > > >>> seem to
> > > only get about half of the bandwidth that the HW can give me.
> > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > >>> memory, and
> > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > >>> These servers are connected to a QDR IB switch. The backing
> > > >>> storage on
> > > the server is tmpfs mounted with noatime.
> > > >>> I am running kernel 3.5.7.
> > > >>>
> > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for
> > > >>> the
> > > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> > > >
> > > > Yan,
> > > >
> > > > Are you trying to optimize single client performance or server
> > > > performance
> > > with multiple clients?
> > > >
> >
> > I am trying to get maximum performance from a single server - I used 2
> processes in fio test - more than 2 did not show any performance boost.
> > I tried running fio from 2 different PCs on 2 different files, but the sum of
> the two is more or less the same as running from single client PC.
> >
> > What I did see is that server is sweating a lot more than the clients and
> more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > cat /proc/softirqs
> 
> Would any profiling help figure out which code it's spending time in?
> (E.g. something simple as "perf top" might have useful output.)
> 


Perf top for the CPU with high tasklet count gives:

             samples  pcnt         RIP        function                    DSO
             _______ _____ ________________ ___________________________ ___________________________________________________________________

             2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner         /root/vmlinux
              978.00  8.4% ffffffff810297f0 clflush_cache_range         /root/vmlinux
              445.00  3.8% ffffffff812ea440 __domain_mapping            /root/vmlinux
              441.00  3.8% 0000000000018c30 svc_recv                    /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
              344.00  3.0% ffffffff813a1bc0 _raw_spin_lock_bh           /root/vmlinux
              333.00  2.9% ffffffff813a19e0 _raw_spin_lock_irqsave      /root/vmlinux
              288.00  2.5% ffffffff813a07d0 __schedule                  /root/vmlinux
              249.00  2.1% ffffffff811a87e0 rb_prev                     /root/vmlinux
              242.00  2.1% ffffffff813a19b0 _raw_spin_lock              /root/vmlinux
              184.00  1.6% 0000000000002e90 svc_rdma_sendto             /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
              177.00  1.5% ffffffff810ac820 get_page_from_freelist      /root/vmlinux
              174.00  1.5% ffffffff812e6da0 alloc_iova                  /root/vmlinux
              165.00  1.4% ffffffff810b1390 put_page                    /root/vmlinux
              148.00  1.3% 0000000000014760 sunrpc_cache_lookup         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
              128.00  1.1% 0000000000017f20 svc_xprt_enqueue            /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
              126.00  1.1% ffffffff8139f820 __mutex_lock_slowpath       /root/vmlinux
              108.00  0.9% ffffffff811a81d0 rb_insert_color             /root/vmlinux
              107.00  0.9% 0000000000004690 svc_rdma_recvfrom           /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
              102.00  0.9% 0000000000002640 send_reply                  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
               99.00  0.9% ffffffff810e6490 kmem_cache_alloc            /root/vmlinux
               96.00  0.8% ffffffff810e5840 __slab_alloc                /root/vmlinux
               91.00  0.8% 0000000000006d30 mlx4_ib_post_send           /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
               88.00  0.8% 0000000000000dd0 svc_rdma_get_context        /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
               86.00  0.7% ffffffff813a1a10 _raw_spin_lock_irq          /root/vmlinux
               86.00  0.7% 0000000000001530 svc_rdma_send               /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
               85.00  0.7% ffffffff81060a80 prepare_creds               /root/vmlinux
               83.00  0.7% ffffffff810a5790 find_get_pages_contig       /root/vmlinux
               79.00  0.7% ffffffff810e4620 __slab_free                 /root/vmlinux
               79.00  0.7% ffffffff813a1a40 _raw_spin_unlock_irqrestore /root/vmlinux
               77.00  0.7% ffffffff81065610 finish_task_switch          /root/vmlinux
               76.00  0.7% ffffffff812e9270 pfn_to_dma_pte              /root/vmlinux
               75.00  0.6% ffffffff810976d0 __call_rcu                  /root/vmlinux
               73.00  0.6% ffffffff811a2fa0 _atomic_dec_and_lock        /root/vmlinux
               73.00  0.6% 00000000000002e0 svc_rdma_has_wspace         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
               67.00  0.6% ffffffff813a1a70 _raw_read_lock              /root/vmlinux
               65.00  0.6% 000000000000f590 svcauth_unix_set_client     /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
               63.00  0.5% 00000000000180e0 svc_reserve                 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
               60.00  0.5% 00000000000064d0 stamp_send_wqe              /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
               57.00  0.5% ffffffff810ac110 free_hot_cold_page          /root/vmlinux
               57.00  0.5% ffffffff811ae540 memcpy                      /root/vmlinux
               56.00  0.5% ffffffff810ad1a0 __alloc_pages_nodemask      /root/vmlinux
               55.00  0.5% ffffffff81118200 splice_to_pipe              /root/vmlinux
               53.00  0.5% ffffffff810e3bc0 get_partial_node            /root/vmlinux
               49.00  0.4% ffffffff812eb840 __intel_map_single          /root/vmlinux


> --b.
> 
> >                     CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6
> CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14
> CPU15
> >           HI:          0          0          0          0          0          0          0          0          0
> 0          0          0          0          0          0          0
> >        TIMER:     418767      46596      43515      44547      50099      34815
> 40634      40337      39551      93442      73733      42631      42509      41592
> 40351      61793
> >       NET_TX:      28719        309       1421       1294       1730       1243        832
> 937         11         44         41         20         26         19         15         29
> >       NET_RX:     612070         19         22         21          6        235          3          2
> 9          6         17         16         20         13         16         10
> >        BLOCK:       5941          0          0          0          0          0          0          0
> 519        259       1238        272        253        174        215       2618
> > BLOCK_IOPOLL:          0          0          0          0          0          0          0          0
> 0          0          0          0          0          0          0          0
> >      TASKLET:         28          1          1          1          1    1540653          1          1
> 29          1          1          1          1          1          1          2
> >        SCHED:     364965      26547      16807      18403      22919       8678
> 14358      14091      16981      64903      47141      18517      19179      18036
> 17037      38261
> >      HRTIMER:         13          0          1          1          0          0          0          0
> 0          0          0          0          1          1          0          1
> >          RCU:     945823     841546     715281     892762     823564      42663
> 863063     841622     333577     389013     393501     239103     221524     258159
> 313426     234030

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-24 12:35             ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-24 12:35 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz



> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org]
> Sent: Wednesday, April 24, 2013 00:06
> To: Yan Burman
> Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
> Subject: Re: NFS over RDMA benchmark
> 
> On Thu, Apr 18, 2013 at 12:47:09PM +0000, Yan Burman wrote:
> >
> >
> > > -----Original Message-----
> > > From: Wendy Cheng [mailto:s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> > > Sent: Wednesday, April 17, 2013 21:06
> > > To: Atchley, Scott
> > > Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
> > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > Subject: Re: NFS over RDMA benchmark
> > >
> > > On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
> > > <atchleyes-1Heg1YXhbW8@public.gmane.org>
> > > wrote:
> > > > On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > > wrote:
> > > >
> > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > wrote:
> > > >>> Hi.
> > > >>>
> > > >>> I've been trying to do some benchmarks for NFS over RDMA and I
> > > >>> seem to
> > > only get about half of the bandwidth that the HW can give me.
> > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > >>> memory, and
> > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > >>> These servers are connected to a QDR IB switch. The backing
> > > >>> storage on
> > > the server is tmpfs mounted with noatime.
> > > >>> I am running kernel 3.5.7.
> > > >>>
> > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for
> > > >>> the
> > > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> > > >
> > > > Yan,
> > > >
> > > > Are you trying to optimize single client performance or server
> > > > performance
> > > with multiple clients?
> > > >
> >
> > I am trying to get maximum performance from a single server - I used 2
> processes in fio test - more than 2 did not show any performance boost.
> > I tried running fio from 2 different PCs on 2 different files, but the sum of
> the two is more or less the same as running from single client PC.
> >
> > What I did see is that server is sweating a lot more than the clients and
> more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > cat /proc/softirqs
> 
> Would any profiling help figure out which code it's spending time in?
> (E.g. something simple as "perf top" might have useful output.)
> 


Perf top for the CPU with high tasklet count gives:

             samples  pcnt         RIP        function                    DSO
             _______ _____ ________________ ___________________________ ___________________________________________________________________

             2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner         /root/vmlinux
              978.00  8.4% ffffffff810297f0 clflush_cache_range         /root/vmlinux
              445.00  3.8% ffffffff812ea440 __domain_mapping            /root/vmlinux
              441.00  3.8% 0000000000018c30 svc_recv                    /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
              344.00  3.0% ffffffff813a1bc0 _raw_spin_lock_bh           /root/vmlinux
              333.00  2.9% ffffffff813a19e0 _raw_spin_lock_irqsave      /root/vmlinux
              288.00  2.5% ffffffff813a07d0 __schedule                  /root/vmlinux
              249.00  2.1% ffffffff811a87e0 rb_prev                     /root/vmlinux
              242.00  2.1% ffffffff813a19b0 _raw_spin_lock              /root/vmlinux
              184.00  1.6% 0000000000002e90 svc_rdma_sendto             /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
              177.00  1.5% ffffffff810ac820 get_page_from_freelist      /root/vmlinux
              174.00  1.5% ffffffff812e6da0 alloc_iova                  /root/vmlinux
              165.00  1.4% ffffffff810b1390 put_page                    /root/vmlinux
              148.00  1.3% 0000000000014760 sunrpc_cache_lookup         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
              128.00  1.1% 0000000000017f20 svc_xprt_enqueue            /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
              126.00  1.1% ffffffff8139f820 __mutex_lock_slowpath       /root/vmlinux
              108.00  0.9% ffffffff811a81d0 rb_insert_color             /root/vmlinux
              107.00  0.9% 0000000000004690 svc_rdma_recvfrom           /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
              102.00  0.9% 0000000000002640 send_reply                  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
               99.00  0.9% ffffffff810e6490 kmem_cache_alloc            /root/vmlinux
               96.00  0.8% ffffffff810e5840 __slab_alloc                /root/vmlinux
               91.00  0.8% 0000000000006d30 mlx4_ib_post_send           /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
               88.00  0.8% 0000000000000dd0 svc_rdma_get_context        /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
               86.00  0.7% ffffffff813a1a10 _raw_spin_lock_irq          /root/vmlinux
               86.00  0.7% 0000000000001530 svc_rdma_send               /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
               85.00  0.7% ffffffff81060a80 prepare_creds               /root/vmlinux
               83.00  0.7% ffffffff810a5790 find_get_pages_contig       /root/vmlinux
               79.00  0.7% ffffffff810e4620 __slab_free                 /root/vmlinux
               79.00  0.7% ffffffff813a1a40 _raw_spin_unlock_irqrestore /root/vmlinux
               77.00  0.7% ffffffff81065610 finish_task_switch          /root/vmlinux
               76.00  0.7% ffffffff812e9270 pfn_to_dma_pte              /root/vmlinux
               75.00  0.6% ffffffff810976d0 __call_rcu                  /root/vmlinux
               73.00  0.6% ffffffff811a2fa0 _atomic_dec_and_lock        /root/vmlinux
               73.00  0.6% 00000000000002e0 svc_rdma_has_wspace         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
               67.00  0.6% ffffffff813a1a70 _raw_read_lock              /root/vmlinux
               65.00  0.6% 000000000000f590 svcauth_unix_set_client     /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
               63.00  0.5% 00000000000180e0 svc_reserve                 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
               60.00  0.5% 00000000000064d0 stamp_send_wqe              /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
               57.00  0.5% ffffffff810ac110 free_hot_cold_page          /root/vmlinux
               57.00  0.5% ffffffff811ae540 memcpy                      /root/vmlinux
               56.00  0.5% ffffffff810ad1a0 __alloc_pages_nodemask      /root/vmlinux
               55.00  0.5% ffffffff81118200 splice_to_pipe              /root/vmlinux
               53.00  0.5% ffffffff810e3bc0 get_partial_node            /root/vmlinux
               49.00  0.4% ffffffff812eb840 __intel_map_single          /root/vmlinux


> --b.
> 
> >                     CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6
> CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14
> CPU15
> >           HI:          0          0          0          0          0          0          0          0          0
> 0          0          0          0          0          0          0
> >        TIMER:     418767      46596      43515      44547      50099      34815
> 40634      40337      39551      93442      73733      42631      42509      41592
> 40351      61793
> >       NET_TX:      28719        309       1421       1294       1730       1243        832
> 937         11         44         41         20         26         19         15         29
> >       NET_RX:     612070         19         22         21          6        235          3          2
> 9          6         17         16         20         13         16         10
> >        BLOCK:       5941          0          0          0          0          0          0          0
> 519        259       1238        272        253        174        215       2618
> > BLOCK_IOPOLL:          0          0          0          0          0          0          0          0
> 0          0          0          0          0          0          0          0
> >      TASKLET:         28          1          1          1          1    1540653          1          1
> 29          1          1          1          1          1          1          2
> >        SCHED:     364965      26547      16807      18403      22919       8678
> 14358      14091      16981      64903      47141      18517      19179      18036
> 17037      38261
> >      HRTIMER:         13          0          1          1          0          0          0          0
> 0          0          0          0          1          1          0          1
> >          RCU:     945823     841546     715281     892762     823564      42663
> 863063     841622     333577     389013     393501     239103     221524     258159
> 313426     234030
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-24 15:05               ` J. Bruce Fields
  0 siblings, 0 replies; 82+ messages in thread
From: J. Bruce Fields @ 2013-04-24 15:05 UTC (permalink / raw)
  To: Yan Burman
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Wed, Apr 24, 2013 at 12:35:03PM +0000, Yan Burman wrote:
> 
> 
> > -----Original Message-----
> > From: J. Bruce Fields [mailto:bfields@fieldses.org]
> > Sent: Wednesday, April 24, 2013 00:06
> > To: Yan Burman
> > Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
> > linux-nfs@vger.kernel.org; Or Gerlitz
> > Subject: Re: NFS over RDMA benchmark
> > 
> > On Thu, Apr 18, 2013 at 12:47:09PM +0000, Yan Burman wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Wendy Cheng [mailto:s.wendy.cheng@gmail.com]
> > > > Sent: Wednesday, April 17, 2013 21:06
> > > > To: Atchley, Scott
> > > > Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
> > > > linux-rdma@vger.kernel.org; linux-nfs@vger.kernel.org
> > > > Subject: Re: NFS over RDMA benchmark
> > > >
> > > > On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
> > > > <atchleyes@ornl.gov>
> > > > wrote:
> > > > > On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng@gmail.com>
> > > > wrote:
> > > > >
> > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb@mellanox.com>
> > > > wrote:
> > > > >>> Hi.
> > > > >>>
> > > > >>> I've been trying to do some benchmarks for NFS over RDMA and I
> > > > >>> seem to
> > > > only get about half of the bandwidth that the HW can give me.
> > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > > >>> memory, and
> > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > >>> These servers are connected to a QDR IB switch. The backing
> > > > >>> storage on
> > > > the server is tmpfs mounted with noatime.
> > > > >>> I am running kernel 3.5.7.
> > > > >>>
> > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for
> > > > >>> the
> > > > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> > > > >
> > > > > Yan,
> > > > >
> > > > > Are you trying to optimize single client performance or server
> > > > > performance
> > > > with multiple clients?
> > > > >
> > >
> > > I am trying to get maximum performance from a single server - I used 2
> > processes in fio test - more than 2 did not show any performance boost.
> > > I tried running fio from 2 different PCs on 2 different files, but the sum of
> > the two is more or less the same as running from single client PC.
> > >
> > > What I did see is that server is sweating a lot more than the clients and
> > more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > > cat /proc/softirqs
> > 
> > Would any profiling help figure out which code it's spending time in?
> > (E.g. something simple as "perf top" might have useful output.)
> > 
> 
> 
> Perf top for the CPU with high tasklet count gives:
> 
>              samples  pcnt         RIP        function                    DSO
>              _______ _____ ________________ ___________________________ ___________________________________________________________________
> 
>              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner         /root/vmlinux

I guess that means lots of contention on some mutex?  If only we knew
which one.... perf should also be able to collect stack statistics, I
forget how.

--b.

>               978.00  8.4% ffffffff810297f0 clflush_cache_range         /root/vmlinux
>               445.00  3.8% ffffffff812ea440 __domain_mapping            /root/vmlinux
>               441.00  3.8% 0000000000018c30 svc_recv                    /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>               344.00  3.0% ffffffff813a1bc0 _raw_spin_lock_bh           /root/vmlinux
>               333.00  2.9% ffffffff813a19e0 _raw_spin_lock_irqsave      /root/vmlinux
>               288.00  2.5% ffffffff813a07d0 __schedule                  /root/vmlinux
>               249.00  2.1% ffffffff811a87e0 rb_prev                     /root/vmlinux
>               242.00  2.1% ffffffff813a19b0 _raw_spin_lock              /root/vmlinux
>               184.00  1.6% 0000000000002e90 svc_rdma_sendto             /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>               177.00  1.5% ffffffff810ac820 get_page_from_freelist      /root/vmlinux
>               174.00  1.5% ffffffff812e6da0 alloc_iova                  /root/vmlinux
>               165.00  1.4% ffffffff810b1390 put_page                    /root/vmlinux
>               148.00  1.3% 0000000000014760 sunrpc_cache_lookup         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>               128.00  1.1% 0000000000017f20 svc_xprt_enqueue            /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>               126.00  1.1% ffffffff8139f820 __mutex_lock_slowpath       /root/vmlinux
>               108.00  0.9% ffffffff811a81d0 rb_insert_color             /root/vmlinux
>               107.00  0.9% 0000000000004690 svc_rdma_recvfrom           /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>               102.00  0.9% 0000000000002640 send_reply                  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>                99.00  0.9% ffffffff810e6490 kmem_cache_alloc            /root/vmlinux
>                96.00  0.8% ffffffff810e5840 __slab_alloc                /root/vmlinux
>                91.00  0.8% 0000000000006d30 mlx4_ib_post_send           /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
>                88.00  0.8% 0000000000000dd0 svc_rdma_get_context        /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>                86.00  0.7% ffffffff813a1a10 _raw_spin_lock_irq          /root/vmlinux
>                86.00  0.7% 0000000000001530 svc_rdma_send               /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>                85.00  0.7% ffffffff81060a80 prepare_creds               /root/vmlinux
>                83.00  0.7% ffffffff810a5790 find_get_pages_contig       /root/vmlinux
>                79.00  0.7% ffffffff810e4620 __slab_free                 /root/vmlinux
>                79.00  0.7% ffffffff813a1a40 _raw_spin_unlock_irqrestore /root/vmlinux
>                77.00  0.7% ffffffff81065610 finish_task_switch          /root/vmlinux
>                76.00  0.7% ffffffff812e9270 pfn_to_dma_pte              /root/vmlinux
>                75.00  0.6% ffffffff810976d0 __call_rcu                  /root/vmlinux
>                73.00  0.6% ffffffff811a2fa0 _atomic_dec_and_lock        /root/vmlinux
>                73.00  0.6% 00000000000002e0 svc_rdma_has_wspace         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>                67.00  0.6% ffffffff813a1a70 _raw_read_lock              /root/vmlinux
>                65.00  0.6% 000000000000f590 svcauth_unix_set_client     /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>                63.00  0.5% 00000000000180e0 svc_reserve                 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>                60.00  0.5% 00000000000064d0 stamp_send_wqe              /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
>                57.00  0.5% ffffffff810ac110 free_hot_cold_page          /root/vmlinux
>                57.00  0.5% ffffffff811ae540 memcpy                      /root/vmlinux
>                56.00  0.5% ffffffff810ad1a0 __alloc_pages_nodemask      /root/vmlinux
>                55.00  0.5% ffffffff81118200 splice_to_pipe              /root/vmlinux
>                53.00  0.5% ffffffff810e3bc0 get_partial_node            /root/vmlinux
>                49.00  0.4% ffffffff812eb840 __intel_map_single          /root/vmlinux
> 
> 
> > --b.
> > 
> > >                     CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6
> > CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14
> > CPU15
> > >           HI:          0          0          0          0          0          0          0          0          0
> > 0          0          0          0          0          0          0
> > >        TIMER:     418767      46596      43515      44547      50099      34815
> > 40634      40337      39551      93442      73733      42631      42509      41592
> > 40351      61793
> > >       NET_TX:      28719        309       1421       1294       1730       1243        832
> > 937         11         44         41         20         26         19         15         29
> > >       NET_RX:     612070         19         22         21          6        235          3          2
> > 9          6         17         16         20         13         16         10
> > >        BLOCK:       5941          0          0          0          0          0          0          0
> > 519        259       1238        272        253        174        215       2618
> > > BLOCK_IOPOLL:          0          0          0          0          0          0          0          0
> > 0          0          0          0          0          0          0          0
> > >      TASKLET:         28          1          1          1          1    1540653          1          1
> > 29          1          1          1          1          1          1          2
> > >        SCHED:     364965      26547      16807      18403      22919       8678
> > 14358      14091      16981      64903      47141      18517      19179      18036
> > 17037      38261
> > >      HRTIMER:         13          0          1          1          0          0          0          0
> > 0          0          0          0          1          1          0          1
> > >          RCU:     945823     841546     715281     892762     823564      42663
> > 863063     841622     333577     389013     393501     239103     221524     258159
> > 313426     234030

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-24 15:05               ` J. Bruce Fields
  0 siblings, 0 replies; 82+ messages in thread
From: J. Bruce Fields @ 2013-04-24 15:05 UTC (permalink / raw)
  To: Yan Burman
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Wed, Apr 24, 2013 at 12:35:03PM +0000, Yan Burman wrote:
> 
> 
> > -----Original Message-----
> > From: J. Bruce Fields [mailto:bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org]
> > Sent: Wednesday, April 24, 2013 00:06
> > To: Yan Burman
> > Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> > linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
> > Subject: Re: NFS over RDMA benchmark
> > 
> > On Thu, Apr 18, 2013 at 12:47:09PM +0000, Yan Burman wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Wendy Cheng [mailto:s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> > > > Sent: Wednesday, April 17, 2013 21:06
> > > > To: Atchley, Scott
> > > > Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
> > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > Subject: Re: NFS over RDMA benchmark
> > > >
> > > > On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
> > > > <atchleyes-1Heg1YXhbW8@public.gmane.org>
> > > > wrote:
> > > > > On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > > > wrote:
> > > > >
> > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > > wrote:
> > > > >>> Hi.
> > > > >>>
> > > > >>> I've been trying to do some benchmarks for NFS over RDMA and I
> > > > >>> seem to
> > > > only get about half of the bandwidth that the HW can give me.
> > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > > >>> memory, and
> > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > >>> These servers are connected to a QDR IB switch. The backing
> > > > >>> storage on
> > > > the server is tmpfs mounted with noatime.
> > > > >>> I am running kernel 3.5.7.
> > > > >>>
> > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for
> > > > >>> the
> > > > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> > > > >
> > > > > Yan,
> > > > >
> > > > > Are you trying to optimize single client performance or server
> > > > > performance
> > > > with multiple clients?
> > > > >
> > >
> > > I am trying to get maximum performance from a single server - I used 2
> > processes in fio test - more than 2 did not show any performance boost.
> > > I tried running fio from 2 different PCs on 2 different files, but the sum of
> > the two is more or less the same as running from single client PC.
> > >
> > > What I did see is that server is sweating a lot more than the clients and
> > more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > > cat /proc/softirqs
> > 
> > Would any profiling help figure out which code it's spending time in?
> > (E.g. something simple as "perf top" might have useful output.)
> > 
> 
> 
> Perf top for the CPU with high tasklet count gives:
> 
>              samples  pcnt         RIP        function                    DSO
>              _______ _____ ________________ ___________________________ ___________________________________________________________________
> 
>              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner         /root/vmlinux

I guess that means lots of contention on some mutex?  If only we knew
which one.... perf should also be able to collect stack statistics, I
forget how.

--b.

>               978.00  8.4% ffffffff810297f0 clflush_cache_range         /root/vmlinux
>               445.00  3.8% ffffffff812ea440 __domain_mapping            /root/vmlinux
>               441.00  3.8% 0000000000018c30 svc_recv                    /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>               344.00  3.0% ffffffff813a1bc0 _raw_spin_lock_bh           /root/vmlinux
>               333.00  2.9% ffffffff813a19e0 _raw_spin_lock_irqsave      /root/vmlinux
>               288.00  2.5% ffffffff813a07d0 __schedule                  /root/vmlinux
>               249.00  2.1% ffffffff811a87e0 rb_prev                     /root/vmlinux
>               242.00  2.1% ffffffff813a19b0 _raw_spin_lock              /root/vmlinux
>               184.00  1.6% 0000000000002e90 svc_rdma_sendto             /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>               177.00  1.5% ffffffff810ac820 get_page_from_freelist      /root/vmlinux
>               174.00  1.5% ffffffff812e6da0 alloc_iova                  /root/vmlinux
>               165.00  1.4% ffffffff810b1390 put_page                    /root/vmlinux
>               148.00  1.3% 0000000000014760 sunrpc_cache_lookup         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>               128.00  1.1% 0000000000017f20 svc_xprt_enqueue            /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>               126.00  1.1% ffffffff8139f820 __mutex_lock_slowpath       /root/vmlinux
>               108.00  0.9% ffffffff811a81d0 rb_insert_color             /root/vmlinux
>               107.00  0.9% 0000000000004690 svc_rdma_recvfrom           /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>               102.00  0.9% 0000000000002640 send_reply                  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>                99.00  0.9% ffffffff810e6490 kmem_cache_alloc            /root/vmlinux
>                96.00  0.8% ffffffff810e5840 __slab_alloc                /root/vmlinux
>                91.00  0.8% 0000000000006d30 mlx4_ib_post_send           /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
>                88.00  0.8% 0000000000000dd0 svc_rdma_get_context        /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>                86.00  0.7% ffffffff813a1a10 _raw_spin_lock_irq          /root/vmlinux
>                86.00  0.7% 0000000000001530 svc_rdma_send               /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>                85.00  0.7% ffffffff81060a80 prepare_creds               /root/vmlinux
>                83.00  0.7% ffffffff810a5790 find_get_pages_contig       /root/vmlinux
>                79.00  0.7% ffffffff810e4620 __slab_free                 /root/vmlinux
>                79.00  0.7% ffffffff813a1a40 _raw_spin_unlock_irqrestore /root/vmlinux
>                77.00  0.7% ffffffff81065610 finish_task_switch          /root/vmlinux
>                76.00  0.7% ffffffff812e9270 pfn_to_dma_pte              /root/vmlinux
>                75.00  0.6% ffffffff810976d0 __call_rcu                  /root/vmlinux
>                73.00  0.6% ffffffff811a2fa0 _atomic_dec_and_lock        /root/vmlinux
>                73.00  0.6% 00000000000002e0 svc_rdma_has_wspace         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>                67.00  0.6% ffffffff813a1a70 _raw_read_lock              /root/vmlinux
>                65.00  0.6% 000000000000f590 svcauth_unix_set_client     /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>                63.00  0.5% 00000000000180e0 svc_reserve                 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>                60.00  0.5% 00000000000064d0 stamp_send_wqe              /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
>                57.00  0.5% ffffffff810ac110 free_hot_cold_page          /root/vmlinux
>                57.00  0.5% ffffffff811ae540 memcpy                      /root/vmlinux
>                56.00  0.5% ffffffff810ad1a0 __alloc_pages_nodemask      /root/vmlinux
>                55.00  0.5% ffffffff81118200 splice_to_pipe              /root/vmlinux
>                53.00  0.5% ffffffff810e3bc0 get_partial_node            /root/vmlinux
>                49.00  0.4% ffffffff812eb840 __intel_map_single          /root/vmlinux
> 
> 
> > --b.
> > 
> > >                     CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6
> > CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14
> > CPU15
> > >           HI:          0          0          0          0          0          0          0          0          0
> > 0          0          0          0          0          0          0
> > >        TIMER:     418767      46596      43515      44547      50099      34815
> > 40634      40337      39551      93442      73733      42631      42509      41592
> > 40351      61793
> > >       NET_TX:      28719        309       1421       1294       1730       1243        832
> > 937         11         44         41         20         26         19         15         29
> > >       NET_RX:     612070         19         22         21          6        235          3          2
> > 9          6         17         16         20         13         16         10
> > >        BLOCK:       5941          0          0          0          0          0          0          0
> > 519        259       1238        272        253        174        215       2618
> > > BLOCK_IOPOLL:          0          0          0          0          0          0          0          0
> > 0          0          0          0          0          0          0          0
> > >      TASKLET:         28          1          1          1          1    1540653          1          1
> > 29          1          1          1          1          1          1          2
> > >        SCHED:     364965      26547      16807      18403      22919       8678
> > 14358      14091      16981      64903      47141      18517      19179      18036
> > 17037      38261
> > >      HRTIMER:         13          0          1          1          0          0          0          0
> > 0          0          0          0          1          1          0          1
> > >          RCU:     945823     841546     715281     892762     823564      42663
> > 863063     841622     333577     389013     393501     239103     221524     258159
> > 313426     234030
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-24 15:26                 ` J. Bruce Fields
  0 siblings, 0 replies; 82+ messages in thread
From: J. Bruce Fields @ 2013-04-24 15:26 UTC (permalink / raw)
  To: Yan Burman
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
> On Wed, Apr 24, 2013 at 12:35:03PM +0000, Yan Burman wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: J. Bruce Fields [mailto:bfields@fieldses.org]
> > > Sent: Wednesday, April 24, 2013 00:06
> > > To: Yan Burman
> > > Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
> > > linux-nfs@vger.kernel.org; Or Gerlitz
> > > Subject: Re: NFS over RDMA benchmark
> > > 
> > > On Thu, Apr 18, 2013 at 12:47:09PM +0000, Yan Burman wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Wendy Cheng [mailto:s.wendy.cheng@gmail.com]
> > > > > Sent: Wednesday, April 17, 2013 21:06
> > > > > To: Atchley, Scott
> > > > > Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
> > > > > linux-rdma@vger.kernel.org; linux-nfs@vger.kernel.org
> > > > > Subject: Re: NFS over RDMA benchmark
> > > > >
> > > > > On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
> > > > > <atchleyes@ornl.gov>
> > > > > wrote:
> > > > > > On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb@mellanox.com>
> > > > > wrote:
> > > > > >>> Hi.
> > > > > >>>
> > > > > >>> I've been trying to do some benchmarks for NFS over RDMA and I
> > > > > >>> seem to
> > > > > only get about half of the bandwidth that the HW can give me.
> > > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > > > >>> memory, and
> > > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > > >>> These servers are connected to a QDR IB switch. The backing
> > > > > >>> storage on
> > > > > the server is tmpfs mounted with noatime.
> > > > > >>> I am running kernel 3.5.7.
> > > > > >>>
> > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> > > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for
> > > > > >>> the
> > > > > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> > > > > >
> > > > > > Yan,
> > > > > >
> > > > > > Are you trying to optimize single client performance or server
> > > > > > performance
> > > > > with multiple clients?
> > > > > >
> > > >
> > > > I am trying to get maximum performance from a single server - I used 2
> > > processes in fio test - more than 2 did not show any performance boost.
> > > > I tried running fio from 2 different PCs on 2 different files, but the sum of
> > > the two is more or less the same as running from single client PC.
> > > >
> > > > What I did see is that server is sweating a lot more than the clients and
> > > more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > > > cat /proc/softirqs
> > > 
> > > Would any profiling help figure out which code it's spending time in?
> > > (E.g. something simple as "perf top" might have useful output.)
> > > 
> > 
> > 
> > Perf top for the CPU with high tasklet count gives:
> > 
> >              samples  pcnt         RIP        function                    DSO
> >              _______ _____ ________________ ___________________________ ___________________________________________________________________
> > 
> >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner         /root/vmlinux
> 
> I guess that means lots of contention on some mutex?  If only we knew
> which one.... perf should also be able to collect stack statistics, I
> forget how.

Googling around....  I think we want:

	perf record -a --call-graph
	(give it a chance to collect some samples, then ^C)
	perf report --call-graph --stdio

--b.

> 
> --b.
> 
> >               978.00  8.4% ffffffff810297f0 clflush_cache_range         /root/vmlinux
> >               445.00  3.8% ffffffff812ea440 __domain_mapping            /root/vmlinux
> >               441.00  3.8% 0000000000018c30 svc_recv                    /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
> >               344.00  3.0% ffffffff813a1bc0 _raw_spin_lock_bh           /root/vmlinux
> >               333.00  2.9% ffffffff813a19e0 _raw_spin_lock_irqsave      /root/vmlinux
> >               288.00  2.5% ffffffff813a07d0 __schedule                  /root/vmlinux
> >               249.00  2.1% ffffffff811a87e0 rb_prev                     /root/vmlinux
> >               242.00  2.1% ffffffff813a19b0 _raw_spin_lock              /root/vmlinux
> >               184.00  1.6% 0000000000002e90 svc_rdma_sendto             /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
> >               177.00  1.5% ffffffff810ac820 get_page_from_freelist      /root/vmlinux
> >               174.00  1.5% ffffffff812e6da0 alloc_iova                  /root/vmlinux
> >               165.00  1.4% ffffffff810b1390 put_page                    /root/vmlinux
> >               148.00  1.3% 0000000000014760 sunrpc_cache_lookup         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
> >               128.00  1.1% 0000000000017f20 svc_xprt_enqueue            /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
> >               126.00  1.1% ffffffff8139f820 __mutex_lock_slowpath       /root/vmlinux
> >               108.00  0.9% ffffffff811a81d0 rb_insert_color             /root/vmlinux
> >               107.00  0.9% 0000000000004690 svc_rdma_recvfrom           /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
> >               102.00  0.9% 0000000000002640 send_reply                  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
> >                99.00  0.9% ffffffff810e6490 kmem_cache_alloc            /root/vmlinux
> >                96.00  0.8% ffffffff810e5840 __slab_alloc                /root/vmlinux
> >                91.00  0.8% 0000000000006d30 mlx4_ib_post_send           /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
> >                88.00  0.8% 0000000000000dd0 svc_rdma_get_context        /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
> >                86.00  0.7% ffffffff813a1a10 _raw_spin_lock_irq          /root/vmlinux
> >                86.00  0.7% 0000000000001530 svc_rdma_send               /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
> >                85.00  0.7% ffffffff81060a80 prepare_creds               /root/vmlinux
> >                83.00  0.7% ffffffff810a5790 find_get_pages_contig       /root/vmlinux
> >                79.00  0.7% ffffffff810e4620 __slab_free                 /root/vmlinux
> >                79.00  0.7% ffffffff813a1a40 _raw_spin_unlock_irqrestore /root/vmlinux
> >                77.00  0.7% ffffffff81065610 finish_task_switch          /root/vmlinux
> >                76.00  0.7% ffffffff812e9270 pfn_to_dma_pte              /root/vmlinux
> >                75.00  0.6% ffffffff810976d0 __call_rcu                  /root/vmlinux
> >                73.00  0.6% ffffffff811a2fa0 _atomic_dec_and_lock        /root/vmlinux
> >                73.00  0.6% 00000000000002e0 svc_rdma_has_wspace         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
> >                67.00  0.6% ffffffff813a1a70 _raw_read_lock              /root/vmlinux
> >                65.00  0.6% 000000000000f590 svcauth_unix_set_client     /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
> >                63.00  0.5% 00000000000180e0 svc_reserve                 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
> >                60.00  0.5% 00000000000064d0 stamp_send_wqe              /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
> >                57.00  0.5% ffffffff810ac110 free_hot_cold_page          /root/vmlinux
> >                57.00  0.5% ffffffff811ae540 memcpy                      /root/vmlinux
> >                56.00  0.5% ffffffff810ad1a0 __alloc_pages_nodemask      /root/vmlinux
> >                55.00  0.5% ffffffff81118200 splice_to_pipe              /root/vmlinux
> >                53.00  0.5% ffffffff810e3bc0 get_partial_node            /root/vmlinux
> >                49.00  0.4% ffffffff812eb840 __intel_map_single          /root/vmlinux
> > 
> > 
> > > --b.
> > > 
> > > >                     CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6
> > > CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14
> > > CPU15
> > > >           HI:          0          0          0          0          0          0          0          0          0
> > > 0          0          0          0          0          0          0
> > > >        TIMER:     418767      46596      43515      44547      50099      34815
> > > 40634      40337      39551      93442      73733      42631      42509      41592
> > > 40351      61793
> > > >       NET_TX:      28719        309       1421       1294       1730       1243        832
> > > 937         11         44         41         20         26         19         15         29
> > > >       NET_RX:     612070         19         22         21          6        235          3          2
> > > 9          6         17         16         20         13         16         10
> > > >        BLOCK:       5941          0          0          0          0          0          0          0
> > > 519        259       1238        272        253        174        215       2618
> > > > BLOCK_IOPOLL:          0          0          0          0          0          0          0          0
> > > 0          0          0          0          0          0          0          0
> > > >      TASKLET:         28          1          1          1          1    1540653          1          1
> > > 29          1          1          1          1          1          1          2
> > > >        SCHED:     364965      26547      16807      18403      22919       8678
> > > 14358      14091      16981      64903      47141      18517      19179      18036
> > > 17037      38261
> > > >      HRTIMER:         13          0          1          1          0          0          0          0
> > > 0          0          0          0          1          1          0          1
> > > >          RCU:     945823     841546     715281     892762     823564      42663
> > > 863063     841622     333577     389013     393501     239103     221524     258159
> > > 313426     234030

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-24 15:26                 ` J. Bruce Fields
  0 siblings, 0 replies; 82+ messages in thread
From: J. Bruce Fields @ 2013-04-24 15:26 UTC (permalink / raw)
  To: Yan Burman
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
> On Wed, Apr 24, 2013 at 12:35:03PM +0000, Yan Burman wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: J. Bruce Fields [mailto:bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org]
> > > Sent: Wednesday, April 24, 2013 00:06
> > > To: Yan Burman
> > > Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> > > linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
> > > Subject: Re: NFS over RDMA benchmark
> > > 
> > > On Thu, Apr 18, 2013 at 12:47:09PM +0000, Yan Burman wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Wendy Cheng [mailto:s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> > > > > Sent: Wednesday, April 17, 2013 21:06
> > > > > To: Atchley, Scott
> > > > > Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
> > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > > Subject: Re: NFS over RDMA benchmark
> > > > >
> > > > > On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
> > > > > <atchleyes-1Heg1YXhbW8@public.gmane.org>
> > > > > wrote:
> > > > > > On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > > > > wrote:
> > > > > >
> > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > > > wrote:
> > > > > >>> Hi.
> > > > > >>>
> > > > > >>> I've been trying to do some benchmarks for NFS over RDMA and I
> > > > > >>> seem to
> > > > > only get about half of the bandwidth that the HW can give me.
> > > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > > > >>> memory, and
> > > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > > >>> These servers are connected to a QDR IB switch. The backing
> > > > > >>> storage on
> > > > > the server is tmpfs mounted with noatime.
> > > > > >>> I am running kernel 3.5.7.
> > > > > >>>
> > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> > > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for
> > > > > >>> the
> > > > > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> > > > > >
> > > > > > Yan,
> > > > > >
> > > > > > Are you trying to optimize single client performance or server
> > > > > > performance
> > > > > with multiple clients?
> > > > > >
> > > >
> > > > I am trying to get maximum performance from a single server - I used 2
> > > processes in fio test - more than 2 did not show any performance boost.
> > > > I tried running fio from 2 different PCs on 2 different files, but the sum of
> > > the two is more or less the same as running from single client PC.
> > > >
> > > > What I did see is that server is sweating a lot more than the clients and
> > > more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > > > cat /proc/softirqs
> > > 
> > > Would any profiling help figure out which code it's spending time in?
> > > (E.g. something simple as "perf top" might have useful output.)
> > > 
> > 
> > 
> > Perf top for the CPU with high tasklet count gives:
> > 
> >              samples  pcnt         RIP        function                    DSO
> >              _______ _____ ________________ ___________________________ ___________________________________________________________________
> > 
> >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner         /root/vmlinux
> 
> I guess that means lots of contention on some mutex?  If only we knew
> which one.... perf should also be able to collect stack statistics, I
> forget how.

Googling around....  I think we want:

	perf record -a --call-graph
	(give it a chance to collect some samples, then ^C)
	perf report --call-graph --stdio

--b.

> 
> --b.
> 
> >               978.00  8.4% ffffffff810297f0 clflush_cache_range         /root/vmlinux
> >               445.00  3.8% ffffffff812ea440 __domain_mapping            /root/vmlinux
> >               441.00  3.8% 0000000000018c30 svc_recv                    /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
> >               344.00  3.0% ffffffff813a1bc0 _raw_spin_lock_bh           /root/vmlinux
> >               333.00  2.9% ffffffff813a19e0 _raw_spin_lock_irqsave      /root/vmlinux
> >               288.00  2.5% ffffffff813a07d0 __schedule                  /root/vmlinux
> >               249.00  2.1% ffffffff811a87e0 rb_prev                     /root/vmlinux
> >               242.00  2.1% ffffffff813a19b0 _raw_spin_lock              /root/vmlinux
> >               184.00  1.6% 0000000000002e90 svc_rdma_sendto             /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
> >               177.00  1.5% ffffffff810ac820 get_page_from_freelist      /root/vmlinux
> >               174.00  1.5% ffffffff812e6da0 alloc_iova                  /root/vmlinux
> >               165.00  1.4% ffffffff810b1390 put_page                    /root/vmlinux
> >               148.00  1.3% 0000000000014760 sunrpc_cache_lookup         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
> >               128.00  1.1% 0000000000017f20 svc_xprt_enqueue            /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
> >               126.00  1.1% ffffffff8139f820 __mutex_lock_slowpath       /root/vmlinux
> >               108.00  0.9% ffffffff811a81d0 rb_insert_color             /root/vmlinux
> >               107.00  0.9% 0000000000004690 svc_rdma_recvfrom           /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
> >               102.00  0.9% 0000000000002640 send_reply                  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
> >                99.00  0.9% ffffffff810e6490 kmem_cache_alloc            /root/vmlinux
> >                96.00  0.8% ffffffff810e5840 __slab_alloc                /root/vmlinux
> >                91.00  0.8% 0000000000006d30 mlx4_ib_post_send           /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
> >                88.00  0.8% 0000000000000dd0 svc_rdma_get_context        /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
> >                86.00  0.7% ffffffff813a1a10 _raw_spin_lock_irq          /root/vmlinux
> >                86.00  0.7% 0000000000001530 svc_rdma_send               /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
> >                85.00  0.7% ffffffff81060a80 prepare_creds               /root/vmlinux
> >                83.00  0.7% ffffffff810a5790 find_get_pages_contig       /root/vmlinux
> >                79.00  0.7% ffffffff810e4620 __slab_free                 /root/vmlinux
> >                79.00  0.7% ffffffff813a1a40 _raw_spin_unlock_irqrestore /root/vmlinux
> >                77.00  0.7% ffffffff81065610 finish_task_switch          /root/vmlinux
> >                76.00  0.7% ffffffff812e9270 pfn_to_dma_pte              /root/vmlinux
> >                75.00  0.6% ffffffff810976d0 __call_rcu                  /root/vmlinux
> >                73.00  0.6% ffffffff811a2fa0 _atomic_dec_and_lock        /root/vmlinux
> >                73.00  0.6% 00000000000002e0 svc_rdma_has_wspace         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
> >                67.00  0.6% ffffffff813a1a70 _raw_read_lock              /root/vmlinux
> >                65.00  0.6% 000000000000f590 svcauth_unix_set_client     /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
> >                63.00  0.5% 00000000000180e0 svc_reserve                 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
> >                60.00  0.5% 00000000000064d0 stamp_send_wqe              /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
> >                57.00  0.5% ffffffff810ac110 free_hot_cold_page          /root/vmlinux
> >                57.00  0.5% ffffffff811ae540 memcpy                      /root/vmlinux
> >                56.00  0.5% ffffffff810ad1a0 __alloc_pages_nodemask      /root/vmlinux
> >                55.00  0.5% ffffffff81118200 splice_to_pipe              /root/vmlinux
> >                53.00  0.5% ffffffff810e3bc0 get_partial_node            /root/vmlinux
> >                49.00  0.4% ffffffff812eb840 __intel_map_single          /root/vmlinux
> > 
> > 
> > > --b.
> > > 
> > > >                     CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6
> > > CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14
> > > CPU15
> > > >           HI:          0          0          0          0          0          0          0          0          0
> > > 0          0          0          0          0          0          0
> > > >        TIMER:     418767      46596      43515      44547      50099      34815
> > > 40634      40337      39551      93442      73733      42631      42509      41592
> > > 40351      61793
> > > >       NET_TX:      28719        309       1421       1294       1730       1243        832
> > > 937         11         44         41         20         26         19         15         29
> > > >       NET_RX:     612070         19         22         21          6        235          3          2
> > > 9          6         17         16         20         13         16         10
> > > >        BLOCK:       5941          0          0          0          0          0          0          0
> > > 519        259       1238        272        253        174        215       2618
> > > > BLOCK_IOPOLL:          0          0          0          0          0          0          0          0
> > > 0          0          0          0          0          0          0          0
> > > >      TASKLET:         28          1          1          1          1    1540653          1          1
> > > 29          1          1          1          1          1          1          2
> > > >        SCHED:     364965      26547      16807      18403      22919       8678
> > > 14358      14091      16981      64903      47141      18517      19179      18036
> > > 17037      38261
> > > >      HRTIMER:         13          0          1          1          0          0          0          0
> > > 0          0          0          0          1          1          0          1
> > > >          RCU:     945823     841546     715281     892762     823564      42663
> > > 863063     841622     333577     389013     393501     239103     221524     258159
> > > 313426     234030
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-24 16:27                   ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-24 16:27 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Yan Burman, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
>> On Wed, Apr 24, 2013 at 12:35:03PM +0000, Yan Burman wrote:
>> >
>> >
>> >
>> > Perf top for the CPU with high tasklet count gives:
>> >
>> >              samples  pcnt         RIP        function                    DSO
>> >              _______ _____ ________________ ___________________________ ___________________________________________________________________
>> >
>> >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner         /root/vmlinux
>>
>> I guess that means lots of contention on some mutex?  If only we knew
>> which one.... perf should also be able to collect stack statistics, I
>> forget how.
>
> Googling around....  I think we want:
>
>         perf record -a --call-graph
>         (give it a chance to collect some samples, then ^C)
>         perf report --call-graph --stdio
>

I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
that "rb_prev" up in the #7 spot ? Do we have Red Black tree somewhere
in the paths ? Trees like that requires extensive lockings.

-- Wendy

.
>>
>> >               978.00  8.4% ffffffff810297f0 clflush_cache_range         /root/vmlinux
>> >               445.00  3.8% ffffffff812ea440 __domain_mapping            /root/vmlinux
>> >               441.00  3.8% 0000000000018c30 svc_recv                    /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>> >               344.00  3.0% ffffffff813a1bc0 _raw_spin_lock_bh           /root/vmlinux
>> >               333.00  2.9% ffffffff813a19e0 _raw_spin_lock_irqsave      /root/vmlinux
>> >               288.00  2.5% ffffffff813a07d0 __schedule                  /root/vmlinux
>> >               249.00  2.1% ffffffff811a87e0 rb_prev                     /root/vmlinux
>> >               242.00  2.1% ffffffff813a19b0 _raw_spin_lock              /root/vmlinux
>> >               184.00  1.6% 0000000000002e90 svc_rdma_sendto             /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>> >               177.00  1.5% ffffffff810ac820 get_page_from_freelist      /root/vmlinux
>> >               174.00  1.5% ffffffff812e6da0 alloc_iova                  /root/vmlinux
>> >               165.00  1.4% ffffffff810b1390 put_page                    /root/vmlinux
>> >               148.00  1.3% 0000000000014760 sunrpc_cache_lookup         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>> >               128.00  1.1% 0000000000017f20 svc_xprt_enqueue            /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>> >               126.00  1.1% ffffffff8139f820 __mutex_lock_slowpath       /root/vmlinux
>> >               108.00  0.9% ffffffff811a81d0 rb_insert_color             /root/vmlinux
>> >               107.00  0.9% 0000000000004690 svc_rdma_recvfrom           /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>> >               102.00  0.9% 0000000000002640 send_reply                  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>> >                99.00  0.9% ffffffff810e6490 kmem_cache_alloc            /root/vmlinux
>> >                96.00  0.8% ffffffff810e5840 __slab_alloc                /root/vmlinux
>> >                91.00  0.8% 0000000000006d30 mlx4_ib_post_send           /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
>> >                88.00  0.8% 0000000000000dd0 svc_rdma_get_context        /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>> >                86.00  0.7% ffffffff813a1a10 _raw_spin_lock_irq          /root/vmlinux
>> >                86.00  0.7% 0000000000001530 svc_rdma_send               /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>> >                85.00  0.7% ffffffff81060a80 prepare_creds               /root/vmlinux
>> >                83.00  0.7% ffffffff810a5790 find_get_pages_contig       /root/vmlinux
>> >                79.00  0.7% ffffffff810e4620 __slab_free                 /root/vmlinux
>> >                79.00  0.7% ffffffff813a1a40 _raw_spin_unlock_irqrestore /root/vmlinux
>> >                77.00  0.7% ffffffff81065610 finish_task_switch          /root/vmlinux
>> >                76.00  0.7% ffffffff812e9270 pfn_to_dma_pte              /root/vmlinux
>> >                75.00  0.6% ffffffff810976d0 __call_rcu                  /root/vmlinux
>> >                73.00  0.6% ffffffff811a2fa0 _atomic_dec_and_lock        /root/vmlinux
>> >                73.00  0.6% 00000000000002e0 svc_rdma_has_wspace         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>> >                67.00  0.6% ffffffff813a1a70 _raw_read_lock              /root/vmlinux
>> >                65.00  0.6% 000000000000f590 svcauth_unix_set_client     /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>> >                63.00  0.5% 00000000000180e0 svc_reserve                 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>> >                60.00  0.5% 00000000000064d0 stamp_send_wqe              /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
>> >                57.00  0.5% ffffffff810ac110 free_hot_cold_page          /root/vmlinux
>> >                57.00  0.5% ffffffff811ae540 memcpy                      /root/vmlinux
>> >                56.00  0.5% ffffffff810ad1a0 __alloc_pages_nodemask      /root/vmlinux
>> >                55.00  0.5% ffffffff81118200 splice_to_pipe              /root/vmlinux
>> >                53.00  0.5% ffffffff810e3bc0 get_partial_node            /root/vmlinux
>> >                49.00  0.4% ffffffff812eb840 __intel_map_single          /root/vmlinux
>> >
>> >
>> > > --b.
>> > >
>> > > >                     CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6
>> > > CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14
>> > > CPU15
>> > > >           HI:          0          0          0          0          0          0          0          0          0
>> > > 0          0          0          0          0          0          0
>> > > >        TIMER:     418767      46596      43515      44547      50099      34815
>> > > 40634      40337      39551      93442      73733      42631      42509      41592
>> > > 40351      61793
>> > > >       NET_TX:      28719        309       1421       1294       1730       1243        832
>> > > 937         11         44         41         20         26         19         15         29
>> > > >       NET_RX:     612070         19         22         21          6        235          3          2
>> > > 9          6         17         16         20         13         16         10
>> > > >        BLOCK:       5941          0          0          0          0          0          0          0
>> > > 519        259       1238        272        253        174        215       2618
>> > > > BLOCK_IOPOLL:          0          0          0          0          0          0          0          0
>> > > 0          0          0          0          0          0          0          0
>> > > >      TASKLET:         28          1          1          1          1    1540653          1          1
>> > > 29          1          1          1          1          1          1          2
>> > > >        SCHED:     364965      26547      16807      18403      22919       8678
>> > > 14358      14091      16981      64903      47141      18517      19179      18036
>> > > 17037      38261
>> > > >      HRTIMER:         13          0          1          1          0          0          0          0
>> > > 0          0          0          0          1          1          0          1
>> > > >          RCU:     945823     841546     715281     892762     823564      42663
>> > > 863063     841622     333577     389013     393501     239103     221524     258159
>> > > 313426     234030

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-24 16:27                   ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-24 16:27 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Yan Burman, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:
> On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
>> On Wed, Apr 24, 2013 at 12:35:03PM +0000, Yan Burman wrote:
>> >
>> >
>> >
>> > Perf top for the CPU with high tasklet count gives:
>> >
>> >              samples  pcnt         RIP        function                    DSO
>> >              _______ _____ ________________ ___________________________ ___________________________________________________________________
>> >
>> >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner         /root/vmlinux
>>
>> I guess that means lots of contention on some mutex?  If only we knew
>> which one.... perf should also be able to collect stack statistics, I
>> forget how.
>
> Googling around....  I think we want:
>
>         perf record -a --call-graph
>         (give it a chance to collect some samples, then ^C)
>         perf report --call-graph --stdio
>

I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
that "rb_prev" up in the #7 spot ? Do we have Red Black tree somewhere
in the paths ? Trees like that requires extensive lockings.

-- Wendy

.
>>
>> >               978.00  8.4% ffffffff810297f0 clflush_cache_range         /root/vmlinux
>> >               445.00  3.8% ffffffff812ea440 __domain_mapping            /root/vmlinux
>> >               441.00  3.8% 0000000000018c30 svc_recv                    /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>> >               344.00  3.0% ffffffff813a1bc0 _raw_spin_lock_bh           /root/vmlinux
>> >               333.00  2.9% ffffffff813a19e0 _raw_spin_lock_irqsave      /root/vmlinux
>> >               288.00  2.5% ffffffff813a07d0 __schedule                  /root/vmlinux
>> >               249.00  2.1% ffffffff811a87e0 rb_prev                     /root/vmlinux
>> >               242.00  2.1% ffffffff813a19b0 _raw_spin_lock              /root/vmlinux
>> >               184.00  1.6% 0000000000002e90 svc_rdma_sendto             /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>> >               177.00  1.5% ffffffff810ac820 get_page_from_freelist      /root/vmlinux
>> >               174.00  1.5% ffffffff812e6da0 alloc_iova                  /root/vmlinux
>> >               165.00  1.4% ffffffff810b1390 put_page                    /root/vmlinux
>> >               148.00  1.3% 0000000000014760 sunrpc_cache_lookup         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>> >               128.00  1.1% 0000000000017f20 svc_xprt_enqueue            /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>> >               126.00  1.1% ffffffff8139f820 __mutex_lock_slowpath       /root/vmlinux
>> >               108.00  0.9% ffffffff811a81d0 rb_insert_color             /root/vmlinux
>> >               107.00  0.9% 0000000000004690 svc_rdma_recvfrom           /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>> >               102.00  0.9% 0000000000002640 send_reply                  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>> >                99.00  0.9% ffffffff810e6490 kmem_cache_alloc            /root/vmlinux
>> >                96.00  0.8% ffffffff810e5840 __slab_alloc                /root/vmlinux
>> >                91.00  0.8% 0000000000006d30 mlx4_ib_post_send           /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
>> >                88.00  0.8% 0000000000000dd0 svc_rdma_get_context        /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>> >                86.00  0.7% ffffffff813a1a10 _raw_spin_lock_irq          /root/vmlinux
>> >                86.00  0.7% 0000000000001530 svc_rdma_send               /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>> >                85.00  0.7% ffffffff81060a80 prepare_creds               /root/vmlinux
>> >                83.00  0.7% ffffffff810a5790 find_get_pages_contig       /root/vmlinux
>> >                79.00  0.7% ffffffff810e4620 __slab_free                 /root/vmlinux
>> >                79.00  0.7% ffffffff813a1a40 _raw_spin_unlock_irqrestore /root/vmlinux
>> >                77.00  0.7% ffffffff81065610 finish_task_switch          /root/vmlinux
>> >                76.00  0.7% ffffffff812e9270 pfn_to_dma_pte              /root/vmlinux
>> >                75.00  0.6% ffffffff810976d0 __call_rcu                  /root/vmlinux
>> >                73.00  0.6% ffffffff811a2fa0 _atomic_dec_and_lock        /root/vmlinux
>> >                73.00  0.6% 00000000000002e0 svc_rdma_has_wspace         /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
>> >                67.00  0.6% ffffffff813a1a70 _raw_read_lock              /root/vmlinux
>> >                65.00  0.6% 000000000000f590 svcauth_unix_set_client     /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>> >                63.00  0.5% 00000000000180e0 svc_reserve                 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
>> >                60.00  0.5% 00000000000064d0 stamp_send_wqe              /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
>> >                57.00  0.5% ffffffff810ac110 free_hot_cold_page          /root/vmlinux
>> >                57.00  0.5% ffffffff811ae540 memcpy                      /root/vmlinux
>> >                56.00  0.5% ffffffff810ad1a0 __alloc_pages_nodemask      /root/vmlinux
>> >                55.00  0.5% ffffffff81118200 splice_to_pipe              /root/vmlinux
>> >                53.00  0.5% ffffffff810e3bc0 get_partial_node            /root/vmlinux
>> >                49.00  0.4% ffffffff812eb840 __intel_map_single          /root/vmlinux
>> >
>> >
>> > > --b.
>> > >
>> > > >                     CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6
>> > > CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14
>> > > CPU15
>> > > >           HI:          0          0          0          0          0          0          0          0          0
>> > > 0          0          0          0          0          0          0
>> > > >        TIMER:     418767      46596      43515      44547      50099      34815
>> > > 40634      40337      39551      93442      73733      42631      42509      41592
>> > > 40351      61793
>> > > >       NET_TX:      28719        309       1421       1294       1730       1243        832
>> > > 937         11         44         41         20         26         19         15         29
>> > > >       NET_RX:     612070         19         22         21          6        235          3          2
>> > > 9          6         17         16         20         13         16         10
>> > > >        BLOCK:       5941          0          0          0          0          0          0          0
>> > > 519        259       1238        272        253        174        215       2618
>> > > > BLOCK_IOPOLL:          0          0          0          0          0          0          0          0
>> > > 0          0          0          0          0          0          0          0
>> > > >      TASKLET:         28          1          1          1          1    1540653          1          1
>> > > 29          1          1          1          1          1          1          2
>> > > >        SCHED:     364965      26547      16807      18403      22919       8678
>> > > 14358      14091      16981      64903      47141      18517      19179      18036
>> > > 17037      38261
>> > > >      HRTIMER:         13          0          1          1          0          0          0          0
>> > > 0          0          0          0          1          1          0          1
>> > > >          RCU:     945823     841546     715281     892762     823564      42663
>> > > 863063     841622     333577     389013     393501     239103     221524     258159
>> > > 313426     234030
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-24 18:04                     ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-24 18:04 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Yan Burman, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng <s.wendy.cheng@gmail.com> wrote:
> On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
>> On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
>>> On Wed, Apr 24, 2013 at 12:35:03PM +0000, Yan Burman wrote:
>>> >
>>> >
>>> >
>>> > Perf top for the CPU with high tasklet count gives:
>>> >
>>> >              samples  pcnt         RIP        function                    DSO
>>> >              _______ _____ ________________ ___________________________ ___________________________________________________________________
>>> >
>>> >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner         /root/vmlinux
>>>
>>> I guess that means lots of contention on some mutex?  If only we knew
>>> which one.... perf should also be able to collect stack statistics, I
>>> forget how.
>>
>> Googling around....  I think we want:
>>
>>         perf record -a --call-graph
>>         (give it a chance to collect some samples, then ^C)
>>         perf report --call-graph --stdio
>>
>
> I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
> that "rb_prev" up in the #7 spot ? Do we have Red Black tree somewhere
> in the paths ? Trees like that requires extensive lockings.
>

So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
tar ball) ... Here is a random thought (not related to the rb tree
comment).....

The inflight packet count seems to be controlled by
xprt_rdma_slot_table_entries that is currently hard-coded as
RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
with the bandwidth number if we pump it up, say 64 instead ? Not sure
whether FMR pool size needs to get adjusted accordingly though.

In short, if anyone has benchmark setup handy, bumping up the slot
table size as the following might be interesting:

--- ofa_kernel-1.5.4.1.orig/include/linux/sunrpc/xprtrdma.h
2013-03-21 09:19:36.233006570 -0700
+++ ofa_kernel-1.5.4.1/include/linux/sunrpc/xprtrdma.h  2013-04-24
10:52:20.934781304 -0700
@@ -59,7 +59,7 @@
  * a single chunk type per message is supported currently.
  */
 #define RPCRDMA_MIN_SLOT_TABLE (2U)
-#define RPCRDMA_DEF_SLOT_TABLE (32U)
+#define RPCRDMA_DEF_SLOT_TABLE (64U)
 #define RPCRDMA_MAX_SLOT_TABLE (256U)

 #define RPCRDMA_DEF_INLINE  (1024)     /* default inline max */

-- Wendy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-24 18:04                     ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-24 18:04 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Yan Burman, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:
>> On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
>>> On Wed, Apr 24, 2013 at 12:35:03PM +0000, Yan Burman wrote:
>>> >
>>> >
>>> >
>>> > Perf top for the CPU with high tasklet count gives:
>>> >
>>> >              samples  pcnt         RIP        function                    DSO
>>> >              _______ _____ ________________ ___________________________ ___________________________________________________________________
>>> >
>>> >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner         /root/vmlinux
>>>
>>> I guess that means lots of contention on some mutex?  If only we knew
>>> which one.... perf should also be able to collect stack statistics, I
>>> forget how.
>>
>> Googling around....  I think we want:
>>
>>         perf record -a --call-graph
>>         (give it a chance to collect some samples, then ^C)
>>         perf report --call-graph --stdio
>>
>
> I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
> that "rb_prev" up in the #7 spot ? Do we have Red Black tree somewhere
> in the paths ? Trees like that requires extensive lockings.
>

So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
tar ball) ... Here is a random thought (not related to the rb tree
comment).....

The inflight packet count seems to be controlled by
xprt_rdma_slot_table_entries that is currently hard-coded as
RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
with the bandwidth number if we pump it up, say 64 instead ? Not sure
whether FMR pool size needs to get adjusted accordingly though.

In short, if anyone has benchmark setup handy, bumping up the slot
table size as the following might be interesting:

--- ofa_kernel-1.5.4.1.orig/include/linux/sunrpc/xprtrdma.h
2013-03-21 09:19:36.233006570 -0700
+++ ofa_kernel-1.5.4.1/include/linux/sunrpc/xprtrdma.h  2013-04-24
10:52:20.934781304 -0700
@@ -59,7 +59,7 @@
  * a single chunk type per message is supported currently.
  */
 #define RPCRDMA_MIN_SLOT_TABLE (2U)
-#define RPCRDMA_DEF_SLOT_TABLE (32U)
+#define RPCRDMA_DEF_SLOT_TABLE (64U)
 #define RPCRDMA_MAX_SLOT_TABLE (256U)

 #define RPCRDMA_DEF_INLINE  (1024)     /* default inline max */

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-24 18:26                       ` Tom Talpey
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Talpey @ 2013-04-24 18:26 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: J. Bruce Fields, Yan Burman, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On 4/24/2013 2:04 PM, Wendy Cheng wrote:
> On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng <s.wendy.cheng@gmail.com> wrote:
>> On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>> On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
>>>> On Wed, Apr 24, 2013 at 12:35:03PM +0000, Yan Burman wrote:
>>>>>
>>>>>
>>>>>
>>>>> Perf top for the CPU with high tasklet count gives:
>>>>>
>>>>>               samples  pcnt         RIP        function                    DSO
>>>>>               _______ _____ ________________ ___________________________ ___________________________________________________________________
>>>>>
>>>>>               2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner         /root/vmlinux
>>>>
>>>> I guess that means lots of contention on some mutex?  If only we knew
>>>> which one.... perf should also be able to collect stack statistics, I
>>>> forget how.
>>>
>>> Googling around....  I think we want:
>>>
>>>          perf record -a --call-graph
>>>          (give it a chance to collect some samples, then ^C)
>>>          perf report --call-graph --stdio
>>>
>>
>> I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
>> that "rb_prev" up in the #7 spot ? Do we have Red Black tree somewhere
>> in the paths ? Trees like that requires extensive lockings.
>>
>
> So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
> tar ball) ... Here is a random thought (not related to the rb tree
> comment).....
>
> The inflight packet count seems to be controlled by
> xprt_rdma_slot_table_entries that is currently hard-coded as
> RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
> with the bandwidth number if we pump it up, say 64 instead ? Not sure
> whether FMR pool size needs to get adjusted accordingly though.

1)

The client slot count is not hard-coded, it can easily be changed by
writing a value to /proc and initiating a new mount. But I doubt that
increasing the slot table will improve performance much, unless this is
a small-random-read, and spindle-limited workload.

2)

The observation appears to be that the bandwidth is server CPU limited.
Increasing the load offered by the client probably won't move the needle,
until that's addressed.


>
> In short, if anyone has benchmark setup handy, bumping up the slot
> table size as the following might be interesting:
>
> --- ofa_kernel-1.5.4.1.orig/include/linux/sunrpc/xprtrdma.h
> 2013-03-21 09:19:36.233006570 -0700
> +++ ofa_kernel-1.5.4.1/include/linux/sunrpc/xprtrdma.h  2013-04-24
> 10:52:20.934781304 -0700
> @@ -59,7 +59,7 @@
>    * a single chunk type per message is supported currently.
>    */
>   #define RPCRDMA_MIN_SLOT_TABLE (2U)
> -#define RPCRDMA_DEF_SLOT_TABLE (32U)
> +#define RPCRDMA_DEF_SLOT_TABLE (64U)
>   #define RPCRDMA_MAX_SLOT_TABLE (256U)
>
>   #define RPCRDMA_DEF_INLINE  (1024)     /* default inline max */
>
> -- Wendy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-24 18:26                       ` Tom Talpey
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Talpey @ 2013-04-24 18:26 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: J. Bruce Fields, Yan Burman, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On 4/24/2013 2:04 PM, Wendy Cheng wrote:
> On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:
>>> On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
>>>> On Wed, Apr 24, 2013 at 12:35:03PM +0000, Yan Burman wrote:
>>>>>
>>>>>
>>>>>
>>>>> Perf top for the CPU with high tasklet count gives:
>>>>>
>>>>>               samples  pcnt         RIP        function                    DSO
>>>>>               _______ _____ ________________ ___________________________ ___________________________________________________________________
>>>>>
>>>>>               2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner         /root/vmlinux
>>>>
>>>> I guess that means lots of contention on some mutex?  If only we knew
>>>> which one.... perf should also be able to collect stack statistics, I
>>>> forget how.
>>>
>>> Googling around....  I think we want:
>>>
>>>          perf record -a --call-graph
>>>          (give it a chance to collect some samples, then ^C)
>>>          perf report --call-graph --stdio
>>>
>>
>> I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
>> that "rb_prev" up in the #7 spot ? Do we have Red Black tree somewhere
>> in the paths ? Trees like that requires extensive lockings.
>>
>
> So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
> tar ball) ... Here is a random thought (not related to the rb tree
> comment).....
>
> The inflight packet count seems to be controlled by
> xprt_rdma_slot_table_entries that is currently hard-coded as
> RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
> with the bandwidth number if we pump it up, say 64 instead ? Not sure
> whether FMR pool size needs to get adjusted accordingly though.

1)

The client slot count is not hard-coded, it can easily be changed by
writing a value to /proc and initiating a new mount. But I doubt that
increasing the slot table will improve performance much, unless this is
a small-random-read, and spindle-limited workload.

2)

The observation appears to be that the bandwidth is server CPU limited.
Increasing the load offered by the client probably won't move the needle,
until that's addressed.


>
> In short, if anyone has benchmark setup handy, bumping up the slot
> table size as the following might be interesting:
>
> --- ofa_kernel-1.5.4.1.orig/include/linux/sunrpc/xprtrdma.h
> 2013-03-21 09:19:36.233006570 -0700
> +++ ofa_kernel-1.5.4.1/include/linux/sunrpc/xprtrdma.h  2013-04-24
> 10:52:20.934781304 -0700
> @@ -59,7 +59,7 @@
>    * a single chunk type per message is supported currently.
>    */
>   #define RPCRDMA_MIN_SLOT_TABLE (2U)
> -#define RPCRDMA_DEF_SLOT_TABLE (32U)
> +#define RPCRDMA_DEF_SLOT_TABLE (64U)
>   #define RPCRDMA_MAX_SLOT_TABLE (256U)
>
>   #define RPCRDMA_DEF_INLINE  (1024)     /* default inline max */
>
> -- Wendy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-25 17:18                         ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-25 17:18 UTC (permalink / raw)
  To: Tom Talpey
  Cc: J. Bruce Fields, Yan Burman, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey <tom@talpey.com> wrote:
>> On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng <s.wendy.cheng@gmail.com>
>> wrote:
>>>
>> So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
>> tar ball) ... Here is a random thought (not related to the rb tree
>> comment).....
>>
>> The inflight packet count seems to be controlled by
>> xprt_rdma_slot_table_entries that is currently hard-coded as
>> RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
>> with the bandwidth number if we pump it up, say 64 instead ? Not sure
>> whether FMR pool size needs to get adjusted accordingly though.
>
> 1)
>
> The client slot count is not hard-coded, it can easily be changed by
> writing a value to /proc and initiating a new mount. But I doubt that
> increasing the slot table will improve performance much, unless this is
> a small-random-read, and spindle-limited workload.

Hi Tom !

It was a shot in the dark :)  .. as our test bed has not been setup
yet .However, since I'll be working on (very) slow clients, increasing
this buffer is still interesting (to me). I don't see where it is
controlled by a /proc value (?) - but that is not a concern at this
moment as /proc entry is easy to add. More questions on the server
though (see below) ...

>
> 2)
>
> The observation appears to be that the bandwidth is server CPU limited.
> Increasing the load offered by the client probably won't move the needle,
> until that's addressed.
>

Could you give more hints on which part of the path is CPU limited ?
Is there a known Linux-based filesystem that is reasonbly tuned for
NFS-RDMA ? Any specific filesystem features would work well with
NFS-RDMA ? I'm wondering when disk+FS are added into the
configuration, how much advantages would NFS-RDMA get when compared
with a plain TCP/IP, say IPOIB on CM , transport ?

-- Wendy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-25 17:18                         ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-25 17:18 UTC (permalink / raw)
  To: Tom Talpey
  Cc: J. Bruce Fields, Yan Burman, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org> wrote:
>> On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> wrote:
>>>
>> So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
>> tar ball) ... Here is a random thought (not related to the rb tree
>> comment).....
>>
>> The inflight packet count seems to be controlled by
>> xprt_rdma_slot_table_entries that is currently hard-coded as
>> RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
>> with the bandwidth number if we pump it up, say 64 instead ? Not sure
>> whether FMR pool size needs to get adjusted accordingly though.
>
> 1)
>
> The client slot count is not hard-coded, it can easily be changed by
> writing a value to /proc and initiating a new mount. But I doubt that
> increasing the slot table will improve performance much, unless this is
> a small-random-read, and spindle-limited workload.

Hi Tom !

It was a shot in the dark :)  .. as our test bed has not been setup
yet .However, since I'll be working on (very) slow clients, increasing
this buffer is still interesting (to me). I don't see where it is
controlled by a /proc value (?) - but that is not a concern at this
moment as /proc entry is easy to add. More questions on the server
though (see below) ...

>
> 2)
>
> The observation appears to be that the bandwidth is server CPU limited.
> Increasing the load offered by the client probably won't move the needle,
> until that's addressed.
>

Could you give more hints on which part of the path is CPU limited ?
Is there a known Linux-based filesystem that is reasonbly tuned for
NFS-RDMA ? Any specific filesystem features would work well with
NFS-RDMA ? I'm wondering when disk+FS are added into the
configuration, how much advantages would NFS-RDMA get when compared
with a plain TCP/IP, say IPOIB on CM , transport ?

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-25 19:01                           ` Phil Pishioneri
  0 siblings, 0 replies; 82+ messages in thread
From: Phil Pishioneri @ 2013-04-25 19:01 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: Tom Talpey, J. Bruce Fields, Yan Burman, Atchley, Scott,
	Tom Tucker, linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org,
	Or Gerlitz

On 4/25/13 1:18 PM, Wendy Cheng wrote:
> On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey <tom@talpey.com> wrote:
>> 1)
>>
>> The client slot count is not hard-coded, it can easily be changed by
>> writing a value to /proc and initiating a new mount. But I doubt that
>> increasing the slot table will improve performance much, unless this is
>> a small-random-read, and spindle-limited workload.
> It was a shot in the dark :)  .. as our test bed has not been setup
> yet .However, since I'll be working on (very) slow clients, increasing
> this buffer is still interesting (to me). I don't see where it is
> controlled by a /proc value (?) - but that is not a concern at this
> moment as /proc entry is easy to add. More questions on the server
> though (see below) ...

Might there be confusion between the RDMA slot table and the TCP/UDP 
ones (which have proc entries under /proc/sys/sunrpc)?

-Phil

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-25 19:01                           ` Phil Pishioneri
  0 siblings, 0 replies; 82+ messages in thread
From: Phil Pishioneri @ 2013-04-25 19:01 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: Tom Talpey, J. Bruce Fields, Yan Burman, Atchley, Scott,
	Tom Tucker, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On 4/25/13 1:18 PM, Wendy Cheng wrote:
> On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org> wrote:
>> 1)
>>
>> The client slot count is not hard-coded, it can easily be changed by
>> writing a value to /proc and initiating a new mount. But I doubt that
>> increasing the slot table will improve performance much, unless this is
>> a small-random-read, and spindle-limited workload.
> It was a shot in the dark :)  .. as our test bed has not been setup
> yet .However, since I'll be working on (very) slow clients, increasing
> this buffer is still interesting (to me). I don't see where it is
> controlled by a /proc value (?) - but that is not a concern at this
> moment as /proc entry is easy to add. More questions on the server
> though (see below) ...

Might there be confusion between the RDMA slot table and the TCP/UDP 
ones (which have proc entries under /proc/sys/sunrpc)?

-Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-25 20:14                             ` Tom Talpey
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Talpey @ 2013-04-25 20:14 UTC (permalink / raw)
  To: Phil Pishioneri
  Cc: Wendy Cheng, J. Bruce Fields, Yan Burman, Atchley, Scott,
	Tom Tucker, linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org,
	Or Gerlitz

On 4/25/2013 3:01 PM, Phil Pishioneri wrote:
> On 4/25/13 1:18 PM, Wendy Cheng wrote:
>> On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey <tom@talpey.com> wrote:
>>> 1)
>>>
>>> The client slot count is not hard-coded, it can easily be changed by
>>> writing a value to /proc and initiating a new mount. But I doubt that
>>> increasing the slot table will improve performance much, unless this is
>>> a small-random-read, and spindle-limited workload.
>> It was a shot in the dark :)  .. as our test bed has not been setup
>> yet .However, since I'll be working on (very) slow clients, increasing
>> this buffer is still interesting (to me). I don't see where it is
>> controlled by a /proc value (?) - but that is not a concern at this
>> moment as /proc entry is easy to add. More questions on the server
>> though (see below) ...
>
> Might there be confusion between the RDMA slot table and the TCP/UDP
> ones (which have proc entries under /proc/sys/sunrpc)?
>

No, the xprtrdma.ko creates similar slot table controls when it loads.
See the names below, prefixed with "rdma":

> tmt@Home:~$ ls /proc/sys/sunrpc
> max_resvport  nfsd_debug  nlm_debug  tcp_fin_timeout             tcp_slot_table_entries  udp_slot_table_entries
> min_resvport  nfs_debug   rpc_debug  tcp_max_slot_table_entries  transports
> tmt@Home:~$ sudo insmod xprtrdma
> tmt@Home:~$ ls /proc/sys/sunrpc
> max_resvport  nlm_debug                  rdma_memreg_strategy     tcp_fin_timeout             udp_slot_table_entries
> min_resvport  rdma_inline_write_padding  rdma_pad_optimize        tcp_max_slot_table_entries
> nfsd_debug    rdma_max_inline_read       rdma_slot_table_entries  tcp_slot_table_entries
> nfs_debug     rdma_max_inline_write      rpc_debug                transports
> tmt@Home:~$



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-25 20:14                             ` Tom Talpey
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Talpey @ 2013-04-25 20:14 UTC (permalink / raw)
  To: Phil Pishioneri
  Cc: Wendy Cheng, J. Bruce Fields, Yan Burman, Atchley, Scott,
	Tom Tucker, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On 4/25/2013 3:01 PM, Phil Pishioneri wrote:
> On 4/25/13 1:18 PM, Wendy Cheng wrote:
>> On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org> wrote:
>>> 1)
>>>
>>> The client slot count is not hard-coded, it can easily be changed by
>>> writing a value to /proc and initiating a new mount. But I doubt that
>>> increasing the slot table will improve performance much, unless this is
>>> a small-random-read, and spindle-limited workload.
>> It was a shot in the dark :)  .. as our test bed has not been setup
>> yet .However, since I'll be working on (very) slow clients, increasing
>> this buffer is still interesting (to me). I don't see where it is
>> controlled by a /proc value (?) - but that is not a concern at this
>> moment as /proc entry is easy to add. More questions on the server
>> though (see below) ...
>
> Might there be confusion between the RDMA slot table and the TCP/UDP
> ones (which have proc entries under /proc/sys/sunrpc)?
>

No, the xprtrdma.ko creates similar slot table controls when it loads.
See the names below, prefixed with "rdma":

> tmt@Home:~$ ls /proc/sys/sunrpc
> max_resvport  nfsd_debug  nlm_debug  tcp_fin_timeout             tcp_slot_table_entries  udp_slot_table_entries
> min_resvport  nfs_debug   rpc_debug  tcp_max_slot_table_entries  transports
> tmt@Home:~$ sudo insmod xprtrdma
> tmt@Home:~$ ls /proc/sys/sunrpc
> max_resvport  nlm_debug                  rdma_memreg_strategy     tcp_fin_timeout             udp_slot_table_entries
> min_resvport  rdma_inline_write_padding  rdma_pad_optimize        tcp_max_slot_table_entries
> nfsd_debug    rdma_max_inline_read       rdma_slot_table_entries  tcp_slot_table_entries
> nfs_debug     rdma_max_inline_write      rpc_debug                transports
> tmt@Home:~$


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-25 20:04                           ` Tom Talpey
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Talpey @ 2013-04-25 20:04 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: J. Bruce Fields, Yan Burman, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On 4/25/2013 1:18 PM, Wendy Cheng wrote:
> On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey <tom@talpey.com> wrote:
>>> On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng <s.wendy.cheng@gmail.com>
>>> wrote:
>>>>
>>> So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
>>> tar ball) ... Here is a random thought (not related to the rb tree
>>> comment).....
>>>
>>> The inflight packet count seems to be controlled by
>>> xprt_rdma_slot_table_entries that is currently hard-coded as
>>> RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
>>> with the bandwidth number if we pump it up, say 64 instead ? Not sure
>>> whether FMR pool size needs to get adjusted accordingly though.
>>
>> 1)
>>
>> The client slot count is not hard-coded, it can easily be changed by
>> writing a value to /proc and initiating a new mount. But I doubt that
>> increasing the slot table will improve performance much, unless this is
>> a small-random-read, and spindle-limited workload.
>
> Hi Tom !
>
> It was a shot in the dark :)  .. as our test bed has not been setup
> yet .However, since I'll be working on (very) slow clients, increasing
> this buffer is still interesting (to me). I don't see where it is
> controlled by a /proc value (?) - but that is not a concern at this

The entries show up in /proc/sys/sunrpc (IIRC). The one you're looking
for is called rdma_slot_table_entries.

> moment as /proc entry is easy to add. More questions on the server
> though (see below) ...
>
>>
>> 2)
>>
>> The observation appears to be that the bandwidth is server CPU limited.
>> Increasing the load offered by the client probably won't move the needle,
>> until that's addressed.
>>
>
> Could you give more hints on which part of the path is CPU limited ?

Sorry, I don't. The profile showing 25% of the 16-core, 2-socket server
spinning on locks is a smoking, flaming gun though. Maybe Tom Tucker
has some ideas on the srv rdma code, but it could also be in the sunrpc
or infiniband driver layers, can't really tell without the call stacks.

> Is there a known Linux-based filesystem that is reasonbly tuned for
> NFS-RDMA ? Any specific filesystem features would work well with
> NFS-RDMA ? I'm wondering when disk+FS are added into the
> configuration, how much advantages would NFS-RDMA get when compared
> with a plain TCP/IP, say IPOIB on CM , transport ?

NFS-RDMA is not really filesystem dependent, but certainly there are
considerations for filesystems to support NFS, and of course the goal in
general is performance. NFS-RDMA is a network transport, applicable to
both client and server. Filesystem choice is a server consideration.

I don't have a simple answer to your question about how much better
NFS-RDMA is over other transports. Architecturally, a lot. In practice,
there are many, many variables. Have you seen RFC5532, that I cowrote
with the late Chet Juszczak? You may find it's still quite relevant.
http://tools.ietf.org/html/rfc5532

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-25 20:04                           ` Tom Talpey
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Talpey @ 2013-04-25 20:04 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: J. Bruce Fields, Yan Burman, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On 4/25/2013 1:18 PM, Wendy Cheng wrote:
> On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>>> wrote:
>>>>
>>> So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
>>> tar ball) ... Here is a random thought (not related to the rb tree
>>> comment).....
>>>
>>> The inflight packet count seems to be controlled by
>>> xprt_rdma_slot_table_entries that is currently hard-coded as
>>> RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
>>> with the bandwidth number if we pump it up, say 64 instead ? Not sure
>>> whether FMR pool size needs to get adjusted accordingly though.
>>
>> 1)
>>
>> The client slot count is not hard-coded, it can easily be changed by
>> writing a value to /proc and initiating a new mount. But I doubt that
>> increasing the slot table will improve performance much, unless this is
>> a small-random-read, and spindle-limited workload.
>
> Hi Tom !
>
> It was a shot in the dark :)  .. as our test bed has not been setup
> yet .However, since I'll be working on (very) slow clients, increasing
> this buffer is still interesting (to me). I don't see where it is
> controlled by a /proc value (?) - but that is not a concern at this

The entries show up in /proc/sys/sunrpc (IIRC). The one you're looking
for is called rdma_slot_table_entries.

> moment as /proc entry is easy to add. More questions on the server
> though (see below) ...
>
>>
>> 2)
>>
>> The observation appears to be that the bandwidth is server CPU limited.
>> Increasing the load offered by the client probably won't move the needle,
>> until that's addressed.
>>
>
> Could you give more hints on which part of the path is CPU limited ?

Sorry, I don't. The profile showing 25% of the 16-core, 2-socket server
spinning on locks is a smoking, flaming gun though. Maybe Tom Tucker
has some ideas on the srv rdma code, but it could also be in the sunrpc
or infiniband driver layers, can't really tell without the call stacks.

> Is there a known Linux-based filesystem that is reasonbly tuned for
> NFS-RDMA ? Any specific filesystem features would work well with
> NFS-RDMA ? I'm wondering when disk+FS are added into the
> configuration, how much advantages would NFS-RDMA get when compared
> with a plain TCP/IP, say IPOIB on CM , transport ?

NFS-RDMA is not really filesystem dependent, but certainly there are
considerations for filesystems to support NFS, and of course the goal in
general is performance. NFS-RDMA is a network transport, applicable to
both client and server. Filesystem choice is a server consideration.

I don't have a simple answer to your question about how much better
NFS-RDMA is over other transports. Architecturally, a lot. In practice,
there are many, many variables. Have you seen RFC5532, that I cowrote
with the late Chet Juszczak? You may find it's still quite relevant.
http://tools.ietf.org/html/rfc5532
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-25 21:17                             ` Tom Tucker
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Tucker @ 2013-04-25 21:17 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Wendy Cheng, J. Bruce Fields, Yan Burman, Atchley, Scott,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On 4/25/13 3:04 PM, Tom Talpey wrote:
> On 4/25/2013 1:18 PM, Wendy Cheng wrote:
>> On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey <tom@talpey.com> wrote:
>>>> On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng <s.wendy.cheng@gmail.com>
>>>> wrote:
>>>>>
>>>> So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
>>>> tar ball) ... Here is a random thought (not related to the rb tree
>>>> comment).....
>>>>
>>>> The inflight packet count seems to be controlled by
>>>> xprt_rdma_slot_table_entries that is currently hard-coded as
>>>> RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
>>>> with the bandwidth number if we pump it up, say 64 instead ? Not sure
>>>> whether FMR pool size needs to get adjusted accordingly though.
>>>
>>> 1)
>>>
>>> The client slot count is not hard-coded, it can easily be changed by
>>> writing a value to /proc and initiating a new mount. But I doubt that
>>> increasing the slot table will improve performance much, unless this is
>>> a small-random-read, and spindle-limited workload.
>>
>> Hi Tom !
>>
>> It was a shot in the dark :)  .. as our test bed has not been setup
>> yet .However, since I'll be working on (very) slow clients, increasing
>> this buffer is still interesting (to me). I don't see where it is
>> controlled by a /proc value (?) - but that is not a concern at this
>
> The entries show up in /proc/sys/sunrpc (IIRC). The one you're looking
> for is called rdma_slot_table_entries.
>
>> moment as /proc entry is easy to add. More questions on the server
>> though (see below) ...
>>
>>>
>>> 2)
>>>
>>> The observation appears to be that the bandwidth is server CPU limited.
>>> Increasing the load offered by the client probably won't move the needle,
>>> until that's addressed.
>>>
>>
>> Could you give more hints on which part of the path is CPU limited ?
>
> Sorry, I don't. The profile showing 25% of the 16-core, 2-socket server
> spinning on locks is a smoking, flaming gun though. Maybe Tom Tucker
> has some ideas on the srv rdma code, but it could also be in the sunrpc
> or infiniband driver layers, can't really tell without the call stacks.

The Mellanox driver uses red-black trees extensively for resource 
management, e.g. QP ID, CQ ID, etc... When completions come in from the 
HW, these are used to find the associated software data structures I 
believe. It is certainly possible that these trees get hot on lookup when 
we're pushing a lot of data. I'm surprised, however, to see 
rb_insert_color there because I'm not aware of any where that resources 
are being inserted into and/or removed from a red-black tree in the data path.

They are also used by IPoIB and the IB CM, however, connections should not 
be coming and going unless we've got other problems. IPoIB is only used by 
the IB transport for connection set up and my impression is that this 
trace is for the IB transport.

I don't believe that red-black trees are used by either the client or 
server transports directly. Note that the rb_lock in the client is for 
buffers; not, as the name might imply, a red-black tree.

I think the key here is to discover what lock is being waited on. Are we 
certain that it's a lock on a red-black tree and if so, which one?

Tom
>
>> Is there a known Linux-based filesystem that is reasonbly tuned for
>> NFS-RDMA ? Any specific filesystem features would work well with
>> NFS-RDMA ? I'm wondering when disk+FS are added into the
>> configuration, how much advantages would NFS-RDMA get when compared
>> with a plain TCP/IP, say IPOIB on CM , transport ?
>
> NFS-RDMA is not really filesystem dependent, but certainly there are
> considerations for filesystems to support NFS, and of course the goal in
> general is performance. NFS-RDMA is a network transport, applicable to
> both client and server. Filesystem choice is a server consideration.
>
> I don't have a simple answer to your question about how much better
> NFS-RDMA is over other transports. Architecturally, a lot. In practice,
> there are many, many variables. Have you seen RFC5532, that I cowrote
> with the late Chet Juszczak? You may find it's still quite relevant.
> http://tools.ietf.org/html/rfc5532
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-25 21:17                             ` Tom Tucker
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Tucker @ 2013-04-25 21:17 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Wendy Cheng, J. Bruce Fields, Yan Burman, Atchley, Scott,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On 4/25/13 3:04 PM, Tom Talpey wrote:
> On 4/25/2013 1:18 PM, Wendy Cheng wrote:
>> On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>>>> wrote:
>>>>>
>>>> So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
>>>> tar ball) ... Here is a random thought (not related to the rb tree
>>>> comment).....
>>>>
>>>> The inflight packet count seems to be controlled by
>>>> xprt_rdma_slot_table_entries that is currently hard-coded as
>>>> RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
>>>> with the bandwidth number if we pump it up, say 64 instead ? Not sure
>>>> whether FMR pool size needs to get adjusted accordingly though.
>>>
>>> 1)
>>>
>>> The client slot count is not hard-coded, it can easily be changed by
>>> writing a value to /proc and initiating a new mount. But I doubt that
>>> increasing the slot table will improve performance much, unless this is
>>> a small-random-read, and spindle-limited workload.
>>
>> Hi Tom !
>>
>> It was a shot in the dark :)  .. as our test bed has not been setup
>> yet .However, since I'll be working on (very) slow clients, increasing
>> this buffer is still interesting (to me). I don't see where it is
>> controlled by a /proc value (?) - but that is not a concern at this
>
> The entries show up in /proc/sys/sunrpc (IIRC). The one you're looking
> for is called rdma_slot_table_entries.
>
>> moment as /proc entry is easy to add. More questions on the server
>> though (see below) ...
>>
>>>
>>> 2)
>>>
>>> The observation appears to be that the bandwidth is server CPU limited.
>>> Increasing the load offered by the client probably won't move the needle,
>>> until that's addressed.
>>>
>>
>> Could you give more hints on which part of the path is CPU limited ?
>
> Sorry, I don't. The profile showing 25% of the 16-core, 2-socket server
> spinning on locks is a smoking, flaming gun though. Maybe Tom Tucker
> has some ideas on the srv rdma code, but it could also be in the sunrpc
> or infiniband driver layers, can't really tell without the call stacks.

The Mellanox driver uses red-black trees extensively for resource 
management, e.g. QP ID, CQ ID, etc... When completions come in from the 
HW, these are used to find the associated software data structures I 
believe. It is certainly possible that these trees get hot on lookup when 
we're pushing a lot of data. I'm surprised, however, to see 
rb_insert_color there because I'm not aware of any where that resources 
are being inserted into and/or removed from a red-black tree in the data path.

They are also used by IPoIB and the IB CM, however, connections should not 
be coming and going unless we've got other problems. IPoIB is only used by 
the IB transport for connection set up and my impression is that this 
trace is for the IB transport.

I don't believe that red-black trees are used by either the client or 
server transports directly. Note that the rb_lock in the client is for 
buffers; not, as the name might imply, a red-black tree.

I think the key here is to discover what lock is being waited on. Are we 
certain that it's a lock on a red-black tree and if so, which one?

Tom
>
>> Is there a known Linux-based filesystem that is reasonbly tuned for
>> NFS-RDMA ? Any specific filesystem features would work well with
>> NFS-RDMA ? I'm wondering when disk+FS are added into the
>> configuration, how much advantages would NFS-RDMA get when compared
>> with a plain TCP/IP, say IPOIB on CM , transport ?
>
> NFS-RDMA is not really filesystem dependent, but certainly there are
> considerations for filesystems to support NFS, and of course the goal in
> general is performance. NFS-RDMA is a network transport, applicable to
> both client and server. Filesystem choice is a server consideration.
>
> I don't have a simple answer to your question about how much better
> NFS-RDMA is over other transports. Architecturally, a lot. In practice,
> there are many, many variables. Have you seen RFC5532, that I cowrote
> with the late Chet Juszczak? You may find it's still quite relevant.
> http://tools.ietf.org/html/rfc5532
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-25 21:58                               ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-25 21:58 UTC (permalink / raw)
  To: Tom Tucker
  Cc: Tom Talpey, J. Bruce Fields, Yan Burman, Atchley, Scott,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Thu, Apr 25, 2013 at 2:17 PM, Tom Tucker <tom@opengridcomputing.com> wrote:
> The Mellanox driver uses red-black trees extensively for resource
> management, e.g. QP ID, CQ ID, etc... When completions come in from the HW,
> these are used to find the associated software data structures I believe. It
> is certainly possible that these trees get hot on lookup when we're pushing
> a lot of data. I'm surprised, however, to see rb_insert_color there because
> I'm not aware of any where that resources are being inserted into and/or
> removed from a red-black tree in the data path.
>

I think they (rb calls) are from base kernel, not from any NFS and/or
IB module (e.g. RPC, MLX, etc). See the right column ? .... it says
"/root/vmlinux". Just a guess - I don't know much about this "perf"
command.

 -- Wendy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-25 21:58                               ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-25 21:58 UTC (permalink / raw)
  To: Tom Tucker
  Cc: Tom Talpey, J. Bruce Fields, Yan Burman, Atchley, Scott,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Thu, Apr 25, 2013 at 2:17 PM, Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
> The Mellanox driver uses red-black trees extensively for resource
> management, e.g. QP ID, CQ ID, etc... When completions come in from the HW,
> these are used to find the associated software data structures I believe. It
> is certainly possible that these trees get hot on lookup when we're pushing
> a lot of data. I'm surprised, however, to see rb_insert_color there because
> I'm not aware of any where that resources are being inserted into and/or
> removed from a red-black tree in the data path.
>

I think they (rb calls) are from base kernel, not from any NFS and/or
IB module (e.g. RPC, MLX, etc). See the right column ? .... it says
"/root/vmlinux". Just a guess - I don't know much about this "perf"
command.

 -- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-25 22:26                                 ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-25 22:26 UTC (permalink / raw)
  To: Tom Tucker
  Cc: Tom Talpey, J. Bruce Fields, Yan Burman, Atchley, Scott,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Thu, Apr 25, 2013 at 2:58 PM, Wendy Cheng <s.wendy.cheng@gmail.com> wrote:
> On Thu, Apr 25, 2013 at 2:17 PM, Tom Tucker <tom@opengridcomputing.com> wrote:
>> The Mellanox driver uses red-black trees extensively for resource
>> management, e.g. QP ID, CQ ID, etc... When completions come in from the HW,
>> these are used to find the associated software data structures I believe. It
>> is certainly possible that these trees get hot on lookup when we're pushing
>> a lot of data. I'm surprised, however, to see rb_insert_color there because
>> I'm not aware of any where that resources are being inserted into and/or
>> removed from a red-black tree in the data path.
>>
>
> I think they (rb calls) are from base kernel, not from any NFS and/or
> IB module (e.g. RPC, MLX, etc). See the right column ? .... it says
> "/root/vmlinux". Just a guess - I don't know much about this "perf"
> command.
>


Oops .. take my words back ! I confused Linux's RB tree w/ BSD's.
BSD's is a set of macros inside a header file while Linux's
implementation is a base kernel library. So every KMOD is a suspect
here :)

-- Wendy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-25 22:26                                 ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-25 22:26 UTC (permalink / raw)
  To: Tom Tucker
  Cc: Tom Talpey, J. Bruce Fields, Yan Burman, Atchley, Scott,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Thu, Apr 25, 2013 at 2:58 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Thu, Apr 25, 2013 at 2:17 PM, Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
>> The Mellanox driver uses red-black trees extensively for resource
>> management, e.g. QP ID, CQ ID, etc... When completions come in from the HW,
>> these are used to find the associated software data structures I believe. It
>> is certainly possible that these trees get hot on lookup when we're pushing
>> a lot of data. I'm surprised, however, to see rb_insert_color there because
>> I'm not aware of any where that resources are being inserted into and/or
>> removed from a red-black tree in the data path.
>>
>
> I think they (rb calls) are from base kernel, not from any NFS and/or
> IB module (e.g. RPC, MLX, etc). See the right column ? .... it says
> "/root/vmlinux". Just a guess - I don't know much about this "perf"
> command.
>


Oops .. take my words back ! I confused Linux's RB tree w/ BSD's.
BSD's is a set of macros inside a header file while Linux's
implementation is a base kernel library. So every KMOD is a suspect
here :)

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-28  6:28                   ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-28  6:28 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz



> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org]
> Sent: Wednesday, April 24, 2013 18:27
> To: Yan Burman
> Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
> linux-nfs@vger.kernel.org; Or Gerlitz
> Subject: Re: NFS over RDMA benchmark
> 
> On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
> > On Wed, Apr 24, 2013 at 12:35:03PM +0000, Yan Burman wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: J. Bruce Fields [mailto:bfields@fieldses.org]
> > > > Sent: Wednesday, April 24, 2013 00:06
> > > > To: Yan Burman
> > > > Cc: Wendy Cheng; Atchley, Scott; Tom Tucker;
> > > > linux-rdma@vger.kernel.org; linux-nfs@vger.kernel.org; Or Gerlitz
> > > > Subject: Re: NFS over RDMA benchmark
> > > >
> > > > On Thu, Apr 18, 2013 at 12:47:09PM +0000, Yan Burman wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Wendy Cheng [mailto:s.wendy.cheng@gmail.com]
> > > > > > Sent: Wednesday, April 17, 2013 21:06
> > > > > > To: Atchley, Scott
> > > > > > Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
> > > > > > linux-rdma@vger.kernel.org; linux-nfs@vger.kernel.org
> > > > > > Subject: Re: NFS over RDMA benchmark
> > > > > >
> > > > > > On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
> > > > > > <atchleyes@ornl.gov>
> > > > > > wrote:
> > > > > > > On Apr 17, 2013, at 1:15 PM, Wendy Cheng
> > > > > > > <s.wendy.cheng@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> > > > > > >> <yanb@mellanox.com>
> > > > > > wrote:
> > > > > > >>> Hi.
> > > > > > >>>
> > > > > > >>> I've been trying to do some benchmarks for NFS over RDMA
> > > > > > >>> and I seem to
> > > > > > only get about half of the bandwidth that the HW can give me.
> > > > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > > > > >>> memory, and
> > > > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > > > >>> These servers are connected to a QDR IB switch. The
> > > > > > >>> backing storage on
> > > > > > the server is tmpfs mounted with noatime.
> > > > > > >>> I am running kernel 3.5.7.
> > > > > > >>>
> > > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-
> 512K.
> > > > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec
> > > > > > >>> for the
> > > > > > same block sizes (4-512K). running over IPoIB-CM, I get 200-
> 980MB/sec.
> > > > > > >
> > > > > > > Yan,
> > > > > > >
> > > > > > > Are you trying to optimize single client performance or
> > > > > > > server performance
> > > > > > with multiple clients?
> > > > > > >
> > > > >
> > > > > I am trying to get maximum performance from a single server - I
> > > > > used 2
> > > > processes in fio test - more than 2 did not show any performance boost.
> > > > > I tried running fio from 2 different PCs on 2 different files,
> > > > > but the sum of
> > > > the two is more or less the same as running from single client PC.
> > > > >
> > > > > What I did see is that server is sweating a lot more than the
> > > > > clients and
> > > > more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > > > > cat /proc/softirqs
> > > >
> > > > Would any profiling help figure out which code it's spending time in?
> > > > (E.g. something simple as "perf top" might have useful output.)
> > > >
> > >
> > >
> > > Perf top for the CPU with high tasklet count gives:
> > >
> > >              samples  pcnt         RIP        function                    DSO
> > >              _______ _____ ________________
> > > ___________________________
> > >
> _________________________________________________________________
> __
> > >
> > >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner
> /root/vmlinux
> >
> > I guess that means lots of contention on some mutex?  If only we knew
> > which one.... perf should also be able to collect stack statistics, I
> > forget how.
> 
> Googling around....  I think we want:
> 
> 	perf record -a --call-graph
> 	(give it a chance to collect some samples, then ^C)
> 	perf report --call-graph --stdio
> 

Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is:
    36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
                    |
                    --- mutex_spin_on_owner
                       |
                       |--99.99%-- __mutex_lock_slowpath
                       |          mutex_lock
                       |          |
                       |          |--85.30%-- generic_file_aio_write
                       |          |          do_sync_readv_writev
                       |          |          do_readv_writev
                       |          |          vfs_writev
                       |          |          nfsd_vfs_write
                       |          |          nfsd_write
                       |          |          nfsd3_proc_write
                       |          |          nfsd_dispatch
                       |          |          svc_process_common
                       |          |          svc_process
                       |          |          nfsd
                       |          |          kthread
                       |          |          kernel_thread_helper
                       |          |
                       |           --14.70%-- svc_send
                       |                     svc_process
                       |                     nfsd
                       |                     kthread
                       |                     kernel_thread_helper
                        --0.01%-- [...]

     9.63%          nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
                    |
                    --- _raw_spin_lock_irqsave
                       |
                       |--43.97%-- alloc_iova
                       |          intel_alloc_iova
                       |          __intel_map_single
                       |          intel_map_page
                       |          |
                       |          |--60.47%-- svc_rdma_sendto
                       |          |          svc_send
                       |          |          svc_process
                       |          |          nfsd
                       |          |          kthread
                       |          |          kernel_thread_helper
                       |          |
                       |          |--30.10%-- rdma_read_xdr
                       |          |          svc_rdma_recvfrom
                       |          |          svc_recv
                       |          |          nfsd
                       |          |          kthread
                       |          |          kernel_thread_helper
                       |          |
                       |          |--6.69%-- svc_rdma_post_recv
                       |          |          send_reply
                       |          |          svc_rdma_sendto
                       |          |          svc_send
                       |          |          svc_process
                       |          |          nfsd
                       |          |          kthread
                       |          |          kernel_thread_helper
                       |          |
                       |           --2.74%-- send_reply
                       |                     svc_rdma_sendto
                       |                     svc_send
                       |                     svc_process
                       |                     nfsd
                       |                     kthread
                       |                     kernel_thread_helper
                       |
                       |--37.52%-- __free_iova
                       |          flush_unmaps
                       |          add_unmap
                       |          intel_unmap_page
                       |          |
                       |          |--97.18%-- svc_rdma_put_frmr
                       |          |          sq_cq_reap
                       |          |          dto_tasklet_func
                       |          |          tasklet_action
                       |          |          __do_softirq
                       |          |          call_softirq
                       |          |          do_softirq
                       |          |          |
                       |          |          |--97.40%-- irq_exit
                       |          |          |          |
                       |          |          |          |--99.85%-- do_IRQ
                       |          |          |          |          ret_from_intr
                       |          |          |          |          |
                       |          |          |          |          |--40.74%-- generic_file_buffered_write
                       |          |          |          |          |          __generic_file_aio_write
                       |          |          |          |          |          generic_file_aio_write
                       |          |          |          |          |          do_sync_readv_writev
                       |          |          |          |          |          do_readv_writev
                       |          |          |          |          |          vfs_writev
                       |          |          |          |          |          nfsd_vfs_write
                       |          |          |          |          |          nfsd_write
                       |          |          |          |          |          nfsd3_proc_write
                       |          |          |          |          |          nfsd_dispatch
                       |          |          |          |          |          svc_process_common
                       |          |          |          |          |          svc_process
                       |          |          |          |          |          nfsd
                       |          |          |          |          |          kthread
                       |          |          |          |          |          kernel_thread_helper
                       |          |          |          |          |
                       |          |          |          |          |--25.21%-- __mutex_lock_slowpath
                       |          |          |          |          |          mutex_lock
                       |          |          |          |          |          |
                       |          |          |          |          |          |--94.84%-- generic_file_aio_write
                       |          |          |          |          |          |          do_sync_readv_writev
                       |          |          |          |          |          |          do_readv_writev
                       |          |          |          |          |          |          vfs_writev
                       |          |          |          |          |          |          nfsd_vfs_write
                       |          |          |          |          |          |          nfsd_write
                       |          |          |          |          |          |          nfsd3_proc_write
                       |          |          |          |          |          |          nfsd_dispatch
                       |          |          |          |          |          |          svc_process_common
                       |          |          |          |          |          |          svc_process
                       |          |          |          |          |          |          nfsd
                       |          |          |          |          |          |          kthread
                       |          |          |          |          |          |          kernel_thread_helper
                       |          |          |          |          |          |

The entire trace is almost 1MB, so send me an off-list message if you want it.

Yan


^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-28  6:28                   ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-28  6:28 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz



> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org]
> Sent: Wednesday, April 24, 2013 18:27
> To: Yan Burman
> Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
> Subject: Re: NFS over RDMA benchmark
> 
> On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
> > On Wed, Apr 24, 2013 at 12:35:03PM +0000, Yan Burman wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: J. Bruce Fields [mailto:bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org]
> > > > Sent: Wednesday, April 24, 2013 00:06
> > > > To: Yan Burman
> > > > Cc: Wendy Cheng; Atchley, Scott; Tom Tucker;
> > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
> > > > Subject: Re: NFS over RDMA benchmark
> > > >
> > > > On Thu, Apr 18, 2013 at 12:47:09PM +0000, Yan Burman wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Wendy Cheng [mailto:s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> > > > > > Sent: Wednesday, April 17, 2013 21:06
> > > > > > To: Atchley, Scott
> > > > > > Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
> > > > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > > > Subject: Re: NFS over RDMA benchmark
> > > > > >
> > > > > > On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
> > > > > > <atchleyes-1Heg1YXhbW8@public.gmane.org>
> > > > > > wrote:
> > > > > > > On Apr 17, 2013, at 1:15 PM, Wendy Cheng
> > > > > > > <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > > > > > wrote:
> > > > > > >
> > > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> > > > > > >> <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > > > > wrote:
> > > > > > >>> Hi.
> > > > > > >>>
> > > > > > >>> I've been trying to do some benchmarks for NFS over RDMA
> > > > > > >>> and I seem to
> > > > > > only get about half of the bandwidth that the HW can give me.
> > > > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > > > > >>> memory, and
> > > > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > > > >>> These servers are connected to a QDR IB switch. The
> > > > > > >>> backing storage on
> > > > > > the server is tmpfs mounted with noatime.
> > > > > > >>> I am running kernel 3.5.7.
> > > > > > >>>
> > > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-
> 512K.
> > > > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec
> > > > > > >>> for the
> > > > > > same block sizes (4-512K). running over IPoIB-CM, I get 200-
> 980MB/sec.
> > > > > > >
> > > > > > > Yan,
> > > > > > >
> > > > > > > Are you trying to optimize single client performance or
> > > > > > > server performance
> > > > > > with multiple clients?
> > > > > > >
> > > > >
> > > > > I am trying to get maximum performance from a single server - I
> > > > > used 2
> > > > processes in fio test - more than 2 did not show any performance boost.
> > > > > I tried running fio from 2 different PCs on 2 different files,
> > > > > but the sum of
> > > > the two is more or less the same as running from single client PC.
> > > > >
> > > > > What I did see is that server is sweating a lot more than the
> > > > > clients and
> > > > more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > > > > cat /proc/softirqs
> > > >
> > > > Would any profiling help figure out which code it's spending time in?
> > > > (E.g. something simple as "perf top" might have useful output.)
> > > >
> > >
> > >
> > > Perf top for the CPU with high tasklet count gives:
> > >
> > >              samples  pcnt         RIP        function                    DSO
> > >              _______ _____ ________________
> > > ___________________________
> > >
> _________________________________________________________________
> __
> > >
> > >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner
> /root/vmlinux
> >
> > I guess that means lots of contention on some mutex?  If only we knew
> > which one.... perf should also be able to collect stack statistics, I
> > forget how.
> 
> Googling around....  I think we want:
> 
> 	perf record -a --call-graph
> 	(give it a chance to collect some samples, then ^C)
> 	perf report --call-graph --stdio
> 

Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is:
    36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
                    |
                    --- mutex_spin_on_owner
                       |
                       |--99.99%-- __mutex_lock_slowpath
                       |          mutex_lock
                       |          |
                       |          |--85.30%-- generic_file_aio_write
                       |          |          do_sync_readv_writev
                       |          |          do_readv_writev
                       |          |          vfs_writev
                       |          |          nfsd_vfs_write
                       |          |          nfsd_write
                       |          |          nfsd3_proc_write
                       |          |          nfsd_dispatch
                       |          |          svc_process_common
                       |          |          svc_process
                       |          |          nfsd
                       |          |          kthread
                       |          |          kernel_thread_helper
                       |          |
                       |           --14.70%-- svc_send
                       |                     svc_process
                       |                     nfsd
                       |                     kthread
                       |                     kernel_thread_helper
                        --0.01%-- [...]

     9.63%          nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
                    |
                    --- _raw_spin_lock_irqsave
                       |
                       |--43.97%-- alloc_iova
                       |          intel_alloc_iova
                       |          __intel_map_single
                       |          intel_map_page
                       |          |
                       |          |--60.47%-- svc_rdma_sendto
                       |          |          svc_send
                       |          |          svc_process
                       |          |          nfsd
                       |          |          kthread
                       |          |          kernel_thread_helper
                       |          |
                       |          |--30.10%-- rdma_read_xdr
                       |          |          svc_rdma_recvfrom
                       |          |          svc_recv
                       |          |          nfsd
                       |          |          kthread
                       |          |          kernel_thread_helper
                       |          |
                       |          |--6.69%-- svc_rdma_post_recv
                       |          |          send_reply
                       |          |          svc_rdma_sendto
                       |          |          svc_send
                       |          |          svc_process
                       |          |          nfsd
                       |          |          kthread
                       |          |          kernel_thread_helper
                       |          |
                       |           --2.74%-- send_reply
                       |                     svc_rdma_sendto
                       |                     svc_send
                       |                     svc_process
                       |                     nfsd
                       |                     kthread
                       |                     kernel_thread_helper
                       |
                       |--37.52%-- __free_iova
                       |          flush_unmaps
                       |          add_unmap
                       |          intel_unmap_page
                       |          |
                       |          |--97.18%-- svc_rdma_put_frmr
                       |          |          sq_cq_reap
                       |          |          dto_tasklet_func
                       |          |          tasklet_action
                       |          |          __do_softirq
                       |          |          call_softirq
                       |          |          do_softirq
                       |          |          |
                       |          |          |--97.40%-- irq_exit
                       |          |          |          |
                       |          |          |          |--99.85%-- do_IRQ
                       |          |          |          |          ret_from_intr
                       |          |          |          |          |
                       |          |          |          |          |--40.74%-- generic_file_buffered_write
                       |          |          |          |          |          __generic_file_aio_write
                       |          |          |          |          |          generic_file_aio_write
                       |          |          |          |          |          do_sync_readv_writev
                       |          |          |          |          |          do_readv_writev
                       |          |          |          |          |          vfs_writev
                       |          |          |          |          |          nfsd_vfs_write
                       |          |          |          |          |          nfsd_write
                       |          |          |          |          |          nfsd3_proc_write
                       |          |          |          |          |          nfsd_dispatch
                       |          |          |          |          |          svc_process_common
                       |          |          |          |          |          svc_process
                       |          |          |          |          |          nfsd
                       |          |          |          |          |          kthread
                       |          |          |          |          |          kernel_thread_helper
                       |          |          |          |          |
                       |          |          |          |          |--25.21%-- __mutex_lock_slowpath
                       |          |          |          |          |          mutex_lock
                       |          |          |          |          |          |
                       |          |          |          |          |          |--94.84%-- generic_file_aio_write
                       |          |          |          |          |          |          do_sync_readv_writev
                       |          |          |          |          |          |          do_readv_writev
                       |          |          |          |          |          |          vfs_writev
                       |          |          |          |          |          |          nfsd_vfs_write
                       |          |          |          |          |          |          nfsd_write
                       |          |          |          |          |          |          nfsd3_proc_write
                       |          |          |          |          |          |          nfsd_dispatch
                       |          |          |          |          |          |          svc_process_common
                       |          |          |          |          |          |          svc_process
                       |          |          |          |          |          |          nfsd
                       |          |          |          |          |          |          kthread
                       |          |          |          |          |          |          kernel_thread_helper
                       |          |          |          |          |          |

The entire trace is almost 1MB, so send me an off-list message if you want it.

Yan

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-28 14:42                     ` J. Bruce Fields
  0 siblings, 0 replies; 82+ messages in thread
From: J. Bruce Fields @ 2013-04-28 14:42 UTC (permalink / raw)
  To: Yan Burman
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote:
> > > > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> > > > > > > >> <yanb@mellanox.com>
> > > > > > > >>> I've been trying to do some benchmarks for NFS over RDMA
> > > > > > > >>> and I seem to
> > > > > > > only get about half of the bandwidth that the HW can give me.
> > > > > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > > > > > >>> memory, and
> > > > > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > > > > >>> These servers are connected to a QDR IB switch. The
> > > > > > > >>> backing storage on
> > > > > > > the server is tmpfs mounted with noatime.
> > > > > > > >>> I am running kernel 3.5.7.
> > > > > > > >>>
> > > > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-
> > 512K.
> > > > > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec
> > > > > > > >>> for the
> > > > > > > same block sizes (4-512K). running over IPoIB-CM, I get 200-
> > 980MB/sec.
...
> > > > > > I am trying to get maximum performance from a single server - I
> > > > > > used 2
> > > > > processes in fio test - more than 2 did not show any performance boost.
> > > > > > I tried running fio from 2 different PCs on 2 different files,
> > > > > > but the sum of
> > > > > the two is more or less the same as running from single client PC.
> > > > > >
> > > > > > What I did see is that server is sweating a lot more than the
> > > > > > clients and
> > > > > more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > > > > > cat /proc/softirqs
...
> > > > Perf top for the CPU with high tasklet count gives:
> > > >
> > > >              samples  pcnt         RIP        function                    DSO
...
> > > >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner
> > /root/vmlinux
...
> > Googling around....  I think we want:
> > 
> > 	perf record -a --call-graph
> > 	(give it a chance to collect some samples, then ^C)
> > 	perf report --call-graph --stdio
> > 
> 
> Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is:
>     36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>                     |
>                     --- mutex_spin_on_owner
>                        |
>                        |--99.99%-- __mutex_lock_slowpath
>                        |          mutex_lock
>                        |          |
>                        |          |--85.30%-- generic_file_aio_write

That's the inode i_mutex.

>                        |          |          do_sync_readv_writev
>                        |          |          do_readv_writev
>                        |          |          vfs_writev
>                        |          |          nfsd_vfs_write
>                        |          |          nfsd_write
>                        |          |          nfsd3_proc_write
>                        |          |          nfsd_dispatch
>                        |          |          svc_process_common
>                        |          |          svc_process
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |           --14.70%-- svc_send

That's the xpt_mutex (ensuring rpc replies aren't interleaved).

>                        |                     svc_process
>                        |                     nfsd
>                        |                     kthread
>                        |                     kernel_thread_helper
>                         --0.01%-- [...]
> 
>      9.63%          nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>                     |
>                     --- _raw_spin_lock_irqsave
>                        |
>                        |--43.97%-- alloc_iova

And that (and __free_iova below) looks like iova_rbtree_lock.

--b.

>                        |          intel_alloc_iova
>                        |          __intel_map_single
>                        |          intel_map_page
>                        |          |
>                        |          |--60.47%-- svc_rdma_sendto
>                        |          |          svc_send
>                        |          |          svc_process
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |          |--30.10%-- rdma_read_xdr
>                        |          |          svc_rdma_recvfrom
>                        |          |          svc_recv
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |          |--6.69%-- svc_rdma_post_recv
>                        |          |          send_reply
>                        |          |          svc_rdma_sendto
>                        |          |          svc_send
>                        |          |          svc_process
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |           --2.74%-- send_reply
>                        |                     svc_rdma_sendto
>                        |                     svc_send
>                        |                     svc_process
>                        |                     nfsd
>                        |                     kthread
>                        |                     kernel_thread_helper
>                        |
>                        |--37.52%-- __free_iova
>                        |          flush_unmaps
>                        |          add_unmap
>                        |          intel_unmap_page
>                        |          |
>                        |          |--97.18%-- svc_rdma_put_frmr
>                        |          |          sq_cq_reap
>                        |          |          dto_tasklet_func
>                        |          |          tasklet_action
>                        |          |          __do_softirq
>                        |          |          call_softirq
>                        |          |          do_softirq
>                        |          |          |
>                        |          |          |--97.40%-- irq_exit
>                        |          |          |          |
>                        |          |          |          |--99.85%-- do_IRQ
>                        |          |          |          |          ret_from_intr
>                        |          |          |          |          |
>                        |          |          |          |          |--40.74%-- generic_file_buffered_write
>                        |          |          |          |          |          __generic_file_aio_write
>                        |          |          |          |          |          generic_file_aio_write
>                        |          |          |          |          |          do_sync_readv_writev
>                        |          |          |          |          |          do_readv_writev
>                        |          |          |          |          |          vfs_writev
>                        |          |          |          |          |          nfsd_vfs_write
>                        |          |          |          |          |          nfsd_write
>                        |          |          |          |          |          nfsd3_proc_write
>                        |          |          |          |          |          nfsd_dispatch
>                        |          |          |          |          |          svc_process_common
>                        |          |          |          |          |          svc_process
>                        |          |          |          |          |          nfsd
>                        |          |          |          |          |          kthread
>                        |          |          |          |          |          kernel_thread_helper
>                        |          |          |          |          |
>                        |          |          |          |          |--25.21%-- __mutex_lock_slowpath
>                        |          |          |          |          |          mutex_lock
>                        |          |          |          |          |          |
>                        |          |          |          |          |          |--94.84%-- generic_file_aio_write
>                        |          |          |          |          |          |          do_sync_readv_writev
>                        |          |          |          |          |          |          do_readv_writev
>                        |          |          |          |          |          |          vfs_writev
>                        |          |          |          |          |          |          nfsd_vfs_write
>                        |          |          |          |          |          |          nfsd_write
>                        |          |          |          |          |          |          nfsd3_proc_write
>                        |          |          |          |          |          |          nfsd_dispatch
>                        |          |          |          |          |          |          svc_process_common
>                        |          |          |          |          |          |          svc_process
>                        |          |          |          |          |          |          nfsd
>                        |          |          |          |          |          |          kthread
>                        |          |          |          |          |          |          kernel_thread_helper
>                        |          |          |          |          |          |
> 
> The entire trace is almost 1MB, so send me an off-list message if you want it.
> 
> Yan
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-28 14:42                     ` J. Bruce Fields
  0 siblings, 0 replies; 82+ messages in thread
From: J. Bruce Fields @ 2013-04-28 14:42 UTC (permalink / raw)
  To: Yan Burman
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote:
> > > > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> > > > > > > >> <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > > > > > >>> I've been trying to do some benchmarks for NFS over RDMA
> > > > > > > >>> and I seem to
> > > > > > > only get about half of the bandwidth that the HW can give me.
> > > > > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > > > > > >>> memory, and
> > > > > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > > > > >>> These servers are connected to a QDR IB switch. The
> > > > > > > >>> backing storage on
> > > > > > > the server is tmpfs mounted with noatime.
> > > > > > > >>> I am running kernel 3.5.7.
> > > > > > > >>>
> > > > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-
> > 512K.
> > > > > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec
> > > > > > > >>> for the
> > > > > > > same block sizes (4-512K). running over IPoIB-CM, I get 200-
> > 980MB/sec.
...
> > > > > > I am trying to get maximum performance from a single server - I
> > > > > > used 2
> > > > > processes in fio test - more than 2 did not show any performance boost.
> > > > > > I tried running fio from 2 different PCs on 2 different files,
> > > > > > but the sum of
> > > > > the two is more or less the same as running from single client PC.
> > > > > >
> > > > > > What I did see is that server is sweating a lot more than the
> > > > > > clients and
> > > > > more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > > > > > cat /proc/softirqs
...
> > > > Perf top for the CPU with high tasklet count gives:
> > > >
> > > >              samples  pcnt         RIP        function                    DSO
...
> > > >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner
> > /root/vmlinux
...
> > Googling around....  I think we want:
> > 
> > 	perf record -a --call-graph
> > 	(give it a chance to collect some samples, then ^C)
> > 	perf report --call-graph --stdio
> > 
> 
> Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is:
>     36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>                     |
>                     --- mutex_spin_on_owner
>                        |
>                        |--99.99%-- __mutex_lock_slowpath
>                        |          mutex_lock
>                        |          |
>                        |          |--85.30%-- generic_file_aio_write

That's the inode i_mutex.

>                        |          |          do_sync_readv_writev
>                        |          |          do_readv_writev
>                        |          |          vfs_writev
>                        |          |          nfsd_vfs_write
>                        |          |          nfsd_write
>                        |          |          nfsd3_proc_write
>                        |          |          nfsd_dispatch
>                        |          |          svc_process_common
>                        |          |          svc_process
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |           --14.70%-- svc_send

That's the xpt_mutex (ensuring rpc replies aren't interleaved).

>                        |                     svc_process
>                        |                     nfsd
>                        |                     kthread
>                        |                     kernel_thread_helper
>                         --0.01%-- [...]
> 
>      9.63%          nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>                     |
>                     --- _raw_spin_lock_irqsave
>                        |
>                        |--43.97%-- alloc_iova

And that (and __free_iova below) looks like iova_rbtree_lock.

--b.

>                        |          intel_alloc_iova
>                        |          __intel_map_single
>                        |          intel_map_page
>                        |          |
>                        |          |--60.47%-- svc_rdma_sendto
>                        |          |          svc_send
>                        |          |          svc_process
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |          |--30.10%-- rdma_read_xdr
>                        |          |          svc_rdma_recvfrom
>                        |          |          svc_recv
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |          |--6.69%-- svc_rdma_post_recv
>                        |          |          send_reply
>                        |          |          svc_rdma_sendto
>                        |          |          svc_send
>                        |          |          svc_process
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |           --2.74%-- send_reply
>                        |                     svc_rdma_sendto
>                        |                     svc_send
>                        |                     svc_process
>                        |                     nfsd
>                        |                     kthread
>                        |                     kernel_thread_helper
>                        |
>                        |--37.52%-- __free_iova
>                        |          flush_unmaps
>                        |          add_unmap
>                        |          intel_unmap_page
>                        |          |
>                        |          |--97.18%-- svc_rdma_put_frmr
>                        |          |          sq_cq_reap
>                        |          |          dto_tasklet_func
>                        |          |          tasklet_action
>                        |          |          __do_softirq
>                        |          |          call_softirq
>                        |          |          do_softirq
>                        |          |          |
>                        |          |          |--97.40%-- irq_exit
>                        |          |          |          |
>                        |          |          |          |--99.85%-- do_IRQ
>                        |          |          |          |          ret_from_intr
>                        |          |          |          |          |
>                        |          |          |          |          |--40.74%-- generic_file_buffered_write
>                        |          |          |          |          |          __generic_file_aio_write
>                        |          |          |          |          |          generic_file_aio_write
>                        |          |          |          |          |          do_sync_readv_writev
>                        |          |          |          |          |          do_readv_writev
>                        |          |          |          |          |          vfs_writev
>                        |          |          |          |          |          nfsd_vfs_write
>                        |          |          |          |          |          nfsd_write
>                        |          |          |          |          |          nfsd3_proc_write
>                        |          |          |          |          |          nfsd_dispatch
>                        |          |          |          |          |          svc_process_common
>                        |          |          |          |          |          svc_process
>                        |          |          |          |          |          nfsd
>                        |          |          |          |          |          kthread
>                        |          |          |          |          |          kernel_thread_helper
>                        |          |          |          |          |
>                        |          |          |          |          |--25.21%-- __mutex_lock_slowpath
>                        |          |          |          |          |          mutex_lock
>                        |          |          |          |          |          |
>                        |          |          |          |          |          |--94.84%-- generic_file_aio_write
>                        |          |          |          |          |          |          do_sync_readv_writev
>                        |          |          |          |          |          |          do_readv_writev
>                        |          |          |          |          |          |          vfs_writev
>                        |          |          |          |          |          |          nfsd_vfs_write
>                        |          |          |          |          |          |          nfsd_write
>                        |          |          |          |          |          |          nfsd3_proc_write
>                        |          |          |          |          |          |          nfsd_dispatch
>                        |          |          |          |          |          |          svc_process_common
>                        |          |          |          |          |          |          svc_process
>                        |          |          |          |          |          |          nfsd
>                        |          |          |          |          |          |          kthread
>                        |          |          |          |          |          |          kernel_thread_helper
>                        |          |          |          |          |          |
> 
> The entire trace is almost 1MB, so send me an off-list message if you want it.
> 
> Yan
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-29  5:34                       ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-29  5:34 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Yan Burman, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields <bfields@fieldses.org> wrote:

>> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman

>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
> > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> ...

[snip]

>>     36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>
> That's the inode i_mutex.
>
>>     14.70%-- svc_send
>
> That's the xpt_mutex (ensuring rpc replies aren't interleaved).
>
>>
>>      9.63%          nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>>
>
> And that (and __free_iova below) looks like iova_rbtree_lock.
>
>

Let's revisit your command:

"FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128
--ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255
--loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1
--norandommap --group_reporting --exitall --buffered=0"

* inode's i_mutex:
If increasing process/file count didn't help, maybe increase "iodepth"
(say 512 ?) could offset the i_mutex overhead a little bit ?

* xpt_mutex:
(no idea)

* iova_rbtree_lock
DMA mapping fragmentation ? I have not studied whether NFS-RDMA
routines such as "svc_rdma_sendto()" could do better but maybe
sequential IO (instead of "randread") could help ? Bigger block size
(instead of 4K) can help ?

-- Wendy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-29  5:34                       ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-29  5:34 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Yan Burman, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:

>> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman

>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
> > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> ...

[snip]

>>     36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>
> That's the inode i_mutex.
>
>>     14.70%-- svc_send
>
> That's the xpt_mutex (ensuring rpc replies aren't interleaved).
>
>>
>>      9.63%          nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>>
>
> And that (and __free_iova below) looks like iova_rbtree_lock.
>
>

Let's revisit your command:

"FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128
--ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255
--loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1
--norandommap --group_reporting --exitall --buffered=0"

* inode's i_mutex:
If increasing process/file count didn't help, maybe increase "iodepth"
(say 512 ?) could offset the i_mutex overhead a little bit ?

* xpt_mutex:
(no idea)

* iova_rbtree_lock
DMA mapping fragmentation ? I have not studied whether NFS-RDMA
routines such as "svc_rdma_sendto()" could do better but maybe
sequential IO (instead of "randread") could help ? Bigger block size
(instead of 4K) can help ?

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-29 12:16                         ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-29 12:16 UTC (permalink / raw)
  To: Wendy Cheng, J. Bruce Fields
  Cc: Atchley, Scott, Tom Tucker, linux-rdma@vger.kernel.org,
	linux-nfs@vger.kernel.org, Or Gerlitz



> -----Original Message-----
> From: Wendy Cheng [mailto:s.wendy.cheng@gmail.com]
> Sent: Monday, April 29, 2013 08:35
> To: J. Bruce Fields
> Cc: Yan Burman; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
> linux-nfs@vger.kernel.org; Or Gerlitz
> Subject: Re: NFS over RDMA benchmark
> 
> On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> 
> >> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> >> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
> > > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> > ...
> 
> [snip]
> 
> >>     36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
> >
> > That's the inode i_mutex.
> >
> >>     14.70%-- svc_send
> >
> > That's the xpt_mutex (ensuring rpc replies aren't interleaved).
> >
> >>
> >>      9.63%          nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
> >>
> >
> > And that (and __free_iova below) looks like iova_rbtree_lock.
> >
> >
> 
> Let's revisit your command:
> 
> "FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --
> ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255
> --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --
> norandommap --group_reporting --exitall --buffered=0"
> 

I tried block sizes from 4-512K.
4K does not give 2.2GB bandwidth - optimal bandwidth is achieved around 128-256K block size

> * inode's i_mutex:
> If increasing process/file count didn't help, maybe increase "iodepth"
> (say 512 ?) could offset the i_mutex overhead a little bit ?
> 

I tried with different iodepth parameters, but found no improvement above iodepth 128.

> * xpt_mutex:
> (no idea)
> 
> * iova_rbtree_lock
> DMA mapping fragmentation ? I have not studied whether NFS-RDMA
> routines such as "svc_rdma_sendto()" could do better but maybe sequential
> IO (instead of "randread") could help ? Bigger block size (instead of 4K) can
> help ?
> 

I am trying to simulate real load (more or less), that is the reason I use randread. Anyhow, read does not result in better performance.
It's probably because backing storage is tmpfs...

Yan


^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-29 12:16                         ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-29 12:16 UTC (permalink / raw)
  To: Wendy Cheng, J. Bruce Fields
  Cc: Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz



> -----Original Message-----
> From: Wendy Cheng [mailto:s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> Sent: Monday, April 29, 2013 08:35
> To: J. Bruce Fields
> Cc: Yan Burman; Atchley, Scott; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
> Subject: Re: NFS over RDMA benchmark
> 
> On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:
> 
> >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> 
> >> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> >> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
> > > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> > ...
> 
> [snip]
> 
> >>     36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
> >
> > That's the inode i_mutex.
> >
> >>     14.70%-- svc_send
> >
> > That's the xpt_mutex (ensuring rpc replies aren't interleaved).
> >
> >>
> >>      9.63%          nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
> >>
> >
> > And that (and __free_iova below) looks like iova_rbtree_lock.
> >
> >
> 
> Let's revisit your command:
> 
> "FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --
> ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255
> --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --
> norandommap --group_reporting --exitall --buffered=0"
> 

I tried block sizes from 4-512K.
4K does not give 2.2GB bandwidth - optimal bandwidth is achieved around 128-256K block size

> * inode's i_mutex:
> If increasing process/file count didn't help, maybe increase "iodepth"
> (say 512 ?) could offset the i_mutex overhead a little bit ?
> 

I tried with different iodepth parameters, but found no improvement above iodepth 128.

> * xpt_mutex:
> (no idea)
> 
> * iova_rbtree_lock
> DMA mapping fragmentation ? I have not studied whether NFS-RDMA
> routines such as "svc_rdma_sendto()" could do better but maybe sequential
> IO (instead of "randread") could help ? Bigger block size (instead of 4K) can
> help ?
> 

I am trying to simulate real load (more or less), that is the reason I use randread. Anyhow, read does not result in better performance.
It's probably because backing storage is tmpfs...

Yan

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-29 13:05                           ` Tom Tucker
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Tucker @ 2013-04-29 13:05 UTC (permalink / raw)
  To: Yan Burman
  Cc: Wendy Cheng, J. Bruce Fields, Atchley, Scott,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On 4/29/13 7:16 AM, Yan Burman wrote:
>
>> -----Original Message-----
>> From: Wendy Cheng [mailto:s.wendy.cheng@gmail.com]
>> Sent: Monday, April 29, 2013 08:35
>> To: J. Bruce Fields
>> Cc: Yan Burman; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
>> linux-nfs@vger.kernel.org; Or Gerlitz
>> Subject: Re: NFS over RDMA benchmark
>>
>> On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>
>>>> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
>>>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
>>>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
>>>> same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
>>> ...
>> [snip]
>>
>>>>      36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>>> That's the inode i_mutex.
>>>
>>>>      14.70%-- svc_send
>>> That's the xpt_mutex (ensuring rpc replies aren't interleaved).
>>>
>>>>       9.63%          nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>>>>
>>> And that (and __free_iova below) looks like iova_rbtree_lock.
>>>
>>>
>> Let's revisit your command:
>>
>> "FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --
>> ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255
>> --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --
>> norandommap --group_reporting --exitall --buffered=0"
>>
> I tried block sizes from 4-512K.
> 4K does not give 2.2GB bandwidth - optimal bandwidth is achieved around 128-256K block size
>
>> * inode's i_mutex:
>> If increasing process/file count didn't help, maybe increase "iodepth"
>> (say 512 ?) could offset the i_mutex overhead a little bit ?
>>
> I tried with different iodepth parameters, but found no improvement above iodepth 128.
>
>> * xpt_mutex:
>> (no idea)
>>
>> * iova_rbtree_lock
>> DMA mapping fragmentation ? I have not studied whether NFS-RDMA
>> routines such as "svc_rdma_sendto()" could do better but maybe sequential
>> IO (instead of "randread") could help ? Bigger block size (instead of 4K) can
>> help ?
>>

I think the biggest issue is that max_payload for TCP is 2MB but only 
256k for RDMA.

> I am trying to simulate real load (more or less), that is the reason I use randread. Anyhow, read does not result in better performance.
> It's probably because backing storage is tmpfs...
>
> Yan
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-29 13:05                           ` Tom Tucker
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Tucker @ 2013-04-29 13:05 UTC (permalink / raw)
  To: Yan Burman
  Cc: Wendy Cheng, J. Bruce Fields, Atchley, Scott,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On 4/29/13 7:16 AM, Yan Burman wrote:
>
>> -----Original Message-----
>> From: Wendy Cheng [mailto:s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
>> Sent: Monday, April 29, 2013 08:35
>> To: J. Bruce Fields
>> Cc: Yan Burman; Atchley, Scott; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
>> linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
>> Subject: Re: NFS over RDMA benchmark
>>
>> On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:
>>
>>>> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
>>>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
>>>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
>>>> same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
>>> ...
>> [snip]
>>
>>>>      36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>>> That's the inode i_mutex.
>>>
>>>>      14.70%-- svc_send
>>> That's the xpt_mutex (ensuring rpc replies aren't interleaved).
>>>
>>>>       9.63%          nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>>>>
>>> And that (and __free_iova below) looks like iova_rbtree_lock.
>>>
>>>
>> Let's revisit your command:
>>
>> "FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --
>> ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255
>> --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --
>> norandommap --group_reporting --exitall --buffered=0"
>>
> I tried block sizes from 4-512K.
> 4K does not give 2.2GB bandwidth - optimal bandwidth is achieved around 128-256K block size
>
>> * inode's i_mutex:
>> If increasing process/file count didn't help, maybe increase "iodepth"
>> (say 512 ?) could offset the i_mutex overhead a little bit ?
>>
> I tried with different iodepth parameters, but found no improvement above iodepth 128.
>
>> * xpt_mutex:
>> (no idea)
>>
>> * iova_rbtree_lock
>> DMA mapping fragmentation ? I have not studied whether NFS-RDMA
>> routines such as "svc_rdma_sendto()" could do better but maybe sequential
>> IO (instead of "randread") could help ? Bigger block size (instead of 4K) can
>> help ?
>>

I think the biggest issue is that max_payload for TCP is 2MB but only 
256k for RDMA.

> I am trying to simulate real load (more or less), that is the reason I use randread. Anyhow, read does not result in better performance.
> It's probably because backing storage is tmpfs...
>
> Yan
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-29 13:07                             ` Tom Tucker
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Tucker @ 2013-04-29 13:07 UTC (permalink / raw)
  To: Yan Burman
  Cc: Wendy Cheng, J. Bruce Fields, Atchley, Scott,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On 4/29/13 8:05 AM, Tom Tucker wrote:
> On 4/29/13 7:16 AM, Yan Burman wrote:
>>
>>> -----Original Message-----
>>> From: Wendy Cheng [mailto:s.wendy.cheng@gmail.com]
>>> Sent: Monday, April 29, 2013 08:35
>>> To: J. Bruce Fields
>>> Cc: Yan Burman; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
>>> linux-nfs@vger.kernel.org; Or Gerlitz
>>> Subject: Re: NFS over RDMA benchmark
>>>
>>> On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields 
>>> <bfields@fieldses.org> wrote:
>>>
>>>>> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
>>>>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
>>>>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
>>>>> same block sizes (4-512K). running over IPoIB-CM, I get 
>>>>> 200-980MB/sec.
>>>> ...
>>> [snip]
>>>
>>>>>      36.18%          nfsd [kernel.kallsyms]   [k] mutex_spin_on_owner
>>>> That's the inode i_mutex.
>>>>
>>>>>      14.70%-- svc_send
>>>> That's the xpt_mutex (ensuring rpc replies aren't interleaved).
>>>>
>>>>>       9.63%          nfsd [kernel.kallsyms]   [k] 
>>>>> _raw_spin_lock_irqsave
>>>>>
>>>> And that (and __free_iova below) looks like iova_rbtree_lock.
>>>>
>>>>
>>> Let's revisit your command:
>>>
>>> "FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --
>>> ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255
>>> --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 
>>> --randrepeat=1 --
>>> norandommap --group_reporting --exitall --buffered=0"
>>>
>> I tried block sizes from 4-512K.
>> 4K does not give 2.2GB bandwidth - optimal bandwidth is achieved 
>> around 128-256K block size
>>
>>> * inode's i_mutex:
>>> If increasing process/file count didn't help, maybe increase "iodepth"
>>> (say 512 ?) could offset the i_mutex overhead a little bit ?
>>>
>> I tried with different iodepth parameters, but found no improvement 
>> above iodepth 128.
>>
>>> * xpt_mutex:
>>> (no idea)
>>>
>>> * iova_rbtree_lock
>>> DMA mapping fragmentation ? I have not studied whether NFS-RDMA
>>> routines such as "svc_rdma_sendto()" could do better but maybe 
>>> sequential
>>> IO (instead of "randread") could help ? Bigger block size (instead 
>>> of 4K) can
>>> help ?
>>>
>
> I think the biggest issue is that max_payload for TCP is 2MB but only 
> 256k for RDMA.

Sorry, I mean 1MB...

>
>> I am trying to simulate real load (more or less), that is the reason 
>> I use randread. Anyhow, read does not result in better performance.
>> It's probably because backing storage is tmpfs...
>>
>> Yan
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-29 13:07                             ` Tom Tucker
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Tucker @ 2013-04-29 13:07 UTC (permalink / raw)
  To: Yan Burman
  Cc: Wendy Cheng, J. Bruce Fields, Atchley, Scott,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On 4/29/13 8:05 AM, Tom Tucker wrote:
> On 4/29/13 7:16 AM, Yan Burman wrote:
>>
>>> -----Original Message-----
>>> From: Wendy Cheng [mailto:s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
>>> Sent: Monday, April 29, 2013 08:35
>>> To: J. Bruce Fields
>>> Cc: Yan Burman; Atchley, Scott; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
>>> linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
>>> Subject: Re: NFS over RDMA benchmark
>>>
>>> On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields 
>>> <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:
>>>
>>>>> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
>>>>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
>>>>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
>>>>> same block sizes (4-512K). running over IPoIB-CM, I get 
>>>>> 200-980MB/sec.
>>>> ...
>>> [snip]
>>>
>>>>>      36.18%          nfsd [kernel.kallsyms]   [k] mutex_spin_on_owner
>>>> That's the inode i_mutex.
>>>>
>>>>>      14.70%-- svc_send
>>>> That's the xpt_mutex (ensuring rpc replies aren't interleaved).
>>>>
>>>>>       9.63%          nfsd [kernel.kallsyms]   [k] 
>>>>> _raw_spin_lock_irqsave
>>>>>
>>>> And that (and __free_iova below) looks like iova_rbtree_lock.
>>>>
>>>>
>>> Let's revisit your command:
>>>
>>> "FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --
>>> ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255
>>> --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 
>>> --randrepeat=1 --
>>> norandommap --group_reporting --exitall --buffered=0"
>>>
>> I tried block sizes from 4-512K.
>> 4K does not give 2.2GB bandwidth - optimal bandwidth is achieved 
>> around 128-256K block size
>>
>>> * inode's i_mutex:
>>> If increasing process/file count didn't help, maybe increase "iodepth"
>>> (say 512 ?) could offset the i_mutex overhead a little bit ?
>>>
>> I tried with different iodepth parameters, but found no improvement 
>> above iodepth 128.
>>
>>> * xpt_mutex:
>>> (no idea)
>>>
>>> * iova_rbtree_lock
>>> DMA mapping fragmentation ? I have not studied whether NFS-RDMA
>>> routines such as "svc_rdma_sendto()" could do better but maybe 
>>> sequential
>>> IO (instead of "randread") could help ? Bigger block size (instead 
>>> of 4K) can
>>> help ?
>>>
>
> I think the biggest issue is that max_payload for TCP is 2MB but only 
> 256k for RDMA.

Sorry, I mean 1MB...

>
>> I am trying to simulate real load (more or less), that is the reason 
>> I use randread. Anyhow, read does not result in better performance.
>> It's probably because backing storage is tmpfs...
>>
>> Yan
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-30  5:09                       ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-30  5:09 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz



> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org]
> Sent: Sunday, April 28, 2013 17:43
> To: Yan Burman
> Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
> linux-nfs@vger.kernel.org; Or Gerlitz
> Subject: Re: NFS over RDMA benchmark
> 
> On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote:
> > > > > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> > > > > > > > >> <yanb@mellanox.com>
> > > > > > > > >>> I've been trying to do some benchmarks for NFS over
> > > > > > > > >>> RDMA and I seem to
> > > > > > > > only get about half of the bandwidth that the HW can give me.
> > > > > > > > >>> My setup consists of 2 servers each with 16 cores,
> > > > > > > > >>> 32Gb of memory, and
> > > > > > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > > > > > >>> These servers are connected to a QDR IB switch. The
> > > > > > > > >>> backing storage on
> > > > > > > > the server is tmpfs mounted with noatime.
> > > > > > > > >>> I am running kernel 3.5.7.
> > > > > > > > >>>
> > > > > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for
> > > > > > > > >>> block sizes 4-
> > > 512K.
> > > > > > > > >>> When I run fio over rdma mounted nfs, I get
> > > > > > > > >>> 260-2200MB/sec for the
> > > > > > > > same block sizes (4-512K). running over IPoIB-CM, I get
> > > > > > > > 200-
> > > 980MB/sec.
> ...
> > > > > > > I am trying to get maximum performance from a single server
> > > > > > > - I used 2
> > > > > > processes in fio test - more than 2 did not show any performance
> boost.
> > > > > > > I tried running fio from 2 different PCs on 2 different
> > > > > > > files, but the sum of
> > > > > > the two is more or less the same as running from single client PC.
> > > > > > >

I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also way higher now).
For some reason when I had intel IOMMU enabled, the performance dropped significantly.
I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
Now I will take care of the issue that I am running only at 40Gbit/s instead of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable issue).

This is still strange, since ib_send_bw with intel iommu enabled did get up to 4.5GB/sec, so why did intel iommu affect only nfs code?

Yan


^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-30  5:09                       ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-30  5:09 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz



> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org]
> Sent: Sunday, April 28, 2013 17:43
> To: Yan Burman
> Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
> Subject: Re: NFS over RDMA benchmark
> 
> On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote:
> > > > > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> > > > > > > > >> <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > > > > > > >>> I've been trying to do some benchmarks for NFS over
> > > > > > > > >>> RDMA and I seem to
> > > > > > > > only get about half of the bandwidth that the HW can give me.
> > > > > > > > >>> My setup consists of 2 servers each with 16 cores,
> > > > > > > > >>> 32Gb of memory, and
> > > > > > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > > > > > >>> These servers are connected to a QDR IB switch. The
> > > > > > > > >>> backing storage on
> > > > > > > > the server is tmpfs mounted with noatime.
> > > > > > > > >>> I am running kernel 3.5.7.
> > > > > > > > >>>
> > > > > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for
> > > > > > > > >>> block sizes 4-
> > > 512K.
> > > > > > > > >>> When I run fio over rdma mounted nfs, I get
> > > > > > > > >>> 260-2200MB/sec for the
> > > > > > > > same block sizes (4-512K). running over IPoIB-CM, I get
> > > > > > > > 200-
> > > 980MB/sec.
> ...
> > > > > > > I am trying to get maximum performance from a single server
> > > > > > > - I used 2
> > > > > > processes in fio test - more than 2 did not show any performance
> boost.
> > > > > > > I tried running fio from 2 different PCs on 2 different
> > > > > > > files, but the sum of
> > > > > > the two is more or less the same as running from single client PC.
> > > > > > >

I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also way higher now).
For some reason when I had intel IOMMU enabled, the performance dropped significantly.
I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
Now I will take care of the issue that I am running only at 40Gbit/s instead of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable issue).

This is still strange, since ib_send_bw with intel iommu enabled did get up to 4.5GB/sec, so why did intel iommu affect only nfs code?

Yan

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-30 13:05                         ` Tom Talpey
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Talpey @ 2013-04-30 13:05 UTC (permalink / raw)
  To: Yan Burman
  Cc: J. Bruce Fields, Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On 4/30/2013 1:09 AM, Yan Burman wrote:
>
>
>> -----Original Message-----
>> From: J. Bruce Fields [mailto:bfields@fieldses.org]
>> Sent: Sunday, April 28, 2013 17:43
>> To: Yan Burman
>> Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
>> linux-nfs@vger.kernel.org; Or Gerlitz
>> Subject: Re: NFS over RDMA benchmark
>>
>> On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote:
>>>>>>>>>>> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
>>>>>>>>>>> <yanb@mellanox.com>
>>>>>>>>>>>> I've been trying to do some benchmarks for NFS over
>>>>>>>>>>>> RDMA and I seem to
>>>>>>>>> only get about half of the bandwidth that the HW can give me.
>>>>>>>>>>>> My setup consists of 2 servers each with 16 cores,
>>>>>>>>>>>> 32Gb of memory, and
>>>>>>>>> Mellanox ConnectX3 QDR card over PCI-e gen3.
>>>>>>>>>>>> These servers are connected to a QDR IB switch. The
>>>>>>>>>>>> backing storage on
>>>>>>>>> the server is tmpfs mounted with noatime.
>>>>>>>>>>>> I am running kernel 3.5.7.
>>>>>>>>>>>>
>>>>>>>>>>>> When running ib_send_bw, I get 4.3-4.5 GB/sec for
>>>>>>>>>>>> block sizes 4-
>>>> 512K.
>>>>>>>>>>>> When I run fio over rdma mounted nfs, I get
>>>>>>>>>>>> 260-2200MB/sec for the
>>>>>>>>> same block sizes (4-512K). running over IPoIB-CM, I get
>>>>>>>>> 200-
>>>> 980MB/sec.
>> ...
>>>>>>>> I am trying to get maximum performance from a single server
>>>>>>>> - I used 2
>>>>>>> processes in fio test - more than 2 did not show any performance
>> boost.
>>>>>>>> I tried running fio from 2 different PCs on 2 different
>>>>>>>> files, but the sum of
>>>>>>> the two is more or less the same as running from single client PC.
>>>>>>>>
>
> I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also way higher now).
> For some reason when I had intel IOMMU enabled, the performance dropped significantly.
> I now get up to ~95K IOPS and 4.1GB/sec bandwidth.

Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps.

What is the client CPU percentage you see under this workload, and
how different are the NFS/RDMA and NFS/IPoIB overheads?

> Now I will take care of the issue that I am running only at 40Gbit/s instead of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable issue).
>
> This is still strange, since ib_send_bw with intel iommu enabled did get up to 4.5GB/sec, so why did intel iommu affect only nfs code?

You'll need to do more profiling to track that down. I would suspect
that ib_send_bw is using some sort of direct hardware access, bypassing
the IOMMU management and possibly performing no dynamic memory registration.

The NFS/RDMA code goes via the standard kernel DMA API, and correctly
registers/deregisters memory on a per-i/o basis in order to provide
storage data integrity. Perhaps there are overheads in the IOMMU
management which can be addressed.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-30 13:05                         ` Tom Talpey
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Talpey @ 2013-04-30 13:05 UTC (permalink / raw)
  To: Yan Burman
  Cc: J. Bruce Fields, Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On 4/30/2013 1:09 AM, Yan Burman wrote:
>
>
>> -----Original Message-----
>> From: J. Bruce Fields [mailto:bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org]
>> Sent: Sunday, April 28, 2013 17:43
>> To: Yan Burman
>> Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
>> linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
>> Subject: Re: NFS over RDMA benchmark
>>
>> On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote:
>>>>>>>>>>> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
>>>>>>>>>>> <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>>>>>>>>>> I've been trying to do some benchmarks for NFS over
>>>>>>>>>>>> RDMA and I seem to
>>>>>>>>> only get about half of the bandwidth that the HW can give me.
>>>>>>>>>>>> My setup consists of 2 servers each with 16 cores,
>>>>>>>>>>>> 32Gb of memory, and
>>>>>>>>> Mellanox ConnectX3 QDR card over PCI-e gen3.
>>>>>>>>>>>> These servers are connected to a QDR IB switch. The
>>>>>>>>>>>> backing storage on
>>>>>>>>> the server is tmpfs mounted with noatime.
>>>>>>>>>>>> I am running kernel 3.5.7.
>>>>>>>>>>>>
>>>>>>>>>>>> When running ib_send_bw, I get 4.3-4.5 GB/sec for
>>>>>>>>>>>> block sizes 4-
>>>> 512K.
>>>>>>>>>>>> When I run fio over rdma mounted nfs, I get
>>>>>>>>>>>> 260-2200MB/sec for the
>>>>>>>>> same block sizes (4-512K). running over IPoIB-CM, I get
>>>>>>>>> 200-
>>>> 980MB/sec.
>> ...
>>>>>>>> I am trying to get maximum performance from a single server
>>>>>>>> - I used 2
>>>>>>> processes in fio test - more than 2 did not show any performance
>> boost.
>>>>>>>> I tried running fio from 2 different PCs on 2 different
>>>>>>>> files, but the sum of
>>>>>>> the two is more or less the same as running from single client PC.
>>>>>>>>
>
> I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also way higher now).
> For some reason when I had intel IOMMU enabled, the performance dropped significantly.
> I now get up to ~95K IOPS and 4.1GB/sec bandwidth.

Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps.

What is the client CPU percentage you see under this workload, and
how different are the NFS/RDMA and NFS/IPoIB overheads?

> Now I will take care of the issue that I am running only at 40Gbit/s instead of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable issue).
>
> This is still strange, since ib_send_bw with intel iommu enabled did get up to 4.5GB/sec, so why did intel iommu affect only nfs code?

You'll need to do more profiling to track that down. I would suspect
that ib_send_bw is using some sort of direct hardware access, bypassing
the IOMMU management and possibly performing no dynamic memory registration.

The NFS/RDMA code goes via the standard kernel DMA API, and correctly
registers/deregisters memory on a per-i/o basis in order to provide
storage data integrity. Perhaps there are overheads in the IOMMU
management which can be addressed.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-30 14:23                           ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-30 14:23 UTC (permalink / raw)
  To: Tom Talpey
  Cc: J. Bruce Fields, Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz



> -----Original Message-----
> From: Tom Talpey [mailto:tom@talpey.com]
> Sent: Tuesday, April 30, 2013 16:05
> To: Yan Burman
> Cc: J. Bruce Fields; Wendy Cheng; Atchley, Scott; Tom Tucker; linux-
> rdma@vger.kernel.org; linux-nfs@vger.kernel.org; Or Gerlitz
> Subject: Re: NFS over RDMA benchmark
> 
> On 4/30/2013 1:09 AM, Yan Burman wrote:
> >
> >
> >> -----Original Message-----
> >> From: J. Bruce Fields [mailto:bfields@fieldses.org]
> >> Sent: Sunday, April 28, 2013 17:43
> >> To: Yan Burman
> >> Cc: Wendy Cheng; Atchley, Scott; Tom Tucker;
> >> linux-rdma@vger.kernel.org; linux-nfs@vger.kernel.org; Or Gerlitz
> >> Subject: Re: NFS over RDMA benchmark
> >>
> >> On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote:
> >>>>>>>>>>> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> >>>>>>>>>>> <yanb@mellanox.com>
> >>>>>>>>>>>> I've been trying to do some benchmarks for NFS over RDMA
> >>>>>>>>>>>> and I seem to
> >>>>>>>>> only get about half of the bandwidth that the HW can give me.
> >>>>>>>>>>>> My setup consists of 2 servers each with 16 cores, 32Gb of
> >>>>>>>>>>>> memory, and
> >>>>>>>>> Mellanox ConnectX3 QDR card over PCI-e gen3.
> >>>>>>>>>>>> These servers are connected to a QDR IB switch. The backing
> >>>>>>>>>>>> storage on
> >>>>>>>>> the server is tmpfs mounted with noatime.
> >>>>>>>>>>>> I am running kernel 3.5.7.
> >>>>>>>>>>>>
> >>>>>>>>>>>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block
> >>>>>>>>>>>> sizes 4-
> >>>> 512K.
> >>>>>>>>>>>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec
> >>>>>>>>>>>> for the
> >>>>>>>>> same block sizes (4-512K). running over IPoIB-CM, I get
> >>>>>>>>> 200-
> >>>> 980MB/sec.
> >> ...
> >>>>>>>> I am trying to get maximum performance from a single server
> >>>>>>>> - I used 2
> >>>>>>> processes in fio test - more than 2 did not show any performance
> >> boost.
> >>>>>>>> I tried running fio from 2 different PCs on 2 different files,
> >>>>>>>> but the sum of
> >>>>>>> the two is more or less the same as running from single client PC.
> >>>>>>>>
> >
> > I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is
> also way higher now).
> > For some reason when I had intel IOMMU enabled, the performance
> dropped significantly.
> > I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
> 
> Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps.
> 

That is not a typo. I get 95K IOPS with randrw test with block size 4K.
I get 4.1GBps with block size 256K randread test.

> What is the client CPU percentage you see under this workload, and how
> different are the NFS/RDMA and NFS/IPoIB overheads?

NFS/RDMA has about more 20-30% CPU usage than NFS/IPoIB, but RDMA has almost twice the bandwidth of IPoIB.
Overall, CPU usage gets up to about 20% for randread and 50% for randwrite.

> 
> > Now I will take care of the issue that I am running only at 40Gbit/s instead
> of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable
> issue).
> >
> > This is still strange, since ib_send_bw with intel iommu enabled did get up
> to 4.5GB/sec, so why did intel iommu affect only nfs code?
> 
> You'll need to do more profiling to track that down. I would suspect that
> ib_send_bw is using some sort of direct hardware access, bypassing the
> IOMMU management and possibly performing no dynamic memory
> registration.
> 
> The NFS/RDMA code goes via the standard kernel DMA API, and correctly
> registers/deregisters memory on a per-i/o basis in order to provide storage
> data integrity. Perhaps there are overheads in the IOMMU management
> which can be addressed.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-30 14:23                           ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-30 14:23 UTC (permalink / raw)
  To: Tom Talpey
  Cc: J. Bruce Fields, Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz



> -----Original Message-----
> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
> Sent: Tuesday, April 30, 2013 16:05
> To: Yan Burman
> Cc: J. Bruce Fields; Wendy Cheng; Atchley, Scott; Tom Tucker; linux-
> rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
> Subject: Re: NFS over RDMA benchmark
> 
> On 4/30/2013 1:09 AM, Yan Burman wrote:
> >
> >
> >> -----Original Message-----
> >> From: J. Bruce Fields [mailto:bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org]
> >> Sent: Sunday, April 28, 2013 17:43
> >> To: Yan Burman
> >> Cc: Wendy Cheng; Atchley, Scott; Tom Tucker;
> >> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
> >> Subject: Re: NFS over RDMA benchmark
> >>
> >> On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote:
> >>>>>>>>>>> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> >>>>>>>>>>> <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>>>>>>>>>>> I've been trying to do some benchmarks for NFS over RDMA
> >>>>>>>>>>>> and I seem to
> >>>>>>>>> only get about half of the bandwidth that the HW can give me.
> >>>>>>>>>>>> My setup consists of 2 servers each with 16 cores, 32Gb of
> >>>>>>>>>>>> memory, and
> >>>>>>>>> Mellanox ConnectX3 QDR card over PCI-e gen3.
> >>>>>>>>>>>> These servers are connected to a QDR IB switch. The backing
> >>>>>>>>>>>> storage on
> >>>>>>>>> the server is tmpfs mounted with noatime.
> >>>>>>>>>>>> I am running kernel 3.5.7.
> >>>>>>>>>>>>
> >>>>>>>>>>>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block
> >>>>>>>>>>>> sizes 4-
> >>>> 512K.
> >>>>>>>>>>>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec
> >>>>>>>>>>>> for the
> >>>>>>>>> same block sizes (4-512K). running over IPoIB-CM, I get
> >>>>>>>>> 200-
> >>>> 980MB/sec.
> >> ...
> >>>>>>>> I am trying to get maximum performance from a single server
> >>>>>>>> - I used 2
> >>>>>>> processes in fio test - more than 2 did not show any performance
> >> boost.
> >>>>>>>> I tried running fio from 2 different PCs on 2 different files,
> >>>>>>>> but the sum of
> >>>>>>> the two is more or less the same as running from single client PC.
> >>>>>>>>
> >
> > I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is
> also way higher now).
> > For some reason when I had intel IOMMU enabled, the performance
> dropped significantly.
> > I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
> 
> Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps.
> 

That is not a typo. I get 95K IOPS with randrw test with block size 4K.
I get 4.1GBps with block size 256K randread test.

> What is the client CPU percentage you see under this workload, and how
> different are the NFS/RDMA and NFS/IPoIB overheads?

NFS/RDMA has about more 20-30% CPU usage than NFS/IPoIB, but RDMA has almost twice the bandwidth of IPoIB.
Overall, CPU usage gets up to about 20% for randread and 50% for randwrite.

> 
> > Now I will take care of the issue that I am running only at 40Gbit/s instead
> of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable
> issue).
> >
> > This is still strange, since ib_send_bw with intel iommu enabled did get up
> to 4.5GB/sec, so why did intel iommu affect only nfs code?
> 
> You'll need to do more profiling to track that down. I would suspect that
> ib_send_bw is using some sort of direct hardware access, bypassing the
> IOMMU management and possibly performing no dynamic memory
> registration.
> 
> The NFS/RDMA code goes via the standard kernel DMA API, and correctly
> registers/deregisters memory on a per-i/o basis in order to provide storage
> data integrity. Perhaps there are overheads in the IOMMU management
> which can be addressed.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-30 14:44                             ` Tom Talpey
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Talpey @ 2013-04-30 14:44 UTC (permalink / raw)
  To: Yan Burman
  Cc: J. Bruce Fields, Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On 4/30/2013 10:23 AM, Yan Burman wrote:
>> -----Original Message-----
>> From: Tom Talpey [mailto:tom@talpey.com]
>>>> On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote:
>>> I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is
>> also way higher now).
>>> For some reason when I had intel IOMMU enabled, the performance
>> dropped significantly.
>>> I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
>>
>> Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps.
>>
>
> That is not a typo. I get 95K IOPS with randrw test with block size 4K.
> I get 4.1GBps with block size 256K randread test.

Well, then I suggest you focus on whether you are satisfied with a
high bandwidth goal or a high IOPS goal. They are two very different
things, and clearly there are still significant issues to track down
in the server.

>> What is the client CPU percentage you see under this workload, and how
>> different are the NFS/RDMA and NFS/IPoIB overheads?
>
> NFS/RDMA has about more 20-30% CPU usage than NFS/IPoIB, but RDMA has almost twice the bandwidth of IPoIB.

So, for 125% of the CPU, RDMA is delivering 200% of the bandwidth.
A common reporting approach is to calculate cycles per Byte (roughly,
CPU/MB/sec), and you'll find this can be a great tool for comparison
when overhead is a consideration.

> Overall, CPU usage gets up to about 20% for randread and 50% for randwrite.

This is *client* CPU? Writes require the server to take additional
overhead to make RDMA Read requests, but the client side is doing
practically the same thing for the read vs write path. Again, you
may want to profile more deeply to track that difference down.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-30 14:44                             ` Tom Talpey
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Talpey @ 2013-04-30 14:44 UTC (permalink / raw)
  To: Yan Burman
  Cc: J. Bruce Fields, Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On 4/30/2013 10:23 AM, Yan Burman wrote:
>> -----Original Message-----
>> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
>>>> On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote:
>>> I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is
>> also way higher now).
>>> For some reason when I had intel IOMMU enabled, the performance
>> dropped significantly.
>>> I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
>>
>> Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps.
>>
>
> That is not a typo. I get 95K IOPS with randrw test with block size 4K.
> I get 4.1GBps with block size 256K randread test.

Well, then I suggest you focus on whether you are satisfied with a
high bandwidth goal or a high IOPS goal. They are two very different
things, and clearly there are still significant issues to track down
in the server.

>> What is the client CPU percentage you see under this workload, and how
>> different are the NFS/RDMA and NFS/IPoIB overheads?
>
> NFS/RDMA has about more 20-30% CPU usage than NFS/IPoIB, but RDMA has almost twice the bandwidth of IPoIB.

So, for 125% of the CPU, RDMA is delivering 200% of the bandwidth.
A common reporting approach is to calculate cycles per Byte (roughly,
CPU/MB/sec), and you'll find this can be a great tool for comparison
when overhead is a consideration.

> Overall, CPU usage gets up to about 20% for randread and 50% for randwrite.

This is *client* CPU? Writes require the server to take additional
overhead to make RDMA Read requests, but the client side is doing
practically the same thing for the read vs write path. Again, you
may want to profile more deeply to track that difference down.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-30 14:20                         ` Tom Talpey
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Talpey @ 2013-04-30 14:20 UTC (permalink / raw)
  To: Yan Burman
  Cc: J. Bruce Fields, Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On 4/30/2013 1:09 AM, Yan Burman wrote:
> I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
>...
>  ib_send_bw with intel iommu enabled did get up to 4.5GB/sec

BTW, you may want to verify that these are the same GB. Many
benchmarks say KB/MB/GB when they really mean KiB/MiB/GiB.

At GB/GiB, the difference is about 7.5%, very close to the
difference between 4.1 and 4.5.

Just a thought.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-30 14:20                         ` Tom Talpey
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Talpey @ 2013-04-30 14:20 UTC (permalink / raw)
  To: Yan Burman
  Cc: J. Bruce Fields, Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On 4/30/2013 1:09 AM, Yan Burman wrote:
> I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
>...
>  ib_send_bw with intel iommu enabled did get up to 4.5GB/sec

BTW, you may want to verify that these are the same GB. Many
benchmarks say KB/MB/GB when they really mean KiB/MiB/GiB.

At GB/GiB, the difference is about 7.5%, very close to the
difference between 4.1 and 4.5.

Just a thought.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-30 14:38                           ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-30 14:38 UTC (permalink / raw)
  To: Tom Talpey
  Cc: J. Bruce Fields, Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz



> -----Original Message-----
> From: Tom Talpey [mailto:tom@talpey.com]
> Sent: Tuesday, April 30, 2013 17:20
> To: Yan Burman
> Cc: J. Bruce Fields; Wendy Cheng; Atchley, Scott; Tom Tucker; linux-
> rdma@vger.kernel.org; linux-nfs@vger.kernel.org; Or Gerlitz
> Subject: Re: NFS over RDMA benchmark
> 
> On 4/30/2013 1:09 AM, Yan Burman wrote:
> > I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
> >...
> >  ib_send_bw with intel iommu enabled did get up to 4.5GB/sec
> 
> BTW, you may want to verify that these are the same GB. Many benchmarks
> say KB/MB/GB when they really mean KiB/MiB/GiB.
> 
> At GB/GiB, the difference is about 7.5%, very close to the difference between
> 4.1 and 4.5.
> 
> Just a thought.

The question is not why there is 400MBps difference between ib_send_bw and NFSoRDMA.
The question is why with IOMMU ib_send_bw got to the same bandwidth as without it while NFSoRDMA got half.

>From some googling, it seems that when IOMMU is enabled, dma mapping functions get a lot more expensive.
Perhaps that is the reason for the performance drop.

Yan


^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-30 14:38                           ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-30 14:38 UTC (permalink / raw)
  To: Tom Talpey
  Cc: J. Bruce Fields, Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz



> -----Original Message-----
> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
> Sent: Tuesday, April 30, 2013 17:20
> To: Yan Burman
> Cc: J. Bruce Fields; Wendy Cheng; Atchley, Scott; Tom Tucker; linux-
> rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
> Subject: Re: NFS over RDMA benchmark
> 
> On 4/30/2013 1:09 AM, Yan Burman wrote:
> > I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
> >...
> >  ib_send_bw with intel iommu enabled did get up to 4.5GB/sec
> 
> BTW, you may want to verify that these are the same GB. Many benchmarks
> say KB/MB/GB when they really mean KiB/MiB/GiB.
> 
> At GB/GiB, the difference is about 7.5%, very close to the difference between
> 4.1 and 4.5.
> 
> Just a thought.

The question is not why there is 400MBps difference between ib_send_bw and NFSoRDMA.
The question is why with IOMMU ib_send_bw got to the same bandwidth as without it while NFSoRDMA got half.

>From some googling, it seems that when IOMMU is enabled, dma mapping functions get a lot more expensive.
Perhaps that is the reason for the performance drop.

Yan

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-30 18:58                             ` Tom Tucker
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Tucker @ 2013-04-30 18:58 UTC (permalink / raw)
  To: Yan Burman
  Cc: Tom Talpey, J. Bruce Fields, Wendy Cheng, Atchley, Scott,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On 4/30/13 9:38 AM, Yan Burman wrote:
>
>> -----Original Message-----
>> From: Tom Talpey [mailto:tom@talpey.com]
>> Sent: Tuesday, April 30, 2013 17:20
>> To: Yan Burman
>> Cc: J. Bruce Fields; Wendy Cheng; Atchley, Scott; Tom Tucker; linux-
>> rdma@vger.kernel.org; linux-nfs@vger.kernel.org; Or Gerlitz
>> Subject: Re: NFS over RDMA benchmark
>>
>> On 4/30/2013 1:09 AM, Yan Burman wrote:
>>> I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
>>> ...
>>>   ib_send_bw with intel iommu enabled did get up to 4.5GB/sec
>> BTW, you may want to verify that these are the same GB. Many benchmarks
>> say KB/MB/GB when they really mean KiB/MiB/GiB.
>>
>> At GB/GiB, the difference is about 7.5%, very close to the difference between
>> 4.1 and 4.5.
>>
>> Just a thought.
> The question is not why there is 400MBps difference between ib_send_bw and NFSoRDMA.
> The question is why with IOMMU ib_send_bw got to the same bandwidth as without it while NFSoRDMA got half.
NFSRDMA is constantly registering and unregistering memory when you use 
FRMR mode. By contrast IPoIB has a descriptor ring that is set up once 
and re-used. I suspect this is the difference maker. Have you tried 
running the server in ALL_PHYSICAL mode, i.e. where it uses a DMA_MR for 
all of memory?

Tom
> >From some googling, it seems that when IOMMU is enabled, dma mapping functions get a lot more expensive.
> Perhaps that is the reason for the performance drop.
>
> Yan


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-30 18:58                             ` Tom Tucker
  0 siblings, 0 replies; 82+ messages in thread
From: Tom Tucker @ 2013-04-30 18:58 UTC (permalink / raw)
  To: Yan Burman
  Cc: Tom Talpey, J. Bruce Fields, Wendy Cheng, Atchley, Scott,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On 4/30/13 9:38 AM, Yan Burman wrote:
>
>> -----Original Message-----
>> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
>> Sent: Tuesday, April 30, 2013 17:20
>> To: Yan Burman
>> Cc: J. Bruce Fields; Wendy Cheng; Atchley, Scott; Tom Tucker; linux-
>> rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
>> Subject: Re: NFS over RDMA benchmark
>>
>> On 4/30/2013 1:09 AM, Yan Burman wrote:
>>> I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
>>> ...
>>>   ib_send_bw with intel iommu enabled did get up to 4.5GB/sec
>> BTW, you may want to verify that these are the same GB. Many benchmarks
>> say KB/MB/GB when they really mean KiB/MiB/GiB.
>>
>> At GB/GiB, the difference is about 7.5%, very close to the difference between
>> 4.1 and 4.5.
>>
>> Just a thought.
> The question is not why there is 400MBps difference between ib_send_bw and NFSoRDMA.
> The question is why with IOMMU ib_send_bw got to the same bandwidth as without it while NFSoRDMA got half.
NFSRDMA is constantly registering and unregistering memory when you use 
FRMR mode. By contrast IPoIB has a descriptor ring that is set up once 
and re-used. I suspect this is the difference maker. Have you tried 
running the server in ALL_PHYSICAL mode, i.e. where it uses a DMA_MR for 
all of memory?

Tom
> >From some googling, it seems that when IOMMU is enabled, dma mapping functions get a lot more expensive.
> Perhaps that is the reason for the performance drop.
>
> Yan

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

[parent not found: <CALsNU1MsjH5=p4Wtj2aJ5+odC7y7-5oTGhrzOL-=15pXaYYUZw@mail.gmail.com>]

[parent not found: <CABgxfbFhZTBO81WC5BcRRfQB_YBjE4N=sfS+G9eAzaFHYC_dWw@mail.gmail.com>]

* Re: NFS over RDMA benchmark
@ 2013-06-20 14:56                                   ` Or Gerlitz
  0 siblings, 0 replies; 82+ messages in thread
From: Or Gerlitz @ 2013-06-20 14:56 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: Devesh Sharma, Tom Tucker, Yan Burman, Tom Talpey,
	J. Bruce Fields, Atchley, Scott, linux-rdma@vger.kernel.org,
	linux-nfs@vger.kernel.org

On 19/06/2013 18:47, Wendy Cheng wrote:
> what kind of HW I would need to run it ?

The mlx4 driver supports memory windows as of kernel 3.9

Or.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-06-20 14:56                                   ` Or Gerlitz
  0 siblings, 0 replies; 82+ messages in thread
From: Or Gerlitz @ 2013-06-20 14:56 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: Devesh Sharma, Tom Tucker, Yan Burman, Tom Talpey,
	J. Bruce Fields, Atchley, Scott,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On 19/06/2013 18:47, Wendy Cheng wrote:
> what kind of HW I would need to run it ?

The mlx4 driver supports memory windows as of kernel 3.9

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-30 16:24                         ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-30 16:24 UTC (permalink / raw)
  To: Yan Burman
  Cc: J. Bruce Fields, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Mon, Apr 29, 2013 at 10:09 PM, Yan Burman <yanb@mellanox.com> wrote:
>
> I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also way higher now).
> For some reason when I had intel IOMMU enabled, the performance dropped significantly.
> I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
> Now I will take care of the issue that I am running only at 40Gbit/s instead of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable issue).
>
> This is still strange, since ib_send_bw with intel iommu enabled did get up to 4.5GB/sec, so why did intel iommu affect only nfs code?
>

That's very exciting ! The sad part is that IOMMU has to be turned off.

I think ib_send_bw uses a single buffer so the DMA mapping search
overhead is not an issue.

-- Wendy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-30 16:24                         ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-30 16:24 UTC (permalink / raw)
  To: Yan Burman
  Cc: J. Bruce Fields, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Mon, Apr 29, 2013 at 10:09 PM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>
> I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also way higher now).
> For some reason when I had intel IOMMU enabled, the performance dropped significantly.
> I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
> Now I will take care of the issue that I am running only at 40Gbit/s instead of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable issue).
>
> This is still strange, since ib_send_bw with intel iommu enabled did get up to 4.5GB/sec, so why did intel iommu affect only nfs code?
>

That's very exciting ! The sad part is that IOMMU has to be turned off.

I think ib_send_bw uses a single buffer so the DMA mapping search
overhead is not an issue.

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-30 13:38                       ` J. Bruce Fields
  0 siblings, 0 replies; 82+ messages in thread
From: J. Bruce Fields @ 2013-04-30 13:38 UTC (permalink / raw)
  To: Yan Burman
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Sun, Apr 28, 2013 at 10:42:48AM -0400, J. Bruce Fields wrote:
> On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote:
> > > > > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> > > > > > > > >> <yanb@mellanox.com>
> > > > > > > > >>> I've been trying to do some benchmarks for NFS over RDMA
> > > > > > > > >>> and I seem to
> > > > > > > > only get about half of the bandwidth that the HW can give me.
> > > > > > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > > > > > > >>> memory, and
> > > > > > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > > > > > >>> These servers are connected to a QDR IB switch. The
> > > > > > > > >>> backing storage on
> > > > > > > > the server is tmpfs mounted with noatime.
> > > > > > > > >>> I am running kernel 3.5.7.
> > > > > > > > >>>
> > > > > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-
> > > 512K.
> > > > > > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec
> > > > > > > > >>> for the
> > > > > > > > same block sizes (4-512K). running over IPoIB-CM, I get 200-
> > > 980MB/sec.
> ...
> > > > > > > I am trying to get maximum performance from a single server - I
> > > > > > > used 2
> > > > > > processes in fio test - more than 2 did not show any performance boost.
> > > > > > > I tried running fio from 2 different PCs on 2 different files,
> > > > > > > but the sum of
> > > > > > the two is more or less the same as running from single client PC.
> > > > > > >
> > > > > > > What I did see is that server is sweating a lot more than the
> > > > > > > clients and
> > > > > > more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > > > > > > cat /proc/softirqs
> ...
> > > > > Perf top for the CPU with high tasklet count gives:
> > > > >
> > > > >              samples  pcnt         RIP        function                    DSO
> ...
> > > > >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner
> > > /root/vmlinux
> ...
> > > Googling around....  I think we want:
> > > 
> > > 	perf record -a --call-graph
> > > 	(give it a chance to collect some samples, then ^C)
> > > 	perf report --call-graph --stdio
> > > 
> > 
> > Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is:
> >     36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
> >                     |
> >                     --- mutex_spin_on_owner
> >                        |
> >                        |--99.99%-- __mutex_lock_slowpath
> >                        |          mutex_lock
> >                        |          |
> >                        |          |--85.30%-- generic_file_aio_write
> 
> That's the inode i_mutex.

Looking at the code....  With CONFIG_MUTEX_SPIN_ON_OWNER it spins
(instead of sleeping) as long as the lock owner's still running.  So
this is just a lot of contention on the i_mutex, I guess.  Not sure what
to do aobut that.

--b.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-30 13:38                       ` J. Bruce Fields
  0 siblings, 0 replies; 82+ messages in thread
From: J. Bruce Fields @ 2013-04-30 13:38 UTC (permalink / raw)
  To: Yan Burman
  Cc: Wendy Cheng, Atchley, Scott, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Sun, Apr 28, 2013 at 10:42:48AM -0400, J. Bruce Fields wrote:
> On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote:
> > > > > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> > > > > > > > >> <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > > > > > > >>> I've been trying to do some benchmarks for NFS over RDMA
> > > > > > > > >>> and I seem to
> > > > > > > > only get about half of the bandwidth that the HW can give me.
> > > > > > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > > > > > > >>> memory, and
> > > > > > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > > > > > >>> These servers are connected to a QDR IB switch. The
> > > > > > > > >>> backing storage on
> > > > > > > > the server is tmpfs mounted with noatime.
> > > > > > > > >>> I am running kernel 3.5.7.
> > > > > > > > >>>
> > > > > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-
> > > 512K.
> > > > > > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec
> > > > > > > > >>> for the
> > > > > > > > same block sizes (4-512K). running over IPoIB-CM, I get 200-
> > > 980MB/sec.
> ...
> > > > > > > I am trying to get maximum performance from a single server - I
> > > > > > > used 2
> > > > > > processes in fio test - more than 2 did not show any performance boost.
> > > > > > > I tried running fio from 2 different PCs on 2 different files,
> > > > > > > but the sum of
> > > > > > the two is more or less the same as running from single client PC.
> > > > > > >
> > > > > > > What I did see is that server is sweating a lot more than the
> > > > > > > clients and
> > > > > > more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > > > > > > cat /proc/softirqs
> ...
> > > > > Perf top for the CPU with high tasklet count gives:
> > > > >
> > > > >              samples  pcnt         RIP        function                    DSO
> ...
> > > > >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner
> > > /root/vmlinux
> ...
> > > Googling around....  I think we want:
> > > 
> > > 	perf record -a --call-graph
> > > 	(give it a chance to collect some samples, then ^C)
> > > 	perf report --call-graph --stdio
> > > 
> > 
> > Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is:
> >     36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
> >                     |
> >                     --- mutex_spin_on_owner
> >                        |
> >                        |--99.99%-- __mutex_lock_slowpath
> >                        |          mutex_lock
> >                        |          |
> >                        |          |--85.30%-- generic_file_aio_write
> 
> That's the inode i_mutex.

Looking at the code....  With CONFIG_MUTEX_SPIN_ON_OWNER it spins
(instead of sleeping) as long as the lock owner's still running.  So
this is just a lot of contention on the i_mutex, I guess.  Not sure what
to do aobut that.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-19  2:27   ` Peng Tao
  0 siblings, 0 replies; 82+ messages in thread
From: Peng Tao @ 2013-04-19  2:27 UTC (permalink / raw)
  To: Yan Burman
  Cc: J. Bruce Fields, Tom Tucker, linux-rdma@vger.kernel.org,
	linux-nfs@vger.kernel.org

On Wed, Apr 17, 2013 at 10:36 PM, Yan Burman <yanb@mellanox.com> wrote:
> Hi.
>
> I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me.
> My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3.
> These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime.
> I am running kernel 3.5.7.
>
> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> I got to these results after the following optimizations:
> 1. Setting IRQ affinity to the CPUs that are part of the NUMA node the card is on
> 2. Increasing /proc/sys/sunrpc/svc_rdma/max_outbound_read_requests and /proc/sys/sunrpc/svc_rdma/max_requests to 256 on server
> 3. Increasing RPCNFSDCOUNT to 32 on server
Did you try to affine nfsd to corresponding CPUs where your IB card
locates? Given that you see a bottleneck on CPU (as in your later
email), it might be worth trying.

> 4. FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255 --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --group_reporting --exitall --buffered=0
>
On client side, it may be good to affine FIO processes and nfsiod to
CPUs where IB card locates as well, in case client is the bottleneck.

--
Thanks,
Tao

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-19  2:27   ` Peng Tao
  0 siblings, 0 replies; 82+ messages in thread
From: Peng Tao @ 2013-04-19  2:27 UTC (permalink / raw)
  To: Yan Burman
  Cc: J. Bruce Fields, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Wed, Apr 17, 2013 at 10:36 PM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> Hi.
>
> I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me.
> My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3.
> These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime.
> I am running kernel 3.5.7.
>
> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> I got to these results after the following optimizations:
> 1. Setting IRQ affinity to the CPUs that are part of the NUMA node the card is on
> 2. Increasing /proc/sys/sunrpc/svc_rdma/max_outbound_read_requests and /proc/sys/sunrpc/svc_rdma/max_requests to 256 on server
> 3. Increasing RPCNFSDCOUNT to 32 on server
Did you try to affine nfsd to corresponding CPUs where your IB card
locates? Given that you see a bottleneck on CPU (as in your later
email), it might be worth trying.

> 4. FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255 --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --group_reporting --exitall --buffered=0
>
On client side, it may be good to affine FIO processes and nfsiod to
CPUs where IB card locates as well, in case client is the bottleneck.

--
Thanks,
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-22 11:07     ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-22 11:07 UTC (permalink / raw)
  To: Peng Tao
  Cc: J. Bruce Fields, Tom Tucker, linux-rdma@vger.kernel.org,
	linux-nfs@vger.kernel.org

DQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogUGVuZyBUYW8gW21haWx0
bzpiZXJnd29sZkBnbWFpbC5jb21dDQo+IFNlbnQ6IEZyaWRheSwgQXByaWwgMTksIDIwMTMgMDU6
MjgNCj4gVG86IFlhbiBCdXJtYW4NCj4gQ2M6IEouIEJydWNlIEZpZWxkczsgVG9tIFR1Y2tlcjsg
bGludXgtcmRtYUB2Z2VyLmtlcm5lbC5vcmc7IGxpbnV4LQ0KPiBuZnNAdmdlci5rZXJuZWwub3Jn
DQo+IFN1YmplY3Q6IFJlOiBORlMgb3ZlciBSRE1BIGJlbmNobWFyaw0KPiANCj4gT24gV2VkLCBB
cHIgMTcsIDIwMTMgYXQgMTA6MzYgUE0sIFlhbiBCdXJtYW4gPHlhbmJAbWVsbGFub3guY29tPg0K
PiB3cm90ZToNCj4gPiBIaS4NCj4gPg0KPiA+IEkndmUgYmVlbiB0cnlpbmcgdG8gZG8gc29tZSBi
ZW5jaG1hcmtzIGZvciBORlMgb3ZlciBSRE1BIGFuZCBJIHNlZW0gdG8NCj4gb25seSBnZXQgYWJv
dXQgaGFsZiBvZiB0aGUgYmFuZHdpZHRoIHRoYXQgdGhlIEhXIGNhbiBnaXZlIG1lLg0KPiA+IE15
IHNldHVwIGNvbnNpc3RzIG9mIDIgc2VydmVycyBlYWNoIHdpdGggMTYgY29yZXMsIDMyR2Igb2Yg
bWVtb3J5LCBhbmQNCj4gTWVsbGFub3ggQ29ubmVjdFgzIFFEUiBjYXJkIG92ZXIgUENJLWUgZ2Vu
My4NCj4gPiBUaGVzZSBzZXJ2ZXJzIGFyZSBjb25uZWN0ZWQgdG8gYSBRRFIgSUIgc3dpdGNoLiBU
aGUgYmFja2luZyBzdG9yYWdlIG9uIHRoZQ0KPiBzZXJ2ZXIgaXMgdG1wZnMgbW91bnRlZCB3aXRo
IG5vYXRpbWUuDQo+ID4gSSBhbSBydW5uaW5nIGtlcm5lbCAzLjUuNy4NCj4gPg0KPiA+IFdoZW4g
cnVubmluZyBpYl9zZW5kX2J3LCBJIGdldCA0LjMtNC41IEdCL3NlYyBmb3IgYmxvY2sgc2l6ZXMg
NC01MTJLLg0KPiA+IFdoZW4gSSBydW4gZmlvIG92ZXIgcmRtYSBtb3VudGVkIG5mcywgSSBnZXQg
MjYwLTIyMDBNQi9zZWMgZm9yIHRoZSBzYW1lDQo+IGJsb2NrIHNpemVzICg0LTUxMkspLiBydW5u
aW5nIG92ZXIgSVBvSUItQ00sIEkgZ2V0IDIwMC05ODBNQi9zZWMuDQo+ID4gSSBnb3QgdG8gdGhl
c2UgcmVzdWx0cyBhZnRlciB0aGUgZm9sbG93aW5nIG9wdGltaXphdGlvbnM6DQo+ID4gMS4gU2V0
dGluZyBJUlEgYWZmaW5pdHkgdG8gdGhlIENQVXMgdGhhdCBhcmUgcGFydCBvZiB0aGUgTlVNQSBu
b2RlIHRoZQ0KPiA+IGNhcmQgaXMgb24gMi4gSW5jcmVhc2luZw0KPiA+IC9wcm9jL3N5cy9zdW5y
cGMvc3ZjX3JkbWEvbWF4X291dGJvdW5kX3JlYWRfcmVxdWVzdHMgYW5kDQo+ID4gL3Byb2Mvc3lz
L3N1bnJwYy9zdmNfcmRtYS9tYXhfcmVxdWVzdHMgdG8gMjU2IG9uIHNlcnZlciAzLiBJbmNyZWFz
aW5nDQo+ID4gUlBDTkZTRENPVU5UIHRvIDMyIG9uIHNlcnZlcg0KPiBEaWQgeW91IHRyeSB0byBh
ZmZpbmUgbmZzZCB0byBjb3JyZXNwb25kaW5nIENQVXMgd2hlcmUgeW91ciBJQiBjYXJkIGxvY2F0
ZXM/DQo+IEdpdmVuIHRoYXQgeW91IHNlZSBhIGJvdHRsZW5lY2sgb24gQ1BVIChhcyBpbiB5b3Vy
IGxhdGVyIGVtYWlsKSwgaXQgbWlnaHQgYmUNCj4gd29ydGggdHJ5aW5nLg0KDQpJIHRyaWVkIHRv
IGFmZmluZSBuZnNkIHRvIENQVXMgb24gdGhlIE5VTUEgbm9kZSB0aGUgSUIgY2FyZCBpcyBvbi4N
CkkgYWxzbyBzZXQgdG1wZnMgbWVtb3J5IHBvbGljeSB0byBhbGxvY2F0ZSBmcm9tIHRoZSBzYW1l
IE5VTUEgbm9kZS4NCkkgZGlkIG5vdCBzZWUgYmlnIGRpZmZlcmVuY2UuDQoNCj4gDQo+ID4gNC4g
RklPIGFyZ3VtZW50czogLS1ydz1yYW5kcmVhZCAtLWJzPTRrIC0tbnVtam9icz0yIC0taW9kZXB0
aD0xMjgNCj4gPiAtLWlvZW5naW5lPWxpYmFpbyAtLXNpemU9MTAwMDAwayAtLXByaW9jbGFzcz0x
IC0tcHJpbz0wIC0tY3B1bWFzaz0yNTUNCj4gPiAtLWxvb3BzPTI1IC0tZGlyZWN0PTEgLS1pbnZh
bGlkYXRlPTEgLS1mc3luY19vbl9jbG9zZT0xIC0tcmFuZHJlcGVhdD0xDQo+ID4gLS1ub3JhbmRv
bW1hcCAtLWdyb3VwX3JlcG9ydGluZyAtLWV4aXRhbGwgLS1idWZmZXJlZD0wDQo+ID4NCj4gT24g
Y2xpZW50IHNpZGUsIGl0IG1heSBiZSBnb29kIHRvIGFmZmluZSBGSU8gcHJvY2Vzc2VzIGFuZCBu
ZnNpb2QgdG8gQ1BVcw0KPiB3aGVyZSBJQiBjYXJkIGxvY2F0ZXMgYXMgd2VsbCwgaW4gY2FzZSBj
bGllbnQgaXMgdGhlIGJvdHRsZW5lY2suDQo+IA0KDQpJIGFtIGRvaW5nIHRoYXQgLSBjcHVtYXNr
PTI1NSBhZmZpbmVzIGl0IHRvIHRoZSBOVU1BIG5vZGUgbXkgY2FyZCBpcyBvbi4NCkZvciBzb21l
IHJlYXNvbiBkb2luZyB0YXNrc2V0IG9uIG5mc2lvZCBmYWlscy4NCg0KPiAtLQ0KPiBUaGFua3Ms
DQo+IFRhbw0K

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: NFS over RDMA benchmark
@ 2013-04-22 11:07     ` Yan Burman
  0 siblings, 0 replies; 82+ messages in thread
From: Yan Burman @ 2013-04-22 11:07 UTC (permalink / raw)
  To: Peng Tao
  Cc: J. Bruce Fields, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2422 bytes --]



> -----Original Message-----
> From: Peng Tao [mailto:bergwolf@gmail.com]
> Sent: Friday, April 19, 2013 05:28
> To: Yan Burman
> Cc: J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org; linux-
> nfs@vger.kernel.org
> Subject: Re: NFS over RDMA benchmark
> 
> On Wed, Apr 17, 2013 at 10:36 PM, Yan Burman <yanb@mellanox.com>
> wrote:
> > Hi.
> >
> > I've been trying to do some benchmarks for NFS over RDMA and I seem to
> only get about half of the bandwidth that the HW can give me.
> > My setup consists of 2 servers each with 16 cores, 32Gb of memory, and
> Mellanox ConnectX3 QDR card over PCI-e gen3.
> > These servers are connected to a QDR IB switch. The backing storage on the
> server is tmpfs mounted with noatime.
> > I am running kernel 3.5.7.
> >
> > When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> > When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same
> block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> > I got to these results after the following optimizations:
> > 1. Setting IRQ affinity to the CPUs that are part of the NUMA node the
> > card is on 2. Increasing
> > /proc/sys/sunrpc/svc_rdma/max_outbound_read_requests and
> > /proc/sys/sunrpc/svc_rdma/max_requests to 256 on server 3. Increasing
> > RPCNFSDCOUNT to 32 on server
> Did you try to affine nfsd to corresponding CPUs where your IB card locates?
> Given that you see a bottleneck on CPU (as in your later email), it might be
> worth trying.

I tried to affine nfsd to CPUs on the NUMA node the IB card is on.
I also set tmpfs memory policy to allocate from the same NUMA node.
I did not see big difference.

> 
> > 4. FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128
> > --ioengine=libaio --size=100000k --prioclass=1 --prio=0 --cpumask=255
> > --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1
> > --norandommap --group_reporting --exitall --buffered=0
> >
> On client side, it may be good to affine FIO processes and nfsiod to CPUs
> where IB card locates as well, in case client is the bottleneck.
> 

I am doing that - cpumask=255 affines it to the NUMA node my card is on.
For some reason doing taskset on nfsiod fails.

> --
> Thanks,
> Tao
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±ÙšŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 82+ messages in thread

[parent not found: <51703280.03e9440a.06a6.3f9f@mx.google.com>]

* Re: NFS over RDMA benchmark
@ 2013-04-18 19:15   ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-18 19:15 UTC (permalink / raw)
  To: Spencer Shepler
  Cc: Yan Burman, Atchley, Scott, J. Bruce Fields, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Thu, Apr 18, 2013 at 10:50 AM, Spencer Shepler
<spencer.shepler@gmail.com> wrote:
>
> Note that SPEC SFS does not support RDMA.
>

IIRC, the benchmark comes with source code - wondering anyone has
modified it to run on RDMA ?  Or is there any real user to share the
experience ?

-- Wendy

> ________________________________
> From: Wendy Cheng
> Sent: 4/18/2013 9:16 AM
> To: Yan Burman
> Cc: Atchley, Scott; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org;
> linux-nfs@vger.kernel.org; Or Gerlitz
>
> Subject: Re: NFS over RDMA benchmark
>
> On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman <yanb@mellanox.com> wrote:
>>
>>
>> What do you suggest for benchmarking NFS?
>>
>
> I believe SPECsfs has been widely used by NFS (server) vendors to
> position their product lines. Its workload was based on a real life
> NFS deployment. I think it is more torward office type of workload
> (large client/user count with smaller file sizes e.g. software
> development with build, compile, etc).
>
> BTW, we're experimenting a similar project and would be interested to
> know your findings.
>
> -- Wendy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-18 19:15   ` Wendy Cheng
  0 siblings, 0 replies; 82+ messages in thread
From: Wendy Cheng @ 2013-04-18 19:15 UTC (permalink / raw)
  To: Spencer Shepler
  Cc: Yan Burman, Atchley, Scott, J. Bruce Fields, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Thu, Apr 18, 2013 at 10:50 AM, Spencer Shepler
<spencer.shepler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> Note that SPEC SFS does not support RDMA.
>

IIRC, the benchmark comes with source code - wondering anyone has
modified it to run on RDMA ?  Or is there any real user to share the
experience ?

-- Wendy

> ________________________________
> From: Wendy Cheng
> Sent: 4/18/2013 9:16 AM
> To: Yan Burman
> Cc: Atchley, Scott; J. Bruce Fields; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
>
> Subject: Re: NFS over RDMA benchmark
>
> On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>>
>>
>> What do you suggest for benchmarking NFS?
>>
>
> I believe SPECsfs has been widely used by NFS (server) vendors to
> position their product lines. Its workload was based on a real life
> NFS deployment. I think it is more torward office type of workload
> (large client/user count with smaller file sizes e.g. software
> development with build, compile, etc).
>
> BTW, we're experimenting a similar project and would be interested to
> know your findings.
>
> -- Wendy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-19  1:03     ` Atchley, Scott
  0 siblings, 0 replies; 82+ messages in thread
From: Atchley, Scott @ 2013-04-19  1:03 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: Spencer Shepler, Yan Burman, J. Bruce Fields, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz

On Apr 18, 2013, at 3:15 PM, Wendy Cheng <s.wendy.cheng@gmail.com> wrote:

> On Thu, Apr 18, 2013 at 10:50 AM, Spencer Shepler
> <spencer.shepler@gmail.com> wrote:
>> 
>> Note that SPEC SFS does not support RDMA.
>> 
> 
> IIRC, the benchmark comes with source code - wondering anyone has
> modified it to run on RDMA ?  Or is there any real user to share the
> experience ?

I am not familiar with SpecSFS, but if it exercises the filesystem, it does not know which RPC layer that NFS uses, no? Or does it implement its own client and directly access the RPC layer?

> 
> -- Wendy
> 
>> ________________________________
>> From: Wendy Cheng
>> Sent: 4/18/2013 9:16 AM
>> To: Yan Burman
>> Cc: Atchley, Scott; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org;
>> linux-nfs@vger.kernel.org; Or Gerlitz
>> 
>> Subject: Re: NFS over RDMA benchmark
>> 
>> On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman <yanb@mellanox.com> wrote:
>>> 
>>> 
>>> What do you suggest for benchmarking NFS?
>>> 
>> 
>> I believe SPECsfs has been widely used by NFS (server) vendors to
>> position their product lines. Its workload was based on a real life
>> NFS deployment. I think it is more torward office type of workload
>> (large client/user count with smaller file sizes e.g. software
>> development with build, compile, etc).
>> 
>> BTW, we're experimenting a similar project and would be interested to
>> know your findings.
>> 
>> -- Wendy
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> 
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-19  1:03     ` Atchley, Scott
  0 siblings, 0 replies; 82+ messages in thread
From: Atchley, Scott @ 2013-04-19  1:03 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: Spencer Shepler, Yan Burman, J. Bruce Fields, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz

On Apr 18, 2013, at 3:15 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On Thu, Apr 18, 2013 at 10:50 AM, Spencer Shepler
> <spencer.shepler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> 
>> Note that SPEC SFS does not support RDMA.
>> 
> 
> IIRC, the benchmark comes with source code - wondering anyone has
> modified it to run on RDMA ?  Or is there any real user to share the
> experience ?

I am not familiar with SpecSFS, but if it exercises the filesystem, it does not know which RPC layer that NFS uses, no? Or does it implement its own client and directly access the RPC layer?

> 
> -- Wendy
> 
>> ________________________________
>> From: Wendy Cheng
>> Sent: 4/18/2013 9:16 AM
>> To: Yan Burman
>> Cc: Atchley, Scott; J. Bruce Fields; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
>> linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
>> 
>> Subject: Re: NFS over RDMA benchmark
>> 
>> On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>>> 
>>> 
>>> What do you suggest for benchmarking NFS?
>>> 
>> 
>> I believe SPECsfs has been widely used by NFS (server) vendors to
>> position their product lines. Its workload was based on a real life
>> NFS deployment. I think it is more torward office type of workload
>> (large client/user count with smaller file sizes e.g. software
>> development with build, compile, etc).
>> 
>> BTW, we're experimenting a similar project and would be interested to
>> know your findings.
>> 
>> -- Wendy
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> 
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-19  3:35       ` Spencer
  0 siblings, 0 replies; 82+ messages in thread
From: Spencer @ 2013-04-19  3:35 UTC (permalink / raw)
  To: Atchley, Scott
  Cc: Wendy Cheng, Yan Burman, J. Bruce Fields, Tom Tucker,
	linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org, Or Gerlitz



On Apr 18, 2013, at 6:03 PM, Atchley, Scott wrote:

> On Apr 18, 2013, at 3:15 PM, Wendy Cheng <s.wendy.cheng@gmail.com> wrote:
> 
>> On Thu, Apr 18, 2013 at 10:50 AM, Spencer Shepler
>> <spencer.shepler@gmail.com> wrote:
>>> 
>>> Note that SPEC SFS does not support RDMA.
>>> 
>> 
>> IIRC, the benchmark comes with source code - wondering anyone has
>> modified it to run on RDMA ?  Or is there any real user to share the
>> experience ?
> 
> I am not familiar with SpecSFS, but if it exercises the filesystem, it does not know which RPC layer that NFS uses, no? Or does it implement its own client and directly access the RPC layer?


Yes, the SPEC SFS benchmark implements  its own NFSv3 client, RPC layer, etc.

Spencer

> 
>> 
>> -- Wendy
>> 
>>> ________________________________
>>> From: Wendy Cheng
>>> Sent: 4/18/2013 9:16 AM
>>> To: Yan Burman
>>> Cc: Atchley, Scott; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org;
>>> linux-nfs@vger.kernel.org; Or Gerlitz
>>> 
>>> Subject: Re: NFS over RDMA benchmark
>>> 
>>> On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman <yanb@mellanox.com> wrote:
>>>> 
>>>> 
>>>> What do you suggest for benchmarking NFS?
>>>> 
>>> 
>>> I believe SPECsfs has been widely used by NFS (server) vendors to
>>> position their product lines. Its workload was based on a real life
>>> NFS deployment. I think it is more torward office type of workload
>>> (large client/user count with smaller file sizes e.g. software
>>> development with build, compile, etc).
>>> 
>>> BTW, we're experimenting a similar project and would be interested to
>>> know your findings.
>>> 
>>> -- Wendy
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> 
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: NFS over RDMA benchmark
@ 2013-04-19  3:35       ` Spencer
  0 siblings, 0 replies; 82+ messages in thread
From: Spencer @ 2013-04-19  3:35 UTC (permalink / raw)
  To: Atchley, Scott
  Cc: Wendy Cheng, Yan Burman, J. Bruce Fields, Tom Tucker,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz



On Apr 18, 2013, at 6:03 PM, Atchley, Scott wrote:

> On Apr 18, 2013, at 3:15 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
>> On Thu, Apr 18, 2013 at 10:50 AM, Spencer Shepler
>> <spencer.shepler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> 
>>> Note that SPEC SFS does not support RDMA.
>>> 
>> 
>> IIRC, the benchmark comes with source code - wondering anyone has
>> modified it to run on RDMA ?  Or is there any real user to share the
>> experience ?
> 
> I am not familiar with SpecSFS, but if it exercises the filesystem, it does not know which RPC layer that NFS uses, no? Or does it implement its own client and directly access the RPC layer?


Yes, the SPEC SFS benchmark implements  its own NFSv3 client, RPC layer, etc.

Spencer

> 
>> 
>> -- Wendy
>> 
>>> ________________________________
>>> From: Wendy Cheng
>>> Sent: 4/18/2013 9:16 AM
>>> To: Yan Burman
>>> Cc: Atchley, Scott; J. Bruce Fields; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
>>> linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Or Gerlitz
>>> 
>>> Subject: Re: NFS over RDMA benchmark
>>> 
>>> On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>>>> 
>>>> 
>>>> What do you suggest for benchmarking NFS?
>>>> 
>>> 
>>> I believe SPECsfs has been widely used by NFS (server) vendors to
>>> position their product lines. Its workload was based on a real life
>>> NFS deployment. I think it is more torward office type of workload
>>> (large client/user count with smaller file sizes e.g. software
>>> development with build, compile, etc).
>>> 
>>> BTW, we're experimenting a similar project and would be interested to
>>> know your findings.
>>> 
>>> -- Wendy
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> 
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2013-06-20 14:56 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-17 14:36 NFS over RDMA benchmark Yan Burman
2013-04-17 14:36 ` Yan Burman
2013-04-17 17:15 ` Wendy Cheng
2013-04-17 17:15   ` Wendy Cheng
2013-04-17 17:32   ` Atchley, Scott
2013-04-17 17:32     ` Atchley, Scott
2013-04-17 18:06     ` Wendy Cheng
2013-04-17 18:06       ` Wendy Cheng
2013-04-18 12:47       ` Yan Burman
2013-04-18 12:47         ` Yan Burman
2013-04-18 16:16         ` Wendy Cheng
2013-04-18 16:16           ` Wendy Cheng
2013-04-23 21:06         ` J. Bruce Fields
2013-04-23 21:06           ` J. Bruce Fields
2013-04-24 12:35           ` Yan Burman
2013-04-24 12:35             ` Yan Burman
2013-04-24 15:05             ` J. Bruce Fields
2013-04-24 15:05               ` J. Bruce Fields
2013-04-24 15:26               ` J. Bruce Fields
2013-04-24 15:26                 ` J. Bruce Fields
2013-04-24 16:27                 ` Wendy Cheng
2013-04-24 16:27                   ` Wendy Cheng
2013-04-24 18:04                   ` Wendy Cheng
2013-04-24 18:04                     ` Wendy Cheng
2013-04-24 18:26                     ` Tom Talpey
2013-04-24 18:26                       ` Tom Talpey
2013-04-25 17:18                       ` Wendy Cheng
2013-04-25 17:18                         ` Wendy Cheng
2013-04-25 19:01                         ` Phil Pishioneri
2013-04-25 19:01                           ` Phil Pishioneri
2013-04-25 20:14                           ` Tom Talpey
2013-04-25 20:14                             ` Tom Talpey
2013-04-25 20:04                         ` Tom Talpey
2013-04-25 20:04                           ` Tom Talpey
2013-04-25 21:17                           ` Tom Tucker
2013-04-25 21:17                             ` Tom Tucker
2013-04-25 21:58                             ` Wendy Cheng
2013-04-25 21:58                               ` Wendy Cheng
2013-04-25 22:26                               ` Wendy Cheng
2013-04-25 22:26                                 ` Wendy Cheng
2013-04-28  6:28                 ` Yan Burman
2013-04-28  6:28                   ` Yan Burman
2013-04-28 14:42                   ` J. Bruce Fields
2013-04-28 14:42                     ` J. Bruce Fields
2013-04-29  5:34                     ` Wendy Cheng
2013-04-29  5:34                       ` Wendy Cheng
2013-04-29 12:16                       ` Yan Burman
2013-04-29 12:16                         ` Yan Burman
2013-04-29 13:05                         ` Tom Tucker
2013-04-29 13:05                           ` Tom Tucker
2013-04-29 13:07                           ` Tom Tucker
2013-04-29 13:07                             ` Tom Tucker
2013-04-30  5:09                     ` Yan Burman
2013-04-30  5:09                       ` Yan Burman
2013-04-30 13:05                       ` Tom Talpey
2013-04-30 13:05                         ` Tom Talpey
2013-04-30 14:23                         ` Yan Burman
2013-04-30 14:23                           ` Yan Burman
2013-04-30 14:44                           ` Tom Talpey
2013-04-30 14:44                             ` Tom Talpey
2013-04-30 14:20                       ` Tom Talpey
2013-04-30 14:20                         ` Tom Talpey
2013-04-30 14:38                         ` Yan Burman
2013-04-30 14:38                           ` Yan Burman
2013-04-30 18:58                           ` Tom Tucker
2013-04-30 18:58                             ` Tom Tucker
     [not found]                             ` <CALsNU1MsjH5=p4Wtj2aJ5+odC7y7-5oTGhrzOL-=15pXaYYUZw@mail.gmail.com>
     [not found]                               ` <CABgxfbFhZTBO81WC5BcRRfQB_YBjE4N=sfS+G9eAzaFHYC_dWw@mail.gmail.com>
2013-06-20 14:56                                 ` Or Gerlitz
2013-06-20 14:56                                   ` Or Gerlitz
2013-04-30 16:24                       ` Wendy Cheng
2013-04-30 16:24                         ` Wendy Cheng
2013-04-30 13:38                     ` J. Bruce Fields
2013-04-30 13:38                       ` J. Bruce Fields
2013-04-19  2:27 ` Peng Tao
2013-04-19  2:27   ` Peng Tao
2013-04-22 11:07   ` Yan Burman
2013-04-22 11:07     ` Yan Burman
     [not found] <51703280.03e9440a.06a6.3f9f@mx.google.com>
2013-04-18 19:15 ` Wendy Cheng
2013-04-18 19:15   ` Wendy Cheng
2013-04-19  1:03   ` Atchley, Scott
2013-04-19  1:03     ` Atchley, Scott
2013-04-19  3:35     ` Spencer
2013-04-19  3:35       ` Spencer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.