Re: NFS over RDMA benchmark

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "J. Bruce Fields" <bfields@fieldses.org>
To: Yan Burman <yanb@mellanox.com>
Cc: Wendy Cheng <s.wendy.cheng@gmail.com>,
	"Atchley, Scott" <atchleyes@ornl.gov>,
	Tom Tucker <tom@opengridcomputing.com>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	Or Gerlitz <ogerlitz@mellanox.com>
Subject: Re: NFS over RDMA benchmark
Date: Tue, 23 Apr 2013 17:06:07 -0400	[thread overview]
Message-ID: <20130423210607.GJ3676@fieldses.org> (raw)
In-Reply-To: <0EE9A1CDC8D6434DB00095CD7DB873462CF9715B@MTLDAG01.mtl.com>

On Thu, Apr 18, 2013 at 12:47:09PM +0000, Yan Burman wrote:
> 
> 
> > -----Original Message-----
> > From: Wendy Cheng [mailto:s.wendy.cheng@gmail.com]
> > Sent: Wednesday, April 17, 2013 21:06
> > To: Atchley, Scott
> > Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org;
> > linux-nfs@vger.kernel.org
> > Subject: Re: NFS over RDMA benchmark
> > 
> > On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott <atchleyes@ornl.gov>
> > wrote:
> > > On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng@gmail.com>
> > wrote:
> > >
> > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb@mellanox.com>
> > wrote:
> > >>> Hi.
> > >>>
> > >>> I've been trying to do some benchmarks for NFS over RDMA and I seem to
> > only get about half of the bandwidth that the HW can give me.
> > >>> My setup consists of 2 servers each with 16 cores, 32Gb of memory, and
> > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > >>> These servers are connected to a QDR IB switch. The backing storage on
> > the server is tmpfs mounted with noatime.
> > >>> I am running kernel 3.5.7.
> > >>>
> > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
> > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> > >
> > > Yan,
> > >
> > > Are you trying to optimize single client performance or server performance
> > with multiple clients?
> > >
> 
> I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost.
> I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC.
> 
> What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> cat /proc/softirqs

Would any profiling help figure out which code it's spending time in?
(E.g. something simple as "perf top" might have useful output.)

--b.

>                     CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15
>           HI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
>        TIMER:     418767      46596      43515      44547      50099      34815      40634      40337      39551      93442      73733      42631      42509      41592      40351      61793
>       NET_TX:      28719        309       1421       1294       1730       1243        832        937         11         44         41         20         26         19         15         29
>       NET_RX:     612070         19         22         21          6        235          3          2          9          6         17         16         20         13         16         10
>        BLOCK:       5941          0          0          0          0          0          0          0        519        259       1238        272        253        174        215       2618
> BLOCK_IOPOLL:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
>      TASKLET:         28          1          1          1          1    1540653          1          1         29          1          1          1          1          1          1          2
>        SCHED:     364965      26547      16807      18403      22919       8678      14358      14091      16981      64903      47141      18517      19179      18036      17037      38261
>      HRTIMER:         13          0          1          1          0          0          0          0          0          0          0          0          1          1          0          1
>          RCU:     945823     841546     715281     892762     823564      42663     863063     841622     333577     389013     393501     239103     221524     258159     313426     234030
> > >
> > >> Remember there are always gaps between wire speed (that ib_send_bw
> > >> measures) and real world applications.
> 
> I realize that, but I don't expect the difference to be more than twice.
> 
> > >>
> > >> That being said, does your server use default export (sync) option ?
> > >> Export the share with "async" option can bring you closer to wire
> > >> speed. However, the practice (async) is generally not recommended in
> > >> a real production system - as it can cause data integrity issues, e.g.
> > >> you have more chances to lose data when the boxes crash.
> 
> I am running with async export option, but that should not matter too much, since my backing storage is tmpfs mounted with noatime.
> 
> > >>
> > >> -- Wendy
> > >
> > >
> > > Wendy,
> > >
> > > It has a been a few years since I looked at RPCRDMA, but I seem to
> > remember that RPCs were limited to 32KB which means that you have to
> > pipeline them to get linerate. In addition to requiring pipelining, the
> > argument from the authors was that the goal was to maximize server
> > performance and not single client performance.
> > >
> 
> What I see is that performance increases almost linearly up to block size 256K and falls a little at block size 512K
> 
> > > Scott
> > >
> > 
> > That (client count) brings up a good point ...
> > 
> > FIO is really not a good benchmark for NFS. Does anyone have SPECsfs
> > numbers on NFS over RDMA to share ?
> > 
> > -- Wendy
> 
> What do you suggest for benchmarking NFS?
> 
> Yan
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)

From: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
To: Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Wendy Cheng
	<s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	"Atchley, Scott" <atchleyes-1Heg1YXhbW8@public.gmane.org>,
	Tom Tucker
	<tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>,
	"linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Subject: Re: NFS over RDMA benchmark
Date: Tue, 23 Apr 2013 17:06:07 -0400	[thread overview]
Message-ID: <20130423210607.GJ3676@fieldses.org> (raw)
In-Reply-To: <0EE9A1CDC8D6434DB00095CD7DB873462CF9715B-fViJhHBwANKuSA5JZHE7gA@public.gmane.org>

On Thu, Apr 18, 2013 at 12:47:09PM +0000, Yan Burman wrote:
> 
> 
> > -----Original Message-----
> > From: Wendy Cheng [mailto:s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> > Sent: Wednesday, April 17, 2013 21:06
> > To: Atchley, Scott
> > Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> > linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Subject: Re: NFS over RDMA benchmark
> > 
> > On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott <atchleyes-1Heg1YXhbW8@public.gmane.org>
> > wrote:
> > > On Apr 17, 2013, at 1:15 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > wrote:
> > >
> > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > wrote:
> > >>> Hi.
> > >>>
> > >>> I've been trying to do some benchmarks for NFS over RDMA and I seem to
> > only get about half of the bandwidth that the HW can give me.
> > >>> My setup consists of 2 servers each with 16 cores, 32Gb of memory, and
> > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > >>> These servers are connected to a QDR IB switch. The backing storage on
> > the server is tmpfs mounted with noatime.
> > >>> I am running kernel 3.5.7.
> > >>>
> > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
> > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
> > same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
> > >
> > > Yan,
> > >
> > > Are you trying to optimize single client performance or server performance
> > with multiple clients?
> > >
> 
> I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost.
> I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC.
> 
> What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> cat /proc/softirqs

Would any profiling help figure out which code it's spending time in?
(E.g. something simple as "perf top" might have useful output.)

--b.

>                     CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15
>           HI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
>        TIMER:     418767      46596      43515      44547      50099      34815      40634      40337      39551      93442      73733      42631      42509      41592      40351      61793
>       NET_TX:      28719        309       1421       1294       1730       1243        832        937         11         44         41         20         26         19         15         29
>       NET_RX:     612070         19         22         21          6        235          3          2          9          6         17         16         20         13         16         10
>        BLOCK:       5941          0          0          0          0          0          0          0        519        259       1238        272        253        174        215       2618
> BLOCK_IOPOLL:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
>      TASKLET:         28          1          1          1          1    1540653          1          1         29          1          1          1          1          1          1          2
>        SCHED:     364965      26547      16807      18403      22919       8678      14358      14091      16981      64903      47141      18517      19179      18036      17037      38261
>      HRTIMER:         13          0          1          1          0          0          0          0          0          0          0          0          1          1          0          1
>          RCU:     945823     841546     715281     892762     823564      42663     863063     841622     333577     389013     393501     239103     221524     258159     313426     234030
> > >
> > >> Remember there are always gaps between wire speed (that ib_send_bw
> > >> measures) and real world applications.
> 
> I realize that, but I don't expect the difference to be more than twice.
> 
> > >>
> > >> That being said, does your server use default export (sync) option ?
> > >> Export the share with "async" option can bring you closer to wire
> > >> speed. However, the practice (async) is generally not recommended in
> > >> a real production system - as it can cause data integrity issues, e.g.
> > >> you have more chances to lose data when the boxes crash.
> 
> I am running with async export option, but that should not matter too much, since my backing storage is tmpfs mounted with noatime.
> 
> > >>
> > >> -- Wendy
> > >
> > >
> > > Wendy,
> > >
> > > It has a been a few years since I looked at RPCRDMA, but I seem to
> > remember that RPCs were limited to 32KB which means that you have to
> > pipeline them to get linerate. In addition to requiring pipelining, the
> > argument from the authors was that the goal was to maximize server
> > performance and not single client performance.
> > >
> 
> What I see is that performance increases almost linearly up to block size 256K and falls a little at block size 512K
> 
> > > Scott
> > >
> > 
> > That (client count) brings up a good point ...
> > 
> > FIO is really not a good benchmark for NFS. Does anyone have SPECsfs
> > numbers on NFS over RDMA to share ?
> > 
> > -- Wendy
> 
> What do you suggest for benchmarking NFS?
> 
> Yan
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2013-04-23 21:06 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-17 14:36 NFS over RDMA benchmark Yan Burman
2013-04-17 14:36 ` Yan Burman
2013-04-17 17:15 ` Wendy Cheng
2013-04-17 17:15   ` Wendy Cheng
2013-04-17 17:32   ` Atchley, Scott
2013-04-17 17:32     ` Atchley, Scott
2013-04-17 18:06     ` Wendy Cheng
2013-04-17 18:06       ` Wendy Cheng
2013-04-18 12:47       ` Yan Burman
2013-04-18 12:47         ` Yan Burman
2013-04-18 16:16         ` Wendy Cheng
2013-04-18 16:16           ` Wendy Cheng
2013-04-23 21:06         ` J. Bruce Fields [this message]
2013-04-23 21:06           ` J. Bruce Fields
2013-04-24 12:35           ` Yan Burman
2013-04-24 12:35             ` Yan Burman
2013-04-24 15:05             ` J. Bruce Fields
2013-04-24 15:05               ` J. Bruce Fields
2013-04-24 15:26               ` J. Bruce Fields
2013-04-24 15:26                 ` J. Bruce Fields
2013-04-24 16:27                 ` Wendy Cheng
2013-04-24 16:27                   ` Wendy Cheng
2013-04-24 18:04                   ` Wendy Cheng
2013-04-24 18:04                     ` Wendy Cheng
2013-04-24 18:26                     ` Tom Talpey
2013-04-24 18:26                       ` Tom Talpey
2013-04-25 17:18                       ` Wendy Cheng
2013-04-25 17:18                         ` Wendy Cheng
2013-04-25 19:01                         ` Phil Pishioneri
2013-04-25 19:01                           ` Phil Pishioneri
2013-04-25 20:14                           ` Tom Talpey
2013-04-25 20:14                             ` Tom Talpey
2013-04-25 20:04                         ` Tom Talpey
2013-04-25 20:04                           ` Tom Talpey
2013-04-25 21:17                           ` Tom Tucker
2013-04-25 21:17                             ` Tom Tucker
2013-04-25 21:58                             ` Wendy Cheng
2013-04-25 21:58                               ` Wendy Cheng
2013-04-25 22:26                               ` Wendy Cheng
2013-04-25 22:26                                 ` Wendy Cheng
2013-04-28  6:28                 ` Yan Burman
2013-04-28  6:28                   ` Yan Burman
2013-04-28 14:42                   ` J. Bruce Fields
2013-04-28 14:42                     ` J. Bruce Fields
2013-04-29  5:34                     ` Wendy Cheng
2013-04-29  5:34                       ` Wendy Cheng
2013-04-29 12:16                       ` Yan Burman
2013-04-29 12:16                         ` Yan Burman
2013-04-29 13:05                         ` Tom Tucker
2013-04-29 13:05                           ` Tom Tucker
2013-04-29 13:07                           ` Tom Tucker
2013-04-29 13:07                             ` Tom Tucker
2013-04-30  5:09                     ` Yan Burman
2013-04-30  5:09                       ` Yan Burman
2013-04-30 13:05                       ` Tom Talpey
2013-04-30 13:05                         ` Tom Talpey
2013-04-30 14:23                         ` Yan Burman
2013-04-30 14:23                           ` Yan Burman
2013-04-30 14:44                           ` Tom Talpey
2013-04-30 14:44                             ` Tom Talpey
2013-04-30 14:20                       ` Tom Talpey
2013-04-30 14:20                         ` Tom Talpey
2013-04-30 14:38                         ` Yan Burman
2013-04-30 14:38                           ` Yan Burman
2013-04-30 18:58                           ` Tom Tucker
2013-04-30 18:58                             ` Tom Tucker
     [not found]                             ` <CALsNU1MsjH5=p4Wtj2aJ5+odC7y7-5oTGhrzOL-=15pXaYYUZw@mail.gmail.com>
     [not found]                               ` <CABgxfbFhZTBO81WC5BcRRfQB_YBjE4N=sfS+G9eAzaFHYC_dWw@mail.gmail.com>
2013-06-20 14:56                                 ` Or Gerlitz
2013-06-20 14:56                                   ` Or Gerlitz
2013-04-30 16:24                       ` Wendy Cheng
2013-04-30 16:24                         ` Wendy Cheng
2013-04-30 13:38                     ` J. Bruce Fields
2013-04-30 13:38                       ` J. Bruce Fields
2013-04-19  2:27 ` Peng Tao
2013-04-19  2:27   ` Peng Tao
2013-04-22 11:07   ` Yan Burman
2013-04-22 11:07     ` Yan Burman
     [not found] <51703280.03e9440a.06a6.3f9f@mx.google.com>
2013-04-18 19:15 ` Wendy Cheng
2013-04-18 19:15   ` Wendy Cheng
2013-04-19  1:03   ` Atchley, Scott
2013-04-19  1:03     ` Atchley, Scott
2013-04-19  3:35     ` Spencer
2013-04-19  3:35       ` Spencer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130423210607.GJ3676@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=atchleyes@ornl.gov \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=ogerlitz@mellanox.com \
    --cc=s.wendy.cheng@gmail.com \
    --cc=tom@opengridcomputing.com \
    --cc=yanb@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.