All of lore.kernel.org
 help / color / mirror / Atom feed
From: "J. Bruce Fields" <bfields@fieldses.org>
To: Yan Burman <yanb@mellanox.com>
Cc: Wendy Cheng <s.wendy.cheng@gmail.com>,
	"Atchley, Scott" <atchleyes@ornl.gov>,
	Tom Tucker <tom@opengridcomputing.com>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	Or Gerlitz <ogerlitz@mellanox.com>
Subject: Re: NFS over RDMA benchmark
Date: Sun, 28 Apr 2013 10:42:48 -0400	[thread overview]
Message-ID: <20130428144248.GA2037@fieldses.org> (raw)
In-Reply-To: <0EE9A1CDC8D6434DB00095CD7DB873462CF9A820@MTLDAG01.mtl.com>

On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote:
> > > > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> > > > > > > >> <yanb@mellanox.com>
> > > > > > > >>> I've been trying to do some benchmarks for NFS over RDMA
> > > > > > > >>> and I seem to
> > > > > > > only get about half of the bandwidth that the HW can give me.
> > > > > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > > > > > >>> memory, and
> > > > > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > > > > >>> These servers are connected to a QDR IB switch. The
> > > > > > > >>> backing storage on
> > > > > > > the server is tmpfs mounted with noatime.
> > > > > > > >>> I am running kernel 3.5.7.
> > > > > > > >>>
> > > > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-
> > 512K.
> > > > > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec
> > > > > > > >>> for the
> > > > > > > same block sizes (4-512K). running over IPoIB-CM, I get 200-
> > 980MB/sec.
...
> > > > > > I am trying to get maximum performance from a single server - I
> > > > > > used 2
> > > > > processes in fio test - more than 2 did not show any performance boost.
> > > > > > I tried running fio from 2 different PCs on 2 different files,
> > > > > > but the sum of
> > > > > the two is more or less the same as running from single client PC.
> > > > > >
> > > > > > What I did see is that server is sweating a lot more than the
> > > > > > clients and
> > > > > more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > > > > > cat /proc/softirqs
...
> > > > Perf top for the CPU with high tasklet count gives:
> > > >
> > > >              samples  pcnt         RIP        function                    DSO
...
> > > >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner
> > /root/vmlinux
...
> > Googling around....  I think we want:
> > 
> > 	perf record -a --call-graph
> > 	(give it a chance to collect some samples, then ^C)
> > 	perf report --call-graph --stdio
> > 
> 
> Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is:
>     36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>                     |
>                     --- mutex_spin_on_owner
>                        |
>                        |--99.99%-- __mutex_lock_slowpath
>                        |          mutex_lock
>                        |          |
>                        |          |--85.30%-- generic_file_aio_write

That's the inode i_mutex.

>                        |          |          do_sync_readv_writev
>                        |          |          do_readv_writev
>                        |          |          vfs_writev
>                        |          |          nfsd_vfs_write
>                        |          |          nfsd_write
>                        |          |          nfsd3_proc_write
>                        |          |          nfsd_dispatch
>                        |          |          svc_process_common
>                        |          |          svc_process
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |           --14.70%-- svc_send

That's the xpt_mutex (ensuring rpc replies aren't interleaved).

>                        |                     svc_process
>                        |                     nfsd
>                        |                     kthread
>                        |                     kernel_thread_helper
>                         --0.01%-- [...]
> 
>      9.63%          nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>                     |
>                     --- _raw_spin_lock_irqsave
>                        |
>                        |--43.97%-- alloc_iova

And that (and __free_iova below) looks like iova_rbtree_lock.

--b.

>                        |          intel_alloc_iova
>                        |          __intel_map_single
>                        |          intel_map_page
>                        |          |
>                        |          |--60.47%-- svc_rdma_sendto
>                        |          |          svc_send
>                        |          |          svc_process
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |          |--30.10%-- rdma_read_xdr
>                        |          |          svc_rdma_recvfrom
>                        |          |          svc_recv
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |          |--6.69%-- svc_rdma_post_recv
>                        |          |          send_reply
>                        |          |          svc_rdma_sendto
>                        |          |          svc_send
>                        |          |          svc_process
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |           --2.74%-- send_reply
>                        |                     svc_rdma_sendto
>                        |                     svc_send
>                        |                     svc_process
>                        |                     nfsd
>                        |                     kthread
>                        |                     kernel_thread_helper
>                        |
>                        |--37.52%-- __free_iova
>                        |          flush_unmaps
>                        |          add_unmap
>                        |          intel_unmap_page
>                        |          |
>                        |          |--97.18%-- svc_rdma_put_frmr
>                        |          |          sq_cq_reap
>                        |          |          dto_tasklet_func
>                        |          |          tasklet_action
>                        |          |          __do_softirq
>                        |          |          call_softirq
>                        |          |          do_softirq
>                        |          |          |
>                        |          |          |--97.40%-- irq_exit
>                        |          |          |          |
>                        |          |          |          |--99.85%-- do_IRQ
>                        |          |          |          |          ret_from_intr
>                        |          |          |          |          |
>                        |          |          |          |          |--40.74%-- generic_file_buffered_write
>                        |          |          |          |          |          __generic_file_aio_write
>                        |          |          |          |          |          generic_file_aio_write
>                        |          |          |          |          |          do_sync_readv_writev
>                        |          |          |          |          |          do_readv_writev
>                        |          |          |          |          |          vfs_writev
>                        |          |          |          |          |          nfsd_vfs_write
>                        |          |          |          |          |          nfsd_write
>                        |          |          |          |          |          nfsd3_proc_write
>                        |          |          |          |          |          nfsd_dispatch
>                        |          |          |          |          |          svc_process_common
>                        |          |          |          |          |          svc_process
>                        |          |          |          |          |          nfsd
>                        |          |          |          |          |          kthread
>                        |          |          |          |          |          kernel_thread_helper
>                        |          |          |          |          |
>                        |          |          |          |          |--25.21%-- __mutex_lock_slowpath
>                        |          |          |          |          |          mutex_lock
>                        |          |          |          |          |          |
>                        |          |          |          |          |          |--94.84%-- generic_file_aio_write
>                        |          |          |          |          |          |          do_sync_readv_writev
>                        |          |          |          |          |          |          do_readv_writev
>                        |          |          |          |          |          |          vfs_writev
>                        |          |          |          |          |          |          nfsd_vfs_write
>                        |          |          |          |          |          |          nfsd_write
>                        |          |          |          |          |          |          nfsd3_proc_write
>                        |          |          |          |          |          |          nfsd_dispatch
>                        |          |          |          |          |          |          svc_process_common
>                        |          |          |          |          |          |          svc_process
>                        |          |          |          |          |          |          nfsd
>                        |          |          |          |          |          |          kthread
>                        |          |          |          |          |          |          kernel_thread_helper
>                        |          |          |          |          |          |
> 
> The entire trace is almost 1MB, so send me an off-list message if you want it.
> 
> Yan
> 

WARNING: multiple messages have this Message-ID (diff)
From: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
To: Yan Burman <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Wendy Cheng
	<s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	"Atchley, Scott" <atchleyes-1Heg1YXhbW8@public.gmane.org>,
	Tom Tucker
	<tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>,
	"linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Subject: Re: NFS over RDMA benchmark
Date: Sun, 28 Apr 2013 10:42:48 -0400	[thread overview]
Message-ID: <20130428144248.GA2037@fieldses.org> (raw)
In-Reply-To: <0EE9A1CDC8D6434DB00095CD7DB873462CF9A820-fViJhHBwANKuSA5JZHE7gA@public.gmane.org>

On Sun, Apr 28, 2013 at 06:28:16AM +0000, Yan Burman wrote:
> > > > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> > > > > > > >> <yanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > > > > > >>> I've been trying to do some benchmarks for NFS over RDMA
> > > > > > > >>> and I seem to
> > > > > > > only get about half of the bandwidth that the HW can give me.
> > > > > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > > > > > >>> memory, and
> > > > > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > > > > >>> These servers are connected to a QDR IB switch. The
> > > > > > > >>> backing storage on
> > > > > > > the server is tmpfs mounted with noatime.
> > > > > > > >>> I am running kernel 3.5.7.
> > > > > > > >>>
> > > > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-
> > 512K.
> > > > > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec
> > > > > > > >>> for the
> > > > > > > same block sizes (4-512K). running over IPoIB-CM, I get 200-
> > 980MB/sec.
...
> > > > > > I am trying to get maximum performance from a single server - I
> > > > > > used 2
> > > > > processes in fio test - more than 2 did not show any performance boost.
> > > > > > I tried running fio from 2 different PCs on 2 different files,
> > > > > > but the sum of
> > > > > the two is more or less the same as running from single client PC.
> > > > > >
> > > > > > What I did see is that server is sweating a lot more than the
> > > > > > clients and
> > > > > more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > > > > > cat /proc/softirqs
...
> > > > Perf top for the CPU with high tasklet count gives:
> > > >
> > > >              samples  pcnt         RIP        function                    DSO
...
> > > >              2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner
> > /root/vmlinux
...
> > Googling around....  I think we want:
> > 
> > 	perf record -a --call-graph
> > 	(give it a chance to collect some samples, then ^C)
> > 	perf report --call-graph --stdio
> > 
> 
> Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is:
>     36.18%          nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>                     |
>                     --- mutex_spin_on_owner
>                        |
>                        |--99.99%-- __mutex_lock_slowpath
>                        |          mutex_lock
>                        |          |
>                        |          |--85.30%-- generic_file_aio_write

That's the inode i_mutex.

>                        |          |          do_sync_readv_writev
>                        |          |          do_readv_writev
>                        |          |          vfs_writev
>                        |          |          nfsd_vfs_write
>                        |          |          nfsd_write
>                        |          |          nfsd3_proc_write
>                        |          |          nfsd_dispatch
>                        |          |          svc_process_common
>                        |          |          svc_process
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |           --14.70%-- svc_send

That's the xpt_mutex (ensuring rpc replies aren't interleaved).

>                        |                     svc_process
>                        |                     nfsd
>                        |                     kthread
>                        |                     kernel_thread_helper
>                         --0.01%-- [...]
> 
>      9.63%          nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>                     |
>                     --- _raw_spin_lock_irqsave
>                        |
>                        |--43.97%-- alloc_iova

And that (and __free_iova below) looks like iova_rbtree_lock.

--b.

>                        |          intel_alloc_iova
>                        |          __intel_map_single
>                        |          intel_map_page
>                        |          |
>                        |          |--60.47%-- svc_rdma_sendto
>                        |          |          svc_send
>                        |          |          svc_process
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |          |--30.10%-- rdma_read_xdr
>                        |          |          svc_rdma_recvfrom
>                        |          |          svc_recv
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |          |--6.69%-- svc_rdma_post_recv
>                        |          |          send_reply
>                        |          |          svc_rdma_sendto
>                        |          |          svc_send
>                        |          |          svc_process
>                        |          |          nfsd
>                        |          |          kthread
>                        |          |          kernel_thread_helper
>                        |          |
>                        |           --2.74%-- send_reply
>                        |                     svc_rdma_sendto
>                        |                     svc_send
>                        |                     svc_process
>                        |                     nfsd
>                        |                     kthread
>                        |                     kernel_thread_helper
>                        |
>                        |--37.52%-- __free_iova
>                        |          flush_unmaps
>                        |          add_unmap
>                        |          intel_unmap_page
>                        |          |
>                        |          |--97.18%-- svc_rdma_put_frmr
>                        |          |          sq_cq_reap
>                        |          |          dto_tasklet_func
>                        |          |          tasklet_action
>                        |          |          __do_softirq
>                        |          |          call_softirq
>                        |          |          do_softirq
>                        |          |          |
>                        |          |          |--97.40%-- irq_exit
>                        |          |          |          |
>                        |          |          |          |--99.85%-- do_IRQ
>                        |          |          |          |          ret_from_intr
>                        |          |          |          |          |
>                        |          |          |          |          |--40.74%-- generic_file_buffered_write
>                        |          |          |          |          |          __generic_file_aio_write
>                        |          |          |          |          |          generic_file_aio_write
>                        |          |          |          |          |          do_sync_readv_writev
>                        |          |          |          |          |          do_readv_writev
>                        |          |          |          |          |          vfs_writev
>                        |          |          |          |          |          nfsd_vfs_write
>                        |          |          |          |          |          nfsd_write
>                        |          |          |          |          |          nfsd3_proc_write
>                        |          |          |          |          |          nfsd_dispatch
>                        |          |          |          |          |          svc_process_common
>                        |          |          |          |          |          svc_process
>                        |          |          |          |          |          nfsd
>                        |          |          |          |          |          kthread
>                        |          |          |          |          |          kernel_thread_helper
>                        |          |          |          |          |
>                        |          |          |          |          |--25.21%-- __mutex_lock_slowpath
>                        |          |          |          |          |          mutex_lock
>                        |          |          |          |          |          |
>                        |          |          |          |          |          |--94.84%-- generic_file_aio_write
>                        |          |          |          |          |          |          do_sync_readv_writev
>                        |          |          |          |          |          |          do_readv_writev
>                        |          |          |          |          |          |          vfs_writev
>                        |          |          |          |          |          |          nfsd_vfs_write
>                        |          |          |          |          |          |          nfsd_write
>                        |          |          |          |          |          |          nfsd3_proc_write
>                        |          |          |          |          |          |          nfsd_dispatch
>                        |          |          |          |          |          |          svc_process_common
>                        |          |          |          |          |          |          svc_process
>                        |          |          |          |          |          |          nfsd
>                        |          |          |          |          |          |          kthread
>                        |          |          |          |          |          |          kernel_thread_helper
>                        |          |          |          |          |          |
> 
> The entire trace is almost 1MB, so send me an off-list message if you want it.
> 
> Yan
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2013-04-28 14:42 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-17 14:36 NFS over RDMA benchmark Yan Burman
2013-04-17 14:36 ` Yan Burman
2013-04-17 17:15 ` Wendy Cheng
2013-04-17 17:15   ` Wendy Cheng
2013-04-17 17:32   ` Atchley, Scott
2013-04-17 17:32     ` Atchley, Scott
2013-04-17 18:06     ` Wendy Cheng
2013-04-17 18:06       ` Wendy Cheng
2013-04-18 12:47       ` Yan Burman
2013-04-18 12:47         ` Yan Burman
2013-04-18 16:16         ` Wendy Cheng
2013-04-18 16:16           ` Wendy Cheng
2013-04-23 21:06         ` J. Bruce Fields
2013-04-23 21:06           ` J. Bruce Fields
2013-04-24 12:35           ` Yan Burman
2013-04-24 12:35             ` Yan Burman
2013-04-24 15:05             ` J. Bruce Fields
2013-04-24 15:05               ` J. Bruce Fields
2013-04-24 15:26               ` J. Bruce Fields
2013-04-24 15:26                 ` J. Bruce Fields
2013-04-24 16:27                 ` Wendy Cheng
2013-04-24 16:27                   ` Wendy Cheng
2013-04-24 18:04                   ` Wendy Cheng
2013-04-24 18:04                     ` Wendy Cheng
2013-04-24 18:26                     ` Tom Talpey
2013-04-24 18:26                       ` Tom Talpey
2013-04-25 17:18                       ` Wendy Cheng
2013-04-25 17:18                         ` Wendy Cheng
2013-04-25 19:01                         ` Phil Pishioneri
2013-04-25 19:01                           ` Phil Pishioneri
2013-04-25 20:14                           ` Tom Talpey
2013-04-25 20:14                             ` Tom Talpey
2013-04-25 20:04                         ` Tom Talpey
2013-04-25 20:04                           ` Tom Talpey
2013-04-25 21:17                           ` Tom Tucker
2013-04-25 21:17                             ` Tom Tucker
2013-04-25 21:58                             ` Wendy Cheng
2013-04-25 21:58                               ` Wendy Cheng
2013-04-25 22:26                               ` Wendy Cheng
2013-04-25 22:26                                 ` Wendy Cheng
2013-04-28  6:28                 ` Yan Burman
2013-04-28  6:28                   ` Yan Burman
2013-04-28 14:42                   ` J. Bruce Fields [this message]
2013-04-28 14:42                     ` J. Bruce Fields
2013-04-29  5:34                     ` Wendy Cheng
2013-04-29  5:34                       ` Wendy Cheng
2013-04-29 12:16                       ` Yan Burman
2013-04-29 12:16                         ` Yan Burman
2013-04-29 13:05                         ` Tom Tucker
2013-04-29 13:05                           ` Tom Tucker
2013-04-29 13:07                           ` Tom Tucker
2013-04-29 13:07                             ` Tom Tucker
2013-04-30  5:09                     ` Yan Burman
2013-04-30  5:09                       ` Yan Burman
2013-04-30 13:05                       ` Tom Talpey
2013-04-30 13:05                         ` Tom Talpey
2013-04-30 14:23                         ` Yan Burman
2013-04-30 14:23                           ` Yan Burman
2013-04-30 14:44                           ` Tom Talpey
2013-04-30 14:44                             ` Tom Talpey
2013-04-30 14:20                       ` Tom Talpey
2013-04-30 14:20                         ` Tom Talpey
2013-04-30 14:38                         ` Yan Burman
2013-04-30 14:38                           ` Yan Burman
2013-04-30 18:58                           ` Tom Tucker
2013-04-30 18:58                             ` Tom Tucker
     [not found]                             ` <CALsNU1MsjH5=p4Wtj2aJ5+odC7y7-5oTGhrzOL-=15pXaYYUZw@mail.gmail.com>
     [not found]                               ` <CABgxfbFhZTBO81WC5BcRRfQB_YBjE4N=sfS+G9eAzaFHYC_dWw@mail.gmail.com>
2013-06-20 14:56                                 ` Or Gerlitz
2013-06-20 14:56                                   ` Or Gerlitz
2013-04-30 16:24                       ` Wendy Cheng
2013-04-30 16:24                         ` Wendy Cheng
2013-04-30 13:38                     ` J. Bruce Fields
2013-04-30 13:38                       ` J. Bruce Fields
2013-04-19  2:27 ` Peng Tao
2013-04-19  2:27   ` Peng Tao
2013-04-22 11:07   ` Yan Burman
2013-04-22 11:07     ` Yan Burman
     [not found] <51703280.03e9440a.06a6.3f9f@mx.google.com>
2013-04-18 19:15 ` Wendy Cheng
2013-04-18 19:15   ` Wendy Cheng
2013-04-19  1:03   ` Atchley, Scott
2013-04-19  1:03     ` Atchley, Scott
2013-04-19  3:35     ` Spencer
2013-04-19  3:35       ` Spencer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130428144248.GA2037@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=atchleyes@ornl.gov \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=ogerlitz@mellanox.com \
    --cc=s.wendy.cheng@gmail.com \
    --cc=tom@opengridcomputing.com \
    --cc=yanb@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.