* [Lustre-devel] COS performance issues
@ 2008-10-08 6:40 Alexander Zarochentsev
0 siblings, 0 replies; 13+ messages in thread
From: Alexander Zarochentsev @ 2008-10-08 6:40 UTC (permalink / raw)
To: lustre-devel
Hello,
the cos performance testing made discovered COS performance
problems in tests which were not supposed to be problematic.
The tests are:
(3a) parallel file creation from several clients, each client creates files in own directory.
(3b) parallel file creation from several clients in one shared directory.
The test results published in Bugzilla bug 17235. The slowdown depends on hardware, for the test clusters
Washie1,3 it is enough big, the difference between COS=0 and COS=1 was about 20%
initially and PTLDEBUG=0 reduces the slowdown to 12-13% (COS patch adds debugging
stuff like ENTRY/EXIT to ptlrpc routines).
COS code finds no inter-client dependencies and adds no extra commits
during those tests, there are about 60 commits in each of 300 seconds tests
with COS=0 and COS=1, one ldiskfs commit in 5 seconds.
However, oprofile results are terrible for COS=1.
COS code profiling results:
== COS=1 ==
376924 4.5556 ptlrpc.ko ptlrpc ptlrpc_server_handle_reply
288668 3.4889 vmlinux vmlinux .text.task_rq_lock
281528 3.4026 ptlrpc.ko ptlrpc ldlm_resource_find
180107 2.1768 vmlinux vmlinux .text.__wake_up_common
173166 2.0929 obdclass.ko obdclass htable_lookup
166829 2.0163 e1000e.ko e1000e e1000_irq_enable
150765 1.8222 vmlinux vmlinux schedule
127059 1.5357 e1000e.ko e1000e e1000_intr_msi
119343 1.4424 ptlrpc.ko ptlrpc target_send_reply
118733 1.4350 lvfs.ko lvfs lprocfs_counter_add
99568 1.2034 vmlinux vmlinux __find_get_block
93356 1.1283 vmlinux vmlinux __switch_to
91259 1.1030 oprofiled oprofiled (no symbols)
91147 1.1016 vmlinux vmlinux try_to_wake_up
89665 1.0837 vmlinux vmlinux kfree
87774 1.0609 vmlinux vmlinux module_refcount
85149 1.0291 vmlinux vmlinux memset
84822 1.0252 vmlinux vmlinux mwait_idle
78850 0.9530 vmlinux vmlinux __kmalloc
70324 0.8500 ptlrpc.ko ptlrpc ptlrpc_main
69333 0.8380 vmlinux vmlinux memmove
== COS=0 ==
237314 2.7642 e1000e.ko e1000e e1000_irq_enable
211304 2.4613 obdclass.ko obdclass htable_lookup
197515 2.3007 vmlinux vmlinux schedule
180431 2.1017 e1000e.ko e1000e e1000_intr_msi
128113 1.4923 vmlinux vmlinux __switch_to
127683 1.4873 lvfs.ko lvfs lprocfs_counter_add
122777 1.4301 vmlinux vmlinux kfree
122584 1.4279 vmlinux vmlinux module_refcount
116800 1.3605 vmlinux vmlinux __find_get_block
110418 1.2862 vmlinux vmlinux mwait_idle
101909 1.1870 vmlinux vmlinux memset
101100 1.1776 oprofiled oprofiled (no symbols)
96397 1.1228 vmlinux vmlinux __kmalloc
91745 1.0686 ptlrpc.ko ptlrpc ptlrpc_main
80605 0.9389 ksocklnd.ko ksocklnd ksocknal_scheduler
79782 0.9293 vmlinux vmlinux kmem_cache_free
76968 0.8965 mds.ko mds mds_lov_dump_objids
75094 0.8747 vmlinux vmlinux .text.task_rq_lock
75087 0.8746 vmlinux vmlinux memmove
72627 0.8460 vmlinux vmlinux find_get_page
67242 0.7832 vmlinux vmlinux kmem_cache_alloc
== events distribution in ptlrpc_server_handle_reply, the most cpu consuming function ==
CPU: P4 / Xeon, speed 2800.2 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 100000
vma samples % image name app name symbol name
00000000000498f0 376924 4.5556 ptlrpc.ko ptlrpc ptlrpc_server_handle_reply
00000000000498f0 1 2.7e-04
00000000000498f4 10 0.0027
0000000000049905 3 8.0e-04
0000000000049912 22 0.0058
0000000000049972 1 2.7e-04
0000000000049977 658 0.1746
000000000004997e 24 0.0064
0000000000049a23 5 0.0013
0000000000049a35 21 0.0056
0000000000049a80 1 2.7e-04
0000000000049a85 4 0.0011
0000000000049a92 2 5.3e-04
0000000000049a99 22 0.0058
0000000000049aa9 1 2.7e-04
0000000000049ab5 371741 98.6249
... ^^^^^^ spin_lock (&obd->obd_uncommitted_replies_lock);
I have a patch to avoid using of obd_uncommitted_replies_lock
in ptlrpc_server_handle_reply but it has minimal effect,
ptlrpc_server_handle_reply still the most cpu consuming function
because of svc->srv_lock contention.
I think the problem is that COS defers processing of replies to transaction commit time.
When commit happens, MDS has to process thousands of replies (about 14k replies per
commit in the test 3.a) in short period of time. I guess the mdt service threads
all woken up and spin trying to get the service svr_lock. Processing of new requests may
also suffer of this.
I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging compiled into a kernel, it found no
sleep under spinlock bugs.
Further optimization may include
1. per-reply spin locks.
2. per-cpu structures and threads to process reply queues.
Any comments?
Thanks.
PS. the test results are much better when MDS server is sata20 machine with 4 cores
(the MDS from Washie1 has 2 cores), COS=0 and COS=1 have only %3 difference:
COS=1
Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300 secs)
Rate: 3096.94 creates/sec (total: 2 threads 929083 creates 300 secs)
COS=0
Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301 secs)
Rate: 3152.89 creates/sec (total: 2 threads 945868 creates 300 secs)
--
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Lustre-devel] COS performance issues
@ 2008-10-08 11:44 Alexander Zarochentsev
2008-10-08 11:48 ` Alex Zhuravlev
2008-10-11 16:00 ` Andreas Dilger
0 siblings, 2 replies; 13+ messages in thread
From: Alexander Zarochentsev @ 2008-10-08 11:44 UTC (permalink / raw)
To: lustre-devel
sending to lustre-recovery@ as well
Hello,
the cos performance testing made discovered COS performance
problems in tests which were not supposed to be problematic.
The tests are:
(3a) parallel file creation from several clients, each client creates files in own directory.
(3b) parallel file creation from several clients in one shared directory.
The test results published in Bugzilla bug 17235. The slowdown depends on hardware, for the test clusters
Washie1,3 it is enough big, the difference between COS=0 and COS=1 was about 20%
initially and PTLDEBUG=0 reduces the slowdown to 12-13% (COS patch adds debugging
stuff like ENTRY/EXIT to ptlrpc routines).
COS code finds no inter-client dependencies and adds no extra commits
during those tests, there are about 60 commits in each of 300 seconds tests
with COS=0 and COS=1, one ldiskfs commit in 5 seconds.
However, oprofile results are terrible for COS=1.
COS code profiling results:
== COS=1 ==
376924 4.5556 ptlrpc.ko ptlrpc ptlrpc_server_handle_reply
288668 3.4889 vmlinux vmlinux .text.task_rq_lock
281528 3.4026 ptlrpc.ko ptlrpc ldlm_resource_find
180107 2.1768 vmlinux vmlinux .text.__wake_up_common
173166 2.0929 obdclass.ko obdclass htable_lookup
166829 2.0163 e1000e.ko e1000e e1000_irq_enable
150765 1.8222 vmlinux vmlinux schedule
127059 1.5357 e1000e.ko e1000e e1000_intr_msi
119343 1.4424 ptlrpc.ko ptlrpc target_send_reply
118733 1.4350 lvfs.ko lvfs lprocfs_counter_add
99568 1.2034 vmlinux vmlinux __find_get_block
93356 1.1283 vmlinux vmlinux __switch_to
91259 1.1030 oprofiled oprofiled (no symbols)
91147 1.1016 vmlinux vmlinux try_to_wake_up
89665 1.0837 vmlinux vmlinux kfree
87774 1.0609 vmlinux vmlinux module_refcount
85149 1.0291 vmlinux vmlinux memset
84822 1.0252 vmlinux vmlinux mwait_idle
78850 0.9530 vmlinux vmlinux __kmalloc
70324 0.8500 ptlrpc.ko ptlrpc ptlrpc_main
69333 0.8380 vmlinux vmlinux memmove
== COS=0 ==
237314 2.7642 e1000e.ko e1000e e1000_irq_enable
211304 2.4613 obdclass.ko obdclass htable_lookup
197515 2.3007 vmlinux vmlinux schedule
180431 2.1017 e1000e.ko e1000e e1000_intr_msi
128113 1.4923 vmlinux vmlinux __switch_to
127683 1.4873 lvfs.ko lvfs lprocfs_counter_add
122777 1.4301 vmlinux vmlinux kfree
122584 1.4279 vmlinux vmlinux module_refcount
116800 1.3605 vmlinux vmlinux __find_get_block
110418 1.2862 vmlinux vmlinux mwait_idle
101909 1.1870 vmlinux vmlinux memset
101100 1.1776 oprofiled oprofiled (no symbols)
96397 1.1228 vmlinux vmlinux __kmalloc
91745 1.0686 ptlrpc.ko ptlrpc ptlrpc_main
80605 0.9389 ksocklnd.ko ksocklnd ksocknal_scheduler
79782 0.9293 vmlinux vmlinux kmem_cache_free
76968 0.8965 mds.ko mds mds_lov_dump_objids
75094 0.8747 vmlinux vmlinux .text.task_rq_lock
75087 0.8746 vmlinux vmlinux memmove
72627 0.8460 vmlinux vmlinux find_get_page
67242 0.7832 vmlinux vmlinux kmem_cache_alloc
== events distribution in ptlrpc_server_handle_reply, the most cpu consuming function ==
CPU: P4 / Xeon, speed 2800.2 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 100000
vma samples % image name app name symbol name
00000000000498f0 376924 4.5556 ptlrpc.ko ptlrpc ptlrpc_server_handle_reply
00000000000498f0 1 2.7e-04
00000000000498f4 10 0.0027
0000000000049905 3 8.0e-04
0000000000049912 22 0.0058
0000000000049972 1 2.7e-04
0000000000049977 658 0.1746
000000000004997e 24 0.0064
0000000000049a23 5 0.0013
0000000000049a35 21 0.0056
0000000000049a80 1 2.7e-04
0000000000049a85 4 0.0011
0000000000049a92 2 5.3e-04
0000000000049a99 22 0.0058
0000000000049aa9 1 2.7e-04
0000000000049ab5 371741 98.6249
... ^^^^^^ spin_lock (&obd->obd_uncommitted_replies_lock);
I have a patch to avoid using of obd_uncommitted_replies_lock
in ptlrpc_server_handle_reply but it has minimal effect,
ptlrpc_server_handle_reply still the most cpu consuming function
because of svc->srv_lock contention.
I think the problem is that COS defers processing of replies to transaction commit time.
When commit happens, MDS has to process thousands of replies (about 14k replies per
commit in the test 3.a) in short period of time. I guess the mdt service threads
all woken up and spin trying to get the service svr_lock. Processing of new requests may
also suffer of this.
I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging compiled into a kernel, it found no
sleep under spinlock bugs.
Further optimization may include
1. per-reply spin locks.
2. per-cpu structures and threads to process reply queues.
Any comments?
Thanks.
PS. the test results are much better when MDS server is sata20 machine with 4 cores
(the MDS from Washie1 has 2 cores), COS=0 and COS=1 have only %3 difference:
COS=1
Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300 secs)
Rate: 3096.94 creates/sec (total: 2 threads 929083 creates 300 secs)
COS=0
Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301 secs)
Rate: 3152.89 creates/sec (total: 2 threads 945868 creates 300 secs)
--
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Lustre-devel] COS performance issues
2008-10-08 11:44 [Lustre-devel] COS performance issues Alexander Zarochentsev
@ 2008-10-08 11:48 ` Alex Zhuravlev
2008-10-12 18:41 ` Alexander Zarochentsev
2008-10-11 16:00 ` Andreas Dilger
1 sibling, 1 reply; 13+ messages in thread
From: Alex Zhuravlev @ 2008-10-08 11:48 UTC (permalink / raw)
To: lustre-devel
try to profile with single CPU? you'll probably get an idea how "per-cpu"
approach can help.
thanks, Alex
Alexander Zarochentsev wrote:
> I have a patch to avoid using of obd_uncommitted_replies_lock
> in ptlrpc_server_handle_reply but it has minimal effect,
> ptlrpc_server_handle_reply still the most cpu consuming function
> because of svc->srv_lock contention.
>
> I think the problem is that COS defers processing of replies to transaction commit time.
> When commit happens, MDS has to process thousands of replies (about 14k replies per
> commit in the test 3.a) in short period of time. I guess the mdt service threads
> all woken up and spin trying to get the service svr_lock. Processing of new requests may
> also suffer of this.
>
> I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging compiled into a kernel, it found no
> sleep under spinlock bugs.
>
> Further optimization may include
> 1. per-reply spin locks.
> 2. per-cpu structures and threads to process reply queues.
>
> Any comments?
>
> Thanks.
>
> PS. the test results are much better when MDS server is sata20 machine with 4 cores
> (the MDS from Washie1 has 2 cores), COS=0 and COS=1 have only %3 difference:
>
> COS=1
> Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300 secs)
> Rate: 3096.94 creates/sec (total: 2 threads 929083 creates 300 secs)
>
> COS=0
> Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301 secs)
> Rate: 3152.89 creates/sec (total: 2 threads 945868 creates 300 secs)
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Lustre-devel] COS performance issues
2008-10-08 11:44 [Lustre-devel] COS performance issues Alexander Zarochentsev
2008-10-08 11:48 ` Alex Zhuravlev
@ 2008-10-11 16:00 ` Andreas Dilger
2008-10-11 20:18 ` Eric Barton
2008-10-12 15:51 ` Alexander Zarochentsev
1 sibling, 2 replies; 13+ messages in thread
From: Andreas Dilger @ 2008-10-11 16:00 UTC (permalink / raw)
To: lustre-devel
On Oct 08, 2008 15:44 +0400, Alexander Zarochentsev wrote:
> I think the problem is that COS defers processing of replies to
> transaction commit time. When commit happens, MDS has to process
> thousands of replies (about 14k replies per commit in the test 3.a)
> in short period of time. I guess the mdt service threads all woken
> up and spin trying to get the service svr_lock. Processing of new
> requests may also suffer of this.
Can you please explain what replies are being blocked? It can't be the
create replies or the clients would be blocked waiting after starting a
single create each.
I think the thread and lock contention is only part of the issue - if all
of these replies are blocked until transaction commit this wastes all of
the bandwidth on the network while the replies are being held.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Lustre-devel] COS performance issues
2008-10-11 16:00 ` Andreas Dilger
@ 2008-10-11 20:18 ` Eric Barton
2008-10-12 15:51 ` Alexander Zarochentsev
1 sibling, 0 replies; 13+ messages in thread
From: Eric Barton @ 2008-10-11 20:18 UTC (permalink / raw)
To: lustre-devel
Andreas,
Zam doesn't mean the reply messages being blocked - he means
the reply state that is holding lock references. Previously
they got cleaned up one at a time as the repACKs came in. Now
they get scheduled by the 1000s on the commit callback. I'm
betting that we're getting thundering herds of MDS threads
at this time and that we should have dedicated 1-per-CPU threads
doing the cleanup to minimize interference with normal request
processing. Bzzz thinks it's purely spinlock contention that's
the source of the inefficiency. So Zam is going to run an
experiment with a single CPU MDS to see if the performance issue
goes away. I'm betting it won't and Bzzz is betting it will.
We should see soon...
Cheers,
Eric
>
>
> -----Original Message-----
> From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On Behalf Of Andreas Dilger
> Sent: 11 October 2008 5:00 PM
> To: Alexander Zarochentsev
> Cc: lustre-devel at lists.lustre.org; lustre-recovery at Sun.COM
> Subject: Re: [Lustre-devel] COS performance issues
>
> On Oct 08, 2008 15:44 +0400, Alexander Zarochentsev wrote:
> > I think the problem is that COS defers processing of replies to
> > transaction commit time. When commit happens, MDS has to process
> > thousands of replies (about 14k replies per commit in the test 3.a)
> > in short period of time. I guess the mdt service threads all woken
> > up and spin trying to get the service svr_lock. Processing of new
> > requests may also suffer of this.
>
> Can you please explain what replies are being blocked? It can't be the
> create replies or the clients would be blocked waiting after starting a
> single create each.
>
> I think the thread and lock contention is only part of the issue - if all
> of these replies are blocked until transaction commit this wastes all of
> the bandwidth on the network while the replies are being held.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Lustre-devel] COS performance issues
2008-10-11 16:00 ` Andreas Dilger
2008-10-11 20:18 ` Eric Barton
@ 2008-10-12 15:51 ` Alexander Zarochentsev
1 sibling, 0 replies; 13+ messages in thread
From: Alexander Zarochentsev @ 2008-10-12 15:51 UTC (permalink / raw)
To: lustre-devel
On 11 October 2008 20:00:10 Andreas Dilger wrote:
> On Oct 08, 2008 15:44 +0400, Alexander Zarochentsev wrote:
> > I think the problem is that COS defers processing of replies to
> > transaction commit time. When commit happens, MDS has to process
> > thousands of replies (about 14k replies per commit in the test 3.a)
> > in short period of time. I guess the mdt service threads all woken
> > up and spin trying to get the service svr_lock. Processing of new
> > requests may also suffer of this.
>
> Can you please explain what replies are being blocked? It can't be
> the create replies or the clients would be blocked waiting after
> starting a single create each.
>
> I think the thread and lock contention is only part of the issue - if
> all of these replies are blocked until transaction commit this wastes
> all of the bandwidth on the network while the replies are being held.
as Eric explained already, replies are sent already but the reply state
objects remain in memory. The cleanup process is almost serialized by
per-service spinlock. It might be only a part of the problem but the
first task to solve.
> Cheers, Andreas
WBR,
--
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Lustre-devel] COS performance issues
2008-10-08 11:48 ` Alex Zhuravlev
@ 2008-10-12 18:41 ` Alexander Zarochentsev
2008-10-12 19:12 ` Alex Zhuravlev
0 siblings, 1 reply; 13+ messages in thread
From: Alexander Zarochentsev @ 2008-10-12 18:41 UTC (permalink / raw)
To: lustre-devel
On 8 October 2008 15:48:50 Alex Zhuravlev wrote:
> try to profile with single CPU? you'll probably get an idea how
> "per-cpu" approach can help.
I booted the MDS server with maxcpus=1 kernel parameter and here are the
results:
cos=0
2039.31 creates/sec (total: 2 threads 611794 creates 300 secs)
2037.80 creates/sec (total: 2 threads 611341 creates 300 secs)
2076.21 creates/sec (total: 2 threads 622864 creates 300 secs)
cos=1
1874.93 creates/sec (total: 2 threads 564354 creates 301 secs)
1923.97 creates/sec (total: 2 threads 577191 creates 300 secs)
1892.61 creates/sec (total: 2 threads 567783 creates 300 secs)
1874.74 creates/sec (total: 2 threads 562421 creates 300 secs)
unfortunately profiling info isn't available yet, the results are done
with SLES10 which can boot with maxcpus=1 but has no oprofile
installed.
> Alexander Zarochentsev wrote:
> > I have a patch to avoid using of obd_uncommitted_replies_lock
> > in ptlrpc_server_handle_reply but it has minimal effect,
> > ptlrpc_server_handle_reply still the most cpu consuming function
> > because of svc->srv_lock contention.
> >
> > I think the problem is that COS defers processing of replies to
> > transaction commit time. When commit happens, MDS has to process
> > thousands of replies (about 14k replies per commit in the test 3.a)
> > in short period of time. I guess the mdt service threads all woken
> > up and spin trying to get the service svr_lock. Processing of new
> > requests may also suffer of this.
> >
> > I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging
> > compiled into a kernel, it found no sleep under spinlock bugs.
> >
> > Further optimization may include
> > 1. per-reply spin locks.
> > 2. per-cpu structures and threads to process reply queues.
> >
> > Any comments?
> >
> > Thanks.
> >
> > PS. the test results are much better when MDS server is sata20
> > machine with 4 cores (the MDS from Washie1 has 2 cores), COS=0 and
> > COS=1 have only %3 difference:
> >
> > COS=1
> > Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300
> > secs) Rate: 3096.94 creates/sec (total: 2 threads 929083 creates
> > 300 secs)
> >
> > COS=0
> > Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301
> > secs) Rate: 3152.89 creates/sec (total: 2 threads 945868 creates
> > 300 secs)
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
--
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Lustre-devel] COS performance issues
2008-10-12 18:41 ` Alexander Zarochentsev
@ 2008-10-12 19:12 ` Alex Zhuravlev
2008-10-13 14:36 ` Alexander Zarochentsev
0 siblings, 1 reply; 13+ messages in thread
From: Alex Zhuravlev @ 2008-10-12 19:12 UTC (permalink / raw)
To: lustre-devel
would be good to look at profiles as the next one was ldlm_resource_get()
thanks, Alex
Alexander Zarochentsev wrote:
> On 8 October 2008 15:48:50 Alex Zhuravlev wrote:
>> try to profile with single CPU? you'll probably get an idea how
>> "per-cpu" approach can help.
>
> I booted the MDS server with maxcpus=1 kernel parameter and here are the
> results:
>
> cos=0
> 2039.31 creates/sec (total: 2 threads 611794 creates 300 secs)
> 2037.80 creates/sec (total: 2 threads 611341 creates 300 secs)
> 2076.21 creates/sec (total: 2 threads 622864 creates 300 secs)
>
> cos=1
> 1874.93 creates/sec (total: 2 threads 564354 creates 301 secs)
> 1923.97 creates/sec (total: 2 threads 577191 creates 300 secs)
> 1892.61 creates/sec (total: 2 threads 567783 creates 300 secs)
> 1874.74 creates/sec (total: 2 threads 562421 creates 300 secs)
>
> unfortunately profiling info isn't available yet, the results are done
> with SLES10 which can boot with maxcpus=1 but has no oprofile
> installed.
>
>> Alexander Zarochentsev wrote:
>>> I have a patch to avoid using of obd_uncommitted_replies_lock
>>> in ptlrpc_server_handle_reply but it has minimal effect,
>>> ptlrpc_server_handle_reply still the most cpu consuming function
>>> because of svc->srv_lock contention.
>>>
>>> I think the problem is that COS defers processing of replies to
>>> transaction commit time. When commit happens, MDS has to process
>>> thousands of replies (about 14k replies per commit in the test 3.a)
>>> in short period of time. I guess the mdt service threads all woken
>>> up and spin trying to get the service svr_lock. Processing of new
>>> requests may also suffer of this.
>>>
>>> I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging
>>> compiled into a kernel, it found no sleep under spinlock bugs.
>>>
>>> Further optimization may include
>>> 1. per-reply spin locks.
>>> 2. per-cpu structures and threads to process reply queues.
>>>
>>> Any comments?
>>>
>>> Thanks.
>>>
>>> PS. the test results are much better when MDS server is sata20
>>> machine with 4 cores (the MDS from Washie1 has 2 cores), COS=0 and
>>> COS=1 have only %3 difference:
>>>
>>> COS=1
>>> Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300
>>> secs) Rate: 3096.94 creates/sec (total: 2 threads 929083 creates
>>> 300 secs)
>>>
>>> COS=0
>>> Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301
>>> secs) Rate: 3152.89 creates/sec (total: 2 threads 945868 creates
>>> 300 secs)
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Lustre-devel] COS performance issues
2008-10-12 19:12 ` Alex Zhuravlev
@ 2008-10-13 14:36 ` Alexander Zarochentsev
2008-10-13 15:04 ` Alex Zhuravlev
0 siblings, 1 reply; 13+ messages in thread
From: Alexander Zarochentsev @ 2008-10-13 14:36 UTC (permalink / raw)
To: lustre-devel
On 12 October 2008 23:12:10 Alex Zhuravlev wrote:
> would be good to look at profiles as the next one was
> ldlm_resource_get()
Alex, look, ptlrpc_server_handle_reply has gone:
CPU: P4 / Xeon, speed 2800.25 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped
with a unit mask of 0x01 (mandatory) count
100000
samples % image name app name symbol name
312318 4.7857 vmlinux vmlinux try_to_wake_up
259690 3.9792 ptlrpc.ko ptlrpc ldlm_resource_find
175916 2.6956 vmlinux vmlinux __wake_up_common
160912 2.4657 e1000.ko e1000 e1000_irq_enable
157031 2.4062 obdclass.ko obdclass htable_lookup
134309 2.0580 e1000.ko e1000 e1000_intr_msi
121261 1.8581 lvfs.ko lvfs lprocfs_counter_add
87613 1.3425 vmlinux vmlinux __find_get_block
87562 1.3417 oprofiled oprofiled (no symbols)
77926 1.1941 vmlinux vmlinux memset
74561 1.1425 vmlinux vmlinux __kmalloc
70952 1.0872 vmlinux vmlinux __switch_to
67288 1.0311 mds.ko mds mds_lov_dump_objids
67085 1.0279 vmlinux vmlinux memmove
66521 1.0193 vmlinux vmlinux kfree
55112 0.8445 ptlrpc.ko ptlrpc ptlrpc_main
54357 0.8329 vmlinux vmlinux schedule
53501 0.8198 vmlinux vmlinux find_get_page
> thanks, Alex
>
> Alexander Zarochentsev wrote:
> > On 8 October 2008 15:48:50 Alex Zhuravlev wrote:
> >> try to profile with single CPU? you'll probably get an idea how
> >> "per-cpu" approach can help.
> >
> > I booted the MDS server with maxcpus=1 kernel parameter and here
> > are the results:
> >
> > cos=0
> > 2039.31 creates/sec (total: 2 threads 611794 creates 300 secs)
> > 2037.80 creates/sec (total: 2 threads 611341 creates 300 secs)
> > 2076.21 creates/sec (total: 2 threads 622864 creates 300 secs)
> >
> > cos=1
> > 1874.93 creates/sec (total: 2 threads 564354 creates 301 secs)
> > 1923.97 creates/sec (total: 2 threads 577191 creates 300 secs)
> > 1892.61 creates/sec (total: 2 threads 567783 creates 300 secs)
> > 1874.74 creates/sec (total: 2 threads 562421 creates 300 secs)
> >
> > unfortunately profiling info isn't available yet, the results are
> > done with SLES10 which can boot with maxcpus=1 but has no oprofile
> > installed.
> >
> >> Alexander Zarochentsev wrote:
> >>> I have a patch to avoid using of obd_uncommitted_replies_lock
> >>> in ptlrpc_server_handle_reply but it has minimal effect,
> >>> ptlrpc_server_handle_reply still the most cpu consuming function
> >>> because of svc->srv_lock contention.
> >>>
> >>> I think the problem is that COS defers processing of replies to
> >>> transaction commit time. When commit happens, MDS has to process
> >>> thousands of replies (about 14k replies per commit in the test
> >>> 3.a) in short period of time. I guess the mdt service threads all
> >>> woken up and spin trying to get the service svr_lock. Processing
> >>> of new requests may also suffer of this.
> >>>
> >>> I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging
> >>> compiled into a kernel, it found no sleep under spinlock bugs.
> >>>
> >>> Further optimization may include
> >>> 1. per-reply spin locks.
> >>> 2. per-cpu structures and threads to process reply queues.
> >>>
> >>> Any comments?
> >>>
> >>> Thanks.
> >>>
> >>> PS. the test results are much better when MDS server is sata20
> >>> machine with 4 cores (the MDS from Washie1 has 2 cores), COS=0
> >>> and COS=1 have only %3 difference:
> >>>
> >>> COS=1
> >>> Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300
> >>> secs) Rate: 3096.94 creates/sec (total: 2 threads 929083 creates
> >>> 300 secs)
> >>>
> >>> COS=0
> >>> Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301
> >>> secs) Rate: 3152.89 creates/sec (total: 2 threads 945868 creates
> >>> 300 secs)
> >>
> >> _______________________________________________
> >> Lustre-devel mailing list
> >> Lustre-devel at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-devel
--
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Lustre-devel] COS performance issues
2008-10-13 14:36 ` Alexander Zarochentsev
@ 2008-10-13 15:04 ` Alex Zhuravlev
2008-10-17 8:07 ` Alexander Zarochentsev
0 siblings, 1 reply; 13+ messages in thread
From: Alex Zhuravlev @ 2008-10-13 15:04 UTC (permalink / raw)
To: lustre-devel
cool! do you have profile with COS disabled? how is it different?
I guess ldlm_resorce_find() is result of COS as now we've got 10K times
more locks and resorces, but what about try_to_wake_up() and __wake_up_common() ?
thanks, Alex
Alexander Zarochentsev wrote:
> On 12 October 2008 23:12:10 Alex Zhuravlev wrote:
>> would be good to look at profiles as the next one was
>> ldlm_resource_get()
>
> Alex, look, ptlrpc_server_handle_reply has gone:
>
> CPU: P4 / Xeon, speed 2800.25 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped
> with a unit mask of 0x01 (mandatory) count
> 100000
> samples % image name app name symbol name
> 312318 4.7857 vmlinux vmlinux try_to_wake_up
> 259690 3.9792 ptlrpc.ko ptlrpc ldlm_resource_find
> 175916 2.6956 vmlinux vmlinux __wake_up_common
> 160912 2.4657 e1000.ko e1000 e1000_irq_enable
> 157031 2.4062 obdclass.ko obdclass htable_lookup
> 134309 2.0580 e1000.ko e1000 e1000_intr_msi
> 121261 1.8581 lvfs.ko lvfs lprocfs_counter_add
> 87613 1.3425 vmlinux vmlinux __find_get_block
> 87562 1.3417 oprofiled oprofiled (no symbols)
> 77926 1.1941 vmlinux vmlinux memset
> 74561 1.1425 vmlinux vmlinux __kmalloc
> 70952 1.0872 vmlinux vmlinux __switch_to
> 67288 1.0311 mds.ko mds mds_lov_dump_objids
> 67085 1.0279 vmlinux vmlinux memmove
> 66521 1.0193 vmlinux vmlinux kfree
> 55112 0.8445 ptlrpc.ko ptlrpc ptlrpc_main
> 54357 0.8329 vmlinux vmlinux schedule
> 53501 0.8198 vmlinux vmlinux find_get_page
>
>> thanks, Alex
>>
>> Alexander Zarochentsev wrote:
>>> On 8 October 2008 15:48:50 Alex Zhuravlev wrote:
>>>> try to profile with single CPU? you'll probably get an idea how
>>>> "per-cpu" approach can help.
>>> I booted the MDS server with maxcpus=1 kernel parameter and here
>>> are the results:
>>>
>>> cos=0
>>> 2039.31 creates/sec (total: 2 threads 611794 creates 300 secs)
>>> 2037.80 creates/sec (total: 2 threads 611341 creates 300 secs)
>>> 2076.21 creates/sec (total: 2 threads 622864 creates 300 secs)
>>>
>>> cos=1
>>> 1874.93 creates/sec (total: 2 threads 564354 creates 301 secs)
>>> 1923.97 creates/sec (total: 2 threads 577191 creates 300 secs)
>>> 1892.61 creates/sec (total: 2 threads 567783 creates 300 secs)
>>> 1874.74 creates/sec (total: 2 threads 562421 creates 300 secs)
>>>
>>> unfortunately profiling info isn't available yet, the results are
>>> done with SLES10 which can boot with maxcpus=1 but has no oprofile
>>> installed.
>>>
>>>> Alexander Zarochentsev wrote:
>>>>> I have a patch to avoid using of obd_uncommitted_replies_lock
>>>>> in ptlrpc_server_handle_reply but it has minimal effect,
>>>>> ptlrpc_server_handle_reply still the most cpu consuming function
>>>>> because of svc->srv_lock contention.
>>>>>
>>>>> I think the problem is that COS defers processing of replies to
>>>>> transaction commit time. When commit happens, MDS has to process
>>>>> thousands of replies (about 14k replies per commit in the test
>>>>> 3.a) in short period of time. I guess the mdt service threads all
>>>>> woken up and spin trying to get the service svr_lock. Processing
>>>>> of new requests may also suffer of this.
>>>>>
>>>>> I ran the tests with with CONFIG_DEBUG_SPINLOCK_SLEEP debugging
>>>>> compiled into a kernel, it found no sleep under spinlock bugs.
>>>>>
>>>>> Further optimization may include
>>>>> 1. per-reply spin locks.
>>>>> 2. per-cpu structures and threads to process reply queues.
>>>>>
>>>>> Any comments?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> PS. the test results are much better when MDS server is sata20
>>>>> machine with 4 cores (the MDS from Washie1 has 2 cores), COS=0
>>>>> and COS=1 have only %3 difference:
>>>>>
>>>>> COS=1
>>>>> Rate: 3101.77 creates/sec (total: 2 threads 930530 creates 300
>>>>> secs) Rate: 3096.94 creates/sec (total: 2 threads 929083 creates
>>>>> 300 secs)
>>>>>
>>>>> COS=0
>>>>> Rate: 3184.01 creates/sec (total: 2 threads 958388 creates 301
>>>>> secs) Rate: 3152.89 creates/sec (total: 2 threads 945868 creates
>>>>> 300 secs)
>>>> _______________________________________________
>>>> Lustre-devel mailing list
>>>> Lustre-devel at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Lustre-devel] COS performance issues
2008-10-13 15:04 ` Alex Zhuravlev
@ 2008-10-17 8:07 ` Alexander Zarochentsev
2008-10-17 20:55 ` Robert Read
0 siblings, 1 reply; 13+ messages in thread
From: Alexander Zarochentsev @ 2008-10-17 8:07 UTC (permalink / raw)
To: lustre-devel
On 13 October 2008 19:04:51 Alex Zhuravlev wrote:
> cool! do you have profile with COS disabled? how is it different?
> I guess ldlm_resorce_find() is result of COS as now we've got 10K
> times more locks and resorces, but what about try_to_wake_up() and
> __wake_up_common() ?
What I got after number of performance optimizations:
332019 4.4010 ptlrpc.ko ptlrpc ldlm_resource_find
190465 2.5247 obdclass.ko obdclass htable_lookup
175260 2.3231 e1000.ko e1000 e1000_irq_enable
146324 1.9396 vmlinux vmlinux schedule
139799 1.8531 e1000.ko e1000 e1000_intr_msi
121729 1.6136 lvfs.ko lvfs lprocfs_counter_add
105164 1.3940 vmlinux vmlinux __find_get_block
102296 1.3560 vmlinux vmlinux memset
96728 1.2822 vmlinux vmlinux __switch_to
92123 1.2211 vmlinux vmlinux kfree
90847 1.2042 oprofiled oprofiled (no symbols)
84152 1.1155 vmlinux vmlinux mwait_idle
82378 1.0919 vmlinux vmlinux __kmalloc
80368 1.0653 vmlinux vmlinux try_to_wake_up
79110 1.0486 mds.ko mds mds_lov_dump_objids
71730 0.9508 vmlinux vmlinux memmove
The ptlrpc_server_handle_reply has gone due to per-reply spin lock implemented,
no try_to_wake_up due to dedicated reply handling threads,
there is only ldlm_resource_find and htable_lookup left, I hope to fix that with the next patch.
The massive create tests shows the following results:
COS=1
Rate: 2085.73 creates/sec (total: 2 threads 625719 creates 300 secs)
Rate: 2072.38 creates/sec (total: 2 threads 621713 creates 300 secs)
Rate: 2053.80 creates/sec (total: 2 threads 616141 creates 300 secs)
COS=0
Rate: 2128.22 creates/sec (total: 2 threads 638465 creates 300 secs)
Rate: 2151.05 creates/sec (total: 2 threads 645316 creates 300 secs)
Rate: 2169.91 creates/sec (total: 2 threads 650974 creates 300 secs)
Clean HEAD
Rate: 2199.67 creates/sec (total: 2 threads 659900 creates 300 secs)
Rate: 2213.95 creates/sec (total: 2 threads 664184 creates 300 secs)
Rate: 2186.74 creates/sec (total: 2 threads 658209 creates 301 secs)
Rate: 2206.10 creates/sec (total: 2 threads 661831 creates 300 secs)
Rate: 2217.49 creates/sec (total: 2 threads 665247 creates 300 secs)
Rate: 2230.86 creates/sec (total: 2 threads 669258 creates 300 secs)
WBR,
--
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Lustre-devel] COS performance issues
2008-10-17 8:07 ` Alexander Zarochentsev
@ 2008-10-17 20:55 ` Robert Read
2008-10-18 10:34 ` Alexander Zarochentsev
0 siblings, 1 reply; 13+ messages in thread
From: Robert Read @ 2008-10-17 20:55 UTC (permalink / raw)
To: lustre-devel
On Oct 17, 2008, at 01:07 , Alexander Zarochentsev wrote:
> The massive create tests shows the following results:
>
> COS=1
> Rate: 2085.73 creates/sec (total: 2 threads 625719 creates 300 secs)
> Rate: 2072.38 creates/sec (total: 2 threads 621713 creates 300 secs)
> Rate: 2053.80 creates/sec (total: 2 threads 616141 creates 300 secs)
>
> COS=0
> Rate: 2128.22 creates/sec (total: 2 threads 638465 creates 300 secs)
> Rate: 2151.05 creates/sec (total: 2 threads 645316 creates 300 secs)
> Rate: 2169.91 creates/sec (total: 2 threads 650974 creates 300 secs)
>
> Clean HEAD
>
> Rate: 2199.67 creates/sec (total: 2 threads 659900 creates 300 secs)
> Rate: 2213.95 creates/sec (total: 2 threads 664184 creates 300 secs)
> Rate: 2186.74 creates/sec (total: 2 threads 658209 creates 301 secs)
> Rate: 2206.10 creates/sec (total: 2 threads 661831 creates 300 secs)
> Rate: 2217.49 creates/sec (total: 2 threads 665247 creates 300 secs)
> Rate: 2230.86 creates/sec (total: 2 threads 669258 creates 300 secs)
>
Any idea why COS=0 was consistently a bit slower than HEAD?
robert
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Lustre-devel] COS performance issues
2008-10-17 20:55 ` Robert Read
@ 2008-10-18 10:34 ` Alexander Zarochentsev
0 siblings, 0 replies; 13+ messages in thread
From: Alexander Zarochentsev @ 2008-10-18 10:34 UTC (permalink / raw)
To: lustre-devel
On 18 October 2008 00:55:03 Robert Read wrote:
> On Oct 17, 2008, at 01:07 , Alexander Zarochentsev wrote:
> > The massive create tests shows the following results:
> >
> > COS=1
> > Rate: 2085.73 creates/sec (total: 2 threads 625719 creates 300
> > secs) Rate: 2072.38 creates/sec (total: 2 threads 621713 creates
> > 300 secs) Rate: 2053.80 creates/sec (total: 2 threads 616141
> > creates 300 secs)
> >
> > COS=0
> > Rate: 2128.22 creates/sec (total: 2 threads 638465 creates 300
> > secs) Rate: 2151.05 creates/sec (total: 2 threads 645316 creates
> > 300 secs) Rate: 2169.91 creates/sec (total: 2 threads 650974
> > creates 300 secs)
> >
> > Clean HEAD
> >
> > Rate: 2199.67 creates/sec (total: 2 threads 659900 creates 300
> > secs) Rate: 2213.95 creates/sec (total: 2 threads 664184 creates
> > 300 secs) Rate: 2186.74 creates/sec (total: 2 threads 658209
> > creates 301 secs) Rate: 2206.10 creates/sec (total: 2 threads
> > 661831 creates 300 secs) Rate: 2217.49 creates/sec (total: 2
> > threads 665247 creates 300 secs) Rate: 2230.86 creates/sec (total:
> > 2 threads 669258 creates 300 secs)
>
> Any idea why COS=0 was consistently a bit slower than HEAD?
I only suspect the ENTRY/EXIT stuff I added to several ptlrpc functions.
next time I will try to build Lustre with CDEBUG_ENTRY_EXIT=0.
> robert
The latest results I got after increasing of ldlm resource
hashtable size are:
zam at lts-head:~$ zgrep Rate: m-ht-improved-WASHIE3.4-20081017.log
COS = 1
Rate: 1887.57 creates/sec (total: 2 threads 566271 creates 300 secs)
Rate: 1980.18 creates/sec (total: 2 threads 594053 creates 300 secs)
Rate: 1923.76 creates/sec (total: 2 threads 577129 creates 300 secs)
COS = 0
Rate: 1804.92 creates/sec (total: 2 threads 541475 creates 300 secs)
Rate: 1915.29 creates/sec (total: 2 threads 574587 creates 300 secs)
Rate: 2025.70 creates/sec (total: 2 threads 607711 creates 300 secs)
CLEAN HEAD:
zam at lts-head:~$ zgrep Rate: m-cleanhead-WASHIE3.4-20081017.log
Rate: 1986.13 creates/sec (total: 2 threads 595839 creates 300 secs)
Rate: 2009.84 creates/sec (total: 2 threads 602952 creates 300 secs)
Rate: 1887.05 creates/sec (total: 2 threads 566115 creates 300 secs)
Rate: 2060.59 creates/sec (total: 2 threads 618178 creates 300 secs)
Rate: 1868.89 creates/sec (total: 2 threads 560666 creates 300 secs)
Rate: 1976.95 creates/sec (total: 2 threads 594084 creates 301 secs)
zam at lts-head:~$
average values:
COS=1 : 1930.5
COS=0 : 1915.3
HEAD : 1964.9
The results are done on another test cluster
so they are not directly comparable with the previous ones.
The profiling chart for COS=1, there is no ldlm_resource_find anymore in top20:
198152 2.9052 e1000.ko e1000 e1000_irq_enable
158535 2.3243 e1000.ko e1000 e1000_intr_msi
152375 2.2340 vmlinux vmlinux schedule
148709 2.1803 obdclass.ko obdclass htable_lookup
109440 1.6045 lvfs.ko lvfs lprocfs_counter_add
102773 1.5068 vmlinux vmlinux __switch_to
95607 1.4017 vmlinux vmlinux __find_get_block
90447 1.3261 vmlinux vmlinux kfree
88713 1.3006 vmlinux vmlinux mwait_idle
84511 1.2390 vmlinux vmlinux try_to_wake_up
83801 1.2286 vmlinux vmlinux memset
79735 1.1690 vmlinux vmlinux __kmalloc
78594 1.1523 oprofiled oprofiled (no symbols)
71184 1.0437 ksocklnd.ko ksocklnd ksocknal_scheduler
65613 0.9620 vmlinux vmlinux kmem_cache_free
64525 0.9460 mds.ko mds mds_lov_dump_objids
64208 0.9414 ptlrpc.ko ptlrpc ptlrpc_main
63260 0.9275 vmlinux vmlinux memmove
56148 0.8232 vmlinux vmlinux find_get_page
55011 0.8065 vmlinux vmlinux kmem_cache_alloc
I attached all performance enhancement patches
(except the latest patch of ldlm resource hash table size increase)
to the COS bug (Bug 15393).
Thanks,
--
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2008-10-18 10:34 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-08 11:44 [Lustre-devel] COS performance issues Alexander Zarochentsev
2008-10-08 11:48 ` Alex Zhuravlev
2008-10-12 18:41 ` Alexander Zarochentsev
2008-10-12 19:12 ` Alex Zhuravlev
2008-10-13 14:36 ` Alexander Zarochentsev
2008-10-13 15:04 ` Alex Zhuravlev
2008-10-17 8:07 ` Alexander Zarochentsev
2008-10-17 20:55 ` Robert Read
2008-10-18 10:34 ` Alexander Zarochentsev
2008-10-11 16:00 ` Andreas Dilger
2008-10-11 20:18 ` Eric Barton
2008-10-12 15:51 ` Alexander Zarochentsev
-- strict thread matches above, loose matches on Subject: below --
2008-10-08 6:40 Alexander Zarochentsev
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.