From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Priebe - Profihost AG Subject: Re: speedup ceph / scaling / find the bottleneck Date: Fri, 29 Jun 2012 15:22:43 +0200 Message-ID: <4FEDAC23.1000705@profihost.ag> References: <4FED8792.1090905@profihost.ag> <4FED964D.3080201@inktank.com> <4FEDA777.1060309@profihost.ag> <4FEDA978.3050106@profihost.ag> <4FEDAA9B.4040101@profihost.ag> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail.profihost.ag ([85.158.179.208]:50998 "EHLO mail.profihost.ag" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754038Ab2F2NWu (ORCPT ); Fri, 29 Jun 2012 09:22:50 -0400 In-Reply-To: <4FEDAA9B.4040101@profihost.ag> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Mark Nelson Cc: "ceph-devel@vger.kernel.org" , Alexandre DERUMIER iostat output via iostat -x -t 5 while 4k random writes 06/29/2012 03:20:55 PM avg-cpu: %user %nice %system %iowait %steal %idle 31,63 0,00 52,64 0,78 0,00 14,95 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0,00 690,40 0,00 3143,60 0,00 33958,80 10,80 2,68 0,85 0,08 24,08 sdc 0,00 1069,80 0,00 5151,60 0,00 54693,00 10,62 8,31 1,61 0,06 29,68 sdd 0,00 581,00 0,00 2762,80 0,00 27809,00 10,07 2,45 0,89 0,08 21,12 sde 0,00 820,00 0,00 4208,20 0,00 43457,40 10,33 4,00 0,95 0,07 28,56 sda 0,00 0,00 0,00 0,40 0,00 9,60 24,00 0,00 0,00 0,00 0,00 06/29/2012 03:21:00 PM avg-cpu: %user %nice %system %iowait %steal %idle 29,68 0,00 52,89 0,98 0,00 16,45 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0,00 1046,60 0,00 5544,20 0,00 57938,00 10,45 6,08 1,10 0,06 32,08 sdc 0,00 115,60 0,00 3483,60 0,00 29368,00 8,43 3,45 0,99 0,06 21,36 sdd 0,00 1143,20 0,00 5991,00 0,00 62607,40 10,45 6,03 1,01 0,06 35,20 sde 0,00 1070,00 0,00 5561,60 0,00 58207,20 10,47 5,76 1,04 0,07 38,08 sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 06/29/2012 03:21:05 PM avg-cpu: %user %nice %system %iowait %steal %idle 29,69 0,00 53,06 0,60 0,00 16,65 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0,00 199,60 0,00 4484,40 0,00 41338,20 9,22 1,96 0,44 0,07 30,56 sdc 0,00 766,60 0,00 3616,20 0,00 38829,00 10,74 3,62 1,00 0,07 25,68 sdd 0,00 149,20 0,00 5066,60 0,00 45793,60 9,04 4,48 0,89 0,06 28,48 sde 0,00 150,00 0,00 4328,80 0,00 36496,00 8,43 2,96 0,68 0,07 32,40 sda 0,00 0,00 0,00 0,40 0,00 35,20 88,00 0,00 0,00 0,00 0,00 06/29/2012 03:21:10 PM avg-cpu: %user %nice %system %iowait %steal %idle 29,11 0,00 46,58 0,50 0,00 23,81 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0,00 881,20 0,00 3077,20 0,00 33382,80 10,85 3,44 1,12 0,06 18,16 sdc 0,00 867,60 0,00 5098,40 0,00 52056,20 10,21 5,65 1,11 0,05 24,32 sdd 0,00 864,40 0,00 2759,00 0,00 30321,60 10,99 3,39 1,23 0,06 17,36 sde 0,00 846,20 0,00 3193,40 0,00 36795,60 11,52 3,48 1,09 0,06 19,92 sda 0,00 0,00 0,00 1,40 0,00 11,20 8,00 0,01 4,57 2,29 0,32 Am 29.06.2012 15:16, schrieb Stefan Priebe - Profihost AG: > Big sorry. ceph was scrubbing during my last test. Didn't recognized this. > > When i redo the test i see writes between 20MB/s and 100Mb/s. That is > OK. Sorry. > > Stefan > > Am 29.06.2012 15:11, schrieb Stefan Priebe - Profihost AG: >> Another BIG hint. >> >> While doing random 4k I/O from one VM i archieve 14k I/Os. This is >> around 54MB/s. But EACH ceph-osd machine is writing between 500MB/s and >> 750MB/s. What do they write?!?! >> >> Just an idea?: >> Do they completely rewrite EACH 4MB block for each 4k write? >> >> Stefan >> >> Am 29.06.2012 15:02, schrieb Stefan Priebe - Profihost AG: >>> Am 29.06.2012 13:49, schrieb Mark Nelson: >>>> I'll try to replicate your findings in house. I've got some other >>>> things I have to do today, but hopefully I can take a look next >>>> week. If >>>> I recall correctly, in the other thread you said that sequential writes >>>> are using much less CPU time on your systems? >>> >>> Random 4k writes: 10% idle >>> Seq 4k writes: !! 99,7% !! idle >>> Seq 4M writes: 90% idle >>> >>> >>> > Do you see better scaling in that case? >>> >>> 3 osd nodes: >>> 1 VM: >>> Rand 4k writes: 7000 iops >>> Seq 4k writes: 19900 iops >>> >>> 2 VMs: >>> Rand 4k writes: 6000 iops each >>> Seq 4k writes: 4000 iops each VM 1 >>> Seq 4k writes: 18500 iops each VM 2 >>> >>> >>> 4 osd nodes: >>> 1 VM: >>> Rand 4k writes: 14400 iops >>> Seq 4k writes: 19000 iops >>> >>> 2 VMs: >>> Rand 4k writes: 7000 iops each >>> Seq 4k writes: 18000 iops each >>> >>> >>> >>>> To figure out where CPU is being used, you could try various options: >>>> oprofile, perf, valgrind, strace. Each has it's own advantages. >>>> >>>> Here's how you can create a simple callgraph with perf: >>>> >>>> http://lwn.net/Articles/340010/ >>> 10s perf data output while doing random 4k writes: >>> https://raw.github.com/gist/2c16136faebec381ae35/09e6de68a5461a198430a9ec19dfd5392f276706/gistfile1.txt >>> >>> >>> >>> >>> Stefan >> >